Human Image Generation through Multimodal Diffusion Models (Project)
Abstract
Pose-guided human image generation remains a challenging yet significant task in computer vision. To tackle limitations in current Human-Object Interaction (HOI) datasets, we introduce a HOI dataset containing 29K images with structured object annotations, detailed captions, and diverse interactions. Leveraging this dataset, we propose HOIGEN, a diffusion-based multimodal model capable of generating realistic human-object interaction images conditioned on textual descriptions and object appearance. Extensive benchmarking demonstrates that HOIGEN effectively synthesizes structurally coherent, style-controllable, and photorealistic images, significantly advancing pose-conditioned image generation.