Deep Human Generation with Disentangled Geometry and Appearance Constraints

Abstract

Recent advancements in human fashion video generation have transformed the field, producing various promising effects. Existing methods mainly focus on pose control but lack the ability to achieve sketch-based control, largely due to the absence of appearance-consistent and shape-varying knowledge in existing datasets. Moreover, the necessity of sequential structure inputs to control video generation hinders real-world applications. To address these limitations, we introduce Sketch2HumanVideo, an approach that, for the first time, achieves sketch-controllable human video generation with three conditions: temporally sparse sketches, a spatially sparse pose sequence, and a reference appearance image. Our key contribution is a sparse sketch encoder, which takes the first two conditions as input, enabling precise and multi-view control of shape motion. To provide the above knowledge, we leverage the expertise of two pretrained models to synthesize a dataset comprising shape-varying yet appearance-consistent examples for model training. Furthermore, we introduce an enlarging-and-resampling scheme to enhance high-frequency details of local regions in resource-constrained scenarios, thereby promoting the generation of realistic videos. Through qualitative and quantitative experiments, our method showcases superior performance to state-of-the-art approaches and flexible control.

Method

An illustration about the three parts of our method: (Left) Given an individual video, we first fine-tune the SD and ControlNet with LoRA to customize the appearance. Then, given arbitrary sketch inputs and prompts, the fine-tuned model generates the reference images for the subsequent training. (Middle) Our Sketch2HumanVideo consists of three main modules. First, the sparse sketch encoder aims to restore complete shape motion from the pose sequence \(\{P_{1:N}\}\) and sparse sketches \((S_1, S_N)\). Second, the appearance net learns the appearance features from a randomly selected reference image \(\hat{I}_{ref}\) from our generated data. Third, the text-to-video backbone assembles the appearance and shape information separately from the above two modules to denoise the noisy input \(\{z_{1:N,0}\}\). (Right) At the inference time, we first generate a coarse full-body video \(\{\hat{I}^{\text{full}}_{1:N}\}\) as a canvas and then enlarge specific regions (face/body) with overlap followed by a resampling strategy. To keep the smooth stitching for those regions, we average the overlap features and then replace them with the averaged one.

Video

Examples of the Synthetic Reference Data

For each appearance in the existing human video dataset, we provide several synthesized reference images that showcase variations in shape, including hairstyles, body shapes, and clothing types.

BibTeX

@article{qu2025controllable,
      title={Controllable Human Video Generation from Sparse Sketches},
      author={Qu, Linzi and Shang, Jiaxiang and Lam, Miu-Ling and Fu, Hongbo},
      journal={IEEE Transactions on Visualization and Computer Graphics},
      year={2025},
      publisher={IEEE}
}