Controllable character animation from a reference image and motion guidance remains challenging due to the complexity of injecting appearance and motion cues into video diffusion models. Existing methods rely heavily on complex architectures, explicit guidance modules, or multi-stage processing pipelines, introducing structural overhead and deployment difficulties. Inspired by the powerful contextual understanding capability inherent in pretrained video diffusion transformers, we propose FramePrompt, a minimalist yet highly effective framework that unifies reference images, skeleton-guided motion inputs, and target video segments into a single coherent visual sequence. By reframing animation as a conditional future prediction task, we eliminate the need for explicit guider networks and architectural modifications. Our experiments demonstrate significant improvements over representative baselines, achieving approximately 5.93% higher SSIM, 20.65% higher PSNR, 34.95% lower LPIPS, and 53.87% lower FVD scores, while also simplifying training and deployment. Our findings underscore the remarkable effectiveness of treating heterogeneous visual inputs within a unified sequential framework, fully leveraging pretrained model contextual dynamics without additional conditioning mechanisms.
Our approach, FramePrompt, introduces a Unified Sequential Framework for controllable character animation. As illustrated below (conceptually based on Figure 1b from the paper), all input modalities—the reference image, the skeleton-guided motion sequence, and the (noisy) target video placeholders—are concatenated into a single input sequence.
This design leverages the pretrained video diffusion transformer's inherent capacity for contextual understanding. By reframing character animation as a future sequence prediction task conditioned on the initial visual context (reference image and motion), FramePrompt eliminates the need for specialized encoder networks or complex architectural modifications. This significantly simplifies the model design, training, and deployment while enhancing performance.