HyperAIHyperAI

Command Palette

Search for a command to run...

CoDance : Un paradigme déliaison-réliaison pour l'animation multi-sujets robuste

Shuai Tan Biao Gong Ke Ma Yutong Feng Qiyuan Zhang Yan Wang Yujun Shen Hengshuang Zhao

Abstract

L’animation d’images de personnages gagne une importance croissante dans divers domaines, portée par la demande croissante de rendus robustes et flexibles pour plusieurs sujets. Bien que les méthodes existantes se distinguent dans l’animation d’une seule personne, elles peinent à gérer un nombre arbitraire de sujets, des types de personnages variés, ainsi que des désalignements spatiaux entre l’image de référence et les poses de pilotage. Nous attribuons ces limitations à une liaison spatiale trop rigide, qui impose un alignement pixel-par-pixel strict entre la pose et l’image de référence, ainsi qu’à une incapacité à réaffecter de manière cohérente le mouvement aux sujets visés. Pour surmonter ces défis, nous proposons CoDance, un nouveau cadre Unbind-Rebind permettant d’animer un nombre arbitraire de sujets, de types variés et de configurations spatiales, conditionné par une seule séquence de poses, potentiellement désalignée. Plus précisément, le module Unbind utilise un nouvel encodeur de décalage de pose pour briser la liaison spatiale rigide entre la pose et la référence, en introduisant des perturbations stochastiques à la fois sur les poses et leurs caractéristiques latentes, forçant ainsi le modèle à apprendre une représentation du mouvement indépendante de la localisation. Pour assurer un contrôle précis et une association fiable aux sujets, nous concevons ensuite un module Rebind, qui exploite une guidance sémantique issue de promts textuels et une guidance spatiale issue de masques de sujets afin de diriger le mouvement appris vers les personnages souhaités. En outre, pour faciliter une évaluation complète, nous introduisons un nouveau benchmark multi-sujets appelé CoDanceBench. Des expériences étendues sur CoDanceBench et des jeux de données existants montrent que CoDance atteint des performances SOTA, démontrant une généralisation remarquable sur une large variété de sujets et de dispositions spatiales. Le code source et les poids du modèle seront rendus accessibles au public.

One-sentence Summary

The authors from The University of Hong Kong, Ant Group, Huazhong University of Science and Technology, Tsinghua University, and the University of North Carolina at Chapel Hill propose CoDance, a novel Unbind-Rebind framework that enables pose-controllable multi-subject animation from a single misaligned pose sequence by learning location-agnostic motion representations through stochastic pose perturbations and re-binding motion via text and mask guidance, achieving state-of-the-art results on diverse subject counts and spatial configurations.

Key Contributions

  • Existing character animation methods fail in multi-subject scenarios due to rigid spatial binding between reference images and driving poses, leading to misaligned or entangled outputs when handling arbitrary subject counts, types, or spatial configurations.
  • CoDance introduces an Unbind-Rebind framework: the Unbind module uses stochastic pose and feature perturbations via a novel pose shift encoder to learn location-agnostic motion representations, while the Rebind module leverages text prompts and subject masks for precise, controllable animation of intended characters.
  • Extensive evaluations on the new CoDanceBench and existing datasets demonstrate state-of-the-art performance, with strong generalization across diverse subjects, spatial layouts, and misaligned inputs, validated through both quantitative metrics and perceptual quality.

Introduction

Character image animation is increasingly vital for applications like virtual performances, advertising, and educational content, where animating multiple subjects with diverse types and spatial arrangements is common. Prior methods, while effective for single-person animation, fail in multi-subject scenarios due to rigid spatial binding between pose and reference image, which requires precise alignment and limits scalability beyond two subjects. This leads to motion entanglement, poor generalization to non-human characters, and sensitivity to misalignment. The authors introduce CoDance, a novel Unbind-Rebind framework that decouples motion from appearance by applying stochastic perturbations to pose skeletons and their latent features during training, enabling the model to learn location-agnostic motion semantics. To restore precise control, the Rebind module uses text prompts for semantic guidance and subject masks for spatial localization, allowing accurate animation of arbitrary subject counts and configurations even under misalignment. The approach is validated on a new benchmark, CoDanceBench, demonstrating state-of-the-art performance and strong generalization across diverse layouts and character types.

Dataset

  • The dataset comprises public TikTok and Fashion datasets, augmented with approximately 1,200 self-collected TikTok-style videos.
  • For the Rebind stage, training is enhanced with 10,000 text-to-video samples to strengthen semantic associations and 20 multi-subject videos to guide spatial binding.
  • Evaluation includes single-person and multi-subject settings:
    • Single-person: 10 videos from TikTok and 100 from Fashion, following prior work [13, 14, 75].
    • Multi-subject: Uses the Follow-Your-Pose-V2 benchmark and a new curated benchmark, CoDanceBench, consisting of 20 multi-subject dance videos.
  • To ensure fair comparison, quantitative results are reported using a model trained only on solo dance videos, excluding multi-subject data from CoDanceBench.
  • The authors use a mixture of training data with specific ratios during training, though exact proportions are not detailed.
  • No explicit cropping strategy is described, but videos are processed to align with pose and motion inputs.
  • Metadata construction involves aligning video content with pose sequences and textual descriptions, particularly for text-to-video samples and multi-subject coordination.
  • Evaluation metrics include frame-level measures (PSNR, SSIM, L1, LPIPS) for perceptual fidelity and distortion, and video-level metrics (FID, FID-VID, FVD) to assess distributional realism.

Method

The authors leverage a diffusion-based framework for multi-character animation, structured around a Diffusion Transformer (DiT) backbone. The overall pipeline, as illustrated in the framework diagram, begins with a reference image IrI^{r}Ir, a driving pose sequence I1:FpI_{1:F}^{p}I1:Fp, a text prompt T\mathcal{T}T, and a subject mask M\mathcal{M}M. A Variational Autoencoder (VAE) encoder extracts the latent feature ferf_{e}^{r}fer from the reference image. This latent representation, along with the subject mask, forms the initial input for the generative process. The driving pose sequence is processed by a Pose Shift Encoder, which is composed of stacked 3D convolutional layers to capture both temporal and spatial features. The extracted pose features are then concatenated with the patchified tokens of the noisy latent input, forming the conditioned input for the DiT backbone.

The core innovation of the method lies in the Unbind-Rebind paradigm, which is designed to overcome the limitations of rigid spatial alignment. The Unbind module, implemented as a Pose Shift Encoder, operates at both the input and feature levels to decouple the pose from its original spatial context. At the input level, the Pose Unbind module stochastically disrupts the alignment by randomly translating and scaling the skeleton of the driving pose. At the feature level, the Feature Unbind module further augments the pose features by applying random translations and duplicating the feature region corresponding to the pose, thereby forcing the model to learn a more abstract, semantic understanding of motion. This process ensures the model does not rely on a fixed spatial correspondence between the reference image and the target pose.

Following the Unbind operation, the Rebind module re-establishes the correct correspondence between the learned motion and the target subjects in the reference image through two complementary mechanisms. From a semantic perspective, a text prompt T\mathcal{T}T is processed by a umT5 text encoder, and the resulting features are injected into the DiT blocks via cross-attention layers. This provides explicit guidance on the identity and number of subjects to be animated. To address figure-ground ambiguity and ensure precise spatial control, a spatial rebind mechanism is employed. A subject mask M\mathcal{M}M, extracted from the reference image using a pretrained segmentation model like SAM, is fed into a Mask Encoder. The resulting mask features are added element-wise to the noisy latent vector, confining the animation to the specified regions. To enhance the model's semantic comprehension and prevent overfitting to animation-specific prompts, the training process alternates between animation data and a diverse text-to-video (TI2V) dataset. The DiT backbone is initialized from a pretrained T2V model and fine-tuned using Low-Rank Adaptation (LoRA) layers. Finally, the denoised latent representation is passed to the VAE decoder to reconstruct the output video I1:FgI_{1:F}^{g}I1:Fg. It is important to note that the Unbind modules and the mixed-data training strategy are applied exclusively during the training phase.

Experiment

  • Evaluated against SOTA methods (AnimateAnyone, MusePose, ControlNeXt, MimicMotion, UniAnimate, Animate-X, StableAnimator, UniAnimate-DiT) on Follow-Your-Pose-V2 and proposed benchmark; our method outperforms all in perceptual similarity (LPIPS), identity consistency (PSNR/SSIM), and motion fidelity (FID-FVD, FVD), demonstrating superiority in multi-subject animation despite training on single-person data.
  • In a challenging single-person driving pose to multi-subject reference scenario, our method maintains identity preservation and motion coherence, while baselines fail due to rigid binding and inability to handle cardinality mismatch.
  • Qualitative results (Fig. 4, Fig. 5) show our method consistently generates accurate, semantically correct animations across 1–5 subjects, avoiding artifacts like identity confusion and motion distortion, unlike competing methods that suffer from shape distortion and uncoordinated motion.
  • User study (Tab. 3) with 10 participants shows our method achieves the highest preference rates across video quality, identity preservation, and temporal consistency, confirming perceptual superiority.
  • Ablation study confirms the necessity of both Unbind and Rebind modules: Unbind enables identity preservation by breaking rigid alignment, Spatial Rebind enables motion localization, and the full model achieves coherent, subject-specific animation by combining both components.

Results show that CoDance outperforms all compared methods across key metrics, achieving the lowest LPIPS and FID scores and the highest PSNR, SSIM, and FVD values. The authors use this table to demonstrate that their method significantly improves perceptual similarity, identity consistency, and motion fidelity compared to existing single-person models, which fail in multi-subject scenarios due to rigid alignment constraints.

Results show that CoDance achieves the highest scores across all three evaluation metrics—Video Quality, Identity Preservation, and Temporal Consistency—outperforming all compared methods. This demonstrates the effectiveness of the proposed Unbind-Rebind strategy in maintaining identity and motion coherence in multi-subject animation scenarios.

Results show that CoDance outperforms all compared methods across multiple metrics, achieving the lowest LPIPS and FID-VID scores and the highest PSNR, SSIM, and FVD scores, indicating superior perceptual similarity, identity consistency, and motion fidelity. The authors use a multi-subject animation setup to demonstrate that conventional single-person models fail due to rigid alignment, while CoDance's Unbind-Rebind strategy enables accurate and coherent motion generation for each subject despite being trained under the same single-person data constraint.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp