HyperAIHyperAI

Command Palette

Search for a command to run...

3D 인식 임플리시트 모션 제어를 통한 시점 적응형 인간 영상 생성

Zhixue Fang Xu He Songlin Tang Haoxian Zhang Qingfeng Li Xiaoqiang Liu Pengfei Wan Kun Gai

초록

영상 생성에서 인간의 운동 제어를 위한 기존 방법들은 일반적으로 2D 자세 또는 명시적인 3D 파라메트릭 모델(SMPL 등)을 제어 신호로 사용한다. 그러나 2D 자세는 운동을 드라이빙 시점에 고정시켜 새로운 시점에서의 합성(노바르-뷰 싱테시스)을 불가능하게 한다. 명시적인 3D 모델은 구조적 정보를 제공하지만, 깊이 모호성과 정확하지 않은 역학적 특성 등의 내재적 오류를 가지고 있으며, 이러한 모델이 강한 제약 조건으로 사용될 경우, 대규모 영상 생성 모델이 내재적으로 지닌 강력한 3D 인식 능력을 억압하게 된다. 본 연구에서는 3D 인식 관점에서 운동 제어를 재검토하며, 외부에서 재구성된 제약 조건에 의존하지 않고 생성 모델의 공간적 사전 지식과 자연스럽게 일치하는 암묵적이고 시점에 무관한 운동 표현을 제안한다. 이를 위해 우리는 3DiMo를 제안하며, 사전 학습된 영상 생성 모델과 함께 운동 인코더를 공동으로 학습시켜 드라이빙 프레임을 컴팩트하고 시점에 무관한 운동 토큰으로 압축하고, 크로스 어텐션을 통해 의미적으로 주입한다. 3D 인식을 강화하기 위해 단시점, 다시점, 그리고 움직이는 카메라 영상 등 다양한 시점의 풍부한 감독 신호를 사용하여 다양한 시점 간의 운동 일관성을 강제한다. 또한, 보조적인 기하학적 감독을 도입하여 SMPL을 초기화용으로만 사용하고, 학습 과정 중에 점차 감쇠시키는 방식으로, 외부 3D 안내에서 데이터와 생성 모델의 사전 지식을 기반으로 진정한 3D 공간 운동 이해를 학습할 수 있도록 전환한다. 실험 결과, 3DiMo는 유연한 텍스트 기반 카메라 제어를 통해 드라이빙 운동을 정확히 재현하며, 기존 방법에 비해 운동 정확도와 시각적 품질 측면에서 뚜렷한 성능 향상을 보였다.

One-sentence Summary

Researchers from Kuaishou Technology, Tsinghua University, and CASIA propose 3DiMo, a view-agnostic motion representation that enables faithful 3D motion reproduction from 2D videos via text-guided camera control, outperforming prior methods by aligning with generator priors rather than relying on rigid 3D models.

Key Contributions

  • 3DiMo introduces a 3D-aware motion control framework that extracts implicit, view-agnostic motion tokens from 2D driving videos, enabling novel-view synthesis and text-guided camera control without rigid geometric constraints from external 3D models.
  • The method jointly trains a motion encoder with a pretrained video generator using view-rich supervision—including single-view, multi-view, and moving-camera videos—to enforce motion consistency across viewpoints and align with the generator’s intrinsic spatial priors.
  • Auxiliary geometric supervision from SMPL is used only during early training and gradually annealed to zero, allowing the model to transition from external guidance to learning genuine 3D motion understanding, resulting in state-of-the-art motion fidelity and visual quality on benchmark tasks.

Introduction

The authors leverage large-scale video generators’ inherent 3D spatial awareness to rethink human motion control, moving beyond rigid 2D pose conditioning or error-prone explicit 3D models like SMPL. Prior methods either lock motion to the driving viewpoint or impose inaccurate 3D constraints that override the generator’s native priors, limiting viewpoint flexibility and motion fidelity. Their main contribution is 3DiMo, an end-to-end framework that learns implicit, view-agnostic motion tokens from 2D driving frames via a Transformer encoder, injected through cross-attention to align with the generator’s spatial understanding. Training uses view-rich supervision—single-view, multi-view, and moving-camera videos—plus an annealed auxiliary loss tied to SMPL for early geometric guidance, enabling the model to transition toward genuine 3D motion reasoning. This yields high-fidelity, text-guided camera control while preserving 3D consistency across views.

Dataset

  • The authors construct a view-rich 3D-aware dataset combining three sources: large-scale internet videos for expressive motion diversity, synthetic UE5-rendered sequences for precise camera motion control, and real-world multi-view captures (including proprietary recordings) for authentic 3D supervision.
  • Internet videos dominate in volume and train expressive motion dynamics under single-view reconstruction; synthetic and real multi-view data are smaller but critical for enforcing 3D consistency via cross-view motion reproduction and moving-camera supervision.
  • Text prompts for camera viewpoints and movements are generated using Qwen2.5-VL, unifying annotations across all data types for text-guided camera control.
  • Training follows a three-stage progressive strategy: Stage 1 uses only single-view reconstruction for stable motion initialization; Stage 2 mixes reconstruction and cross-view reproduction to transition toward 3D semantics; Stage 3 focuses entirely on multi-view and camera-motion data to strengthen view-agnostic motion features.
  • Reference images are taken from the first frame of the target video to align generated motion with subject orientation, avoiding explicit camera alignment or SMPL regression.
  • Auxiliary geometric supervision is introduced early via a lightweight MLP decoder predicting SMPL/MANO pose parameters (excluding global root orientation) to stabilize training and initialize motion representations; this supervision is gradually annealed and removed by Stage 3.
  • Proprietary real-world captures use a three-camera array with randomized camera motions—including static, linear, zoom, and complex trajectories—paired with identical human motions to maximize view-agnostic learning signals.

Method

The authors leverage a pretrained DiT-based video generation model as the backbone, which is conditioned on a reference image, a driving video, and a text prompt to synthesize a target video that reenacts the driving motion while preserving text-guided camera control. The core innovation lies in the design of an implicit motion encoder, trained end-to-end with the generator, that extracts view-agnostic motion representations from 2D driving frames and injects them via cross-attention to enable semantically coherent motion transfer.

Refer to the framework diagram, which illustrates the full pipeline. The motion encoder consists of two specialized components: a body encoder Eb\mathcal{E}_bEb and a hand encoder Eh\mathcal{E}_hEh, each designed to capture coarse body motion and fine hand dynamics, respectively. Driving frames are first augmented with random perspective transformations and appearance perturbations to encourage the encoder to discard view-specific and identity-dependent cues. Each frame is patchified into visual tokens and concatenated with a fixed number KKK of learnable latent tokens. After several attention layers, only the latent tokens are retained as the motion representation zbz_bzb and zhz_hzh, forming a semantic bottleneck that strips away 2D structural information while preserving the intrinsic 3D motion semantics.

These motion tokens are then concatenated and injected into the DiT generator via cross-attention layers appended after each full self-attention block. This design allows motion tokens to interact semantically with video tokens without imposing rigid spatial alignment constraints, enabling flexible coexistence with the generator’s native text-driven camera control. The text prompt, which may include camera movement instructions, is processed through the same self-attention mechanism as the visual tokens, allowing joint control over motion and viewpoint.

To endow the motion representations with 3D awareness, the authors introduce early-stage geometric supervision using parametric human models (SMPL for body, MANO for hands). During training, auxiliary decoders regress the motion tokens to external 3D pose parameters θb\theta_bθb and θh\theta_hθh, providing geometric alignment signals that accelerate spatial understanding. The framework is trained on a view-rich dataset that includes single-view, multi-view, and camera-motion sequences, enabling the model to learn cross-view motion consistency and genuine 3D motion priors. Training alternates between same-view reconstruction and cross-view motion reproduction tasks, reinforcing the emergence of expressive, 3D-aware motion representations.

At inference, the motion encoder directly processes 2D driving frames to extract motion tokens, which are then used to animate any reference subject under arbitrary text-guided camera trajectories. This eliminates the need for external pose estimation or explicit 3D reconstruction, enabling high-fidelity, view-adaptive motion-controlled video generation.

Experiment

  • 3DiMo outperforms state-of-the-art 2D and 3D methods in motion control and visual fidelity, particularly in LPIPS, FID, and FVD, demonstrating superior 3D-aware reenactment despite slightly lower PSNR/SSIM due to intentional geometric consistency over pixel alignment.
  • Qualitative results show 3DiMo produces physically plausible, depth-accurate motions, avoiding limb ordering errors common in 2D methods and pose inaccuracies in SMPL-based approaches, especially under dynamic camera prompts.
  • User studies confirm 3DiMo excels in motion naturalness and 3D plausibility, validating its learned motion representation better captures spatial and dynamic realism than parametric or 2D alignment methods.
  • Ablation studies reveal that implicit motion representation, multi-stage view-rich training, cross-attention conditioning, and auxiliary geometric supervision are critical for stable training and 3D-aware control; removing them degrades quality or causes collapse.
  • The model enables novel applications: single-image human novel view synthesis with stable subject rendering, video stabilization by enforcing static camera prompts, and automatic motion-image alignment via view-agnostic motion transfer without manual calibration.

The authors evaluate ablation variants of their model and find that removing key components such as geometric supervision, hand motion encoding, or using channel concatenation instead of cross-attention consistently degrades performance. The full model achieves the best balance across metrics, particularly in FID and FVD, indicating superior visual fidelity and motion coherence. Results confirm that multi-stage training with view-rich data and auxiliary supervision is essential for stable learning and 3D-aware motion control.

The authors use a 3D-aware implicit motion representation to achieve superior motion control and visual fidelity compared to both 2D pose-based and SMPL-based baselines. Results show their method outperforms others in perceptual metrics and user studies, particularly in motion naturalness and 3D plausibility, despite slightly lower pixel-wise scores due to enforced geometric consistency. Ablation studies confirm that multi-stage view-rich training and cross-attention conditioning are critical for stable, high-fidelity 3D motion synthesis.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp