Command Palette
Search for a command to run...
Diffusion 기반 이산 운동 Tokenizer 를 통한 의미론적 및 운동학적 조건 간의 연결
Diffusion 기반 이산 운동 Tokenizer 를 통한 의미론적 및 운동학적 조건 간의 연결
Chenyang Gu Mingyuan Zhang Haozhe Xie Zhongang Cai Lei Yang Ziwei Liu
초록
기존의 모션 생성 연구는 주로 두 가지 패러다임을 따릅니다. 하나는 운동학적 제어에 뛰어난 연속 확산 모델이고, 다른 하나는 의미론적 조건부 생성에 효과적인 이산 토큰 기반 생성기입니다. 이 두 접근법의 강점을 결합하기 위해 우리는 조건 특성 추출(Perception), 이산 토큰 생성(Planning), 확산 기반 모션 합성(Control)으로 구성되는 3 단계 프레임워크를 제안합니다. 이 프레임워크의 핵심은 MoTok으로, 이는 확산 디코더에 모션 복원을 위임함으로써 의미론적 추상화와 정밀 재구성을 분리하는 확산 기반 이산 모션 토키나이저입니다. 이를 통해 모션 충실도를 유지하면서도 컴팩트한 단일 계층 토큰을 구현할 수 있습니다. 운동학적 조건 처리에서는 계획 단계에서 거시적 제약이 토큰 생성을 안내하고, 제어 단계에서는 확산 기반 최적화를 통해 미시적 제약이 강제됩니다. 이러한 설계는 운동학적 세부 사항이 의미론적 토큰 계획 과정을 방해하는 것을 방지합니다. HumanML3D 벤치마크에서 본 방법은 MaskControl 대비 토큰 수를 6 분의 1 수준으로 줄이면서 제어 가능성과 충실도를 크게 향상시켰으며, 궤적 오차는 0.72cm 에서 0.08cm 로, FID 는 0.083 에서 0.029 로 감소했습니다. 더 강력한 운동학적 제약 하에서 성능이 저하되는 기존 방법들과 달리, 본 방법은 충실도를 개선하여 FID 를 0.033 에서 0.014 로 낮췄습니다.
One-sentence Summary
Researchers from Nanyang Technological University and The Chinese University of Hong Kong propose MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from kinematic reconstruction to enable compact tokenization and superior trajectory control in human motion generation.
Key Contributions
- The paper introduces a three-stage Perception-Planning-Control paradigm for controllable motion generation that unifies autoregressive and discrete diffusion planners under a single interface to separate high-level planning from low-level kinematics.
- This work presents MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens with a significantly reduced token budget.
- A coarse-to-fine conditioning scheme is developed to inject kinematic signals as coarse constraints during token planning and enforce fine-grained constraints during diffusion denoising, which experiments on HumanML3D show improves controllability and fidelity while reducing trajectory error from 0.72 cm to 0.08 cm.
Introduction
Human motion generation is critical for applications in animation, robotics, and embodied agents, yet existing methods struggle to balance high-level semantic intent with fine-grained kinematic control. Prior token-based approaches often entangle semantic abstraction with low-level motion details, requiring high token rates and causing performance degradation when strong kinematic constraints are applied. The authors propose a three-stage Perception-Planning-Control framework centered on MoTok, a diffusion-based discrete motion tokenizer that decouples semantic planning from motion reconstruction. By delegating fine-grained recovery to a diffusion decoder and applying kinematic constraints in a coarse-to-fine manner across stages, their method achieves compact single-layer tokenization while significantly improving both controllability and motion fidelity.
Method
The authors propose a unified motion generation framework that bridges the strengths of continuous diffusion models for kinematic control and discrete token-based generators for semantic conditioning. This approach follows a three-stage Perception-Planning-Control paradigm, as illustrated in the overview diagram below.
At the core of this framework is MoTok, a diffusion-based discrete motion tokenizer. Unlike conventional VQ-VAE tokenizers that directly decode continuous motion from discrete codes, MoTok factorizes the representation into a compact discrete code sequence and a diffusion decoder for fine-grained reconstruction. This design allows discrete tokens to focus on semantic structure while offloading low-level details to the diffusion process.
Refer to the detailed architecture diagram below for the specific components of the MoTok tokenizer and the unified generation pipeline.
The MoTok tokenizer consists of three primary components. First, a convolutional encoder E(⋅) extracts latent features from the input motion sequence θ1:T through progressive temporal downsampling:
h1:N=E(θ1:T),h1:N∈RN×d,where N is the compressed sequence length and d is the latent dimension. Second, a vector quantizer Q(⋅) maps these latents to a discrete token sequence z1:N by finding the nearest entry in a shared codebook C:
zn=argk∈{1,...,K}min∥hn−ck∥22,qn=czn.Third, instead of direct regression, the decoder employs a conditional diffusion model. A convolutional decoder D(⋅) first upsamples the quantized latents q1:N into a per-frame conditioning signal s1:T. A neural denoiser fϕ then reconstructs the clean motion x^0 from a noisy input xt conditioned on s1:T:
x^0=fϕ(xt,t,s1:T).This diffusion-based decoding provides a natural interface for enforcing fine-grained constraints during the reconstruction phase.
The unified conditional generation pipeline supports both discrete diffusion and autoregressive planners through a shared conditioning interface. Conditions are categorized into global conditions cg (e.g., text descriptions) and local conditions c1:Ts (e.g., target trajectories). Global conditions are encoded into a sequence-level feature Mg, while local conditions are encoded into a token-aligned feature sequence M1:Ns.
During planning in discrete token space, these conditions are injected into the Transformer-based generator. For discrete diffusion planning, a token embedding sequence is constructed where the global condition occupies the first position, and local condition features are added via additive fusion to the motion token positions. For autoregressive planning, the global condition similarly occupies the first position, with local conditions aligned to preceding token positions to preserve temporal causality.
Finally, control is enforced during the diffusion decoding stage. After the discrete tokens are generated, they are decoded into the conditioning sequence s1:T. To ensure adherence to local kinematic constraints, an auxiliary control loss Lctrl is optimized during the denoising process. At each diffusion step k, the motion estimate x^k is refined via gradient descent:
x^k←x^k−η∇x^kLctrl(x^k,c1:Ts),where η controls the refinement strength. This mechanism allows the system to achieve precise low-level control without burdening the discrete planner with high-frequency details.
Experiment
- Controllable motion generation experiments on HumanML3D and KIT-ML validate that MoTok achieves superior trajectory alignment and motion realism compared to baselines, even with significantly fewer tokens.
- Text-to-motion generation tests confirm that MoTok produces higher quality motions with lower FID scores while operating under a reduced token budget, demonstrating efficient semantic planning.
- Ablation studies reveal that diffusion-based decoders outperform convolutional ones by better recovering fine-grained motion details under noisy generation conditions.
- Configuration analysis shows that moderate temporal downsampling and specific kernel sizes optimize the balance between reconstruction quality and planning stability.
- Dual-path conditioning experiments prove that injecting low-level control signals in both the generator and decoder is essential for achieving high fidelity and precise constraint adherence.
- Two-stage training evaluations demonstrate that MoTok tokens encode richer semantic information and allow for better detail recovery than standard VQ-VAE approaches.
- Efficiency comparisons highlight that MoTok generates sequences substantially faster than competing methods while maintaining high performance.