HyperAIHyperAI

Command Palette

Search for a command to run...

Flow Matching 를 통한 통합된 숫자 없는 텍스트-모션 생성

Guanhe Huang Oya Celiktutan

초록

생성 모델은 고정된 수의 에이전트에 대한 동작 합성에서는 탁월한 성능을 보이지만, 가변적인 수의 에이전트에 대해서는 일반화에 어려움을 겪습니다. 기존 방법들은 제한된 도메인 특화 데이터를 기반으로 자기회귀 (autoregressive) 모델을 재귀적으로 사용하여 동작을 생성하는데, 이는 비효율성과 오차 누적 문제를 야기합니다. 이에 우리는 Unified Motion Flow(UMF) 를 제안합니다. UMF 는 Pyramid Motion Flow(P-Flow) 와 Semi-Noise Motion Flow(S-Flow) 로 구성됩니다. UMF 는 에이전트 수에 구애받지 않는 동작 생성을 단일 패스의 동작 사전 (motion prior) 생성 단계와 다중 패스의 반응 (reaction) 생성 단계로 분해합니다. 구체적으로, UMF 는 이질적인 동작 데이터셋 간의 분포 간극을 해소하기 위해 통일된 잠재 공간 (unified latent space) 을 활용하여 효과적인 통합 학습을 가능하게 합니다. 동작 사전 생성 단계에서는 P-Flow 가 서로 다른 노이즈 수준에 조건부로 계층적 해상도에서 작동함으로써 계산 오버헤드를 경감시킵니다. 반응 생성 단계에서는 S-Flow 가 적응적으로 반응 변환과 문맥 재구성을 수행하는 결합 확률 경로 (joint probabilistic path) 를 학습하여 오차 누적을 완화합니다. 광범위한 실험 결과와 사용자 연구는 텍스트 기반 다중 인물 동작 생성을 위한 범용 모델로서 UMF 의 유효성을 입증합니다. 프로젝트 페이지: https://githubhgh.github.io/umf/.

One-sentence Summary

Authors from King's College London propose Unified Motion Flow, a generalist model for multi-person motion generation that replaces inefficient autoregressive methods with a single-pass prior and multi-pass reaction stages. By leveraging a unified latent space and hierarchical noise conditioning, it effectively handles variable agent counts while mitigating error accumulation.

Key Contributions

  • The paper introduces Unified Motion Flow (UMF), a framework that utilizes a unified latent space to bridge distribution gaps between heterogeneous motion datasets, enabling effective unified training for number-free text-to-motion generation.
  • A Pyramid Motion Flow module is presented to generate motion priors across hierarchical resolutions conditioned on noise levels, which mitigates computational overheads by handling different resolutions within a single Transformer.
  • The method incorporates a Semi-Noise Motion Flow component that learns a joint probabilistic path to adaptively perform reaction transformation and context reconstruction, thereby alleviating error accumulation in multi-pass reaction generation.

Introduction

Text-conditioned human motion synthesis is critical for applications like virtual reality and animation, yet existing methods often struggle with scalability and flexibility. Prior approaches typically focus on single or dual-agent scenarios, while unified models that handle variable numbers of actors frequently suffer from inefficiency and error accumulation during autoregressive generation. To address these challenges, the authors propose a unified number-free text-to-motion generation framework based on flow matching that eliminates the need for explicit agent counting and avoids the error propagation issues found in previous multi-person systems.

Dataset

  • The authors utilize two primary datasets for evaluating text-conditioned motion generation: InterHuman, which contains 7,779 interaction sequences, and HumanML3D, which includes 14,616 individual sequences.
  • Each sequence in both datasets is paired with three textual annotations, while the InterHuman-AS subset adds specific actor-reactor order annotations to the standard InterHuman data.
  • The paper employs these datasets strictly for evaluation rather than training, using established metrics such as Frechet Inception Distance (FID), R-precision, Multimodal Distance (MM Dist), Diversity, and Multimodality scores to assess fidelity and variety.
  • Model training relies on the AdamW optimizer with an initial learning rate of 10410^{-4}104 and a cosine decay schedule, utilizing a mini-batch size of 128 for the VAE stage and 64 for the flow matching stages.
  • The training process spans 6K epochs for the VAE stage, followed by 2K epochs each for the P-Flow and S-Flow stages, with no specific cropping strategies or metadata construction details mentioned beyond the existing annotations.

Method

The authors propose Unified Motion Flow (UMF), a generalist framework designed for number-free text-to-motion generation. Unlike standard methods restricted to fixed agent counts or autoregressive approaches that suffer from error accumulation, UMF leverages a heterogeneous motion prior as the adaptive start point of the reaction flow path. This design mitigates error accumulation and allows for effective unified training across heterogeneous datasets. Refer to the framework diagram for a visual comparison of naive methods, inherent prior approaches, and the proposed UMF architecture.

To bridge the distribution gap between heterogeneous motion datasets, UMF establishes a unified latent space. As shown in the figure below, the framework consists of three main stages. The first stage involves a single motion tokenizer that encodes raw motions from heterogeneous datasets into a regularized multi-token latent representation. This VAE-based encoder-decoder utilizes latent adapters to decouple internal token representation from the final latent dimension. The training loss of the VAE is defined as:

LVAE=Lgeometric+Lreconstruction+λKLLKL.\mathcal { L } _ { \mathrm { V A E } } = \mathcal { L } _ { \mathrm { g e o m e t r i c } } + \mathcal { L } _ { \mathrm { r e c o n s t r u c t i o n } } + \lambda _ { \mathrm { K L } } \, \mathcal { L } _ { \mathrm { K L } } \, .LVAE=Lgeometric+Lreconstruction+λKLLKL.

This regularized latent space stabilizes flow matching training on heterogeneous single-agent and multi-agent datasets.

For efficient individual motion synthesis, the authors introduce Pyramid Motion Flow (P-Flow). This module operates on hierarchical resolutions conditioned on the noise level to alleviate computational overheads. P-Flow decomposes the motion prior generation into continuous hierarchical stages based on the timestep. It processes downsampled, low-resolution latents for early timesteps and switches to original-resolution latents for later stages within a single transformer model. The model is trained to regress the flow model GθPG_{\theta}^{P}GθP on the conditional vector field with the following objective:

LPFlow=Ek,t,z^ek,z^skGθP(z^t;t,c)(z^ekz^sk)2.\mathcal { L } _ { \mathrm { P - F l o w } } = \mathbb { E } _ { k , t , \hat { z } _ { e _ { k } } , \hat { z } _ { s _ { k } } } \left\| G _ { \theta } ^ { P } ( \hat { z } _ { t } ; t , c ) - ( \hat { z } _ { e _ { k } } - \hat { z } _ { s _ { k } } ) \right\| ^ { 2 } .LPFlow=Ek,t,z^ek,z^skGθP(z^t;t,c)(z^ekz^sk)2.

For reaction and interaction synthesis, the framework employs Semi-Noise Motion Flow (S-Flow). S-Flow learns a joint probabilistic path by balancing reaction transformation and context reconstruction. Instead of using generated motions as a static condition, S-Flow integrates them to define the context distribution. This source distribution initializes the reaction generation path, enabling the model to focus on learning the dynamic transformation between motion distributions while simultaneously reconstructing the context from noise distributions as a regularizer. The S-Flow training objective is a weighted sum of the reaction transformation loss and the context reconstruction loss:

LSFlow=Ltrans+λreconLrecon.\mathcal { L } _ { \mathrm { S - F l o w } } = \mathcal { L } _ { \mathrm { t r a n s } } + \lambda _ { \mathrm { r e c o n } } \mathcal { L } _ { \mathrm { r e c o n } } .LSFlow=Ltrans+λreconLrecon.

This joint training balances between reaction prediction and context awareness, making the model less prone to error accumulation during autoregressive generation.

Experiment

  • Quantitative evaluations on InterHuman and InterHuman-AS benchmarks demonstrate that the method substantially outperforms generalist and specialist baselines in text adherence, motion fidelity, and diversity, validating its superior ability to generate realistic multi-agent interactions.
  • Qualitative comparisons and user studies confirm the model's capacity to produce coherent physical interactions and correct agent positioning in complex scenarios, including zero-shot generation for variable group sizes where baseline methods fail.
  • Ablation studies validate that leveraging heterogeneous priors from single-agent datasets enhances multi-agent performance, while the proposed latent adapter and multi-token design are essential for effective number-free generation.
  • Efficiency analysis reveals that the Pyramid Flow structure significantly reduces computational cost and inference time compared to baselines, with asymmetric step allocation identified as the optimal strategy for balancing speed and quality.
  • Component analysis confirms that the semi-noise flow design, context adapter, and reconstruction loss are critical for preventing error accumulation and maintaining high generation fidelity.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp