HyperAIHyperAI

Command Palette

Search for a command to run...

フローマッチングによる統一された数値不要のテキストからモーションへの生成

Guanhe Huang Oya Celiktutan

概要

生成モデルは、エージェント数が固定された運動合成において卓越した性能を発揮しますが、エージェント数が可変である場合の一般化には課題を抱えています。既存手法は、限定的かつドメイン固有のデータに基づき、再帰的に運動を生成する自己回帰モデルを採用しており、計算効率の低下や誤差の蓄積という問題に直面しています。本研究では、Pyramid Motion Flow(P-Flow)と Semi-Noise Motion Flow(S-Flow)から構成される Unified Motion Flow(UMF)を提案します。UMF は、エージェント数に依存しない運動生成を、単一パスの運動事前生成ステージと多パスの反応生成ステージに分解します。具体的には、UMF は統一された潜在空間を活用して異質な運動データセット間の分布ギャップを橋渡しし、効果的な統一学習を可能にします。運動事前生成においては、P-Flow が異なるノイズレベルに条件付けられた階層的解像度上で動作することで、計算オーバーヘッドを軽減します。反応生成においては、S-Flow が反応変換と文脈再構成を適応的に実行する結合確率パスを学習し、誤差の蓄積を緩和します。広範な実験結果およびユーザー評価により、UMF がテキストから複数人の運動を生成する汎用モデルとして有効であることが示されました。プロジェクトページ:https://githubhgh.github.io/umf/

One-sentence Summary

Authors from King's College London propose Unified Motion Flow, a generalist model for multi-person motion generation that replaces inefficient autoregressive methods with a single-pass prior and multi-pass reaction stages. By leveraging a unified latent space and hierarchical noise conditioning, it effectively handles variable agent counts while mitigating error accumulation.

Key Contributions

  • The paper introduces Unified Motion Flow (UMF), a framework that utilizes a unified latent space to bridge distribution gaps between heterogeneous motion datasets, enabling effective unified training for number-free text-to-motion generation.
  • A Pyramid Motion Flow module is presented to generate motion priors across hierarchical resolutions conditioned on noise levels, which mitigates computational overheads by handling different resolutions within a single Transformer.
  • The method incorporates a Semi-Noise Motion Flow component that learns a joint probabilistic path to adaptively perform reaction transformation and context reconstruction, thereby alleviating error accumulation in multi-pass reaction generation.

Introduction

Text-conditioned human motion synthesis is critical for applications like virtual reality and animation, yet existing methods often struggle with scalability and flexibility. Prior approaches typically focus on single or dual-agent scenarios, while unified models that handle variable numbers of actors frequently suffer from inefficiency and error accumulation during autoregressive generation. To address these challenges, the authors propose a unified number-free text-to-motion generation framework based on flow matching that eliminates the need for explicit agent counting and avoids the error propagation issues found in previous multi-person systems.

Dataset

  • The authors utilize two primary datasets for evaluating text-conditioned motion generation: InterHuman, which contains 7,779 interaction sequences, and HumanML3D, which includes 14,616 individual sequences.
  • Each sequence in both datasets is paired with three textual annotations, while the InterHuman-AS subset adds specific actor-reactor order annotations to the standard InterHuman data.
  • The paper employs these datasets strictly for evaluation rather than training, using established metrics such as Frechet Inception Distance (FID), R-precision, Multimodal Distance (MM Dist), Diversity, and Multimodality scores to assess fidelity and variety.
  • Model training relies on the AdamW optimizer with an initial learning rate of 10410^{-4}104 and a cosine decay schedule, utilizing a mini-batch size of 128 for the VAE stage and 64 for the flow matching stages.
  • The training process spans 6K epochs for the VAE stage, followed by 2K epochs each for the P-Flow and S-Flow stages, with no specific cropping strategies or metadata construction details mentioned beyond the existing annotations.

Method

The authors propose Unified Motion Flow (UMF), a generalist framework designed for number-free text-to-motion generation. Unlike standard methods restricted to fixed agent counts or autoregressive approaches that suffer from error accumulation, UMF leverages a heterogeneous motion prior as the adaptive start point of the reaction flow path. This design mitigates error accumulation and allows for effective unified training across heterogeneous datasets. Refer to the framework diagram for a visual comparison of naive methods, inherent prior approaches, and the proposed UMF architecture.

To bridge the distribution gap between heterogeneous motion datasets, UMF establishes a unified latent space. As shown in the figure below, the framework consists of three main stages. The first stage involves a single motion tokenizer that encodes raw motions from heterogeneous datasets into a regularized multi-token latent representation. This VAE-based encoder-decoder utilizes latent adapters to decouple internal token representation from the final latent dimension. The training loss of the VAE is defined as:

LVAE=Lgeometric+Lreconstruction+λKLLKL.\mathcal { L } _ { \mathrm { V A E } } = \mathcal { L } _ { \mathrm { g e o m e t r i c } } + \mathcal { L } _ { \mathrm { r e c o n s t r u c t i o n } } + \lambda _ { \mathrm { K L } } \, \mathcal { L } _ { \mathrm { K L } } \, .LVAE=Lgeometric+Lreconstruction+λKLLKL.

This regularized latent space stabilizes flow matching training on heterogeneous single-agent and multi-agent datasets.

For efficient individual motion synthesis, the authors introduce Pyramid Motion Flow (P-Flow). This module operates on hierarchical resolutions conditioned on the noise level to alleviate computational overheads. P-Flow decomposes the motion prior generation into continuous hierarchical stages based on the timestep. It processes downsampled, low-resolution latents for early timesteps and switches to original-resolution latents for later stages within a single transformer model. The model is trained to regress the flow model GθPG_{\theta}^{P}GθP on the conditional vector field with the following objective:

LPFlow=Ek,t,z^ek,z^skGθP(z^t;t,c)(z^ekz^sk)2.\mathcal { L } _ { \mathrm { P - F l o w } } = \mathbb { E } _ { k , t , \hat { z } _ { e _ { k } } , \hat { z } _ { s _ { k } } } \left\| G _ { \theta } ^ { P } ( \hat { z } _ { t } ; t , c ) - ( \hat { z } _ { e _ { k } } - \hat { z } _ { s _ { k } } ) \right\| ^ { 2 } .LPFlow=Ek,t,z^ek,z^skGθP(z^t;t,c)(z^ekz^sk)2.

For reaction and interaction synthesis, the framework employs Semi-Noise Motion Flow (S-Flow). S-Flow learns a joint probabilistic path by balancing reaction transformation and context reconstruction. Instead of using generated motions as a static condition, S-Flow integrates them to define the context distribution. This source distribution initializes the reaction generation path, enabling the model to focus on learning the dynamic transformation between motion distributions while simultaneously reconstructing the context from noise distributions as a regularizer. The S-Flow training objective is a weighted sum of the reaction transformation loss and the context reconstruction loss:

LSFlow=Ltrans+λreconLrecon.\mathcal { L } _ { \mathrm { S - F l o w } } = \mathcal { L } _ { \mathrm { t r a n s } } + \lambda _ { \mathrm { r e c o n } } \mathcal { L } _ { \mathrm { r e c o n } } .LSFlow=Ltrans+λreconLrecon.

This joint training balances between reaction prediction and context awareness, making the model less prone to error accumulation during autoregressive generation.

Experiment

  • Quantitative evaluations on InterHuman and InterHuman-AS benchmarks demonstrate that the method substantially outperforms generalist and specialist baselines in text adherence, motion fidelity, and diversity, validating its superior ability to generate realistic multi-agent interactions.
  • Qualitative comparisons and user studies confirm the model's capacity to produce coherent physical interactions and correct agent positioning in complex scenarios, including zero-shot generation for variable group sizes where baseline methods fail.
  • Ablation studies validate that leveraging heterogeneous priors from single-agent datasets enhances multi-agent performance, while the proposed latent adapter and multi-token design are essential for effective number-free generation.
  • Efficiency analysis reveals that the Pyramid Flow structure significantly reduces computational cost and inference time compared to baselines, with asymmetric step allocation identified as the optimal strategy for balancing speed and quality.
  • Component analysis confirms that the semi-noise flow design, context adapter, and reconstruction loss are critical for preventing error accumulation and maintaining high generation fidelity.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています