HyperAIHyperAI

Command Palette

Search for a command to run...

Unified Number-Free Text-to-Motion Generation Via Flow Matching

Guanhe Huang Oya Celiktutan

Abstract

Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domain-specific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF' s effectiveness as a generalist model for multi-person motion generation from text. Project page: https://githubhgh.github.io/umf/.

One-sentence Summary

Authors from King's College London propose Unified Motion Flow, a generalist model for multi-person motion generation that replaces inefficient autoregressive methods with a single-pass prior and multi-pass reaction stages. By leveraging a unified latent space and hierarchical noise conditioning, it effectively handles variable agent counts while mitigating error accumulation.

Key Contributions

  • The paper introduces Unified Motion Flow (UMF), a framework that utilizes a unified latent space to bridge distribution gaps between heterogeneous motion datasets, enabling effective unified training for number-free text-to-motion generation.
  • A Pyramid Motion Flow module is presented to generate motion priors across hierarchical resolutions conditioned on noise levels, which mitigates computational overheads by handling different resolutions within a single Transformer.
  • The method incorporates a Semi-Noise Motion Flow component that learns a joint probabilistic path to adaptively perform reaction transformation and context reconstruction, thereby alleviating error accumulation in multi-pass reaction generation.

Introduction

Text-conditioned human motion synthesis is critical for applications like virtual reality and animation, yet existing methods often struggle with scalability and flexibility. Prior approaches typically focus on single or dual-agent scenarios, while unified models that handle variable numbers of actors frequently suffer from inefficiency and error accumulation during autoregressive generation. To address these challenges, the authors propose a unified number-free text-to-motion generation framework based on flow matching that eliminates the need for explicit agent counting and avoids the error propagation issues found in previous multi-person systems.

Dataset

  • The authors utilize two primary datasets for evaluating text-conditioned motion generation: InterHuman, which contains 7,779 interaction sequences, and HumanML3D, which includes 14,616 individual sequences.
  • Each sequence in both datasets is paired with three textual annotations, while the InterHuman-AS subset adds specific actor-reactor order annotations to the standard InterHuman data.
  • The paper employs these datasets strictly for evaluation rather than training, using established metrics such as Frechet Inception Distance (FID), R-precision, Multimodal Distance (MM Dist), Diversity, and Multimodality scores to assess fidelity and variety.
  • Model training relies on the AdamW optimizer with an initial learning rate of 10410^{-4}104 and a cosine decay schedule, utilizing a mini-batch size of 128 for the VAE stage and 64 for the flow matching stages.
  • The training process spans 6K epochs for the VAE stage, followed by 2K epochs each for the P-Flow and S-Flow stages, with no specific cropping strategies or metadata construction details mentioned beyond the existing annotations.

Method

The authors propose Unified Motion Flow (UMF), a generalist framework designed for number-free text-to-motion generation. Unlike standard methods restricted to fixed agent counts or autoregressive approaches that suffer from error accumulation, UMF leverages a heterogeneous motion prior as the adaptive start point of the reaction flow path. This design mitigates error accumulation and allows for effective unified training across heterogeneous datasets. Refer to the framework diagram for a visual comparison of naive methods, inherent prior approaches, and the proposed UMF architecture.

To bridge the distribution gap between heterogeneous motion datasets, UMF establishes a unified latent space. As shown in the figure below, the framework consists of three main stages. The first stage involves a single motion tokenizer that encodes raw motions from heterogeneous datasets into a regularized multi-token latent representation. This VAE-based encoder-decoder utilizes latent adapters to decouple internal token representation from the final latent dimension. The training loss of the VAE is defined as:

LVAE=Lgeometric+Lreconstruction+λKLLKL.\mathcal { L } _ { \mathrm { V A E } } = \mathcal { L } _ { \mathrm { g e o m e t r i c } } + \mathcal { L } _ { \mathrm { r e c o n s t r u c t i o n } } + \lambda _ { \mathrm { K L } } \, \mathcal { L } _ { \mathrm { K L } } \, .LVAE=Lgeometric+Lreconstruction+λKLLKL.

This regularized latent space stabilizes flow matching training on heterogeneous single-agent and multi-agent datasets.

For efficient individual motion synthesis, the authors introduce Pyramid Motion Flow (P-Flow). This module operates on hierarchical resolutions conditioned on the noise level to alleviate computational overheads. P-Flow decomposes the motion prior generation into continuous hierarchical stages based on the timestep. It processes downsampled, low-resolution latents for early timesteps and switches to original-resolution latents for later stages within a single transformer model. The model is trained to regress the flow model GθPG_{\theta}^{P}GθP on the conditional vector field with the following objective:

LPFlow=Ek,t,z^ek,z^skGθP(z^t;t,c)(z^ekz^sk)2.\mathcal { L } _ { \mathrm { P - F l o w } } = \mathbb { E } _ { k , t , \hat { z } _ { e _ { k } } , \hat { z } _ { s _ { k } } } \left\| G _ { \theta } ^ { P } ( \hat { z } _ { t } ; t , c ) - ( \hat { z } _ { e _ { k } } - \hat { z } _ { s _ { k } } ) \right\| ^ { 2 } .LPFlow=Ek,t,z^ek,z^skGθP(z^t;t,c)(z^ekz^sk)2.

For reaction and interaction synthesis, the framework employs Semi-Noise Motion Flow (S-Flow). S-Flow learns a joint probabilistic path by balancing reaction transformation and context reconstruction. Instead of using generated motions as a static condition, S-Flow integrates them to define the context distribution. This source distribution initializes the reaction generation path, enabling the model to focus on learning the dynamic transformation between motion distributions while simultaneously reconstructing the context from noise distributions as a regularizer. The S-Flow training objective is a weighted sum of the reaction transformation loss and the context reconstruction loss:

LSFlow=Ltrans+λreconLrecon.\mathcal { L } _ { \mathrm { S - F l o w } } = \mathcal { L } _ { \mathrm { t r a n s } } + \lambda _ { \mathrm { r e c o n } } \mathcal { L } _ { \mathrm { r e c o n } } .LSFlow=Ltrans+λreconLrecon.

This joint training balances between reaction prediction and context awareness, making the model less prone to error accumulation during autoregressive generation.

Experiment

  • Quantitative evaluations on InterHuman and InterHuman-AS benchmarks demonstrate that the method substantially outperforms generalist and specialist baselines in text adherence, motion fidelity, and diversity, validating its superior ability to generate realistic multi-agent interactions.
  • Qualitative comparisons and user studies confirm the model's capacity to produce coherent physical interactions and correct agent positioning in complex scenarios, including zero-shot generation for variable group sizes where baseline methods fail.
  • Ablation studies validate that leveraging heterogeneous priors from single-agent datasets enhances multi-agent performance, while the proposed latent adapter and multi-token design are essential for effective number-free generation.
  • Efficiency analysis reveals that the Pyramid Flow structure significantly reduces computational cost and inference time compared to baselines, with asymmetric step allocation identified as the optimal strategy for balancing speed and quality.
  • Component analysis confirms that the semi-noise flow design, context adapter, and reconstruction loss are critical for preventing error accumulation and maintaining high generation fidelity.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp