HyperAIHyperAI

Command Palette

Search for a command to run...

توليد النص إلى الحركة الموحد الخالي من الأرقام عبر مطابقة التدفق

Guanhe Huang Oya Celiktutan

الملخص

تتفوق النماذج التوليدية في توليد الحركات لعدد ثابت من الوكلاء، لكنها تواجه صعوبات في التعميم عند التعامل مع أعداد متغيرة من الوكلاء. ومع الاعتماد على بيانات محدودة ومخصصة لمجال معين، تلجأ الطرق الحالية إلى نماذج توليدية ذاتية الانحدار (autoregressive models) لتوليد الحركات بشكل متكرر، مما يؤدي إلى عدم كفاءة في الأداء وتراكم الأخطاء. نقترح في هذا العمل نموذج "تدفق الحركة الموحد" (Unified Motion Flow - UMF)، الذي يتكون من تدفق الحركة الهرمي (Pyramid Motion Flow - P-Flow) وتدفق الحركة شبه الضوضائي (Semi-Noise Motion Flow - S-Flow). يقوم UMF بتفكيك عملية توليد الحركات المستقلة عن عدد الوكلاء إلى مرحلة واحدة لتوليد مسبق للحركة (motion prior generation) ومرحلتين أو أكثر متكررة لتوليد ردود الفعل (reaction generation). على وجه التحديد، يستفيد UMF من فضاء كامن موحد لسد الفجوة في التوزيع بين مجموعات بيانات الحركة غير المتجانسة، مما يمكّن من تدريب موحد وفعال. أما بالنسبة لتوليد الحركة المسبقة، فإن P-Flow يعمل على دقات هرمية مشروطة بمستويات ضوضاء مختلفة، مما يقلل من الأعباء الحسابية. وفيما يتعلق بتوليد ردود الفعل، يتعلم S-Flow مسارًا احتماليًا مشتركًا يقوم بشكل تكيفي بتحويل ردود الفعل وإعادة بناء السياق، مما يخفف من تراكم الأخطاء. وتُظهر النتائج الواسعة والدراسات التي أجريت مع المستخدمين فعالية UMF كنموذج عام لتوليد حركات متعددة الأشخاص انطلاقًا من نصوص وصفية. صفحة المشروع: https://githubhgh.github.io/umf/.

One-sentence Summary

Authors from King's College London propose Unified Motion Flow, a generalist model for multi-person motion generation that replaces inefficient autoregressive methods with a single-pass prior and multi-pass reaction stages. By leveraging a unified latent space and hierarchical noise conditioning, it effectively handles variable agent counts while mitigating error accumulation.

Key Contributions

  • The paper introduces Unified Motion Flow (UMF), a framework that utilizes a unified latent space to bridge distribution gaps between heterogeneous motion datasets, enabling effective unified training for number-free text-to-motion generation.
  • A Pyramid Motion Flow module is presented to generate motion priors across hierarchical resolutions conditioned on noise levels, which mitigates computational overheads by handling different resolutions within a single Transformer.
  • The method incorporates a Semi-Noise Motion Flow component that learns a joint probabilistic path to adaptively perform reaction transformation and context reconstruction, thereby alleviating error accumulation in multi-pass reaction generation.

Introduction

Text-conditioned human motion synthesis is critical for applications like virtual reality and animation, yet existing methods often struggle with scalability and flexibility. Prior approaches typically focus on single or dual-agent scenarios, while unified models that handle variable numbers of actors frequently suffer from inefficiency and error accumulation during autoregressive generation. To address these challenges, the authors propose a unified number-free text-to-motion generation framework based on flow matching that eliminates the need for explicit agent counting and avoids the error propagation issues found in previous multi-person systems.

Dataset

  • The authors utilize two primary datasets for evaluating text-conditioned motion generation: InterHuman, which contains 7,779 interaction sequences, and HumanML3D, which includes 14,616 individual sequences.
  • Each sequence in both datasets is paired with three textual annotations, while the InterHuman-AS subset adds specific actor-reactor order annotations to the standard InterHuman data.
  • The paper employs these datasets strictly for evaluation rather than training, using established metrics such as Frechet Inception Distance (FID), R-precision, Multimodal Distance (MM Dist), Diversity, and Multimodality scores to assess fidelity and variety.
  • Model training relies on the AdamW optimizer with an initial learning rate of 10410^{-4}104 and a cosine decay schedule, utilizing a mini-batch size of 128 for the VAE stage and 64 for the flow matching stages.
  • The training process spans 6K epochs for the VAE stage, followed by 2K epochs each for the P-Flow and S-Flow stages, with no specific cropping strategies or metadata construction details mentioned beyond the existing annotations.

Method

The authors propose Unified Motion Flow (UMF), a generalist framework designed for number-free text-to-motion generation. Unlike standard methods restricted to fixed agent counts or autoregressive approaches that suffer from error accumulation, UMF leverages a heterogeneous motion prior as the adaptive start point of the reaction flow path. This design mitigates error accumulation and allows for effective unified training across heterogeneous datasets. Refer to the framework diagram for a visual comparison of naive methods, inherent prior approaches, and the proposed UMF architecture.

To bridge the distribution gap between heterogeneous motion datasets, UMF establishes a unified latent space. As shown in the figure below, the framework consists of three main stages. The first stage involves a single motion tokenizer that encodes raw motions from heterogeneous datasets into a regularized multi-token latent representation. This VAE-based encoder-decoder utilizes latent adapters to decouple internal token representation from the final latent dimension. The training loss of the VAE is defined as:

LVAE=Lgeometric+Lreconstruction+λKLLKL.\mathcal { L } _ { \mathrm { V A E } } = \mathcal { L } _ { \mathrm { g e o m e t r i c } } + \mathcal { L } _ { \mathrm { r e c o n s t r u c t i o n } } + \lambda _ { \mathrm { K L } } \, \mathcal { L } _ { \mathrm { K L } } \, .LVAE=Lgeometric+Lreconstruction+λKLLKL.

This regularized latent space stabilizes flow matching training on heterogeneous single-agent and multi-agent datasets.

For efficient individual motion synthesis, the authors introduce Pyramid Motion Flow (P-Flow). This module operates on hierarchical resolutions conditioned on the noise level to alleviate computational overheads. P-Flow decomposes the motion prior generation into continuous hierarchical stages based on the timestep. It processes downsampled, low-resolution latents for early timesteps and switches to original-resolution latents for later stages within a single transformer model. The model is trained to regress the flow model GθPG_{\theta}^{P}GθP on the conditional vector field with the following objective:

LPFlow=Ek,t,z^ek,z^skGθP(z^t;t,c)(z^ekz^sk)2.\mathcal { L } _ { \mathrm { P - F l o w } } = \mathbb { E } _ { k , t , \hat { z } _ { e _ { k } } , \hat { z } _ { s _ { k } } } \left\| G _ { \theta } ^ { P } ( \hat { z } _ { t } ; t , c ) - ( \hat { z } _ { e _ { k } } - \hat { z } _ { s _ { k } } ) \right\| ^ { 2 } .LPFlow=Ek,t,z^ek,z^skGθP(z^t;t,c)(z^ekz^sk)2.

For reaction and interaction synthesis, the framework employs Semi-Noise Motion Flow (S-Flow). S-Flow learns a joint probabilistic path by balancing reaction transformation and context reconstruction. Instead of using generated motions as a static condition, S-Flow integrates them to define the context distribution. This source distribution initializes the reaction generation path, enabling the model to focus on learning the dynamic transformation between motion distributions while simultaneously reconstructing the context from noise distributions as a regularizer. The S-Flow training objective is a weighted sum of the reaction transformation loss and the context reconstruction loss:

LSFlow=Ltrans+λreconLrecon.\mathcal { L } _ { \mathrm { S - F l o w } } = \mathcal { L } _ { \mathrm { t r a n s } } + \lambda _ { \mathrm { r e c o n } } \mathcal { L } _ { \mathrm { r e c o n } } .LSFlow=Ltrans+λreconLrecon.

This joint training balances between reaction prediction and context awareness, making the model less prone to error accumulation during autoregressive generation.

Experiment

  • Quantitative evaluations on InterHuman and InterHuman-AS benchmarks demonstrate that the method substantially outperforms generalist and specialist baselines in text adherence, motion fidelity, and diversity, validating its superior ability to generate realistic multi-agent interactions.
  • Qualitative comparisons and user studies confirm the model's capacity to produce coherent physical interactions and correct agent positioning in complex scenarios, including zero-shot generation for variable group sizes where baseline methods fail.
  • Ablation studies validate that leveraging heterogeneous priors from single-agent datasets enhances multi-agent performance, while the proposed latent adapter and multi-token design are essential for effective number-free generation.
  • Efficiency analysis reveals that the Pyramid Flow structure significantly reduces computational cost and inference time compared to baselines, with asymmetric step allocation identified as the optimal strategy for balancing speed and quality.
  • Component analysis confirms that the semi-noise flow design, context adapter, and reconstruction loss are critical for preventing error accumulation and maintaining high generation fidelity.

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp