HyperAIHyperAI

Command Palette

Search for a command to run...

SARAH : Humains agents en temps réel sensibles à l'espace

Evonne Ng Siwei Zhang Zhang Chen Michael Zollhoefer Alexander Richard

Résumé

À mesure que les agents incarnés deviennent centraux dans les applications de réalité virtuelle (RV), de téléprésence et d’humains numériques, leur mouvement doit dépasser les gestes synchronisés avec le discours : ces agents doivent se tourner vers l’utilisateur, réagir à ses mouvements et maintenir un regard naturel. Les méthodes actuelles manquent de cette conscience spatiale. Nous comblons cet écart en proposant la première méthode en temps réel et entièrement causale pour le mouvement conversationnel conscient de l’espace, déployable sur un casque RV en streaming. Étant donné la position de l’utilisateur et un audio dyadique, notre approche génère un mouvement corporel complet, synchronisant les gestes avec le discours tout en orientant l’agent selon la position de l’utilisateur. Notre architecture combine un VAE basé sur un transformateur causale avec des jetons latents entrelacés pour une inférence en streaming, et un modèle de correspondance de flux conditionné par la trajectoire de l’utilisateur et l’audio. Pour prendre en compte les préférences variées en matière de regard, nous introduisons un mécanisme d’évaluation du regard avec une guidance sans classificateur, permettant de délier apprentissage et contrôle : le modèle capture naturellement l’alignement spatial à partir des données, tandis que les utilisateurs peuvent ajuster l’intensité du contact visuel en temps d’inférence. Sur le jeu de données Embody 3D, notre méthode atteint une qualité de mouvement de pointe à plus de 300 FPS — trois fois plus rapide que les bases non causales — tout en capturant les subtilités des dynamiques spatiales propres aux conversations naturelles. Nous validons notre approche sur un système RV en direct, rendant ainsi disponibles des agents conversationnels conscients de l’espace pour un déploiement en temps réel. Pour plus de détails, veuillez consulter https://evonneng.github.io/sarah/.

One-sentence Summary

Meta Reality Labs researchers propose a real-time, causal method for spatially aware conversational motion using a transformer VAE and flow matching, enabling VR agents to dynamically align gestures and gaze with users via audio and position—achieving 300+ FPS and natural interaction in live deployment.

Key Contributions

  • We introduce the first real-time, fully causal method for generating spatially aware conversational motion in virtual agents, enabling them to dynamically orient toward users and align gestures with speech using only past and present user position and dyadic audio.
  • Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio, plus a classifier-free gaze guidance mechanism that decouples learned spatial behavior from user-adjustable eye contact intensity.
  • Evaluated on the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS—three times faster than non-causal baselines—and successfully deploys on a live VR system, demonstrating real-time spatial reactivity without future frame access.

Introduction

The authors leverage real-time, causal generative modeling to enable virtual agents in VR and telepresence systems to dynamically orient toward users during conversation—turning, gazing, and gesturing naturally in response to both speech and spatial movement. Prior methods either ignore spatial context, assume static participants, or rely on non-causal models that can’t stream in real time, limiting their use in interactive systems. Their main contribution is SARAH: a streaming architecture combining a causal transformer-based VAE with interleaved latent tokens and a flow matching model conditioned on user trajectory and dyadic audio, plus a classifier-free gaze guidance mechanism that lets users adjust eye contact intensity at inference—achieving state-of-the-art motion quality at over 300 FPS.

Dataset

  • The authors use the dyadic conversation subset from the Embody 3D dataset [McLean et al. 2025], which contains approximately 50 hours of multiview dome-captured interactions covering casual, work, and social conversations.
  • Participants represent diverse age groups, genders, and ethnicities, and the dataset includes both audio and 3D motion annotations.
  • Unlike prior monadic datasets (e.g., Speech2Gesture, BEAT) that capture single speakers without spatial context, or dyadic datasets (e.g., Audio2Photoreal, Panoptic Studio) where participants remain stationary and face each other, Embody 3D records natural, dynamic interactions where individuals walk freely and shift positions.
  • The dataset is used to train models on 3D spatial proxemics in conversation, leveraging its unique capture of movement and spatial relationships between speakers.
  • No cropping or metadata construction details are specified; processing focuses on utilizing raw audio and 3D motion annotations directly from the source.

Method

The authors leverage a real-time, autoregressive motion synthesis pipeline that conditions on dyadic conversational audio and user spatial position to generate spatially and conversationally aware 3D motion for an AI agent. The system is built around a causal transformer-based variational autoencoder (VAE) and a flow matching generator, both designed for streaming inference with strict temporal causality.

The overall framework begins with input conditioning: the user’s floor-projected head position pyRT×2\mathbf{p}_y \in \mathbb{R}^{T \times 2}pyRT×2, and audio features a,bRT×Da\mathbf{a}, \mathbf{b} \in \mathbb{R}^{T \times D_a}a,bRT×Da extracted via HuBERT from agent and user speech streams. These are fed into a generative model G\mathcal{G}G that outputs the agent’s motion sequence xRT×Dx\mathbf{x} \in \mathbb{R}^{T \times D_x}xRT×Dx. Refer to the framework diagram for a visual overview of the end-to-end pipeline, including the VAE encoder-decoder structure and the flow matching generator.

To enable efficient and stable training, the authors adopt a fully Euclidean motion representation. Instead of traditional joint-angle parameterizations, each joint jjj is encoded as a 3D icosahedron: the centroid of its 12 vertices yields the global position Πj\Pi_jΠj, and the global orientation Ωj\Omega_jΩj is recovered via SVD against a reference icosahedron. This representation avoids error propagation from local rotations and improves convergence. The full pose is thus encoded as xtRJ×12×3x_t \in \mathbb{R}^{J \times 12 \times 3}xtRJ×12×3, where JJJ is the number of joints, and is normalized relative to the first frame to prevent drift. As shown in the figure, this geometric encoding provides a robust and differentiable motion representation.

The core of the architecture is a causal transformer-based VAE. Unlike standard VAEs that place global latents at the sequence start, this model interleaves latent tokens μk,σkRDz\mu_k, \sigma_k \in \mathbb{R}^{D_z}μk,σkRDz at a fixed temporal stride sss, enabling causal attention. The encoder E\mathcal{E}E processes input in blocks: (x1:S,μ1,σ1,xs+1:2S,μ2,σ2,)(\mathbf{x}_{1:S}, \mu_1, \sigma_1, \mathbf{x}_{s+1:2S}, \mu_2, \sigma_2, \ldots)(x1:S,μ1,σ1,xs+1:2S,μ2,σ2,), where each μk/σk\mu_k/\sigma_kμk/σk token attends only to preceding frames and earlier latents. The decoder D\mathcal{D}D mirrors this pattern. The model is trained with a VAE loss combining reconstruction and KL divergence:

LVAE=xx^22+βk=1KKL(qϕ(zkx1:kS)N(0,I)),\mathcal{L}_{\mathrm{VAE}} = \| \mathbf{x} - \hat{\mathbf{x}} \|_2^2 + \beta \sum_{k=1}^{K} \mathrm{KL} \big( q_{\phi}(z_k \mid \mathbf{x}_{1:k_S}) \parallel \mathcal{N}(\mathbf{0}, \mathbf{I}) \big),LVAE=xx^22+βk=1KKL(qϕ(zkx1:kS)N(0,I)),

where K=T/sK = T/sK=T/s, x^\hat{\mathbf{x}}x^ is the reconstruction, and zkN(μk,σk2)z_k \sim \mathcal{N}(\mu_k, \sigma_k^2)zkN(μk,σk2) is the sampled latent for block kkk. After training, the encoder is used to extract the latent sequence z=(z1,,zK)RK×Dz\mathbf{z} = (z_1, \ldots, z_K) \in \mathbb{R}^{K \times D_z}z=(z1,,zK)RK×Dz for generation.

The motion generator is a transformer-based flow matching model that operates on the latent space. It transports noise ϵN(0,I)\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})ϵN(0,I) to data by predicting a velocity field vθ(zτ,τ,c)\mathbf{v}_{\theta}(\mathbf{z}^{\tau}, \tau, \mathbf{c})vθ(zτ,τ,c), where τ[0,1]\tau \in [0,1]τ[0,1] is flow time and c=[py;a;b]\mathbf{c} = [\mathbf{p}_y; \mathbf{a}; \mathbf{b}]c=[py;a;b] is the conditioning. The interpolated latent is formed as:

zτ=τz+(1τ)ϵ,ϵN(0,I).\mathbf{z}^{\tau} = \tau \mathbf{z} + (1 - \tau) \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I}).zτ=τz+(1τ)ϵ,ϵN(0,I).

Training uses x1x_1x1-prediction with loss:

Lflow=Eτ,ϵ,z[G(zτ,τ,c)z22],\mathcal{L}_{\mathrm{flow}} = \mathbb{E}_{\tau, \epsilon, \mathbf{z}} \left[ \| \pmb{\mathcal{G}}(\mathbf{z}^{\tau}, \tau, \mathbf{c}) - \mathbf{z} \|_2^2 \right],Lflow=Eτ,ϵ,z[G(zτ,τ,c)z22],

where τU[0,1]\tau \sim \mathcal{U}[0,1]τU[0,1]. Classifier-free guidance is applied by dropping each modality independently with 5% probability. For real-time streaming, causal attention masking is enforced, and temporal consistency is maintained via imputation: at each step, previously predicted latents are imputed into the noisy sequence before denoising proceeds.

To enable controllable gaze behavior, the authors introduce a tunable gaze guidance mechanism. The gaze score ggg is computed as the dot product between the agent’s facing direction dxd_xdx and the direction toward the user dyd_ydy:

dx=hfhbhfhb,dy=pyhbpuhb,g=dxdu.d_x = \frac{h_f - h_b}{\| h_f - h_b \|}, \quad d_y = \frac{p_y - h_b}{\| p_u - h_b \|}, \quad g = d_x \cdot d_u.dx=hfhbhfhb,dy=puhbpyhb,g=dxdu.

This score ranges from -1 (facing away) to 1 (direct eye contact). During training, gRT×1\mathbf{g} \in \mathbb{R}^{T \times 1}gRT×1 is concatenated with conditioning c\mathbf{c}c, and classifier-free guidance drops g\mathbf{g}g with 5% probability. At inference, a target gaze score can be specified to steer eye contact intensity while preserving natural variation. As illustrated in the figure, this allows fine-grained control over non-verbal engagement cues.

For deployment, motion is generated in chunks of s=4s = 4s=4 frames, with the last 2 tokens retained for temporal continuity. The system uses a midpoint solver with 4 iterations per chunk, achieving 60 fps for real-time streaming. Photorealistic rendering is handled via a separate learning-based method that synthesizes geometry and texture from joint parameters and facial expressions derived from speech.

Experiment

  • The model generates spatially-aware conversational motion that is competitive with state-of-the-art methods, including non-causal and non-real-time approaches, while maintaining real-time, causal inference.
  • It excels in gaze alignment, significantly outperforming retrieval and generative baselines by orienting the agent toward the user in natural conversational contexts.
  • Unlike retrieval methods, it generates novel motion that jointly optimizes realism, expressiveness, and spatial awareness, rather than being limited to dataset examples.
  • Compared to diffusion-based baselines (MDM, A2P), it achieves better motion dynamics, lower foot sliding, and higher expressiveness, while running faster and without requiring future context.
  • Against audio-only models (SHOW), it demonstrates superior full-body coordination and spatial awareness by explicitly conditioning on user position.
  • Ablation studies confirm the value of the Euclidean motion representation (vs. joint angles) for precise positioning and the VAE’s role in capturing motion distribution without compromising physical plausibility.
  • Gaze direction is controllable at inference time via classifier-free guidance, enabling adjustable social engagement—from avoiding eye contact to fully facing the user—while preserving motion quality.
  • Qualitative video results show natural transitions between speaking/listening modes, context-appropriate emotional gestures, and seamless integration with real-time VR applications using off-the-shelf LLMs and TTS.

The authors demonstrate that their model generates spatially aware conversational motion in real time while maintaining competitive realism and expressiveness compared to non-causal, non-real-time baselines. Results show that explicit gaze control during inference allows flexible adjustment of agent orientation toward the user, with moderate guidance improving both alignment and motion quality, while strict alignment introduces a trade-off with natural variation. Ablations confirm that their Euclidean motion representation and causal VAE are critical for achieving high fidelity and physical plausibility without compromising speed.

The authors use a causal, real-time model to generate spatially aware conversational motion, achieving competitive realism and diversity while explicitly aligning the agent’s gaze toward the user. Results show their method outperforms non-causal baselines in gaze alignment and physical plausibility, and surpasses retrieval methods by generating novel motion that jointly satisfies multiple criteria rather than merely sampling from existing data. Ablations confirm that their Euclidean motion representation and latent compression via VAE are critical for maintaining both spatial accuracy and expressive dynamics.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp