HyperAIHyperAI

Command Palette

Search for a command to run...

Recurrent-Depth VLA : Échelle implicite du calcul en temps de test des modèles vision-langage-action par raisonnement itératif latent

Yalcin Tur Jalal Naghiyev Haoquan Fang Wei-Chuan Tsai Jiafei Duan Dieter Fox Ranjay Krishna

Résumé

Les modèles actuels de vision-langage-action (VLA) s’appuient sur une profondeur de calcul fixe, consommant la même quantité de ressources informatiques pour des ajustements simples comme pour des manipulations complexes à plusieurs étapes. Bien que l’approche de « Chain-of-Thought » (CoT) permette une calculabilité variable, elle entraîne une croissance linéaire de la mémoire et se révèle mal adaptée aux espaces d’actions continus. Nous introduisons RD-VLA (Recurrent-Depth VLA), une architecture qui atteint une adaptation computationnelle grâce à un raffinement itératif latent, plutôt que par une génération explicite de tokens. RD-VLA utilise une tête d’action récurrente et partagée (weight-tied), permettant une profondeur d’inférence arbitraire avec une empreinte mémoire constante. Le modèle est entraîné à l’aide d’une rétropropagation tronquée dans le temps (TBPTT) afin de superviser efficacement le processus de raffinement. Lors de l’inférence, RD-VLA alloue dynamiquement les ressources computationnelles à l’aide d’un critère d’arrêt adaptatif fondé sur la convergence latente. Des expériences sur des tâches de manipulation exigeantes montrent que la profondeur récurrente est cruciale : les tâches qui échouent entièrement (0 % de réussite) avec une inférence en une seule itération atteignent plus de 90 % de réussite avec quatre itérations, tandis que les tâches plus simples atteignent rapidement un plateau. RD-VLA ouvre ainsi une voie évolutive pour une utilisation à l’inférence de ressources computationnelles en robotique, remplaçant le raisonnement basé sur les tokens par un raisonnement latent, ce qui permet une consommation mémoire constante et jusqu’à un gain de vitesse d’inférence de 80 fois par rapport aux modèles VLA précédents fondés sur le raisonnement. Page du projet : https://rd-vla.github.io/

One-sentence Summary

Researchers from NVIDIA and Stanford introduce RD-VLA, a vision-language-action model that dynamically scales computation via latent-space iterative refinement rather than token-based reasoning, enabling 80× faster inference and 90%+ success on complex robotic tasks while maintaining constant memory usage.

Key Contributions

  • RD-VLA introduces a recurrent, weight-tied action head that enables adaptive test-time compute via latent iterative refinement, eliminating fixed-depth constraints and supporting arbitrary inference depth with constant memory usage.
  • The model is trained with truncated backpropagation through time and dynamically stops inference based on latent convergence, allowing it to allocate more compute to complex tasks while rapidly saturating on simpler ones—boosting success rates from 0% to over 90% on challenging manipulation tasks.
  • Evaluated on LIBERO and CALVIN benchmarks, RD-VLA achieves state-of-the-art performance with 93.0% success on LIBERO and 45.3% task-5 success on CALVIN, while delivering up to 80× faster inference than prior reasoning-based VLA models.

Introduction

The authors leverage a recurrent, weight-tied architecture to enable adaptive test-time compute in Vision-Language-Action (VLA) models, addressing a key limitation of prior systems that expend fixed computational resources regardless of task complexity. Existing reasoning-based VLAs rely on explicit token generation—like Chain-of-Thought—which scales memory linearly and forces reasoning through lossy, discretized output spaces, making them inefficient for continuous robotic control. RD-VLA instead performs iterative refinement entirely within a fixed-dimensional latent space, using truncated backpropagation through time for training and an adaptive stopping criterion at inference to dynamically allocate compute. This design enables up to 80x faster inference than token-based reasoning models while maintaining constant memory usage, and achieves state-of-the-art performance on benchmarks like LIBERO and CALVIN, including real-world transfer to tasks like towel folding and bread toasting.

Method

The authors leverage a novel architecture called Recurrent-Depth Vision-Language Action (RD-VLA) to decouple computational depth from the fixed structural constraints of pretrained vision-language backbones. Rather than relying on fixed-depth MLP heads or output-space iterative methods such as diffusion, RD-VLA shifts the computational burden into a weight-tied recurrent transformer core that operates entirely within a continuous latent manifold. This enables dynamic test-time compute scaling by unrolling the recurrent block to an arbitrary depth rrr, allowing the model to allocate more computation to complex tasks and less to simpler ones, as illustrated in the performance curve on LIBERO-10.

The RD-VLA action head is designed to be backbone-agnostic and is instantiated here using a Qwen2.5-0.5B-based VLM, augmented with 64 learned latent tokens that attend to the multi-modal context during the LLM’s forward pass. After VLM execution, hidden states are partitioned into task/vision representations hvisR512×Dh_{vis} \in \mathbb{R}^{512 \times D}hvisR512×D and latent-specific representations hlatR64×Dh_{lat} \in \mathbb{R}^{64 \times D}hlatR64×D. These are concatenated with proprioception ppp to form a static conditioning manifold [hvis+lat(24);p][h_{vis+lat}^{(24)}; p][hvis+lat(24);p], which grounds the recurrent reasoning process.

The architecture follows a functional triplet: Prelude, Recurrent Core, and Coda. The Prelude PϕP_{\phi}Pϕ consumes K=8K=8K=8 learned queries, which first self-attend bidirectionally and then cross-attend to the VLM’s middle-layer features hvis+lat(12)h_{vis+lat}^{(12)}hvis+lat(12) to produce a grounded latent foundation:

Spre=Pϕ(Queries,hvis+lat(12))RK×DS_{pre} = P_{\phi}(Queries, h_{vis+lat}^{(12)}) \in \mathbb{R}^{K \times D}Spre=Pϕ(Queries,hvis+lat(12))RK×D

In parallel, a latent scratchpad S0S_0S0 is initialized from a high-entropy truncated normal distribution:

S0TruncNormal(0,γinitσinit)S_0 \sim \operatorname{TruncNormal}(0, \gamma_{init} \cdot \sigma_{init})S0TruncNormal(0,γinitσinit)

This noisy initialization ensures the model learns a stable refinement operator rather than overfitting to a fixed starting point.

The Recurrent Core RθR_{\theta}Rθ performs iterative refinement by maintaining representational stability through persistent Input Injection. At each iteration kkk, the current scratchpad state Sk1S_{k-1}Sk1 is concatenated with the static foundation SpreS_{pre}Spre, mapped back to the manifold dimension via a learned adapter, and normalized:

xk=RMSNorm(γadaptWadapt[Sk1;Spre])x_k = \mathrm{RMSNorm}\left( \gamma_{adapt} \cdot W_{adapt} [S_{k-1}; S_{pre}] \right)xk=RMSNorm(γadaptWadapt[Sk1;Spre])

The scratchpad is then updated via the weight-tied transformer block:

Sk=Rθ(xk,[hvis+lat(24);p])S_k = R_{\theta}(x_k, [h_{vis+lat}^{(24)}; p])Sk=Rθ(xk,[hvis+lat(24);p])

Here, RθR_{\theta}Rθ performs bidirectional self-attention across the KKK queries and gated cross-attention using keys/values derived from the concatenated conditioning manifold. This ensures the model remains grounded in the physical observation throughout unrolling.

Once the recurrence reaches depth rrr, the converged scratchpad SrS_rSr is processed by the Coda CψC_{\psi}Cψ, which performs final decoding by attending to self and high-level VLM features. The output is projected to the robot’s action space:

a=WoutRMSNorm(Cψ(Sr,[hvis(24);hlat(24);p]))\mathbf{a} = W_{out} \cdot \mathrm{RMSNorm}(C_{\psi}(S_r, [h_{vis}^{(24)}; h_{lat}^{(24)}; p]))a=WoutRMSNorm(Cψ(Sr,[hvis(24);hlat(24);p]))

Training employs randomized recurrence: the number of iterations NNN is sampled from a heavy-tailed log-normal Poisson distribution with μrec=32\mu_{rec} = 32μrec=32. Truncated BPTT is used, propagating gradients only through the final d=8d=8d=8 iterations, forcing the model to learn iterative refinement from any noisy initialization into a stable manifold.

At inference, adaptive computation is implemented by monitoring convergence via the L2 distance between consecutive predicted actions:

akak122<δ||\mathbf{a}_k - \mathbf{a}_{k-1}||_2^2 < \delta∣∣akak122<δ

where δ=1e3\delta = 1e^{-3}δ=1e3. This allows the model to self-regulate compute: terminating early for simple tasks and allocating more iterations for complex ones.

Adaptive execution further couples reasoning depth with action horizon. For high-uncertainty states (k>τk^* > \tauk>τ), the execution horizon is truncated to HshortH_{short}Hshort; otherwise, it remains HlongH_{long}Hlong. Alternatively, a linear decay schedule reduces horizon inversely with iteration count:

Hexec(k)=max(Hmin,Hmaxmax(0,kτbase))H_{exec}(k^*) = \operatorname*{max}\left( H_{min}, H_{max} - \operatorname*{max}(0, k^* - \tau_{base}) \right)Hexec(k)=max(Hmin,Hmaxmax(0,kτbase))

This ensures the agent replans more frequently under high computational demand, prioritizing safety in complex scenarios.

Experiment

  • Recurrent computation significantly boosts performance on manipulation tasks, with gains plateauing after 8–12 iterations, indicating diminishing returns beyond that point.
  • Task complexity varies widely, requiring different numbers of reasoning steps; adaptive computation dynamically allocates compute based on task difficulty without predefined rules.
  • Adaptive strategies (especially Binary Adaptation) match fixed-depth performance while cutting inference cost by up to 34%, confirming that condition-dependent compute allocation is more effective than uniform budgets.
  • Latent reasoning outperforms both end-to-end and token-level reasoning methods, achieving state-of-the-art results on LIBERO and CALVIN benchmarks with a smaller model size (0.5B parameters).
  • Real-world deployment on a bimanual robot shows strong robustness across household tasks, with fixed-depth variants excelling in most scenarios and adaptive variants remaining competitive while enabling efficiency.
  • The approach demonstrates viability for physical systems and opens pathways for uncertainty-aware execution, though depth generalization and saturation remain key limitations for future work.

The authors evaluate their latent reasoning approach against end-to-end and token-level reasoning methods, showing that RD-VLA achieves state-of-the-art performance on the LIBERO benchmark with significantly fewer parameters. Results indicate that both fixed and adaptive variants of RD-VLA outperform prior methods across task categories, with the adaptive version maintaining strong performance while reducing average inference cost. The findings support that iterative latent-space reasoning is more parameter-efficient and effective for robotic manipulation than token-based or direct action prediction approaches.

The authors evaluate adaptive computation strategies for their recurrent reasoning model, showing that dynamic allocation of inference steps based on task complexity achieves performance comparable to fixed-depth models while reducing average compute cost. Results indicate that different task categories naturally require varying numbers of iterations, and adaptive methods like Binary Adaptation strike the best balance between efficiency and success rate. The model’s ability to self-calibrate computation based on state uncertainty demonstrates practical viability for real-world deployment.

The authors evaluate their RD-VLA model on the CALVIN ABC→D benchmark, showing it completes longer task sequences than prior methods, achieving the highest average chain length of 3.39. Despite using only 0.5B parameters—significantly fewer than most baselines—the model demonstrates strong sequential planning capability through latent iterative reasoning. Results confirm that recurrent depth enhances performance on long-horizon manipulation tasks without requiring large-scale architectures.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp