HyperAIHyperAI

Command Palette

Search for a command to run...

BayesianVLA: Bayes'sche Zerlegung von Vision-Sprache-Aktion-Modellen mittels latenter Aktionabfragen

Shijie Lian Bin Yu Xiaopeng Lin Laurence T. Yang Zhaolong Shen Changti Wu Yuzhuo Miao Cong Huang Kai Chen

Abstract

Vision-Language-Action (VLA)-Modelle haben bei der Robotermanipulation vielversprechende Ergebnisse gezeigt, konfrontieren sich jedoch häufig mit Schwierigkeiten bei der Generalisierung auf neue Anweisungen oder komplexe Mehraufgaben-Szenarien. Wir identifizieren eine kritische Pathologie in aktuellen Trainingsparadigmen, bei denen zielgerichtete Datensammlung zu einer Datensatzverzerrung führt. In solchen Datensätzen sind Sprachanweisungen allein aus visuellen Beobachtungen hochgradig vorhersagbar, was dazu führt, dass die bedingte gegenseitige Information zwischen Anweisungen und Aktionen verschwindet – ein Phänomen, das wir „Information Collapse“ nennen. Folglich degenerieren die Modelle zu rein visuellen Strategien, die Sprachanweisungen ignorieren und in Out-of-Distribution (OOD)-Szenarien versagen. Um diesem Problem zu begegnen, schlagen wir BayesianVLA vor, einen neuartigen Ansatz, der das Folgen von Anweisungen durch eine bayessche Zerlegung erzwingt. Durch die Einführung lernbarer latenter Aktionsabfragen konstruieren wir eine Dual-Branch-Architektur, um sowohl eine rein visuelle Prior p(av)p(a \mid v)p(av) als auch eine sprachbedingte Posterior π(av,)π(a \mid v, \ell)π(av,) zu schätzen. Anschließend optimieren wir die Policy, um die bedingte Pointwise Mutual Information (PMI) zwischen Aktionen und Anweisungen zu maximieren. Diese Zielsetzung bestraft effektiv den visuellen Kurzschluss und belohnt Aktionen, die die Sprachanweisung explizit erklären. Ohne zusätzliche Daten erzielt BayesianVLA eine erhebliche Verbesserung der Generalisierungsfähigkeit. Umfangreiche Experimente auf SimplerEnv und RoboCasa zeigen signifikante Leistungssteigerungen, darunter eine Verbesserung um 11,3 % beim anspruchsvollen OOD-SimplerEnv-Benchmark, was die Fähigkeit unseres Ansatzes unterstreicht, Sprache robust in Aktionen zu verankern.

One-sentence Summary

Researchers from HUST, ZGCA, and collaborators propose BayesianVLA, a framework that combats instruction-ignoring in VLA models via Bayesian decomposition and Latent Action Queries, boosting OOD generalization by 11.3% on SimplerEnv without new data.

Key Contributions

  • We identify "Information Collapse" in VLA training, where goal-driven datasets cause language instructions to become predictable from visuals alone, leading models to ignore language and fail in OOD settings.
  • We introduce BayesianVLA, a dual-branch framework using Latent Action Queries to separately model vision-only priors and language-conditioned posteriors, optimized via conditional PMI to enforce explicit instruction grounding.
  • BayesianVLA achieves state-of-the-art results without new data, including an 11.3% OOD improvement on SimplerEnv and preserves text-only conversational abilities of the backbone VLM.

Introduction

The authors leverage Vision-Language-Action (VLA) models to enable robots to follow natural language instructions, but identify a critical flaw: in goal-driven datasets, visual observations alone can predict instructions, causing models to ignore language and rely on “vision shortcuts.” This leads to poor generalization in out-of-distribution or ambiguous scenarios. To fix this, they introduce BayesianVLA, which uses Bayesian decomposition and learnable Latent Action Queries to train a dual-branch policy—one estimating a vision-only prior and the other a language-conditioned posterior—optimized to maximize the mutual information between actions and instructions. Their method requires no new data and significantly improves OOD performance, while preserving the model’s core language understanding, making it more robust and reliable for real-world deployment.

Method

The authors leverage a Bayesian framework to address the vision shortcut problem in Vision-Language-Action (VLA) models, where the conditional mutual information between instructions and actions collapses due to deterministic mappings from vision to language in goal-driven datasets. To counteract this, they propose maximizing the conditional Pointwise Mutual Information (PMI) between actions and instructions, which is formalized as the Log-Likelihood Ratio (LLR) between the posterior policy and the vision-only prior. This objective, derived from information-theoretic principles, encourages the model to learn action representations that carry instruction-specific semantics not predictable from vision alone.

As shown in the figure below, the proposed BayesianVLA framework operates through a dual-branch training strategy that shares a single Large Language Model (LLM) backbone while maintaining distinct input structures for each branch. The core innovation lies in the use of Latent Action Queries, which are learnable tokens appended to the input sequence to serve as a dedicated bottleneck interface between the LLM and the continuous action head. This design enables precise control over the information flow by leveraging the causal masking of decoder-only models, allowing the queries to attend to different subsets of the input depending on their position.

In the Priori Branch, the input sequence is structured as [v,Q,][v, Q, \ell][v,Q,], where vvv is the visual observation, QQQ is the set of action queries, and \ell is the language instruction. Due to the causal attention mask, the queries can attend to the visual input but not to the language instruction, resulting in hidden states HQprior\mathbf{H}_{\mathcal{Q}}^{\text{prior}}HQprior that encode only vision-dependent information. These features are used to predict the action aaa via a flow-matching loss Lprior\mathcal{L}_{\text{prior}}Lprior, effectively learning the dataset's inherent action bias p(av)p(a \mid v)p(av).

In the Posteriori Branch, the input sequence is arranged as [v,,Q][v, \ell, Q][v,,Q], allowing the queries to attend to both the visual and language inputs. This produces hidden states HQpost\mathbf{H}_{\mathcal{Q}}^{\text{post}}HQpost that encode the full context of vision and language, which are used to predict the expert action aaa with a main flow-matching loss Lmain\mathcal{L}_{\text{main}}Lmain. The two branches are trained simultaneously, sharing the same LLM weights.

To explicitly maximize the LLR objective, the framework computes the difference in log-probabilities of the language tokens between the two branches. The LLR loss is defined as LLLR=logp(v,HQprior)sg(logp(v))\mathcal{L}_{\mathrm{LLR}} = \log p(\boldsymbol{\ell} \mid \boldsymbol{v}, \mathbf{H}_{\mathcal{Q}}^{\text{prior}}) - \mathrm{sg}(\log p(\boldsymbol{\ell} \mid \boldsymbol{v}))LLLR=logp(v,HQprior)sg(logp(v)), where the stop-gradient operator prevents the model from degrading the baseline language model capability. This term is optimized to force the action representations to carry information that explains the instruction.

The total training objective combines the action prediction losses from both branches with the LLR regularization term: Ltotal=(1λ)LFM(ψ;HQpost)+λLFM(ψ;HQprior)βLLLR\mathcal{L}_{\mathrm{total}} = (1 - \lambda) \mathcal{L}_{\mathrm{FM}}(\psi; \mathbf{H}_{\mathcal{Q}}^{\text{post}}) + \lambda \mathcal{L}_{\mathrm{FM}}(\psi; \mathbf{H}_{\mathcal{Q}}^{\text{prior}}) - \beta \mathcal{L}_{\mathrm{LLR}}Ltotal=(1λ)LFM(ψ;HQpost)+λLFM(ψ;HQprior)βLLLR. The action decoder is trained using the Rectified Flow Matching objective, where the Diffusion Transformer predicts the velocity field for action trajectories conditioned on the query features. During inference, only the Posteriori Branch is executed to generate actions, ensuring no additional computational overhead compared to standard VLA baselines.

Experiment

  • Pilot experiments reveal that standard VLA models often learn vision-only policies p(a|v) instead of true language-conditioned policies π(a|v,ℓ), even when trained on goal-driven datasets.
  • On RoboCasa (24 tasks), vision-only models achieve 44.6% success vs. 47.8% for language-conditioned baselines, showing minimal reliance on instructions due to visual-task correlation.
  • On LIBERO Goal, vision-only models drop to 9.8% success (vs. 98.0% baseline) when scenes map to multiple tasks, exposing failure to resolve ambiguity without language.
  • On BridgeDataV2, vision-only models match full models in training loss (0.13 vs. 0.08) but fail catastrophically (near 0%) on OOD SimplerEnv, confirming overfitting to visual shortcuts.
  • BayesianVLA achieves 66.5% avg success on SimplerEnv (vs. 55.2% baseline), with +11.3% absolute gain, excelling in tasks like “Put Carrot on Plate” (+13.6%) and “Put Eggplant in Yellow Basket” (+15.0%).
  • On RoboCasa, BayesianVLA reaches 50.4% avg success (vs. 47.8% baseline), outperforming all competitors and notably improving on ambiguous tasks like “PnP Novel From Placemat To Plate” (70.0% vs. 34.0% vision-only).
  • Ablations confirm Bayesian decomposition drives core gains (+6.0% over +Action Query alone), while latent action queries improve efficiency by reducing DiT complexity from O(N²) to O(K²).
  • Future work includes scaling to larger models (e.g., Qwen3VL-8B), real-world testing, and expanding to RoboTwin/LIBERO benchmarks.

The authors use the SimplerEnv benchmark to evaluate BayesianVLA, which is trained on BridgeDataV2 and Fractal datasets. Results show that BayesianVLA achieves a state-of-the-art average success rate of 66.5%, significantly outperforming the baseline QwenGR00T (55.2%) and other strong competitors, with notable improvements in tasks requiring precise object manipulation.

The authors use the SimplerEnv benchmark to evaluate the performance of BayesianVLA against baseline models, with results showing that BayesianVLA achieves a state-of-the-art average success rate of 63.5%, outperforming the QwenGR00T baseline by 8.3 percentage points. This improvement is particularly notable in tasks requiring precise object manipulation, such as "Put Carrot on Plate" and "Put Eggplant in Yellow Basket," where BayesianVLA demonstrates significant gains over the baseline. The results confirm that the proposed Bayesian decomposition effectively mitigates the vision shortcut by encouraging the model to rely on language instructions rather than visual cues alone.

The authors use the RoboCasa benchmark to evaluate VLA models, where the VisionOnly baseline achieves a high success rate of 44.7%, indicating that the model can perform well without language instructions due to visual shortcuts. BayesianVLA surpasses all baselines, achieving an average success rate of 50.4% and demonstrating significant improvements in tasks where the vision-only policy fails, such as "PnP Novel From Placemat To Plate," confirming that the method effectively mitigates the vision shortcut by leveraging language for disambiguation.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp