HyperAIHyperAI

Command Palette

Search for a command to run...

BayesianVLA:潜在行動クエリを用いた視覚言語行動モデルのベイズ分解

Shijie Lian Bin Yu Xiaopeng Lin Laurence T. Yang Zhaolong Shen Changti Wu Yuzhuo Miao Cong Huang Kai Chen

Abstract

視覚-言語-行動(Vision-Language-Action: VLA)モデルはロボット操作において有望な成果を示しているが、新しい指示や複雑なマルチタスク環境への一般化能力に課題を抱えている。本研究では、現在の学習フレームワークにおける重要な病理を特定した。すなわち、目的指向のデータ収集によりデータセットにバイアスが生じる現象である。このようなデータセットでは、視覚的観測から言語指令を非常に予測可能にしているため、指令と行動間の条件付き相互情報量が消失する現象が生じる。これを本研究では「情報崩壊(Information Collapse)」と呼ぶ。その結果、モデルは言語制約を無視する視覚のみに基づく方策へと退化し、分布外(out-of-distribution: OOD)の設定では失敗する。これを解決するために、本研究ではベイジアン分解を用いた新しい枠組み「BayesianVLA」を提案する。学習可能な潜在行動クエリ(Latent Action Queries)を導入することで、視覚のみに基づく事前分布 p(av)p(a \mid v)p(av) と、言語条件付きの事後分布 π(av,)\pi(a \mid v, \ell)π(av,) を推定する二本の分岐構造を構築する。さらに、行動と指令間の条件付きポイントワイズ相互情報量(PMI)を最大化するように方策を最適化する。この目的関数は、視覚的ショートカットを効果的にペナルティ化し、言語命令を明示的に説明する行動を報酬化する。新たなデータ収集を必要とせず、BayesianVLAは一般化性能を著しく向上させる。SimplerEnvおよびRoboCasaにおける広範な実験により、特に困難なOOD設定下でSimplerEnvベンチマークで11.3%の性能向上を達成し、本手法が行動に言語を堅牢に根付かせる能力を実証した。

One-sentence Summary

Researchers from HUST, ZGCA, and collaborators propose BayesianVLA, a framework that combats instruction-ignoring in VLA models via Bayesian decomposition and Latent Action Queries, boosting OOD generalization by 11.3% on SimplerEnv without new data.

Key Contributions

  • We identify "Information Collapse" in VLA training, where goal-driven datasets cause language instructions to become predictable from visuals alone, leading models to ignore language and fail in OOD settings.
  • We introduce BayesianVLA, a dual-branch framework using Latent Action Queries to separately model vision-only priors and language-conditioned posteriors, optimized via conditional PMI to enforce explicit instruction grounding.
  • BayesianVLA achieves state-of-the-art results without new data, including an 11.3% OOD improvement on SimplerEnv and preserves text-only conversational abilities of the backbone VLM.

Introduction

The authors leverage Vision-Language-Action (VLA) models to enable robots to follow natural language instructions, but identify a critical flaw: in goal-driven datasets, visual observations alone can predict instructions, causing models to ignore language and rely on “vision shortcuts.” This leads to poor generalization in out-of-distribution or ambiguous scenarios. To fix this, they introduce BayesianVLA, which uses Bayesian decomposition and learnable Latent Action Queries to train a dual-branch policy—one estimating a vision-only prior and the other a language-conditioned posterior—optimized to maximize the mutual information between actions and instructions. Their method requires no new data and significantly improves OOD performance, while preserving the model’s core language understanding, making it more robust and reliable for real-world deployment.

Method

The authors leverage a Bayesian framework to address the vision shortcut problem in Vision-Language-Action (VLA) models, where the conditional mutual information between instructions and actions collapses due to deterministic mappings from vision to language in goal-driven datasets. To counteract this, they propose maximizing the conditional Pointwise Mutual Information (PMI) between actions and instructions, which is formalized as the Log-Likelihood Ratio (LLR) between the posterior policy and the vision-only prior. This objective, derived from information-theoretic principles, encourages the model to learn action representations that carry instruction-specific semantics not predictable from vision alone.

As shown in the figure below, the proposed BayesianVLA framework operates through a dual-branch training strategy that shares a single Large Language Model (LLM) backbone while maintaining distinct input structures for each branch. The core innovation lies in the use of Latent Action Queries, which are learnable tokens appended to the input sequence to serve as a dedicated bottleneck interface between the LLM and the continuous action head. This design enables precise control over the information flow by leveraging the causal masking of decoder-only models, allowing the queries to attend to different subsets of the input depending on their position.

In the Priori Branch, the input sequence is structured as [v,Q,][v, Q, \ell][v,Q,], where vvv is the visual observation, QQQ is the set of action queries, and \ell is the language instruction. Due to the causal attention mask, the queries can attend to the visual input but not to the language instruction, resulting in hidden states HQprior\mathbf{H}_{\mathcal{Q}}^{\text{prior}}HQprior that encode only vision-dependent information. These features are used to predict the action aaa via a flow-matching loss Lprior\mathcal{L}_{\text{prior}}Lprior, effectively learning the dataset's inherent action bias p(av)p(a \mid v)p(av).

In the Posteriori Branch, the input sequence is arranged as [v,,Q][v, \ell, Q][v,,Q], allowing the queries to attend to both the visual and language inputs. This produces hidden states HQpost\mathbf{H}_{\mathcal{Q}}^{\text{post}}HQpost that encode the full context of vision and language, which are used to predict the expert action aaa with a main flow-matching loss Lmain\mathcal{L}_{\text{main}}Lmain. The two branches are trained simultaneously, sharing the same LLM weights.

To explicitly maximize the LLR objective, the framework computes the difference in log-probabilities of the language tokens between the two branches. The LLR loss is defined as LLLR=logp(v,HQprior)sg(logp(v))\mathcal{L}_{\mathrm{LLR}} = \log p(\boldsymbol{\ell} \mid \boldsymbol{v}, \mathbf{H}_{\mathcal{Q}}^{\text{prior}}) - \mathrm{sg}(\log p(\boldsymbol{\ell} \mid \boldsymbol{v}))LLLR=logp(v,HQprior)sg(logp(v)), where the stop-gradient operator prevents the model from degrading the baseline language model capability. This term is optimized to force the action representations to carry information that explains the instruction.

The total training objective combines the action prediction losses from both branches with the LLR regularization term: Ltotal=(1λ)LFM(ψ;HQpost)+λLFM(ψ;HQprior)βLLLR\mathcal{L}_{\mathrm{total}} = (1 - \lambda) \mathcal{L}_{\mathrm{FM}}(\psi; \mathbf{H}_{\mathcal{Q}}^{\text{post}}) + \lambda \mathcal{L}_{\mathrm{FM}}(\psi; \mathbf{H}_{\mathcal{Q}}^{\text{prior}}) - \beta \mathcal{L}_{\mathrm{LLR}}Ltotal=(1λ)LFM(ψ;HQpost)+λLFM(ψ;HQprior)βLLLR. The action decoder is trained using the Rectified Flow Matching objective, where the Diffusion Transformer predicts the velocity field for action trajectories conditioned on the query features. During inference, only the Posteriori Branch is executed to generate actions, ensuring no additional computational overhead compared to standard VLA baselines.

Experiment

  • Pilot experiments reveal that standard VLA models often learn vision-only policies p(a|v) instead of true language-conditioned policies π(a|v,ℓ), even when trained on goal-driven datasets.
  • On RoboCasa (24 tasks), vision-only models achieve 44.6% success vs. 47.8% for language-conditioned baselines, showing minimal reliance on instructions due to visual-task correlation.
  • On LIBERO Goal, vision-only models drop to 9.8% success (vs. 98.0% baseline) when scenes map to multiple tasks, exposing failure to resolve ambiguity without language.
  • On BridgeDataV2, vision-only models match full models in training loss (0.13 vs. 0.08) but fail catastrophically (near 0%) on OOD SimplerEnv, confirming overfitting to visual shortcuts.
  • BayesianVLA achieves 66.5% avg success on SimplerEnv (vs. 55.2% baseline), with +11.3% absolute gain, excelling in tasks like “Put Carrot on Plate” (+13.6%) and “Put Eggplant in Yellow Basket” (+15.0%).
  • On RoboCasa, BayesianVLA reaches 50.4% avg success (vs. 47.8% baseline), outperforming all competitors and notably improving on ambiguous tasks like “PnP Novel From Placemat To Plate” (70.0% vs. 34.0% vision-only).
  • Ablations confirm Bayesian decomposition drives core gains (+6.0% over +Action Query alone), while latent action queries improve efficiency by reducing DiT complexity from O(N²) to O(K²).
  • Future work includes scaling to larger models (e.g., Qwen3VL-8B), real-world testing, and expanding to RoboTwin/LIBERO benchmarks.

The authors use the SimplerEnv benchmark to evaluate BayesianVLA, which is trained on BridgeDataV2 and Fractal datasets. Results show that BayesianVLA achieves a state-of-the-art average success rate of 66.5%, significantly outperforming the baseline QwenGR00T (55.2%) and other strong competitors, with notable improvements in tasks requiring precise object manipulation.

The authors use the SimplerEnv benchmark to evaluate the performance of BayesianVLA against baseline models, with results showing that BayesianVLA achieves a state-of-the-art average success rate of 63.5%, outperforming the QwenGR00T baseline by 8.3 percentage points. This improvement is particularly notable in tasks requiring precise object manipulation, such as "Put Carrot on Plate" and "Put Eggplant in Yellow Basket," where BayesianVLA demonstrates significant gains over the baseline. The results confirm that the proposed Bayesian decomposition effectively mitigates the vision shortcut by encouraging the model to rely on language instructions rather than visual cues alone.

The authors use the RoboCasa benchmark to evaluate VLA models, where the VisionOnly baseline achieves a high success rate of 44.7%, indicating that the model can perform well without language instructions due to visual shortcuts. BayesianVLA surpasses all baselines, achieving an average success rate of 50.4% and demonstrating significant improvements in tasks where the vision-only policy fails, such as "PnP Novel From Placemat To Plate," confirming that the method effectively mitigates the vision shortcut by leveraging language for disambiguation.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています