Command Palette
Search for a command to run...
행동 전 확인: 비전-언어-행동 모델을 위한 비전 기반 표현 강화
행동 전 확인: 비전-언어-행동 모델을 위한 비전 기반 표현 강화
초록
시각-언어-행동 (Vision-Language-Action, VLA) 모델은 최근 로봇 조작 분야에서 유망한 패러다임으로 부상하였습니다. VLA 모델에서 신뢰할 수 있는 행동 예측은 언어 지시를 조건으로 한 시각 관찰을 정확하게 해석하고 통합하는 능력에 결정적으로 의존합니다. 최근 연구들은 VLA 모델의 시각적 능력을 향상시키기 위해 노력해 왔으나, 대부분의 접근 방식은 LLM 백본을 블랙박스처럼 취급하여 시각 정보가 어떻게 행동 생성에 기반 (grounding) 되는지에 대한 통찰을 제한적으로 제공합니다. 따라서 우리는 다양한 행동 생성 패러다임에 걸친 여러 VLA 모델에 대한 체계적인 분석을 수행한 결과, 행동 생성 과정에서 시각 토큰 (visual tokens) 에 대한 민감도가 더 깊은 레이어로 갈수록 점진적으로 감소함을 관찰하였습니다. 이러한 관찰에 영감을 받아, 우리는 Vision-Language Mixture-of-Transformers (VL-MoT) 프레임워크를 기반으로 구축된 DeepVision-VLA 를 제안합니다. 본 프레임워크는 비전 기반 모델 (vision foundation model) 과 VLA 백본 간의 공유 어텐션을 가능하게 하여, 비전 전문가 (vision expert) 로부터의 다중 레벨 시각 특징을 VLA 백본의 깊은 레이어로 주입함으로써 정밀하고 복잡한 조작을 위한 시각 표현을 강화합니다. 또한, 우리는 작업 관련 시각 토큰은 보존하면서 무관한 시각 토큰을 제거하기 위해 얕은 레이어의 어텐션을 활용하는 행동 유도 시각 가지치기 (Action-Guided Visual Pruning, AGVP) 를 도입하였습니다. 이는 최소한의 계산 오버헤드로 조작에 필수적인 시각 단서를 강화합니다. 실험 결과, DeepVision-VLA 는 시뮬레이션 환경 및 실세계 작업에서 각각 기존 최첨단 (state-of-the-art) 방법보다 9.0% 및 7.5% 높은 성능을 달성하여, 시각적으로 강화된 VLA 모델 설계에 대한 새로운 통찰을 제공하였습니다.
One-sentence Summary
Researchers from Peking University, Simplexity Robotics, and The Chinese University of Hong Kong propose DeepVision-VLA, a Vision-Language Mixture-of-Transformers framework that injects multi-level visual features into deeper layers and employs Action-Guided Visual Pruning to significantly outperform prior methods in complex robotic manipulation tasks.
Key Contributions
- The paper introduces DeepVision-VLA, a framework built on a Vision-Language Mixture-of-Transformers architecture that injects multi-level visual features from a dedicated vision expert into deeper layers of the VLA backbone to counteract the progressive loss of visual sensitivity during action generation.
- An Action-Guided Visual Pruning strategy is presented to refine information flow by leveraging shallow-layer attention to identify and preserve task-relevant visual tokens while removing irrelevant background data with minimal computational overhead.
- Experimental results demonstrate that the proposed method outperforms prior state-of-the-art approaches by 9.0% on simulated tasks and 7.5% on real-world manipulation benchmarks, validating the effectiveness of enhanced visual grounding in complex robotic control.
Introduction
Vision-Language-Action (VLA) models are critical for robotic manipulation as they translate visual observations and language instructions into precise physical actions. However, prior approaches often treat the underlying Large Language Model backbone as a black box, failing to address a key limitation where the model's sensitivity to task-relevant visual tokens progressively degrades in deeper layers. To solve this, the authors introduce DeepVision-VLA, which leverages a Vision-Language Mixture-of-Transformers framework to inject multi-level visual features from a dedicated vision expert directly into the deeper layers of the VLA backbone. They further enhance this architecture with Action-Guided Visual Pruning, a technique that filters irrelevant visual tokens using shallow-layer attention to ensure only critical cues influence action generation.
Method
The authors build upon the QwenVLA-OFT baseline, which utilizes a visual encoder (SigLIP2-Large) and an LLM backbone (Qwen3-VL) to map observations and instructions to actions. However, standard VLA models often suffer from sensitivity attenuation in deep layers, where visual grounding becomes diffuse and less effective for precise manipulation. To address this, the authors propose the DeepVision-VLA framework, which enhances visual grounding by injecting multi-level knowledge from a Vision Expert into the deep layers of the VLA.
Refer to the framework diagram for a high-level comparison of the vanilla architecture against the proposed DeepVision-VLA. While the vanilla model relies solely on the LLM backbone, the proposed method introduces a Vision Expert branch that processes high-resolution inputs to capture fine-grained spatial details. This design aims to counteract the loss of visual sensitivity in deeper network layers.
The detailed architecture is depicted in the figure below. The model consists of a Vision Expert branch (using DINOv3) and the standard LLM Backbone. The Vision Expert is connected only to the deepest n layers of the VLA, where visual grounding is typically weakest. To integrate these features, the authors employ a Vision-Language Mixture-of-Transformers (VL-MoT) design. Instead of simple concatenation, the intermediate Query, Key, and Value (QKV) representations from the Vision Expert are exposed and integrated with the corresponding QKV of the deep VLA layers via a shared-attention mechanism.
To ensure the model focuses on task-relevant regions, the authors introduce Action-Guided Vision Pruning (AGVP). This strategy leverages attention maps from the shallow layers of the VLA, where visual grounding is most reliable, to identify Regions of Interest (ROIs). These attention cues are aggregated over shallow layers and interpolated to match the Vision Expert's resolution. The model then retains only the top-K most relevant visual tokens, effectively filtering out redundant background features before they are integrated into the deep layers.
The integration of these pruned visual tokens is handled via the Vision-Language Shared Attention mechanism. In this module, the QKV projections from both the Vision Expert and the LLM backbone are concatenated. The attention is computed over this combined set, enabling cross-branch information exchange while preserving separate processing pathways. This allows the deep layers to access high-level, object-centric representations from the Vision Expert, significantly enhancing action prediction precision. The model is trained end-to-end on a large-scale cross-embodiment dataset, and during inference, the pipeline remains fully executable without additional external supervision.
Experiment
- Layer-wise analysis of existing VLA models reveals that while shallow layers effectively ground actions in task-relevant visual regions, deeper layers increasingly rely on diffuse and less relevant features, leading to reduced action reliability.
- Simulation experiments demonstrate that the proposed DeepVision-VLA significantly outperforms multiple baselines across diverse manipulation tasks by integrating a Vision-Language Mixture-of-Transformers framework and an Action-Guided Visual Pruning strategy.
- Ablation studies confirm that coupling a high-resolution Vision Expert with deeper LLM layers and utilizing action-to-vision attention for token pruning are critical for maintaining strong visual grounding and achieving superior performance.
- Real-world evaluations on complex single-arm tasks show that the model achieves high success rates in precise manipulation scenarios, such as writing and pouring, where it maintains stability and accuracy even in multi-stage sequences.
- Generalization tests under unseen backgrounds and varying lighting conditions indicate that the model effectively decouples task-relevant objects from environmental noise and maintains robust performance where baseline methods fail.