HyperAIHyperAI

Command Palette

Search for a command to run...

실용적인 VLA 기초 모델

초록

로봇 조작 분야에서 큰 잠재력을 지닌 비전-언어-행동(Vision-Language-Action, VLA) 기반 모델은 다양한 작업과 플랫폼 간에 충실하게 일반화할 뿐만 아니라 적절한 비용 효율성(예: 적응을 위한 데이터 및 GPU 시간)을 보장해야 한다. 이를 위해 우리는 9종의 대표적인 이중 팔 로봇 구성에서 수집한 약 2만 시간의 실세계 데이터를 기반으로 한 LingBot-VLA를 개발하였다. 3종의 로봇 플랫폼에서 각각 100개의 작업을 수행하고, 작업당 130회의 사후 훈련 에피소드를 수행한 체계적인 평가를 통해, 본 모델은 경쟁 모델들에 비해 뚜렷한 우수성을 입증하며 강력한 성능과 광범위한 일반화 능력을 보여주었다. 또한, 8개 GPU를 사용한 훈련 환경에서 GPU당 초당 261개의 샘플을 처리할 수 있는 효율적인 코드베이스를 구축하였으며, 기존의 VLA 중심 코드베이스 대비 1.5~2.8배(사용하는 VLM 기반 모델에 따라 달라짐)의 성능 향상을 달성하였다. 이러한 특징들은 본 모델이 실세계 적용에 매우 적합함을 보장한다. 로봇 학습 분야의 발전을 위해, 우리는 코드, 기반 모델, 벤치마크 데이터를 모두 공개하여 보다 도전적인 작업 수행과 신뢰할 수 있는 평가 기준 확립을 지원하고자 한다.

One-sentence Summary

Robbyant researchers propose LingBot-VLA, a Vision-Language-Action foundation model trained on 20,000 real-world robot hours across 9 platforms, achieving state-of-the-art generalization via Mixture-of-Transformers and spatial-aware depth alignment, plus 1.5–2.8× faster training throughput, enabling scalable, deployable robotic manipulation.

Key Contributions

  • LingBot-VLA is trained on 20,000 hours of real-world dual-arm robot data from 9 platforms, demonstrating that VLA performance scales favorably with data volume and shows no saturation at this scale, enabling stronger generalization across tasks and embodiments.
  • The model achieves state-of-the-art results in a rigorous real-world evaluation across 3 robotic platforms, completing 100 diverse tasks with 130 episodes each, outperforming competitors and establishing a new benchmark for multi-platform VLA assessment.
  • An optimized training codebase delivers 261 samples per second per GPU on an 8-GPU setup, offering 1.5–2.8× speedup over existing VLA frameworks, reducing computational cost and accelerating deployment-ready model development.

Introduction

The authors leverage Vision-Language-Action (VLA) foundation models to enable robots to perform diverse manipulation tasks from natural language instructions, aiming to bridge the gap between large-scale pre-training and real-world deployment. Prior work lacks systematic evaluation on real robots at scale and suffers from inefficient training codebases that bottleneck data scaling and multi-platform testing. Their main contribution is LingBot-VLA, trained on 20,000 hours of real-world dual-arm robot data across 9 platforms, which demonstrates consistent performance gains with data scale and achieves state-of-the-art generalization across 3 robotic embodiments on 100 tasks. They also release an optimized training codebase delivering 1.5–2.8x speedup over existing frameworks, enabling faster iteration and lower costs while promoting open science through public release of code, models, and benchmarks.

Dataset

  • The authors use a large-scale pre-training dataset built from teleoperated data across 9 dual-arm robot platforms, including AgiBot G1, AgileX, Galaxea R1Lite/Pro, Realman Rs-02, Leju KUAVO 4 Pro, Oinglong, ARX Lift2, and Bimanual Franka—each with varying DoF arms, camera setups, and gripper configurations.

  • For language labeling, human annotators segment multi-view robot videos into clips aligned with atomic actions, trimming static start/end frames. They then use Qwen3-VL-235B-A22B to generate precise task and sub-task instructions for each clip.

  • For the GM-100 benchmark, 150 raw trajectories per task are collected via teleoperation across three platforms; the top 130 (by task completion, smoothness, and protocol adherence) are retained. Objects are standardized per GM-100 specs, and object poses are randomized per trajectory to boost environmental diversity.

  • Teleoperation follows strict guidelines: maintain end-effector clearance, slow motion during contact, and ensure distinct start/end visual states. Automated filtering removes technical anomalies, followed by manual review using multi-view video to exclude off-protocol or cluttered episodes.

  • The test set includes ~50% atomic actions not among the top 100 most frequent training actions, ensuring strong generalization evaluation. Word clouds in Figures 3a and 3b visualize action category distributions across train/test splits.

  • All trajectories are processed to align with GM-100 task specs, and metadata includes standardized object info, randomized poses, and quality-ranked trajectory labels. No cropping is mentioned; processing focuses on segmentation, filtering, and instruction annotation.

Method

The authors leverage a Mixture-of-Transformers (MoT) architecture to integrate a pre-trained vision-language model (VLM) with an action generation module, forming the core of LingBot-VLA. This framework processes vision-language and action modalities through distinct transformer pathways, which are coupled via a shared self-attention mechanism to enable layer-wise unified sequence modeling. The VLM, specifically Qwen2.5-VL, encodes multi-view operational images and task instructions, while the action expert processes proprioceptive sequences including initial states and action chunks. This design ensures that high-dimensional semantic priors from the VLM guide action generation across all layers while minimizing cross-modal interference through modality-specific processing.

The joint modeling sequence at timestamp ttt is constructed by concatenating the observation context Ot\mathbf{O}_tOt and the action chunk At\mathbf{A}_tAt. The observation context is defined as Ot=[It1,It2,It3,Tt,st]\mathbf{O}_t = [\mathbf{I}_t^1, \mathbf{I}_t^2, \mathbf{I}_t^3, \mathbf{T}_t, \mathbf{s}_t]Ot=[It1,It2,It3,Tt,st], incorporating tokens from three-view operational images of dual-arm robots, the task instruction Tt\mathbf{T}_tTt, and the robot state st\mathbf{s}_tst. The action sequence is denoted as At=[at,at+1,,at+T1]\mathbf{A}_t = [\mathbf{a}_t, \mathbf{a}_{t+1}, \ldots, \mathbf{a}_{t+T-1}]At=[at,at+1,,at+T1], where TTT is the action chunk length, set to 50 during pre-training. The training objective is to model the conditional distribution p(AtOt)p(\mathbf{A}_t|\mathbf{O}_t)p(AtOt) using Flow Matching, which enables smooth and continuous action modeling for precise robotic control.

For a flow timestep s[0,1]s \in [0, 1]s[0,1], the intermediate action At,s\mathbf{A}_{t,s}At,s is defined through linear interpolation between the ground-truth action At\mathbf{A}_tAt and Gaussian noise ϵN(0,I)\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})ϵN(0,I), resulting in At,s=sAt+(1s)ϵ\mathbf{A}_{t,s} = s\mathbf{A}_t + (1-s)\epsilonAt,s=sAt+(1s)ϵ. The conditional distribution of At,s\mathbf{A}_{t,s}At,s is formulated as p(At,sAt)=N(sAt,(1s)I)p(\mathbf{A}_{t,s}|\mathbf{A}_t) = \mathcal{N}(s\mathbf{A}_t, (1-s)\mathbf{I})p(At,sAt)=N(sAt,(1s)I). The action expert vθv_{\boldsymbol{\theta}}vθ is trained to predict the conditional vector field by minimizing the Flow Matching objective:

LFM=EsU[0,1],At,ϵvθ(At,s,Ot,s)(Atϵ)2,\mathcal{L}_{\mathrm{FM}} = \mathbb{E}_{s \sim \mathcal{U}[0,1], \mathbf{A}_t, \epsilon} \left\| v_{\boldsymbol{\theta}}(\mathbf{A}_{t,s}, \mathbf{O}_t, s) - (\mathbf{A}_t - \epsilon) \right\|^2,LFM=EsU[0,1],At,ϵvθ(At,s,Ot,s)(Atϵ)2,

where the target velocity is derived from the ideal vector field Atϵ\mathbf{A}_t - \epsilonAtϵ.

To ensure proper information flow, blockwise causal attention is implemented for the joint sequence [Ot,At][\mathbf{O}_t, \mathbf{A}_t][Ot,At]. The sequence is partitioned into three functional blocks: [It1,It2,It3,Tt][\mathbf{I}_t^1, \mathbf{I}_t^2, \mathbf{I}_t^3, \mathbf{T}_t][It1,It2,It3,Tt], [st][\mathbf{s}_t][st], and [at,at+1,,at+T1][\mathbf{a}_t, \mathbf{a}_{t+1}, \ldots, \mathbf{a}_{t+T-1}][at,at+1,,at+T1]. A causal mask restricts attention such that tokens within each block can only attend to themselves and preceding blocks, while tokens within the same block use bidirectional attention. This prevents information leakage from future action tokens into observation representations.

To enhance spatial awareness and execution robustness, a vision distillation approach is adopted. Learnable queries [Qt1,Qt2,Qt3][\mathbf{Q}_t^1, \mathbf{Q}_t^2, \mathbf{Q}_t^3][Qt1,Qt2,Qt3] corresponding to the three-view images are processed by the VLM and aligned with depth tokens [Dt1,Dt2,Dt3][\mathbf{D}_t^1, \mathbf{D}_t^2, \mathbf{D}_t^3][Dt1,Dt2,Dt3] from LingBot-Depth. This alignment is achieved by minimizing the distillation loss Ldistill\mathcal{L}_{\mathrm{distill}}Ldistill:

Ldistill=EOtProj(Qt)Dt,\mathcal{L}_{\mathrm{distill}} = \mathbb{E}_{\mathbf{O}_t} \left| \mathrm{Proj}(\mathbf{Q}_t) - \mathbf{D}_t \right|,Ldistill=EOtProj(Qt)Dt,

where Proj()\mathrm{Proj}(\cdot)Proj() is a projection layer using cross-attention for dimensional alignment. This integration infuses geometric information into the model, improving perception for complex manipulation tasks.

Experiment

  • Evaluated LingBot-VLA across 25 robots on 3 platforms (AgileX, Agibot G1, Galaxea R1Pro) using GM-100 benchmark (100 tasks, 39K demos), achieving 22.5K controlled trials vs. 3 baselines under identical conditions.
  • On real-world GM-100, LingBot-VLA w/ depth outperformed π0.5 by +4.28% SR and +7.76% PS averaged across platforms; surpassed WALL-OSS and GR00T N1.6 consistently, with GR00T showing platform-specific gains due to pre-training alignment.
  • In RoboTwin 2.0 simulation, LingBot-VLA w/ depth exceeded π0.5 by +5.82% SR (clean) and +9.92% SR (randomized), leveraging depth-enhanced spatial priors for robust multi-task generalization.
  • Training throughput analysis showed LingBot’s codebase achieved fastest sample/s across Qwen2.5-VL-3B-π and PaliGemma-3B-pt-224-π, scaling efficiently up to 256 GPUs and outperforming StarVLA, Dexbotic, and OpenPI.
  • Scaling experiments revealed consistent SR and PS gains as pre-training data increased from 3K to 20K hours, with cross-platform alignment confirming robust generalization.
  • Data-efficiency tests on Agibot G1 showed LingBot-VLA with only 80 demos/task outperformed π0.5 using 130 demos/task, with performance gap widening as data increased.

The authors use a large-scale real-world benchmark to evaluate LingBot-VLA against three state-of-the-art baselines across three robotic platforms, with results showing that LingBot-VLA without depth outperforms WALL-OSS and GR00T N1.6 in both success rate and progress score. By incorporating depth information, LingBot-VLA with depth achieves a 4.28% higher success rate and a 7.76% higher progress score on average compared to π₀.₅ across all platforms.

The authors use a data-efficient post-training experiment on the Agibot G1 platform to compare LingBot-VLA with the π₀.₅ baseline, showing that LingBot-VLA achieves higher success and progress rates with fewer demonstrations. Results show that LingBot-VLA outperforms π₀.₅ even when trained on only 80 demonstrations per task, and the performance gap widens as the number of demonstrations increases, demonstrating superior data efficiency.

Results show that LingBot-VLA outperforms the π0.5\pi_{0.5}π0.5 baseline across all platforms, with the variant incorporating depth information achieving the highest average success rate of 86.68%. The authors use a controlled evaluation protocol to ensure fair comparisons, demonstrating that depth-based spatial information significantly improves real-world task performance.

Results show that LingBot-VLA outperforms all baselines across the three robotic platforms in both success rate and progress score, with the variant incorporating depth information achieving the highest performance. The model demonstrates strong generalization, as evidenced by consistent improvements over the baselines on each platform and in the average across all embodiments.

The authors use a large-scale real-world benchmark to evaluate LingBot-VLA against three state-of-the-art baselines across three robotic platforms, with results showing that LingBot-VLA without depth outperforms WALL-OSS and GR00T N1.6 in both success rate and progress score. By incorporating depth information, LingBot-VLA with depth achieves a 4.28% higher success rate and a 7.76% higher progress score than π₀.₅ across all platforms, demonstrating improved performance through enhanced spatial understanding.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp