X-VLA: Open-Source Robot Model Shatters Performance Records
清华大学智能产业研究院(AIR)与上海人工智能实验室联合发布全新通用跨本体具身基座模型 X-VLA. This fully open-source model—releasing all data, code, and model parameters—achieves a groundbreaking 120-minute autonomous shirt-folding task without any external assistance, setting new performance records across five major benchmark environments. Despite its compact size of just 0.9 billion parameters, X-VLA outperforms existing models in efficiency and capability, establishing a new, open, and high-performance standard for embodied intelligence. While multimodal large models (MLLMs) have made impressive strides—from image captioning to video understanding—questions remain about their true understanding. Can they genuinely reason and make decisions in complex, multi-step visual tasks? The new X-VLA model, developed by Professor Yang Liu’s team at AIR, in collaboration with Tsinghua University’s Department of Computer Science and Fudan University, aims to answer this. To test this, the team introduced EscapeCraft, a 3D room-escape environment designed to challenge MLLMs with real-world reasoning. The results were revealing: models often saw the door but kept walking around walls, picked up keys but failed to use them, and even attempted to "grab" a sofa, reasoning it might have a hidden compartment. These weren't isolated errors—they pointed to a systemic issue: seeing is not the same as understanding. Even GPT-4o, a leading model, only successfully completed a small fraction of subtasks with genuine reasoning, with the rest being coincidental or superficial. This highlights a critical gap in current models: they can process visual input but lack deep, grounded cognitive alignment. X-VLA addresses this through three core innovations: 1. Efficient model architecture – A lightweight, scalable design using a simplified Transformer with a novel Soft-Prompt mechanism, enabling high performance with minimal parameters. 2. Large-scale, high-quality heterogeneous data pretraining – Training on diverse, real-world and simulated data that bridges vision, language, and action. 3. Customized post-training pipeline – A carefully designed fine-tuning strategy with adaptive learning rates and slow-start mechanisms, ensuring stable and efficient knowledge transfer from general to specific tasks. The model’s pretraining scaling laws show a strong linear performance increase as data and model size grow, proving the architecture’s scalability. In post-training, X-VLA demonstrates exceptional data efficiency—achieving state-of-the-art (SOTA) results on benchmarks like LIBERO and SIMPLER with only moderate task-specific data. It also excels in real-world robotic deployment, successfully performing complex desktop manipulation and autonomous clothing folding tasks. Notably, X-VLA completed a full 120-minute autonomous folding sequence without human intervention and can zero-shot transfer to entirely new environments, showcasing its robustness in long-horizon, real-world scenarios. For more details, visit the project page: https://thu-air-dream.github.io/X-VLA/ Code and model weights are available at: https://github.com/2toinf/X-VLA.git Authors: Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, Xianyuan Zhan
