Command Palette
Search for a command to run...
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

Abstract
The human ability to seamlessly perform multimodal reasoning and physicalinteraction in the open world is a core goal for general-purpose embodiedintelligent systems. Recent vision-language-action (VLA) models, which areco-trained on large-scale robot and visual-text data, have demonstrated notableprogress in general robot control. However, they still fail to achievehuman-level flexibility in interleaved reasoning and interaction. In this work,introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 isa unified embodied foundation model that achieves superior performance inmultimodal embodied reasoning and robot control through interleavedvision-text-action pre-training. The development of EO-1 is based on two keypillars: (i) a unified architecture that processes multimodal inputsindiscriminately (image, text, video, and action), and (ii) a massive,high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which containsover 1.5 million samples with emphasis on interleaved vision-text-actioncomprehension. EO-1 is trained through synergies between auto-regressivedecoding and flow matching denoising on EO-Data1.5M, enabling seamless robotaction generation and multimodal embodied reasoning. Extensive experimentsdemonstrate the effectiveness of interleaved vision-text-action learning foropen-world understanding and generalization, validated through a variety oflong-horizon, dexterous manipulation tasks across multiple embodiments. Thispaper details the architecture of EO-1, the data construction strategy ofEO-Data1.5M, and the training methodology, offering valuable insights fordeveloping advanced embodied foundation models.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.