2 months ago

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

Delin Qu Haoming Song Qizhi Chen Zhaoqing Chen Xianqiang Gao Xinyi Ye Qi Lv Modi Shi Guanghui Ren Cheng Ruan

Abstract

The human ability to seamlessly perform multimodal reasoning and physicalinteraction in the open world is a core goal for general-purpose embodiedintelligent systems. Recent vision-language-action (VLA) models, which areco-trained on large-scale robot and visual-text data, have demonstrated notableprogress in general robot control. However, they still fail to achievehuman-level flexibility in interleaved reasoning and interaction. In this work,introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 isa unified embodied foundation model that achieves superior performance inmultimodal embodied reasoning and robot control through interleavedvision-text-action pre-training. The development of EO-1 is based on two keypillars: (i) a unified architecture that processes multimodal inputsindiscriminately (image, text, video, and action), and (ii) a massive,high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which containsover 1.5 million samples with emphasis on interleaved vision-text-actioncomprehension. EO-1 is trained through synergies between auto-regressivedecoding and flow matching denoising on EO-Data1.5M, enabling seamless robotaction generation and multimodal embodied reasoning. Extensive experimentsdemonstrate the effectiveness of interleaved vision-text-action learning foropen-world understanding and generalization, validated through a variety oflong-horizon, dexterous manipulation tasks across multiple embodiments. Thispaper details the architecture of EO-1, the data construction strategy ofEO-Data1.5M, and the training methodology, offering valuable insights fordeveloping advanced embodied foundation models.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

Delin Qu Haoming Song Qizhi Chen Zhaoqing Chen Xianqiang Gao Xinyi Ye Qi Lv Modi Shi Guanghui Ren Cheng Ruan5 more

Abstract

Build AI with AI

Hyper Newsletters

Delin Qu Haoming Song Qizhi Chen Zhaoqing Chen Xianqiang Gao Xinyi Ye Qi Lv Modi Shi Guanghui Ren Cheng Ruan