LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks

Real-world embodied agents face long-horizon tasks, characterized byhigh-level goals demanding multi-step solutions beyond single actions.Successfully navigating these requires both high-level task planning (i.e.,decomposing goals into sub-tasks) and low-level motion control (i.e.,generating precise robot actions). While existing vision language action (VLA)models and hierarchical architectures offer potential in embodied tasks, theformer often falter in planning, and the latter can suffer from coordinationissues, both hampering performance. We introduce a new unified VLA frameworkfor long-horizon tasks, dubbed LoHoVLA, to overcome these limitations. LoHoVLAleverages a large pretrained vision language model (VLM) as the backbone tojointly generate language and action tokens for sub-task generation and robotaction prediction, respectively. This shared representation promotes bettergeneralization across tasks. Additionally, LoHoVLA embraces a hierarchicalclosed-loop control mechanism to mitigate errors originating from bothhigh-level planning and low-level control. To train LoHoVLA, we introduceLoHoSet, a dataset built on the Ravens simulator, containing 20 long-horizontasks, each with 1,000 expert demonstrations composed of visual observations,linguistic goals, sub-tasks, and robot actions. Experimental results show thatLoHoVLA significantly surpasses both hierarchical and standard VLA approacheson long-horizon embodied tasks in the Ravens simulator. These findingsunderscore the promise of unified architectures for advancing generalizableembodied intelligence.