HyperAI
5 days ago

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive
  World Knowledge
Abstract

Recent advances in vision-language-action (VLA) models have shown promise inintegrating image generation with action prediction to improve generalizationand reasoning in robot manipulation. However, existing methods are limited tochallenging image-based forecasting, which suffers from redundant informationand lacks comprehensive and critical world knowledge, including dynamic,spatial and semantic information. To address these limitations, we proposeDreamVLA, a novel VLA framework that integrates comprehensive world knowledgeforecasting to enable inverse dynamics modeling, thereby establishing aperception-prediction-action loop for manipulation tasks. Specifically,DreamVLA introduces a dynamic-region-guided world knowledge prediction,integrated with the spatial and semantic cues, which provide compact yetcomprehensive representations for action planning. This design aligns with howhumans interact with the world by first forming abstract multimodal reasoningchains before acting. To mitigate interference among the dynamic, spatial andsemantic information during training, we adopt a block-wise structuredattention mechanism that masks their mutual attention, preventing informationleakage and keeping each representation clean and disentangled. Moreover, tomodel the conditional distribution over future actions, we employ adiffusion-based transformer that disentangles action representations fromshared latent features. Extensive experiments on both real-world and simulationenvironments demonstrate that DreamVLA achieves 76.7% success rate on realrobot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.