14 hours ago

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, Jianyu Chen, Jiang Bian

View Paper Details View Code

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action
Models

Abstract

Visual-Language-Action (VLA) models have emerged as a popular paradigm forlearning robot manipulation policies that can follow language instructions andgeneralize to novel scenarios. Recent work has begun to explore theincorporation of latent actions, an abstract representation of visual changebetween two frames, into VLA pre-training. In this paper, we introduce villa-X,a novel Visual-Language-Latent-Action (ViLLA) framework that advances latentaction modeling for learning generalizable robot manipulation policies. Ourapproach improves both how latent actions are learned and how they areincorporated into VLA pre-training. Together, these contributions enablevilla-X to achieve superior performance across simulated environments includingSIMPLER and LIBERO, as well as on two real-world robot setups including gripperand dexterous hand manipulation. We believe the ViLLA paradigm holdssignificant promise, and that our villa-X provides a strong foundation forfuture research.