villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Visual-Language-Action (VLA) models have emerged as a popular paradigm forlearning robot manipulation policies that can follow language instructions andgeneralize to novel scenarios. Recent work has begun to explore theincorporation of latent actions, an abstract representation of visual changebetween two frames, into VLA pre-training. In this paper, we introduce villa-X,a novel Visual-Language-Latent-Action (ViLLA) framework that advances latentaction modeling for learning generalizable robot manipulation policies. Ourapproach improves both how latent actions are learned and how they areincorporated into VLA pre-training. Together, these contributions enablevilla-X to achieve superior performance across simulated environments includingSIMPLER and LIBERO, as well as on two real-world robot setups including gripperand dexterous hand manipulation. We believe the ViLLA paradigm holdssignificant promise, and that our villa-X provides a strong foundation forfuture research.