4 days ago

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang

View Paper Details View Code

A Survey on Vision-Language-Action Models: An Action Tokenization
Perspective

Abstract

The remarkable advancements of vision and language foundation models inmultimodal understanding, reasoning, and generation has sparked growing effortsto extend such intelligence to the physical world, fueling the flourishing ofvision-language-action (VLA) models. Despite seemingly diverse approaches, weobserve that current VLA models can be unified under a single framework: visionand language inputs are processed by a series of VLA modules, producing a chainof action tokens that progressively encode more grounded andactionable information, ultimately generating executable actions. We furtherdetermine that the primary design choice distinguishing VLA models lies in howaction tokens are formulated, which can be categorized into languagedescription, code, affordance, trajectory, goal state, latent representation,raw action, and reasoning. However, there remains a lack of comprehensiveunderstanding regarding action tokens, significantly impeding effective VLAdevelopment and obscuring future directions. Therefore, this survey aims tocategorize and interpret existing VLA research through the lens of actiontokenization, distill the strengths and limitations of each token type, andidentify areas for improvement. Through this systematic review and analysis, weoffer a synthesized outlook on the broader evolution of VLA models, highlightunderexplored yet promising directions, and contribute guidance for futureresearch, hoping to bring the field closer to general-purpose intelligence.