SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Vision-language models (VLMs) pretrained on large-scale multimodal datasetsencode rich visual and linguistic knowledge, making them a strong foundationfor robotics. Rather than training robotic policies from scratch, recentapproaches adapt VLMs into vision-language-action (VLA) models that enablenatural language-driven perception and control. However, existing VLAs aretypically massive--often with billions of parameters--leading to high trainingcosts and limited real-world deployability. Moreover, they rely on academic andindustrial datasets, overlooking the growing availability ofcommunity-collected data from affordable robotic platforms. In this work, wepresent SmolVLA, a small, efficient, and community-driven VLA that drasticallyreduces both training and inference costs, while retaining competitiveperformance. SmolVLA is designed to be trained on a single GPU and deployed onconsumer-grade GPUs or even CPUs. To further improve responsiveness, weintroduce an asynchronous inference stack decoupling perception and actionprediction from action execution, allowing higher control rates with chunkedaction generation. Despite its compact size, SmolVLA achieves performancecomparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of bothsimulated as well as real-world robotic benchmarks and release all code,pretrained models, and training data.