0.9B X-VLA Achieves 5 State-of-the-Art Results in Embodied AI
If you didn’t know his research focus, it would be hard to connect Jin-Yuan Zhan with the field of embodied intelligence. He began his academic journey at Tsinghua University studying civil engineering, then pursued a PhD in transportation engineering at Purdue University, where he spent half his time in the computer science department working on machine learning. After graduation, he joined Microsoft Research Asia, and later followed his former supervisor to JD Technology, where he led a project on offline reinforcement learning for power plant optimization—successfully productized and deployed across multiple power stations in China. In 2021, he returned to Tsinghua University to focus full-time on academic research. “To be honest,” he says with a smile, “I just wanted the freedom to explore what truly interests me.” On the surface, his career path—from civil engineering to transportation, from industrial control to autonomous driving and embodied intelligence—appears like a series of career shifts. But when abstracted, a clear thread emerges: how to use data-driven decision optimization techniques to enable intelligent agents to solve real-world problems more effectively. It’s this underlying mission that led him to recognize a critical bottleneck in the era of large models: the heterogeneity across different robotic embodiments. Differences in hardware, sensing, and control systems create data silos, making so-called “general VLA” models fragile when transferred across platforms. X-VLA was born from this insight. Over the past 11 months, Zhan and his students experimented with dozens of model architectures—ranging from unified action spaces to various intermediate representation mappings—before settling on a key innovation: handling embodiment heterogeneity at the model’s input stage. They introduced a learnable soft prompt to encode each robot’s unique “body characteristics,” allowing the Transformer backbone to focus on learning universal task patterns across diverse platforms. The result was striking: a 0.9-billion-parameter model that achieved state-of-the-art (SOTA) performance across five major simulation benchmarks. It learned to fold clothes—a complex, long-horizon task—using only 1,200 demonstration trajectories. Even more impressively, it demonstrated zero-shot transfer to entirely new environments. At the IROS 2025 AGIBOT World Challenge in Hangzhou, Zhan’s team partnered with the Shanghai Artificial Intelligence Laboratory and won first place. Why Return to Academia? Zhan explains that while industry allows impactful, practical work, academic research offers greater freedom to pursue frontier topics. “I wanted to explore things I’m passionate about without constraints.” From Industrial Control to Embodied Intelligence He sees industrial control, autonomous driving, and robotics as variations of the same core challenge: decision optimization under uncertainty. The shift to embodied intelligence wasn’t a departure but a natural evolution. With large models advancing robot cognition and decision-making, the field is now ripe for tackling complex, general-purpose tasks—something previously out of reach. Will This Research Be Commercialized? Zhan believes the current moment is critical for foundational research. While some humanoid robots are showing promise, true productization—especially in homes or public services—will likely take another 3 to 5 years. “Right now, we need to build robust, scalable, and transferable frameworks,” he says. “We must ensure that performance keeps improving with more data and compute—what we call a strong scaling law. Without that, scaling up blindly won’t help.” Why “Small but Strong”? Zhan emphasizes that 0.9B parameters is small by today’s standards—most comparable models range from 3B to 72B. But in embodied intelligence, size matters less than efficiency and deployability. “Future robots must run on-device. We can’t rely on cloud computing for every action.” X-VLA’s compact size, combined with high performance, makes it ideal for real-world deployment. What Makes X-VLA Different? Unlike many large VLA models built on generic vision-language models (VLMs), X-VLA uses Florence—a smaller but more embodiment-aware model trained on visual grounding, object relationships, and physical reasoning. This foundation, paired with the soft prompt mechanism and a streamlined Transformer architecture, enables high efficiency and strong generalization. Scaling: Data or Model? Zhan favors a dual approach. While the model architecture can still be improved, expanding data—especially with full-body humanoids—will be key. The team is also working to integrate embodied reasoning into the model, enabling it to make logical decisions during long-horizon tasks. Unexpected Successes Two results surprised the team. First, the model learned a full clothing-folding policy from just 1,200 demonstrations—behaving in a remarkably human-like way, adapting when things went wrong. Second, when tested in a high-variability exhibition environment—unlike the lab’s controlled setup—the model worked flawlessly without any fine-tuning. “We didn’t expect it,” Zhan admits. “It showed true generalization.” Another breakthrough came from a minimal LoRA fine-tuning experiment: using only 9MB of trainable parameters, the model matched full fine-tuning performance on two benchmarks. “That’s when I truly believed in our approach,” says PhD student Jin-Liang Zheng. Real-World Applications Zhan expects near-term deployment in semi-open environments—such as sorting, assembly, and table-top manipulation—where conditions are controlled and tasks are well-defined. Full home automation remains years away. But with continued scaling and refinement, X-VLA could soon reach commercial readiness in specific, high-impact scenarios. The team’s work, detailed in arXiv:2510.10274, marks a significant step toward truly general, efficient, and deployable embodied intelligence.
