Meta's V-JEPA 2 Model Trains Robots Using Raw Video, 30x Faster than NVIDIA Cosmos
On Wednesday, Meta announced the release of its new AI model, V-JEPA 2, which builds upon the original V-JEPA model launched last year. V-JEPA 2, a "world model" designed to help AI agents and robots understand their environment, represents a significant advancement in the field of embodied AI and robotics. The model is particularly notable for its ability to predict physical interactions and outcomes, much like how young children and animals intuitively grasp basic physics concepts such as gravity and trajectory. Overview of V-JEPA 2 V-JEPA 2 is a 1.2-billion-parameter AI model that has undergone a two-stage training process to enhance its capabilities in understanding, prediction, and planning. During the first stage, the model is trained on over 1 million hours of video and 1 million images, learning patterns and sequences of physical interactions without any human supervision. This massive dataset enables V-JEPA 2 to build a robust foundation of how objects behave in various scenarios. In the second stage, the model undergoes action-conditioned learning using a smaller set of robot control data—approximately 62 hours. This step is crucial as it allows the model to integrate the physical actions of an AI agent, such as a robot, into its predictions. For simple tasks like pick-and-place, V-JEPA 2 generates potential actions and evaluates them based on predicted outcomes. For more complex tasks, such as placing an object accurately in a specific location, the model uses a series of visual subgoals to guide the robot's behavior. Testing and Performance Meta has conducted internal tests on V-JEPA 2 using robots in their labs, specifically the Franka robot. In these tests, the model demonstrated its ability to perform common robotic tasks with high success rates. For instance, it achieved success rates ranging from 65% to 80% in pick-and-place tasks in unfamiliar environments. These results are impressive, considering the model operates solely on visual data collected from the robot's perspective and does not require extensive domain-specific training or data. However, while V-JEPA 2 outperforms previous models, there is still a noticeable gap between its performance and that of humans. Meta acknowledges this and suggests that further advancements are needed, including incorporating multiple timescales and modalities, such as audio and tactile information, to create a more comprehensive understanding of the physical world. Industry Impact and Future Directions Yann LeCun, Meta's chief AI scientist, emphasized the potential of world models in ushering in a new era for robotics. He believes that these models will enable real-world AI agents to assist with household chores and physical tasks, reducing the need for vast amounts of robotic training data. This could significantly lower the barriers to entry for companies and researchers looking to develop more advanced and capable robots. To facilitate broader exploration and improvement of world models, Meta has made the V-JEPA 2 code and model checkpoints available for both commercial and research use. The company has also released three new benchmarks to assess the model's physical understanding from video data. These benchmarks will help researchers and developers measure and improve the performance of their own models against a standardized set of challenges. Comparison with Competitors V-JEPA 2 is reported to be 30 times faster than Nvidia's Cosmos model, another prominent player in the field of embodied AI. However, the speed claims might be evaluated using different benchmarks. Google DeepMind's Genie and World Labs' large world models, both developed by leading tech companies, also aim to enhance physical understanding in AI. The competition in this space highlights the growing importance and potential of world models in advancing robotics and AI. Limitations and Next Steps Despite its advancements, V-JEPA 2 still faces limitations. One of the primary issues is the reliance on visual data alone, which can lead to errors in long-term planning tasks due to cumulative inaccuracies in predictions and the lack of diverse sensor inputs. To address these limitations, Meta plans to explore integrating multiple sensory modalities, such as hearing and touch, to create more sophisticated and reliable AI agents. Industry Insights Industry experts and insiders are optimistic about the implications of V-JEPA 2. They see this model as a significant step forward in making robots more adaptive and efficient in real-world environments. The availability of the model's code and checkpoints is particularly commendable, as it fosters collaboration and innovation among researchers and developers. Companies like Meta, Google DeepMind, and World Labs are setting the pace for advancements in AI and robotics, and their efforts are expected to drive significant progress in the coming years. Amanda Silberling Profile: Amanda Silberling is a senior writer at TechCrunch, covering the intersection of technology and culture. Her diverse background includes writing for publications like Polygon, MTV, the Kenyon Review, NPR, and Business Insider. Silberling brings a unique perspective to her work, having previously worked as a grassroots organizer, museum educator, and film festival coordinator. She holds a B.A. in English from the University of Pennsylvania and served as a Princeton in Asia Fellow in Laos. Her role at TechCrunch often involves exploring the broader societal impacts of technological innovations, such as the ethical and practical considerations of AI and robotics.