Boost Robot Training with NVIDIA’s World Foundation Models and Advanced Workflows in R²D²
NVIDIA Research has introduced a new suite of World Foundation Models (WFMs) under the NVIDIA Cosmos platform, designed to revolutionize robot training by overcoming the limitations of real-world data collection. As physical AI systems like robots and autonomous vehicles grow more complex, the need for vast, diverse, and accurately labeled training data has outpaced what can be gathered manually. Cosmos addresses this challenge by enabling synthetic data generation and intelligent data curation through three core model types: Cosmos Predict, Cosmos Transfer, and Cosmos Reason. Cosmos Predict generates realistic future video frames from inputs such as images, videos, or text prompts. These models simulate physically accurate world states, making them ideal for training AI systems in complex, dynamic environments. One notable application is Single2MultiView, a post-trained version of Cosmos Predict that synthesizes multiple synchronized camera views from a single front-facing autonomous driving video. This enhances the development of autonomous vehicles by creating rich, multi-perspective datasets without relying on expensive real-world sensor arrays. Another key application is GR00T-Dreams, where Cosmos Predict is used to generate neural trajectories for robot tasks. For example, it enables a robot to plan and execute a plant-watering motion based on synthetic data, demonstrating strong sim-to-real transfer. Additionally, DiffusionRenderer, built on Cosmos, allows for advanced image and video re-lighting by applying novel lighting conditions to rendered scenes, improving realism and diversity in synthetic datasets. Cosmos Transfer enables precise control over synthetic data generation using multimodal inputs such as segmentation maps, depth, edge maps, LiDAR scans, keypoints, and HD maps. By combining these with natural language prompts, users can generate diverse and realistic scenarios—such as a snowy day or a nighttime driving scene—based on the same source video. This capability is critical for testing edge cases and ensuring robustness in robotics and autonomous systems. Cosmos Reason is a vision-language-action model trained for long-horizon reasoning in physical environments. It uses chain-of-thought reasoning to understand physical constraints, predict action sequences, and evaluate the quality of synthetic data. Trained via supervised fine-tuning and reinforcement learning, it can be adapted to specific tasks like robotics visual question answering. It also acts as a critic during synthetic data generation, ensuring that the data produced is not only visually realistic but also semantically and physically valid. These models are part of NVIDIA’s broader effort to accelerate physical AI development through scalable, high-fidelity simulation. Developers can access the models and related tools via public repositories on GitHub, Hugging Face, and project websites, including detailed documentation and research papers. NVIDIA will showcase these advancements at SIGGRAPH 2025, offering a glimpse into the future of AI-driven robotics. For those interested in exploring the technology, free courses on NVIDIA Robotics Fundamentals are available to help beginners get started. The work is the result of collaboration across NVIDIA Research, with contributions from over 150 researchers and engineers. Their collective efforts are paving the way for a new era of intelligent, adaptive, and autonomous systems powered by synthetic data and world foundation models.