HyperAIHyperAI

Command Palette

Search for a command to run...

NVIDIA Unveils Cosmos Reason 2: Advanced Open Model for Physical AI with Enhanced Vision, Reasoning, and Robotics Capabilities

NVIDIA has unveiled Cosmos Reason 2, the latest evolution in its open reasoning vision-language models designed for physical AI applications. This new model sets a new benchmark in visual understanding, outperforming its predecessor and claiming the top spot on both the Physical AI Bench and Physical Reasoning leaderboards as the leading open model for real-world AI tasks. Cosmos Reason 2 is a state-of-the-art vision-language model that empowers robots and AI agents to perceive, comprehend, plan, and act in dynamic physical environments with human-like reasoning. It integrates common sense, physics-based knowledge, and contextual awareness to anticipate object behavior over time and space, enabling step-by-step problem solving, adaptation to novel situations, and robust decision-making under uncertainty. Key advancements in Cosmos Reason 2 include support for OCR, 2D and 3D point localization, and enhanced text understanding within video content. This allows the model to extract meaningful insights from complex visual data—such as analyzing road conditions during a rainstorm by reading signs and markings in video footage. The model supports a range of high-impact use cases. In video analytics, it powers AI agents that can search, summarize, and interpret vast video datasets. Salesforce is using Cosmos Reason 2 with the Agentforce and Video Search and Summarization (VSS) blueprint to improve workplace safety by analyzing footage from Cobalt robots. For data annotation, the model enables automated, time-stamped, and detailed labeling of training videos—critical for building high-quality datasets. Uber is leveraging Cosmos Reason 2 to generate accurate, searchable captions for autonomous vehicle training data, significantly improving performance across multiple metrics. Fine-tuning on AV datasets led to a 10.6% increase in BLEU scores, a 0.67 percentage point gain in MCQ-based VQA, and a 13.8% rise in LingoQA, demonstrating strong domain-specific adaptation. In robotics, Cosmos Reason 2 acts as a core reasoning engine for vision-language-action (VLA) systems. It now provides not just action steps but also precise trajectory coordinates for robotic arms, enabling accurate manipulation tasks—such as guiding a gripper to place painter’s tape into a basket. Encord has integrated Cosmos Reason 2 into its Data Agent library, allowing developers to use it for robotics and other physical AI applications. Companies like Hitachi, Milestone, and VAST Data are applying the model to advance autonomous driving, traffic monitoring, and safety analytics. Developers can now try Cosmos Reason 2 on build.nvidia.com, where they can upload their own videos and images for analysis. The 2B and 8B versions are available on Hugging Face, and cloud deployment will soon be supported on AWS, Google Cloud, and Microsoft Azure. Comprehensive documentation and the Cosmos Cookbook provide guidance for implementation and customization. Other models in the Cosmos family include Cosmos Predict 2.5 for forecasting physical world states from visual inputs, Cosmos Transfer 2.5 for video-to-world style transfer, and NVIDIA GR00T N1.6—a specialized VLA model for humanoid robots that uses Cosmos Reason for enhanced reasoning and control. For more information, visit the Cosmos Cookbook, explore models and datasets on GitHub, try the hosted catalog, join the community on Discord, or contribute to the open-source project.

Related Links