HyperAIHyperAI

Command Palette

Search for a command to run...

How NVIDIA Builds Open Data for AI

To address critical bottlenecks in AI development, NVIDIA has adopted a collaborative strategy centered on open data, aiming to accelerate the creation of trustworthy and capable agentic systems. Recognizing that model performance relies heavily on the quality and accessibility of training data, the company is releasing large-scale, permissively licensed datasets alongside its open models, tools, and training techniques. This approach is designed to reduce the months-long, costly processes of data collection and annotation that typically hinder progress. To date, NVIDIA has shared over two petabytes of AI-ready data across 180 datasets and more than 650 open models. The company's open data portfolio spans diverse domains including robotics, autonomous driving, biology, and sovereign AI. Notable examples include the Physical AI Collection, which features over 500,000 robotics trajectories and 17,000 hours of multi-sensor autonomous driving data from 25 countries. This dataset, utilized by companies like Runway and Lightwheel, enables the development of advanced reasoning models and robust perception benchmarks. Another key release is the Nemotron Personas Collection, which provides synthetic, culturally diverse population data to support Sovereign AI. This dataset has demonstrably improved translation accuracy and legal question answering for global partners such as CrowdStrike and NTT Data. In the life sciences, the La Proteina dataset offers 455,000 synthetic protein structures to aid drug discovery without licensing constraints or privacy issues. For evaluation and efficiency, NVIDIA introduced SPEED-Bench, a standardized benchmark for speculative decoding, and Retrieval-Synthetic-NVDocs-v1, a dataset designed to enhance retrieval-augmented generation systems. Additionally, the Nemotron-ClimbMix dataset, created using a novel clustering algorithm, has significantly reduced compute time for training smaller models while improving performance on leaderboards. NVIDIA also details the evolution of data stacks used to train its own Nemotron model family. Pre-training data has shifted from general web corpora to high-signal domains like mathematics, code, and STEM to enhance reasoning capabilities. Post-training datasets now focus on multilingual diversity, structured reasoning, and agentic interactions, enabling the development of models that excel in complex, multi-step tasks. These efforts have supported partnerships with organizations like ServiceNow and Hugging Face, resulting in models that compete with or surpass leading commercial offerings. This initiative relies on extreme co-design, a methodology that integrates data strategists, researchers, and engineers to solve problems holistically. By releasing both data and the methodologies behind it, NVIDIA invites the community to stress-test datasets, identify edge cases, and contribute to iterative improvements. Through consortia like ViDoRe and CVDP, the company further collaborates with academic and industry partners to establish open evaluation frameworks. Ultimately, NVIDIA views open data as the shared foundation necessary for the next generation of AI. By making ingredients and recipes visible, similar to an open kitchen, the company encourages developers to explore its datasets on Hugging Face, participate in labs, and engage with the community to build more capable and safe AI systems.

Related Links

How NVIDIA Builds Open Data for AI | Trending Stories | HyperAI