Command Palette
Search for a command to run...
Nemotron-Pretraining-Code-v1 Code Dataset
Date
Paper URL
License
Other
Nemotron-Pretraining-Code-v1 is a set of selected large-scale code datasets based on GitHub and released by NVIDIA in 2025. The related paper results are "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model"
This dataset, filtered through multi-stage deduplication, license enforcement, and heuristic quality checks, contains LLM-generated code question-answer pairs in 11 programming languages. The data includes not only 175.1 B tokens of high-quality synthesized code but also metadata (approximately 747.4 B tokens) to facilitate user reproduction.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.