HyperAIHyperAI

Command Palette

Search for a command to run...

Nemotron-Pretraining-Code-v1 Code Dataset

Date

2 months ago

Organization

NVIDIA

Paper URL

2508.14444

License

Other

Join the Discord Community

Nemotron-Pretraining-Code-v1 is a set of selected large-scale code datasets based on GitHub and released by NVIDIA in 2025. The related paper results are "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model"

This dataset, filtered through multi-stage deduplication, license enforcement, and heuristic quality checks, contains LLM-generated code question-answer pairs in 11 programming languages. The data includes not only 175.1 B tokens of high-quality synthesized code but also metadata (approximately 747.4 B tokens) to facilitate user reproduction.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp