HyperAIHyperAI

Command Palette

Search for a command to run...

Nemotron-CC-v2 pre-training Dataset

Date

2 months ago

Organization

NVIDIA

Paper URL

2508.14444

License

Other

Join the Discord Community

Nemotron-CC-v2 is a follow-up version of Nemotron-CC released by NVIDIA in 2025. The related paper results are "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model".

This dataset builds on the existing English web corpus by adding eight Common Crawl snapshots from 2024–2025, performing global deduplication and English filtering. It also uses Qwen3-30B-A3B to synthesize and restate web content, supplemented with Diverse Question Answering (Diverse QA), and further translated into 15 languages to enhance multilingual logical reasoning and general knowledge pre-training. Its significance lies in advancing the effective approach of "high-quality English webpages → synthesized Diverse QA" to a new level, combining updated web crawling and multilingual expansion into a systematic approach. Through rigorous deduplication, filtering, and reproducible distribution, it facilitates direct integration into various pre-training pipelines.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp