HyperAIHyperAI

Nemotron-CC-v2 pre-training Dataset

Date

7 days ago

Organization

NVIDIA

Publish URL

huggingface.co

License

其他

Categories

Download Help

Nemotron-CC-v2 is a follow-up version of Nemotron-CC released by NVIDIA in 2025. The related paper results are "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model".

This dataset builds on the existing English web corpus by adding eight Common Crawl snapshots from 2024–2025, performing global deduplication and English filtering. It also uses Qwen3-30B-A3B to synthesize and restate web content, supplemented with Diverse Question Answering (Diverse QA), and further translated into 15 languages to enhance multilingual logical reasoning and general knowledge pre-training. Its significance lies in advancing the effective approach of "high-quality English webpages → synthesized Diverse QA" to a new level, combining updated web crawling and multilingual expansion into a systematic approach. Through rigorous deduplication, filtering, and reproducible distribution, it facilitates direct integration into various pre-training pipelines.