Date

10 months ago

Organization

Paper URL

2508.14444

License

Other

Tags

NVIDIA

Nemotron-CC-v2 is a follow-up version of Nemotron-CC released by NVIDIA in 2025. The related paper results are "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model". This dataset builds on the existing English web corpus by adding eight Common Crawl snapshots from 2024–2025, performing global deduplication and English filtering. It also uses Qwen3-30B-A3B to synthesize and restate web content, supplemented with Diverse Question Answering (Diverse QA), and further translated into 15 languages to enhance multilingual logical reasoning and general knowledge pre-training. Its significance lies in advancing the effective approach of "high-quality English webpages → synthesized Diverse QA" to a new level, combining updated web crawling and multilingual expansion into a systematic approach. Through rigorous deduplication, filtering, and reproducible distribution, it facilitates direct integration into various pre-training pipelines.

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Datasets

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Discuss on Discord

Date

10 months ago

Organization

Paper URL

2508.14444

License

Other

Related Datasets

Nemotron Personas France (French Synthetic Personas Dataset)

2 months ago

CHIMERA General Inference Synthetic Dataset

4 months ago

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

5 months ago

RoVid-X Robot Video Generation Dataset

2 months ago

LightOnOCR-mix-0126 Text Transcription Dataset

5 months ago

Nemotron-Math-v2 Mathematical Inference Dataset

5 months ago

TxT360-3efforts Multi-Task Inference Dataset

5 months ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Nemotron-CC-v2 pre-training Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

Nemotron-CC-v2 pre-training Dataset

Related Datasets

Nemotron Personas France (French Synthetic Personas Dataset)

CHIMERA General Inference Synthetic Dataset

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

RoVid-X Robot Video Generation Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

Nemotron-Math-v2 Mathematical Inference Dataset

TxT360-3efforts Multi-Task Inference Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

Nemotron-CC-v2 pre-training Dataset

Related Datasets

Nemotron Personas France (French Synthetic Personas Dataset)

CHIMERA General Inference Synthetic Dataset

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

RoVid-X Robot Video Generation Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

Nemotron-Math-v2 Mathematical Inference Dataset

TxT360-3efforts Multi-Task Inference Dataset

Build AI with AI

HyperAI Newsletters

Related Datasets

Nemotron Personas France (French Synthetic Personas Dataset)

CHIMERA General Inference Synthetic Dataset

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

RoVid-X Robot Video Generation Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

Nemotron-Math-v2 Mathematical Inference Dataset

TxT360-3efforts Multi-Task Inference Dataset

Related Datasets

Nemotron Personas France (French Synthetic Personas Dataset)

CHIMERA General Inference Synthetic Dataset

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

RoVid-X Robot Video Generation Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

Nemotron-Math-v2 Mathematical Inference Dataset

TxT360-3efforts Multi-Task Inference Dataset