HyperAI

Nemotron-CC-Math is a high-quality, large-scale pre-training dataset focused on mathematics, released by NVIDIA and Boston University in 2025. The related paper results are "Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset”, aims to preserve and display high-value mathematical and code content, thereby driving the next wave of intelligent, globally capable language models.

This dataset, containing 133 billion tokens, was built from Common Crawl using an extraction and normalization pipeline based on NVIDIA Lynx and a lightweight LLM. While preserving the structure of equations and code, the mathematical content is standardized into an editable LaTeX format. This represents the first reliable coverage of diverse (including long-tail) mathematical formats at web scale; its advantages have been validated in multiple benchmarks.

Nemotron-CC-Math Mathematics pre-training Dataset

Build AI with AI

Hyper Newsletters

Command Palette

Nemotron-CC-Math Mathematics pre-training Dataset

Build AI with AI

Hyper Newsletters