Command Palette
Search for a command to run...
Nemotron-Post-Training-Dataset-v2 Post-training Dataset
Date
Size
Paper URL
License
CC BY 4.0
Nemotron-Post-Training-Dataset-v2 is a version launched by NVIDIA in 2025 based on the existing post-training corpus. This dataset expands SFT and RL data to five target languages (Spanish/French/German/Italian/Japanese), covering mathematics, code, STEM (science, technology, engineering and mathematics), dialogue and other scenarios, used to improve the model's reasoning and instruction following capabilities; and provides metadata-based filtering functions and typical subset examples. This dataset serves the release and alignment research of the Nemotron-Nano-9B-v2 series, and is one of its public post-training corpora, which facilitates users to reproduce experiments and further improve. The relevant paper results are "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model".
Screenable samplesdistributedWith metadata:
- Filter download: Supports quick filtering and downloading by metadata such as category/language/source model
- Category and size (Value): math (239,467); code (175,000); stem (355,000); chat (627,720)
- Multi-language coverage: ja, de, it, es, fr
- Source: Synthesized from multiple large models (such as DeepSeek-R1-0528, Qwen 2.5/3 series, etc.)
- Annotation format: Some samples provide two responses: "reasoning on or off"; the reasoning trace is in English
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.