HyperAI

Nemotron-Pretraining-SFT-v1 is a synthetic generative dataset released by NVIDIA in 2025. The related paper is "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model", which aims to enhance the model's capabilities in tasks such as instruction following, reasoning, code and general question answering.

This dataset is aimed at STEM, academic, logical reasoning and multilingual scenarios. It is expanded and generated from high-quality mathematics and science materials, and combines graduate-level academic texts with instructed and fine-tuned SFT data to construct complex multiple-choice questions and analytical questions (with complete answers/ideas), covering multiple tasks such as mathematics, coding, general knowledge and logical reasoning.

In the official statistics of Nemotron pre-training data, SFT-related categories (such as Math SFT, Code SFT, and General SFT) occupy a significant proportion, making it easy for users to filter the required subsets according to metadata for reproducible experiments.

Nemotron-Pretraining-SFT-v1 Supervised fine-tuning Dataset

Build AI with AI

Hyper Newsletters

Command Palette

Nemotron-Pretraining-SFT-v1 Supervised fine-tuning Dataset

Build AI with AI

Hyper Newsletters