Nemotron-Cascade : Extension de l'apprentissage par renforcement en cascade pour les modèles de raisonnement polyvalents
Nemotron-Cascade : Extension de l'apprentissage par renforcement en cascade pour les modèles de raisonnement polyvalents
Abstract
La construction de modèles de raisonnement à usage général à l’aide de l’apprentissage par renforcement (RL) soulève des défis importants liés à l’hétérogénéité inter-domaines, notamment des variations importantes dans la longueur des réponses au moment de l’inférence et dans la latence de vérification. Une telle variabilité complique l’infrastructure de RL, ralentit l’entraînement et rend difficile la conception du curriculum d’entraînement (par exemple, l’extension de la longueur des réponses) ainsi que le choix des hyperparamètres. Dans ce travail, nous proposons un cadre de renforcement par apprentissage en cascade par domaine (Cascade RL) afin de développer des modèles de raisonnement à usage général, Nemotron-Cascade, capables de fonctionner aussi bien en mode instruction qu’en mode de réflexion approfondie. Contrairement aux approches classiques qui mélangent des prompts hétérogènes provenant de différents domaines, Cascade RL orchestre une séquence d’apprentissage par renforcement par domaine, réduisant ainsi la complexité d’ingénierie tout en offrant des performances de pointe sur une large gamme de benchmarks. Notamment, l’utilisation du RLHF (Reinforcement Learning from Human Feedback) comme étape préalable, au lieu de se limiter à une simple optimisation des préférences, améliore considérablement la capacité de raisonnement du modèle. Les étapes ultérieures de RLVR par domaine conservent généralement les performances atteintes dans les domaines précédents, voire les améliorent (voir illustration dans la Figure 1). Notre modèle de 14 milliards de paramètres, après entraînement par RL, dépasse son modèle enseignant SFT, DeepSeek-R1-0528, sur LiveCodeBench v5/v6/Pro, et atteint une performance de médaille d’argent lors de l’Olympiade Internationale d’Informatique (IOI) 2025. Nous rendons publique et transparente notre recette d’entraînement ainsi que nos données.
One-sentence Summary
NVIDIA researchers et al. propose Cascade RL to train Nemotron-Cascade reasoning models via sequential domain-wise reinforcement learning, avoiding conventional blended prompt approaches to handle cross-domain heterogeneity in response lengths and verification latency. This reduces engineering complexity while achieving state-of-the-art performance on coding benchmarks like LiveCodeBench and securing a silver medal in the 2025 IOI, with RLHF pre-training significantly enhancing reasoning capabilities beyond standard alignment.
Key Contributions
- Cross-domain heterogeneity in reinforcement learning for reasoning models causes significant infrastructure complexity and training slowdowns due to varying response lengths and verification latency across tasks.
- The proposed cascaded domain-wise reinforcement learning (Cascade RL) method sequentially trains on distinct domains instead of blending heterogeneous prompts, enabling dual instruct/deep thinking modes while preventing performance degradation between stages.
- Evaluated on benchmarks including LiveCodeBench v5/v6/Pro and the 2025 International Olympiad in Informatics, their 14B Nemotron-Cascade model surpassed its teacher (DeepSeek-R1-0528) and achieved silver-medal performance without degrading prior domain results.
Introduction
Training general-purpose reasoning models with reinforcement learning faces significant hurdles due to cross-domain heterogeneity—varying response lengths and verification latencies across tasks like math, coding, and alignment. This diversity complicates RL infrastructure, slows training, and hinders curriculum design and hyperparameter tuning. Prior approaches that blend heterogeneous prompts from multiple domains simultaneously often degrade performance in unified models, forcing compromises between thinking-mode reasoning and instruct-mode responsiveness.
The authors leverage cascaded domain-wise reinforcement learning (Cascade RL) to address these challenges. Their method sequences RL stages per domain—starting with RLHF for alignment, followed by sequential math, code, and software engineering RL—reducing engineering complexity while minimizing catastrophic forgetting. Crucially, early-stage RL (e.g., RLHF) unexpectedly boosts reasoning beyond alignment, and subsequent domain training rarely degrades prior gains. This enables a single unified model (Nemotron-Cascade) to operate effectively in both instruct and deep-thinking modes, with their 14B variant outperforming its teacher model on coding benchmarks and achieving competitive results in the 2025 IOI.
Dataset
- Dataset composition and sources: The authors use a multi-stage supervised fine-tuning (SFT) curriculum spanning math, coding, science, tool use, software engineering, and general domains (e.g., dialogue, knowledge QA, creative writing). Sources include public datasets like AceMath, NuminaMath, TACO, APPs, SWE-Bench variants, and Llama-Nemotron tool-calling data, supplemented by synthetically generated samples.
- Key subset details:
- General-domain: 2.8M samples (3.2B tokens) from diverse sources (e.g., Lian et al., 2023; Xu et al., 2024), with parallel thinking/non-thinking responses. Filtered for response quality, length, and stylistic consistency.
- Math: Stage 1 (16K tokens): 353K prompts → 2.77M samples (DeepSeek-R1); Stage 2 (32K): 163K filtered "hard" prompts → 1.88M samples (DeepSeek-R1-0528). Decontaminated via 9-gram overlap removal.
- Code: Stage 1: 172K prompts → 1.42M samples; Stage 2: 79K prompts → 1.39M samples. Sources include TACO, APPs, and OpenCodeReasoning.
- Science: 226K prompts → 289K Stage-1 samples; 345K Stage-2 samples. Filtered for complex reasoning and decontaminated.
- Tool calling: 310K conversations (1.41M turns) from Llama-Nemotron, with tools listed in system prompts.
- Software engineering: 127K code repair instances (e.g., SWE-Bench-Train, SWE-Smith), filtered via patch similarity (Unidiff ≥0.5).
- Usage in training: The SFT curriculum runs in two stages: Stage 1 (16K tokens) trains on general-domain + math/science/code data for one epoch; Stage 2 (32K tokens) recombines general data with new Stage-2 reasoning data, tool-calling, and software engineering datasets (also one epoch). Science data is upsampled 2×; software engineering data is upsampled 3× in Stage 2. All reasoning/tool data uses thinking-mode formatting.
- Processing details: Responses are generated via DeepSeek models (e.g., R1-0528), with multi-response sampling per prompt (avg. 7–17 responses). Data undergoes 9-gram decontamination against benchmarks, ground-truth verification (e.g., discarding mismatched MCQ answers), and cross-validation with auxiliary models (e.g., Qwen2.5-32B). For software engineering RL, prompts exceed SFT context limits (up to 60K tokens via YaRN scaling) and include noisy localized files to simulate real-world complexity.
Method
The authors leverage a cascaded reinforcement learning (Cascade RL) framework to progressively refine model capabilities across increasingly specialized domains. The overall training pipeline begins with a base model that undergoes multi-stage supervised fine-tuning (SFT) to establish foundational skills. From this SFT checkpoint, the model enters a sequential RL pipeline: first, Reinforcement Learning from Human Feedback (RLHF) is applied to align outputs with human preferences and reduce verbosity; this is followed by Instruction-Following RL (IF-RL) to enhance precise adherence to user directives. Subsequent stages—Math RL, Code RL, and finally Software Engineering RL (SWE RL)—target domain-specific reasoning and generation tasks, culminating in the Nemotron-Cascade model. This staged progression from general to specialized domains is designed to mitigate catastrophic forgetting by ensuring that reward structures across stages are aligned and that prompt overlap is minimized.
Refer to the framework diagram for the full training pipeline.

Each RL stage employs the Group Relative Policy Optimization (GRPO) algorithm under a strict on-policy regime, with no KL divergence term, simplifying the objective to a group-normalized REINFORCE formulation. At each iteration, the policy generates a group of G rollouts, and the advantage for each token is computed relative to the group’s mean and standard deviation of rewards. This design ensures stable updates and avoids entropy collapse. The reward functions vary by domain: RLHF uses a scalar score from a 72B reward model trained on human preferences; Math RL assigns binary rewards based on answer correctness via a rule-based verifier; Code RL and SWE RL use execution-free verifiers that compute lexical and semantic similarity between generated and ground-truth patches.
For interaction control, the authors adopt a simplified ChatML-based template with explicit /think and /no_think flags appended to each user prompt, enabling fine-grained, turn-level control over reasoning mode. This contrasts with prior work that embeds mode control in the system prompt or uses redundant template-based cues. For tool calling, available functions are declared within tags in the system prompt, and model-generated calls are enclosed in <tool_call> tags, as illustrated in the system prompt example.
As shown in the figure below:

In the SWE RL stage, the authors employ a simplified Agentless framework that decomposes software repair into localization, repair, and patch validation. The repair phase generates targeted, diff-style patches by concatenating localized files and surrounding context into a unified prompt, preserving code structure to reduce hallucinations. Patch validation proceeds through regression, reproduction, and majority voting phases to ensure functional correctness and robustness. For training stability, a two-stage curriculum extends input context from 16K to 24K tokens, allowing the model to gradually develop multi-file reasoning capabilities without early degradation.
Experiment
- Cascade RL framework validated across human-feedback alignment, instruction following, math reasoning, competitive programming, and software engineering, demonstrating minimal catastrophic forgetting and domain-specific performance gains.
- Nemotron-Cascade-14B-Thinking achieved 78.0/74.8 on LiveCodeBench v5/v6, surpassing DeepSeek-R1-0528 (74.8/73.3) and Gemini-2.5-Pro-06-05 despite using a 64K-token inference budget.
- Nemotron-Cascade-8B unified model matched DeepSeek-R1-0528 on LiveCodeBench v5/v6 (75.3/71.5) with only 8B parameters versus 671B, while achieving silver-medal performance on IOI 2025.
- Nemotron-Cascade-14B reached 43.1% pass@1 on SWE-bench Verified, exceeding specialized models like DeepSWE-32B (42.2%) and general-purpose 14B models (Qwen3-14B: 27.4%).
The authors evaluate the impact of maximum prompt length on code repair performance using a 14B model across four conditions. Results show that increasing prompt length from 16K to 32K improves repair accuracy, particularly when ground-truth file localization is provided, but performance degrades at 40K, suggesting diminishing returns beyond 32K context.

The authors evaluate their Nemotron-Cascade models on a series of Codeforces contests, reporting scores, penalties, and estimated ELO rankings across multiple divisions. Results show consistent performance across contests, with estimated ELO scores ranging from approximately 1500 to 2600, indicating competitive standing among human participants. The model’s performance varies by contest difficulty and division, with higher scores and rankings typically observed in lower-divisions and more recent rounds.

The authors evaluate different reward functions for SWE RL, finding that semantic similarity-based rewards outperform lexical similarity in code repair tasks, especially when ground-truth file localization is provided. Reward shaping improves performance with lexical similarity but offers no additional benefit with semantic similarity, suggesting the latter provides more reliable training signals even at low similarity scores.

The authors apply SWE RL as the final stage in their Cascade RL pipeline and observe substantial gains on SWE-bench Verified, with the 14B-Thinking model achieving 43.1% pass@1, outperforming specialized 32B models. While SWE RL improves software engineering performance, it has minimal impact on other domains, with most changes attributable to evaluation variance. The unified 8B model closes the performance gap with its dedicated 8B-Thinking counterpart on SWE-bench Verified after full training, achieving 37.2% versus 38.5%.

The authors evaluate their Nemotron-Cascade models on a series of Codeforces contests, reporting scores, penalties, and estimated ranks across multiple divisions. Results show consistent performance with high scores in many rounds, particularly in Div. 2 contests, and demonstrate competitive standing against other participants based on estimated ELO ratings. The data reflects the model’s ability to solve algorithmic problems under contest conditions, with performance varying by round difficulty and division.

Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.