Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models
Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models
Abstract
Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.
One-sentence Summary
NVIDIA researchers et al. propose Cascade RL to train Nemotron-Cascade reasoning models via sequential domain-wise reinforcement learning, avoiding conventional blended prompt approaches to handle cross-domain heterogeneity in response lengths and verification latency. This reduces engineering complexity while achieving state-of-the-art performance on coding benchmarks like LiveCodeBench and securing a silver medal in the 2025 IOI, with RLHF pre-training significantly enhancing reasoning capabilities beyond standard alignment.
Key Contributions
- Cross-domain heterogeneity in reinforcement learning for reasoning models causes significant infrastructure complexity and training slowdowns due to varying response lengths and verification latency across tasks.
- The proposed cascaded domain-wise reinforcement learning (Cascade RL) method sequentially trains on distinct domains instead of blending heterogeneous prompts, enabling dual instruct/deep thinking modes while preventing performance degradation between stages.
- Evaluated on benchmarks including LiveCodeBench v5/v6/Pro and the 2025 International Olympiad in Informatics, their 14B Nemotron-Cascade model surpassed its teacher (DeepSeek-R1-0528) and achieved silver-medal performance without degrading prior domain results.
Introduction
Training general-purpose reasoning models with reinforcement learning faces significant hurdles due to cross-domain heterogeneity—varying response lengths and verification latencies across tasks like math, coding, and alignment. This diversity complicates RL infrastructure, slows training, and hinders curriculum design and hyperparameter tuning. Prior approaches that blend heterogeneous prompts from multiple domains simultaneously often degrade performance in unified models, forcing compromises between thinking-mode reasoning and instruct-mode responsiveness.
The authors leverage cascaded domain-wise reinforcement learning (Cascade RL) to address these challenges. Their method sequences RL stages per domain—starting with RLHF for alignment, followed by sequential math, code, and software engineering RL—reducing engineering complexity while minimizing catastrophic forgetting. Crucially, early-stage RL (e.g., RLHF) unexpectedly boosts reasoning beyond alignment, and subsequent domain training rarely degrades prior gains. This enables a single unified model (Nemotron-Cascade) to operate effectively in both instruct and deep-thinking modes, with their 14B variant outperforming its teacher model on coding benchmarks and achieving competitive results in the 2025 IOI.
Dataset
- Dataset composition and sources: The authors use a multi-stage supervised fine-tuning (SFT) curriculum spanning math, coding, science, tool use, software engineering, and general domains (e.g., dialogue, knowledge QA, creative writing). Sources include public datasets like AceMath, NuminaMath, TACO, APPs, SWE-Bench variants, and Llama-Nemotron tool-calling data, supplemented by synthetically generated samples.
- Key subset details:
- General-domain: 2.8M samples (3.2B tokens) from diverse sources (e.g., Lian et al., 2023; Xu et al., 2024), with parallel thinking/non-thinking responses. Filtered for response quality, length, and stylistic consistency.
- Math: Stage 1 (16K tokens): 353K prompts → 2.77M samples (DeepSeek-R1); Stage 2 (32K): 163K filtered "hard" prompts → 1.88M samples (DeepSeek-R1-0528). Decontaminated via 9-gram overlap removal.
- Code: Stage 1: 172K prompts → 1.42M samples; Stage 2: 79K prompts → 1.39M samples. Sources include TACO, APPs, and OpenCodeReasoning.
- Science: 226K prompts → 289K Stage-1 samples; 345K Stage-2 samples. Filtered for complex reasoning and decontaminated.
- Tool calling: 310K conversations (1.41M turns) from Llama-Nemotron, with tools listed in system prompts.
- Software engineering: 127K code repair instances (e.g., SWE-Bench-Train, SWE-Smith), filtered via patch similarity (Unidiff ≥0.5).
- Usage in training: The SFT curriculum runs in two stages: Stage 1 (16K tokens) trains on general-domain + math/science/code data for one epoch; Stage 2 (32K tokens) recombines general data with new Stage-2 reasoning data, tool-calling, and software engineering datasets (also one epoch). Science data is upsampled 2×; software engineering data is upsampled 3× in Stage 2. All reasoning/tool data uses thinking-mode formatting.
- Processing details: Responses are generated via DeepSeek models (e.g., R1-0528), with multi-response sampling per prompt (avg. 7–17 responses). Data undergoes 9-gram decontamination against benchmarks, ground-truth verification (e.g., discarding mismatched MCQ answers), and cross-validation with auxiliary models (e.g., Qwen2.5-32B). For software engineering RL, prompts exceed SFT context limits (up to 60K tokens via YaRN scaling) and include noisy localized files to simulate real-world complexity.
Method
The authors leverage a cascaded reinforcement learning (Cascade RL) framework to progressively refine model capabilities across increasingly specialized domains. The overall training pipeline begins with a base model that undergoes multi-stage supervised fine-tuning (SFT) to establish foundational skills. From this SFT checkpoint, the model enters a sequential RL pipeline: first, Reinforcement Learning from Human Feedback (RLHF) is applied to align outputs with human preferences and reduce verbosity; this is followed by Instruction-Following RL (IF-RL) to enhance precise adherence to user directives. Subsequent stages—Math RL, Code RL, and finally Software Engineering RL (SWE RL)—target domain-specific reasoning and generation tasks, culminating in the Nemotron-Cascade model. This staged progression from general to specialized domains is designed to mitigate catastrophic forgetting by ensuring that reward structures across stages are aligned and that prompt overlap is minimized.
Refer to the framework diagram for the full training pipeline.

Each RL stage employs the Group Relative Policy Optimization (GRPO) algorithm under a strict on-policy regime, with no KL divergence term, simplifying the objective to a group-normalized REINFORCE formulation. At each iteration, the policy generates a group of G rollouts, and the advantage for each token is computed relative to the group’s mean and standard deviation of rewards. This design ensures stable updates and avoids entropy collapse. The reward functions vary by domain: RLHF uses a scalar score from a 72B reward model trained on human preferences; Math RL assigns binary rewards based on answer correctness via a rule-based verifier; Code RL and SWE RL use execution-free verifiers that compute lexical and semantic similarity between generated and ground-truth patches.
For interaction control, the authors adopt a simplified ChatML-based template with explicit /think and /no_think flags appended to each user prompt, enabling fine-grained, turn-level control over reasoning mode. This contrasts with prior work that embeds mode control in the system prompt or uses redundant template-based cues. For tool calling, available functions are declared within tags in the system prompt, and model-generated calls are enclosed in <tool_call> tags, as illustrated in the system prompt example.
As shown in the figure below:

In the SWE RL stage, the authors employ a simplified Agentless framework that decomposes software repair into localization, repair, and patch validation. The repair phase generates targeted, diff-style patches by concatenating localized files and surrounding context into a unified prompt, preserving code structure to reduce hallucinations. Patch validation proceeds through regression, reproduction, and majority voting phases to ensure functional correctness and robustness. For training stability, a two-stage curriculum extends input context from 16K to 24K tokens, allowing the model to gradually develop multi-file reasoning capabilities without early degradation.
Experiment
- Cascade RL framework validated across human-feedback alignment, instruction following, math reasoning, competitive programming, and software engineering, demonstrating minimal catastrophic forgetting and domain-specific performance gains.
- Nemotron-Cascade-14B-Thinking achieved 78.0/74.8 on LiveCodeBench v5/v6, surpassing DeepSeek-R1-0528 (74.8/73.3) and Gemini-2.5-Pro-06-05 despite using a 64K-token inference budget.
- Nemotron-Cascade-8B unified model matched DeepSeek-R1-0528 on LiveCodeBench v5/v6 (75.3/71.5) with only 8B parameters versus 671B, while achieving silver-medal performance on IOI 2025.
- Nemotron-Cascade-14B reached 43.1% pass@1 on SWE-bench Verified, exceeding specialized models like DeepSWE-32B (42.2%) and general-purpose 14B models (Qwen3-14B: 27.4%).
The authors evaluate the impact of maximum prompt length on code repair performance using a 14B model across four conditions. Results show that increasing prompt length from 16K to 32K improves repair accuracy, particularly when ground-truth file localization is provided, but performance degrades at 40K, suggesting diminishing returns beyond 32K context.

The authors evaluate their Nemotron-Cascade models on a series of Codeforces contests, reporting scores, penalties, and estimated ELO rankings across multiple divisions. Results show consistent performance across contests, with estimated ELO scores ranging from approximately 1500 to 2600, indicating competitive standing among human participants. The model’s performance varies by contest difficulty and division, with higher scores and rankings typically observed in lower-divisions and more recent rounds.

The authors evaluate different reward functions for SWE RL, finding that semantic similarity-based rewards outperform lexical similarity in code repair tasks, especially when ground-truth file localization is provided. Reward shaping improves performance with lexical similarity but offers no additional benefit with semantic similarity, suggesting the latter provides more reliable training signals even at low similarity scores.

The authors apply SWE RL as the final stage in their Cascade RL pipeline and observe substantial gains on SWE-bench Verified, with the 14B-Thinking model achieving 43.1% pass@1, outperforming specialized 32B models. While SWE RL improves software engineering performance, it has minimal impact on other domains, with most changes attributable to evaluation variance. The unified 8B model closes the performance gap with its dedicated 8B-Thinking counterpart on SWE-bench Verified after full training, achieving 37.2% versus 38.5%.

The authors evaluate their Nemotron-Cascade models on a series of Codeforces contests, reporting scores, penalties, and estimated ranks across multiple divisions. Results show consistent performance with high scores in many rounds, particularly in Div. 2 contests, and demonstrate competitive standing against other participants based on estimated ELO ratings. The data reflects the model’s ability to solve algorithmic problems under contest conditions, with performance varying by round difficulty and division.

Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.