vor einem Monat

Inhaltsverzeichnis

Zusammenfassung

Die Nachtrainierung großer Sprachmodelle verläuft üblicherweise durch eine abwechselnde Anwendung von überwachtem Feinabstimmung (Supervised Fine-Tuning, SFT) und Verstärkendem Lernen (Reinforcement Learning, RL). Diese beiden Methoden verfolgen unterschiedliche Ziele: Während SFT die Kreuzentropie zwischen den Modellausgaben und Expertenantworten minimiert, strebt RL die Maximierung von Belohnungssignalen an, die aus menschlichen Präferenzen oder regelbasierten Verifizierern abgeleitet werden. Moderne Reasoning-Modelle haben die Praxis der alternierenden SFT- und RL-Trainingsphase weit verbreitet übernommen. Es gibt jedoch bisher keine theoretische Begründung dafür, ob diese beiden Schritte voneinander entkoppelt werden können. Wir beweisen, dass eine Entkopplung in keiner Reihenfolge möglich ist: (1) Bei der SFT-then-RL-Kombination erhöht RL die SFT-Verlustfunktion unter SFT-Optimalität, und (2) bei der RL-then-SFT-Kombination verringert SFT die Belohnung, die durch RL erreicht wurde. Experimente am Modell Qwen3-0.6B bestätigen die vorhergesagte Leistungsverschlechterung und belegen somit, dass SFT und RL im Nachtrainingsprozess nicht ohne Verlust der vorherigen Leistung voneinander getrennt werden können.

One-sentence Summary

The authors from Huawei's Central Research Institute propose that supervised fine-tuning (SFT) and reinforcement learning (RL) in post-training of large language models cannot be decoupled without performance degradation, as SFT undermines RL rewards and RL worsens SFT loss, with experiments on Qwen3-0.6B confirming this fundamental trade-off in alternating training pipelines.

Key Contributions

Post-training of large language models typically alternates supervised fine-tuning (SFT) and reinforcement learning (RL), but this work proves theoretically that these stages cannot be decoupled: SFT minimizes cross-entropy loss on expert responses, while RL maximizes reward signals, leading to conflicting objectives that prevent independent optimization.
Theoretical analysis shows that in the SFT-then-RL pipeline, RL training increases the SFT loss even when SFT is optimal, and in the RL-then-SFT pipeline, subsequent SFT reduces the reward achieved by the previously optimized RL model, demonstrating a fundamental incompatibility between the two stages.
Experiments on Qwen3-0.6B confirm both theoretical predictions: RL degrades SFT performance (increased cross-entropy loss), and SFT following RL leads to reward degradation, validating that SFT and RL must be treated as an integrated optimization process rather than separate steps.

Introduction

The authors investigate the interplay between supervised fine-tuning (SFT) and reinforcement learning (RL) in post-training large language models, a common practice in modern reasoning models like DeepSeek-R1 and Qwen3. While SFT aligns model outputs with expert responses by minimizing cross-entropy loss, RL optimizes for human preferences or rule-based rewards, often leading to conflicting objectives. Prior work has shown inconsistent empirical results, with some observing performance gains from alternating SFT and RL, while others report catastrophic forgetting or limited synergy. The key limitation is the lack of theoretical understanding of whether these stages can be decoupled without performance degradation. The authors prove that decoupling is fundamentally impossible: performing SFT after RL increases the SFT loss, and performing RL after SFT reduces the reward achieved. Experiments on Qwen3-0.6B validate these findings, showing that interleaving SFT and RL is necessary to preserve prior performance, highlighting a critical constraint in current LLM post-training pipelines.

Method

The authors investigate the interaction between supervised fine-tuning (SFT) and reinforcement learning (RL) in post-training pipelines for language models, focusing on the non-decoupling of these stages. The overall framework begins with a pretrained base model, which undergoes either SFT or RL as a post-training step, followed by the other, before reaching test-time computation. The two primary sequential strategies are SFT-then-RL and RL-then-SFT, which are analyzed to determine if the stages can be treated as independent optimizations.

In the SFT stage, the pretrained model $p_{\theta}$ is adapted to task-specific knowledge using labeled data $\mathcal{D}_{\text{SFT}}$ . The objective is to minimize the negative log-likelihood, which is equivalent to the cross-entropy loss for next-token prediction. This is formalized as $\mathcal{L}_{\text{SFT}}(p_{\theta}) = -\sum_{(\pmb{x}, \pmb{y}) \in \mathcal{D}_{\text{SFT}}} \sum_{j=1}^{|\pmb{y}|} \log p_{\theta}(\pmb{y}_j \mid \pmb{x}, \pmb{y}_{<j})$ . The resulting model $p_{\theta_{\text{SFT}}}$ is optimized to generate outputs $\pmb{y}$ given a prompt $\pmb{x}$ , and this process is effective for in-distribution tasks.

The RL stage, typically used for aligning models with human preferences, treats the language model as a policy $p_{\theta}$ . It aims to maximize the expected reward $r_G(\pmb{x}, \pmb{y})$ over the output distribution. When the ground truth reward is not available, a proxy reward model $r(\cdot, \cdot)$ is trained on preference data $\mathcal{D}_{\text{RL}}$ , which consists of prompt-response pairs with positive and negative responses. The policy is updated using a policy gradient objective, such as PPO, which maximizes the expected reward while regularizing against drift from a reference model $\pi_{\text{ref}}$ . The objective is $\mathcal{I}_{\text{RL}}(\theta) = \mathbb{E}_{\pmb{x} \sim p_{\mathcal{D}_{\text{RL}}}, \pmb{y} \sim p_{\theta}(\cdot \mid \pmb{x})} [r(\pmb{x}, \pmb{y})] - \beta \mathbb{E}_{\pmb{x} \sim p_{\mathcal{D}_{\text{RL}}}} [D_{\text{KL}}(p_{\theta}(\cdot \mid \pmb{x}) \Vert \pi_{\text{ref}}(\cdot \mid \pmb{x}))]$ . The closed-form solution for the updated policy is $p_{\theta_{\text{RL}}^{(2)}}(\pmb{y} \mid \pmb{x}) = \frac{1}{Z_{\beta}(\pmb{x})} \pi_{\text{ref}}(\pmb{y} \mid \pmb{x}) \exp(r(\pmb{x}, \pmb{y}) / \beta)$ .

The analysis reveals that the two stages are fundamentally coupled. In the SFT-then-RL pipeline, even if the SFT stage has converged, the subsequent RL phase inevitably degrades the SFT loss. This is because the RL update, which maximizes reward, shifts the model's output distribution away from the SFT-optimized distribution, leading to a non-trivial increase in the SFT loss. Conversely, in the RL-then-SFT pipeline, the SFT stage, which aims to fit the SFT data, can create a persistent performance gap that decreases the reward achieved by the RL stage. This is shown by the fact that any SFT update from an RL policy cannot increase the expected reward by more than a constant controlled by the distribution shift, and under stronger assumptions, it can lead to a measurable reward deficit. Therefore, the authors conclude that SFT and RL cannot be decoupled and should be treated as a single joint optimization problem.

Experiment

SFT-then-RL experiment: Fine-tuning the Qwen3-0.6B model on a CoLA-style SFT dataset followed by GRPO-based RL leads to a sharp increase in cross-entropy loss, exceeding the base model’s loss, validating Theorem 3.1 on non-decoupling.
RL-then-SFT experiment: Applying SFT after RL on the same dataset causes a significant drop in mean@1 reward from 0.385 (≈69.5% accuracy) to 0.343 (≈67.2% accuracy) under robust evaluation, confirming Theorem 4.1 and demonstrating performance degradation due to objective mismatch.
Both pipelines show performance deterioration in the second stage, empirically validating the inherent coupling between SFT and RL, with results consistent across both orders of training.

Quell-PDF

Inhaltsverzeichnis

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren

Sofort einsatzbereite GPUs

Die besten Preise

Erste Schritte Preise anzeigen

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates

Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen

Unterstützt von MailChimp

HyperAI

vor einem Monat

Überwachtes Feinabstimmen

Verstärkendes Lernen

Präferenzmodellierung

Ansatz/Rahmenwerk

Natürliche Sprachverarbeitung

Aufgabe

Xueyan Niu Bo Bai Wei Han Weixi Zhang

Inhaltsverzeichnis

Zusammenfassung

One-sentence Summary

Key Contributions

Post-training of large language models typically alternates supervised fine-tuning (SFT) and reinforcement learning (RL), but this work proves theoretically that these stages cannot be decoupled: SFT minimizes cross-entropy loss on expert responses, while RL maximizes reward signals, leading to conflicting objectives that prevent independent optimization.
Theoretical analysis shows that in the SFT-then-RL pipeline, RL training increases the SFT loss even when SFT is optimal, and in the RL-then-SFT pipeline, subsequent SFT reduces the reward achieved by the previously optimized RL model, demonstrating a fundamental incompatibility between the two stages.
Experiments on Qwen3-0.6B confirm both theoretical predictions: RL degrades SFT performance (increased cross-entropy loss), and SFT following RL leads to reward degradation, validating that SFT and RL must be treated as an integrated optimization process rather than separate steps.

Introduction

Method

Experiment

SFT-then-RL experiment: Fine-tuning the Qwen3-0.6B model on a CoLA-style SFT dataset followed by GRPO-based RL leads to a sharp increase in cross-entropy loss, exceeding the base model’s loss, validating Theorem 3.1 on non-decoupling.
RL-then-SFT experiment: Applying SFT after RL on the same dataset causes a significant drop in mean@1 reward from 0.385 (≈69.5% accuracy) to 0.343 (≈67.2% accuracy) under robust evaluation, confirming Theorem 4.1 and demonstrating performance degradation due to objective mismatch.
Both pipelines show performance deterioration in the second stage, empirically validating the inherent coupling between SFT and RL, with results consistent across both orders of training.

Quell-PDF

Inhaltsverzeichnis

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren

Sofort einsatzbereite GPUs

Die besten Preise

Erste Schritte Preise anzeigen

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates

Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen

Unterstützt von MailChimp

Command Palette

Zur Nicht-Entkopplung von Supervised Fine-tuning und Reinforcement Learning im Post-training

Xueyan Niu Bo Bai Wei Han Weixi Zhang

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Command Palette

Zur Nicht-Entkopplung von Supervised Fine-tuning und Reinforcement Learning im Post-training

Xueyan Niu Bo Bai Wei Han Weixi Zhang

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Command Palette

Zur Nicht-Entkopplung von Supervised Fine-tuning und Reinforcement Learning im Post-training

Xueyan Niu Bo Bai Wei Han Weixi Zhang

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters