HyperAIHyperAI

Command Palette

Search for a command to run...

vor 4 Stunden
LLM
Reasoning

Warum verschlechtert Self-Distillation (manchmal) die Reasoning-Fähigkeit von LLMs?

Jeonghye Kim Xufang Luo Minbeom Kim Sangmook Lee Dohyung Kim Jiwon Jeon Dongsheng Li Yuqing Yang

Zusammenfassung

Selbst-Distillation hat sich als ein effektives Post-Training-Paradigma für LLMs etabliert, das häufig die Leistung verbessert und gleichzeitig die Reasoning-Traces verkürzt. Im Bereich des mathematischen Reasoning stellen wir jedoch fest, dass diese Methode zwar die Antwortlänge reduziert, jedoch zu einer Verschlechterung der Leistung führt. Wir führen diesen Leistungsabfall auf die Unterdrückung der epistemischen Verbalisierung zurück – also des Ausdrucks von Unsicherheit durch das Modell während des Reasoning-Prozesses. Durch kontrollierte Experimente, bei denen wir die Reichhaltigkeit des Conditioning-Kontexts und die Aufgabenabdeckung variierten, zeigen wir, dass ein auf reichhaltige Informationen konditionierter Teacher die Äußerung von Unsicherheit unterdrückt. Dies ermöglicht eine schnelle Optimierung innerhalb der Trainingsdomäne bei begrenzter Aufgabenabdeckung, schädigt jedoch die OOD-Leistung, da bei ungelösten Problemen der Ausdruck von Unsicherheit und die entsprechende Anpassung von Vorteil sind. Bei den Modellen Qwen3-8B, DeepSeek-Distill-Qwen-7B und Olmo3-7B-Instruct beobachten wir Leistungsabfälle von bis zu 40 %. Unsere Ergebnisse unterstreichen, dass das Aufzeigen angemessener Unsicherheitsniveaus für ein robustes Reasoning entscheidend ist, und betonen die Notwendigkeit, das Reasoning-Verhalten über die bloße Verstärkung korrekter Antwort-Traces hinaus zu optimieren.

One-sentence Summary

Researchers from Microsoft Research, KAIST, and Seoul National University reveal that self-distillation harms mathematical reasoning in LLMs by suppressing epistemic verbalization. They demonstrate that rich teacher conditioning reduces uncertainty expression, causing significant performance drops on out-of-distribution tasks despite shorter reasoning traces.

Key Contributions

  • The paper identifies that self-distillation in mathematical reasoning degrades performance by suppressing epistemic verbalization, which is the model's expression of uncertainty during the reasoning process.
  • Controlled experiments varying conditioning context richness and task coverage demonstrate that conditioning a teacher on rich information enables rapid in-domain optimization but harms out-of-distribution performance where uncertainty expression is beneficial.
  • Empirical results across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct show performance drops of up to 40%, providing evidence that optimizing reasoning behavior requires preserving appropriate levels of uncertainty beyond reinforcing correct answer traces.

Introduction

Self-distillation is a popular post-training paradigm for LLMs that typically improves performance while shortening reasoning traces, yet it unexpectedly degrades mathematical reasoning capabilities in certain scenarios. Prior work often assumes that compressing reasoning into concise, confident outputs is universally beneficial, but this approach fails to account for the loss of epistemic verbalization where models express uncertainty to navigate complex problems. The authors identify that conditioning the teacher model on rich ground-truth information suppresses these uncertainty signals, leading to significant performance drops of up to 40% on out-of-distribution tasks. They demonstrate that preserving appropriate levels of uncertainty expression is critical for robust reasoning and propose that future training objectives must optimize for reasoning behavior beyond mere answer correctness.

Dataset

  • The authors incorporate the DAPO-Math-17k dataset alongside their experimental setup to enhance task coverage and model performance.
  • This subset contains 17,000 math problems derived from a pool of 25,600 samples, with 14,000 distinct problems representing 78% of the total due to repeated sampling over 100 training steps.
  • Unlike the Chemistry dataset which relies on only six problem types or LiveCodeBench v6 with just 131 problems, DAPO-Math-17k exposes the model to a broad, non-overlapping range of problem types.
  • The data is processed using a specific prompt format that instructs the model to solve problems step by step and format the final output as "Answer: $Answer" on its own line.
  • Evaluation is conducted on unseen problem types to ensure the model generalizes beyond the training distribution.

Method

The authors leverage a self-distillation framework to enhance the reasoning capabilities of language models. In this setup, the model πθ\pi_\thetaπθ functions as both a student and a teacher under different conditioning contexts. The student generates a sequence yyy based solely on the input xxx, while the teacher policy is conditioned on a richer context ccc that provides additional information such as solutions or environment feedback. The training objective minimizes the divergence between the student and teacher next-token distributions:

LSD(θ)=tKL(πθ(x,y<t)stopgrad(πθ(x,c,y<t))).\mathcal { L } _ { \mathrm { S D } } ( \theta ) = \sum _ { t } \mathrm { K L } \big ( \pi _ { \theta } ( \, \cdot \mid x , y _ { < t } ) \parallel \mathrm { s t o p g r a d } \big ( \pi _ { \theta } ( \, \cdot \mid x , c , y _ { < t } ) \big ) \big ) \, .LSD(θ)=tKL(πθ(x,y<t)stopgrad(πθ(x,c,y<t))).

This objective encourages the student to match the teacher's predictions under the richer context, enabling the model to improve by distilling information available at training time without requiring an external teacher. A key component of this approach is the handling of uncertainty during the reasoning process. Math reasoning is treated as self-Bayesian reasoning where the model iteratively updates its belief over intermediate hypotheses.

As shown in the figure below, the authors distinguish between procedural reasoning and reasoning with epistemic verbalization.

Reasoning without epistemic signals often leads to premature commitment to incorrect hypotheses with limited opportunity for recovery. In contrast, epistemic verbalization allows the model to express uncertainty, which serves as an informative signal rather than mere stylistic redundancy. This approach helps maintain alternative hypotheses and supports gradual uncertainty reduction. The challenge lies in filtering out non-informative content while retaining epistemic expressions that enable iterative belief refinement, rather than blindly compressing the reasoning process.

Experiment

  • Experiments on LLM reasoning under varying information richness demonstrate that providing richer conditioning context (e.g., full solutions) significantly reduces response length and the usage of epistemic tokens, leading to more concise and confident outputs.
  • Supervised fine-tuning using solution-guided responses, which lack epistemic markers, causes substantial performance degradation on math benchmarks, whereas training on unguided responses preserves reasoning capability, indicating that epistemic verbalization is critical for autonomous error correction.
  • On-policy self-distillation (SDPO) consistently suppresses epistemic tokens and shortens responses compared to GRPO, resulting in severe out-of-distribution performance drops on challenging math tasks, particularly when the base model relies on uncertainty expression for complex reasoning.
  • The negative impact of self-distillation is linked to task coverage; while concise reasoning improves efficiency on small, narrow datasets, it hinders generalization on larger, diverse problem sets where expressing uncertainty is necessary for adaptation.
  • Ablation studies confirm that using a fixed teacher policy mitigates but does not eliminate the performance degradation caused by epistemic suppression, and these findings hold across multiple model families including DeepSeek, Qwen, and OLMo.

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp