2달 전

Jeonghye Kim Xufang Luo Minbeom Kim Sangmook Lee Dohyung Kim Jiwon Jeon Dongsheng Li Yuqing Yang

초록

LLM(대형 언어 모델) 의 효율적인 사후 학습(paradigm) 으로 부상한 자기 증류(self-distillation) 는 일반적으로 추론 경로를 단축하면서도 성능을 향상시키는 것으로 알려져 있습니다. 그러나 수학 추론 분야에서는 응답 길이를 줄이는 대신 성능이 저하되는 현상이 관찰됩니다. 본 연구는 이러한 성능 저하가 모델의 추론 과정에서 불확실성을 표현하는 '인식적 언어화(epistemic verbalization)'가 억제되기 때문임을 규명했습니다. 조건부 컨텍스트의 풍부함과 작업 범위를 체계적으로 변형한 통제 실험을 통해, 풍부한 정보를 기반으로 한 교사 모델(teacher) 의 조건부 학습은 불확실성 표현을 억제하여 제한된 작업 범위 내에서도 빠른 도메인 내(in-domain) 최적화를 가능하게 하지만, 미관측 문제(unseen problems) 에 대한 도메인 외(OOD) 성능에는 부정적인 영향을 미친다는 사실을 확인했습니다. 이는 미관측 문제의 경우 불확실성을 표현하고 이에 대응하는 조정이 성능 향상에 기여하기 때문입니다. Qwen3-8B, DeepSeek-Distill-Qwen-7B, Olmo3-7B-Instruct 등 다양한 모델에서 수행한 실험 결과, 최대 40% 에 달하는 성능 저하가 관찰되었습니다. 본 연구의 주요 시사점은 적절한 수준의 불확실성 노출이 견고한 추론에 필수적이며, 단순히 정답 추론 경로를 강화하는 것을 넘어 추론 행동 자체를 최적화하는 것이 중요하다는 점입니다.

One-sentence Summary

Researchers from Microsoft Research, KAIST, and Seoul National University reveal that self-distillation harms mathematical reasoning in LLMs by suppressing epistemic verbalization. They demonstrate that rich teacher conditioning reduces uncertainty expression, causing significant performance drops on out-of-distribution tasks despite shorter reasoning traces.

Key Contributions

The paper identifies that self-distillation in mathematical reasoning degrades performance by suppressing epistemic verbalization, which is the model's expression of uncertainty during the reasoning process.
Controlled experiments varying conditioning context richness and task coverage demonstrate that conditioning a teacher on rich information enables rapid in-domain optimization but harms out-of-distribution performance where uncertainty expression is beneficial.
Empirical results across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct show performance drops of up to 40%, providing evidence that optimizing reasoning behavior requires preserving appropriate levels of uncertainty beyond reinforcing correct answer traces.

Introduction

Self-distillation is a popular post-training paradigm for LLMs that typically improves performance while shortening reasoning traces, yet it unexpectedly degrades mathematical reasoning capabilities in certain scenarios. Prior work often assumes that compressing reasoning into concise, confident outputs is universally beneficial, but this approach fails to account for the loss of epistemic verbalization where models express uncertainty to navigate complex problems. The authors identify that conditioning the teacher model on rich ground-truth information suppresses these uncertainty signals, leading to significant performance drops of up to 40% on out-of-distribution tasks. They demonstrate that preserving appropriate levels of uncertainty expression is critical for robust reasoning and propose that future training objectives must optimize for reasoning behavior beyond mere answer correctness.

Dataset

The authors incorporate the DAPO-Math-17k dataset alongside their experimental setup to enhance task coverage and model performance.
This subset contains 17,000 math problems derived from a pool of 25,600 samples, with 14,000 distinct problems representing 78% of the total due to repeated sampling over 100 training steps.
Unlike the Chemistry dataset which relies on only six problem types or LiveCodeBench v6 with just 131 problems, DAPO-Math-17k exposes the model to a broad, non-overlapping range of problem types.
The data is processed using a specific prompt format that instructs the model to solve problems step by step and format the final output as "Answer: $Answer" on its own line.
Evaluation is conducted on unseen problem types to ensure the model generalizes beyond the training distribution.

Method

The authors leverage a self-distillation framework to enhance the reasoning capabilities of language models. In this setup, the model $\pi_\theta$ functions as both a student and a teacher under different conditioning contexts. The student generates a sequence $y$ based solely on the input $x$ , while the teacher policy is conditioned on a richer context $c$ that provides additional information such as solutions or environment feedback. The training objective minimizes the divergence between the student and teacher next-token distributions:

$\mathcal { L } _ { \mathrm { S D } } ( \theta ) = \sum _ { t } \mathrm { K L } \big ( \pi _ { \theta } ( \, \cdot \mid x , y _ { < t } ) \parallel \mathrm { s t o p g r a d } \big ( \pi _ { \theta } ( \, \cdot \mid x , c , y _ { < t } ) \big ) \big ) \, .$

This objective encourages the student to match the teacher's predictions under the richer context, enabling the model to improve by distilling information available at training time without requiring an external teacher. A key component of this approach is the handling of uncertainty during the reasoning process. Math reasoning is treated as self-Bayesian reasoning where the model iteratively updates its belief over intermediate hypotheses.

As shown in the figure below, the authors distinguish between procedural reasoning and reasoning with epistemic verbalization.

Reasoning without epistemic signals often leads to premature commitment to incorrect hypotheses with limited opportunity for recovery. In contrast, epistemic verbalization allows the model to express uncertainty, which serves as an informative signal rather than mere stylistic redundancy. This approach helps maintain alternative hypotheses and supports gradual uncertainty reduction. The challenge lies in filtering out non-informative content while retaining epistemic expressions that enable iterative belief refinement, rather than blindly compressing the reasoning process.

Experiment

Experiments on LLM reasoning under varying information richness demonstrate that providing richer conditioning context (e.g., full solutions) significantly reduces response length and the usage of epistemic tokens, leading to more concise and confident outputs.
Supervised fine-tuning using solution-guided responses, which lack epistemic markers, causes substantial performance degradation on math benchmarks, whereas training on unguided responses preserves reasoning capability, indicating that epistemic verbalization is critical for autonomous error correction.
On-policy self-distillation (SDPO) consistently suppresses epistemic tokens and shortens responses compared to GRPO, resulting in severe out-of-distribution performance drops on challenging math tasks, particularly when the base model relies on uncertainty expression for complex reasoning.
The negative impact of self-distillation is linked to task coverage; while concise reasoning improves efficiency on small, narrow datasets, it hinders generalization on larger, diverse problem sets where expressing uncertainty is necessary for adaptation.
Ablation studies confirm that using a fixed teacher policy mitigates but does not eliminate the performance degradation caused by epistemic suppression, and these findings hold across multiple model families including DeepSeek, Qwen, and OLMo.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

2달 전

Jeonghye Kim Xufang Luo Minbeom Kim Sangmook Lee Dohyung Kim Jiwon Jeon Dongsheng Li Yuqing Yang

초록

One-sentence Summary

Key Contributions

The paper identifies that self-distillation in mathematical reasoning degrades performance by suppressing epistemic verbalization, which is the model's expression of uncertainty during the reasoning process.
Controlled experiments varying conditioning context richness and task coverage demonstrate that conditioning a teacher on rich information enables rapid in-domain optimization but harms out-of-distribution performance where uncertainty expression is beneficial.
Empirical results across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct show performance drops of up to 40%, providing evidence that optimizing reasoning behavior requires preserving appropriate levels of uncertainty beyond reinforcing correct answer traces.

Introduction

Dataset

The authors incorporate the DAPO-Math-17k dataset alongside their experimental setup to enhance task coverage and model performance.
This subset contains 17,000 math problems derived from a pool of 25,600 samples, with 14,000 distinct problems representing 78% of the total due to repeated sampling over 100 training steps.
Unlike the Chemistry dataset which relies on only six problem types or LiveCodeBench v6 with just 131 problems, DAPO-Math-17k exposes the model to a broad, non-overlapping range of problem types.
The data is processed using a specific prompt format that instructs the model to solve problems step by step and format the final output as "Answer: $Answer" on its own line.
Evaluation is conducted on unseen problem types to ensure the model generalizes beyond the training distribution.

Method

As shown in the figure below, the authors distinguish between procedural reasoning and reasoning with epistemic verbalization.

Experiment

Experiments on LLM reasoning under varying information richness demonstrate that providing richer conditioning context (e.g., full solutions) significantly reduces response length and the usage of epistemic tokens, leading to more concise and confident outputs.
Supervised fine-tuning using solution-guided responses, which lack epistemic markers, causes substantial performance degradation on math benchmarks, whereas training on unguided responses preserves reasoning capability, indicating that epistemic verbalization is critical for autonomous error correction.
On-policy self-distillation (SDPO) consistently suppresses epistemic tokens and shortens responses compared to GRPO, resulting in severe out-of-distribution performance drops on challenging math tasks, particularly when the base model relies on uncertainty expression for complex reasoning.
The negative impact of self-distillation is linked to task coverage; while concise reasoning improves efficiency on small, narrow datasets, it hinders generalization on larger, diverse problem sets where expressing uncertainty is necessary for adaptation.
Ablation studies confirm that using a fixed teacher policy mitigates but does not eliminate the performance degradation caused by epistemic suppression, and these findings hold across multiple model families including DeepSeek, Qwen, and OLMo.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

LLM 의 추론 능력을 (때때로) 저하시키는 이유는 무엇인가?

Jeonghye Kim Xufang Luo Minbeom Kim Sangmook Lee Dohyung Kim Jiwon Jeon Dongsheng Li Yuqing Yang

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

LLM 의 추론 능력을 (때때로) 저하시키는 이유는 무엇인가?

Jeonghye Kim Xufang Luo Minbeom Kim Sangmook Lee Dohyung Kim Jiwon Jeon Dongsheng Li Yuqing Yang

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

LLM 의 추론 능력을 (때때로) 저하시키는 이유는 무엇인가?

Jeonghye Kim Xufang Luo Minbeom Kim Sangmook Lee Dohyung Kim Jiwon Jeon Dongsheng Li Yuqing Yang

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters