4 hours ago

Jonas Hübotter Frederike Lübeck Lejs Behric Anton Baumann Marco Bagatella Daniel Marta Ido Hakimi Idan Shenfeld Thomas Kleine Buening Carlos Guestrin

Abstract

Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

4 hours ago

Reinforcement Learning

Retrieval-Augmented Generation

LLM

Method/Architecture

Jonas Hübotter Frederike Lübeck Lejs Behric Anton Baumann Marco Bagatella Daniel Marta Ido Hakimi Idan Shenfeld Thomas Kleine Buening Carlos Guestrin

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

4 hours ago

Reinforcement Learning

Retrieval-Augmented Generation

LLM

Method/Architecture

Jonas Hübotter Frederike Lübeck Lejs Behric Anton Baumann Marco Bagatella Daniel Marta Ido Hakimi Idan Shenfeld Thomas Kleine Buening Carlos Guestrin

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Reinforcement Learning via Self-Distillation

Jonas Hübotter Frederike Lübeck Lejs Behric Anton Baumann Marco Bagatella Daniel Marta Ido Hakimi Idan Shenfeld Thomas Kleine Buening Carlos Guestrin1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Reinforcement Learning via Self-Distillation

Jonas Hübotter Frederike Lübeck Lejs Behric Anton Baumann Marco Bagatella Daniel Marta Ido Hakimi Idan Shenfeld Thomas Kleine Buening Carlos Guestrin1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Reinforcement Learning via Self-Distillation

Jonas Hübotter Frederike Lübeck Lejs Behric Anton Baumann Marco Bagatella Daniel Marta Ido Hakimi Idan Shenfeld Thomas Kleine Buening Carlos Guestrin1 more

Abstract

Build AI with AI

HyperAI Newsletters

Jonas Hübotter Frederike Lübeck Lejs Behric Anton Baumann Marco Bagatella Daniel Marta Ido Hakimi Idan Shenfeld Thomas Kleine Buening Carlos Guestrin

Jonas Hübotter Frederike Lübeck Lejs Behric Anton Baumann Marco Bagatella Daniel Marta Ido Hakimi Idan Shenfeld Thomas Kleine Buening Carlos Guestrin

Jonas Hübotter Frederike Lübeck Lejs Behric Anton Baumann Marco Bagatella Daniel Marta Ido Hakimi Idan Shenfeld Thomas Kleine Buening Carlos Guestrin