Command Palette
Search for a command to run...
Zone der proximalen Politikoptimierung: Lehrer in Prompts, nicht in Gradienten
Zone der proximalen Politikoptimierung: Lehrer in Prompts, nicht in Gradienten
Zusammenfassung
Wissensdistillation überträgt die Kompetenz eines Lehrers auf einen kleinen Schüler, ist jedoch im Regime kleiner Schüler anfällig: Das Erzwingen der Imitation von Logits eines deutlich größeren Lehrers konzentriert den Schüler auf die schärfsten Modi des Lehrers, was die Generalisierung auf Benchmark-Familien außerhalb des Trainingskorpus beeinträchtigt. Verstärkungslernen (RL) vermeidet die Imitation von Logits, indem es auf den eigenen Rollouts des Schülers trainiert. Bei Fragen jedoch, bei denen jeder Rollout fehlschlägt – was einen Advantage-Wert von Null ergibt und die Rollouts stillschweigend verworfen werden –, verletzt das Einbetten der Antwort eines stärkeren Lehrers in den Policy Gradient die On-Policy-Annahme und induziert eine Drift. Wir stellen die Zone of Proximal Policy Optimization (ZPPO) vor, die sich an Vygotskis Zone der proximalen Entwicklung orientiert und den Lehrer im Prompt statt im Policy Gradient hält. Bei schwierigen Fragen konstruiert ZPPO zwei reformulierte Prompts: Eine Binary Candidate-included Question (BCQ) paart eine korrekte Lehrerantwort mit einer inkorrekten Schülerantwort als anonymisierte Kandidaten, die der Schüler unterscheiden muss, und eine Negative Candidate-included Question (NCQ) aggregiert die fehlerhaften Rollouts des Schülers zu einem einzigen Prompt, um deren gemeinsame Fehlermodi sichtbar zu machen. Ein Prompt-Replay-Puffer zirkuliert jede schwierige Frage erneut, bis sie entweder abgeschlossen ist – die durchschnittliche Rollout-Genauigkeit des Schülers dabei die Hälfte erreicht hat – oder bei endlicher Kapazität nach FIFO-Prinzip entfernt wird, wodurch BCQ und NCQ innerhalb der aktuellen Zone der proximalen Entwicklung des Schülers verstärkt werden. Auf der Qwen3.5-Familie bei vier Schüler-Skalen (0.8B-9B) mit einem 27B-Lehrer, die als Vision-Language-Modelle nachtrainiert und an einer 31-Benchmark-Suite (16 VLM, 10 LLM, 5 Video) evaluiert wurden, übertrifft ZPPO die Off-/On-Policy-Distillation sowie GRPO, wobei die größten Verbesserungen auf der kleinsten Skala erzielt werden.
One-sentence Summary
Zone of Proximal Policy Optimization (ZPPO) addresses the brittleness of logit imitation and the on-policy drift of reinforcement learning by routing teacher guidance into prompts instead of gradients, utilizing Binary Candidate-included Questions to pair correct teacher responses with incorrect student outputs as anonymized candidates and Negative Candidate-included Questions to aggregate student failures, thereby enabling small student models to learn through targeted prompt discrimination without violating policy assumptions.
Key Contributions
- Introduces Zone of Proximal Policy Optimization (ZPPO) to overcome small-student distillation brittleness and policy drift by relocating teacher guidance from the policy gradient directly into the prompt. This architecture ensures that every token processed by the policy gradient remains student-generated, thereby preserving strict on-policy training dynamics.
- Dynamically constructs two reformulated prompts for challenging questions that yield zero advantage during reinforcement learning. These include Binary Candidate-included Questions (BCQ) that pair anonymized correct teacher responses with incorrect student rollouts for discrimination, and Negative Candidate-included Questions (NCQ) that aggregate failed student attempts to surface shared failure modes.
- Extends reinforcement learning post-training across math, science, broad knowledge, and multimodal reasoning domains while circumventing the generalization collapse typical of small-student regimes. The framework maintains on-policy guarantees through dynamic candidate generation and a targeted prompt replay buffer that amplifies reformulated prompts within the student's zone of proximal development.
Introduction
Efficiently post-training compact vision-language and language models for complex reasoning is essential for deploying scalable AI, yet existing methods struggle to transfer knowledge effectively. Knowledge distillation becomes brittle for smaller students, often triggering memorization and poor generalization, while standard RL post-training silently discards prompts where the model consistently fails due to zero group advantage. Hybrid approaches that splice teacher responses into the policy gradient violate on-policy assumptions and induce severe drift, whereas prompt-based scaffolding typically relies on static hints that encourage shortcut copying. The authors leverage a prompt-centric framework called Zone of Proximal Policy Optimization to bypass these bottlenecks. By reformulating failed prompts with teacher and self-generated candidates and replaying them, the method transfers teacher knowledge exclusively through the prompt, ensuring all gradient updates remain strictly on-policy while dynamically scaffolding the student within its current learning frontier.
Dataset
- Dataset Composition & Sources: The authors construct ZPPO-77K, a multimodal reinforcement learning corpus containing approximately 77,000 triples of input images, text questions, and gold answers. The data is aggregated from two public repositories: the Vero-600k collection, which spans 34 sub-datasets across STEM, chart and OCR, and general visual question answering, and the MMFineReason-SFT-586K collection, a chain-of-thought corpus annotated with a per-sample success rate generated by a 4B teacher model.
- Subset Details & Filtering: The authors organize the corpus into two tiers to balance reasoning depth and auxiliary knowledge. Tier 1 prioritizes direct reasoning tasks like mathematics and diagram analysis, capping each sub-dataset at 2,800 samples. Tier 2 covers auxiliary grounding and recognition tasks, capping each at 1,400 samples. To emphasize genuinely difficult problems, the authors discard any MMFineReason examples where the 4B teacher model achieved a success rate above 0.5. Cross-source duplicates are resolved by prioritizing the Vero repository, while per-sample filters enforce a maximum answer length of 512 characters and a minimum image resolution of 100 by 100 pixels.
- Training Usage & Mixture Strategy: The authors use this curated dataset to train the student policy through reinforcement learning. By applying tier-based caps, they construct a controlled mixture that heavily weights complex multimodal reasoning over general recognition. During training, the model generates rollouts using high-temperature sampling to encourage exploration, while a standardized prompt template forces an internal reasoning process followed by a strictly formatted final answer.
- Input Processing & Evaluation Pipeline: The authors apply consistent image scaling constraints across the pipeline, requiring inputs to stay within a 256 by 32 by 32 to 1280 by 32 by 32 pixel range. They strip all task-specific formatting instructions from upstream prompts and apply a unified RL closer during both training and evaluation. This ensures the policy optimizes against the exact answer-extraction rules it encounters during testing, with evaluation metrics relying on deterministic parsers that fall back to a dedicated judge model only when strict formatting cannot be parsed.
Experiment
Evaluated across LLM, VLM, and video benchmarks at multiple model scales, the main experiments validate that ZPPO consistently enhances generalization where standard reinforcement learning and distillation methods often degrade performance. Component ablations confirm that while prompt replay alone is insufficient, pairing it with contrastive candidate selection and collective negative failure analysis yields a super-additive learning signal that sustains exploration on difficult questions. Training dynamics and candidate audits further validate that the reformulation strategy extracts actionable insights from previously unrecoverable errors without relying on trivial answer matching or off-policy shortcuts. Collectively, these findings demonstrate that the method successfully bridges the capability gap between smaller students and larger teachers, delivering robust cross-domain improvements that scale with model capacity.
The experiment evaluates multiple training strategies on vision-language models across sixteen diverse benchmarks. The results demonstrate that the proposed ZPPO method consistently outperforms alternative approaches, including policy distillation and standard reinforcement learning variants. This superior performance is observed across both smaller and larger model scales, indicating robust generalization capabilities. ZPPO achieves the highest average performance across all benchmarks compared to distillation and GRPO methods. The method shows consistent improvements on individual benchmarks, with positive gains observed in nearly every category. Scaling up the model size preserves the performance advantage of ZPPO over other training techniques.
The authors evaluate various training methods on language model and video benchmarks across different model scales. Results demonstrate that distillation techniques generally fail to improve generalization and often degrade performance compared to the base model. In contrast, the proposed ZPPO method consistently yields the largest performance gains across all benchmarks and scales, significantly outperforming both standard reinforcement learning and distillation approaches. Distillation methods typically underperform the base model on video benchmarks and show minimal gains on language tasks. ZPPO achieves the highest average scores across all evaluated benchmarks, with substantial improvements over baseline reinforcement learning. Performance gains from ZPPO are most pronounced at smaller model scales, though they remain robust at larger capacities.
The authors evaluate the proposed ZPPO method against several baseline and ablation variants across different model scales and benchmark types. Results show that ZPPO consistently achieves the highest performance across all categories, demonstrating the effectiveness of combining its core components. The findings highlight the superiority of the full recipe over isolated modifications or simpler prompt-based guidance strategies. ZPPO outperforms all baseline and ablation methods across LLM, VLM, and Video benchmarks for both model sizes. Incorporating BCQ into the GRPO framework yields substantial improvements over the standard GRPO baseline. Prompt-based guidance methods like Hint and Prefix show limited gains compared to the full ZPPO recipe.
The authors evaluate the ZPPO method across various model scales on LLM, VLM, and Video benchmarks. Results show that ZPPO consistently outperforms the base model across all scales and benchmark families. The performance gains are most pronounced for smaller models, where the gap between the student and teacher is widest, and diminish as model size increases. This indicates that ZPPO is particularly effective at enhancing generalization and learning from hard examples in smaller students. ZPPO consistently improves performance across all model sizes and benchmark categories compared to the base model. Smaller models achieve the largest relative gains, highlighting the method's effectiveness for weaker students. The approach demonstrates strong generalization capabilities, yielding positive results on LLM and Video tasks beyond the training domain.
The authors analyze how inner-loop iteration counts and batch normalization choices affect performance across language, vision-language, and video benchmarks. The data shows that moderate iteration settings maximize accuracy, whereas larger counts degrade results by exacerbating policy drift. Furthermore, normalization methods that omit zero-advantage groups consistently surpass both unnormalized and fully inclusive variants. Accuracy peaks at moderate inner-loop iteration counts, with higher values causing performance drops due to off-policy drift. Batch normalization that excludes zero-advantage groups reliably outperforms both unnormalized settings and those that retain trivial groups. The optimal training configuration balances update frequency and stability, delivering consistent gains across all evaluated benchmark families.
The experiments evaluate the proposed ZPPO method against distillation, standard reinforcement learning, and prompt-based baselines across diverse language, vision-language, and video benchmarks at varying model scales. Results demonstrate that ZPPO consistently outperforms all competing approaches, delivering particularly substantial gains for smaller models while maintaining robust improvements at larger capacities. Ablation studies confirm that the full methodological combination is essential, as isolated components yield significantly lower performance. Finally, hyperparameter analysis reveals that moderate inner-loop iterations and batch normalization excluding zero-advantage groups are critical for maximizing training stability and cross-domain accuracy.