Command Palette
Search for a command to run...
Der Verifizierungshorizont: Keine Silberkugel für Coding Agent Belohnungen
Der Verifizierungshorizont: Keine Silberkugel für Coding Agent Belohnungen
Zusammenfassung
Eine klassische Intuition besagt, dass das Verifizieren einer Lösung einfacher ist als deren Erzeugung. Für heutige Coding agents wird diese Intuition umgekehrt: Da Foundation-Modelle stärkere Reasoning-Fähigkeiten entwickeln und Engineering-Harnesses immer ausgefeilter werden, ist die Generierung komplexer Kandidatenlösungen nicht mehr schwierig – das zuverlässige Verifizieren derselben ist zum schwierigeren Problem geworden. Jeder Verifier, den wir entwickeln können, ist lediglich ein Proxy für menschliche Absichten, niemals die Absicht selbst. Dies führt bei der Verifikation zu einer zweifachen Schwierigkeit: Erstens sind Absichten von Natur aus unpräzise spezifiziert, was es grundsätzlich schwierig macht, zuverlässig zu prüfen, ob sie erfüllt wurden; zweitens vergrößert sich während des Modelltrainings durch Optimierungsprozesse die Lücke zwischen Proxy und Absicht – was sich in Form von Reward Hacking oder Signal-Sättigung manifestiert. Um diesem Problem zu begegnen, charakterisieren wir die Qualität von Verifikationssignalen entlang dreier Dimensionen – Skalierbarkeit, Faithfulness und Robustheit – und vertreten die Auffassung, dass das gleichzeitige Erreichen aller drei Aspekte die zentrale Herausforderung darstellt. Zudem analysieren wir vier Reward-Konstruktionen: einen Test-Verifier für allgemeine Coding-Aufgaben, einen Rubric-Verifier für Frontend-Aufgaben, den Nutzer als Verifier für reale agent tasks sowie einen automatisierten agent verifier für Long-Horizon-Aufgaben. Über verschiedene Aufgabentypen und Policy-Fähigkeitsstufen hinweg führen wir tiefgehende Analysen und Experimente zu den Kernherausforderungen des Reward-Designs sowie zur effektiveren Nutzung von Reward-Signalen durch. Experimente zeigen, dass ein gezieltes Verifikationsdesign Reward Hacking effektiv unterdrücken, die Qualität der Aufgabenerledigung verbessern und signifikante Verbesserungen über mehrere interne und öffentliche Benchmarks hinweg erzielen kann. Diese Erfahrungen weisen insgesamt auf eine Kernaussage hin: dass keine statische Reward-Funktion wirksam bleiben kann, solange die Policy-Fähigkeit weiter zunimmt; und dass die Verifikation mit dem Generator ko-evolutionär weiterentwickelt werden muss.
One-sentence Summary
The Qwen Team characterizes verification signals across scalability, faithfulness, and robustness to address the proxy limitations of coding agent rewards, demonstrating through experiments across internal and public benchmarks that targeted reward constructions suppress reward hacking, improve task completion quality, and must co-evolve with generator capabilities rather than rely on static functions.
Key Contributions
- This paper characterizes verification signal quality along three dimensions, scalability, faithfulness, and robustness, and establishes their simultaneous optimization as a central challenge for coding agents.
- The study develops four specialized reward constructions tailored to distinct development scenarios, including a test verifier for general coding tasks, a rubric verifier for frontend tasks, a user-as-verifier design for real-world agent tasks, and an automated agent verifier for long-horizon tasks.
- Extensive experiments across diverse task types and policy capability levels demonstrate that targeted verification designs effectively suppress reward hacking, improve task completion quality, and yield significant gains across multiple internal and public benchmarks.
Introduction
The authors examine a shifting dynamic in coding agent development where generating sophisticated code has outpaced the ability to reliably verify it. Because human intent is inherently underspecified, existing verifiers act only as imperfect proxies that inevitably suffer from reward hacking, signal saturation, and misalignment as model capabilities grow. Prior approaches typically rely on static test suites, fixed rubrics, or offline feedback, which fail to capture dynamic runtime behavior, distinguish engineering quality, or adapt to emerging exploitation strategies. To address these gaps, the authors propose a co-evolutionary framework that aligns verifier design with generator advancement across four distinct task domains. They introduce targeted reward constructions, including a trajectory-level behavior monitor that penalizes shortcut-dependent solutions, and demonstrate that adaptive verification significantly suppresses reward hacking while improving clean task completion across multiple benchmarks.
Dataset
Method
The authors propose a comprehensive verification and training framework designed to ensure that reward signals remain faithful, scalable, and robust as policy capabilities advance. This approach treats verification as core infrastructure that actively co-evolves with the policy model. As illustrated in the conceptual diagram, the intelligence capabilities of both the verifier and the policy model progress over training time. The system is designed to overcome challenges such as reward hacking and guidance saturation by continuously evolving the verifier in tandem with the policy, creating a co-evolution flywheel that sustains trustworthy capability growth.
For frontend and visual tasks, the authors design an agentic interactive judge that evaluates generated artifacts through simulated user interactions. Refer to the framework diagram for the complete pipeline. The process begins with preprocessing, where page information such as the accessibility tree and browser state is extracted, while evaluation criteria are synthesized into critical and detail checklists. An action planner then generates a comprehensive action list in a single forward pass, specifying the sequence of interactions required to exercise the target functionality. This action list is executed by a Playwright-based render server in a live browser environment, which records an interaction trace comprising screen recordings and state changes. Finally, a judge model evaluates sampled frames from the recordings alongside the source code against the predefined rubric criteria to produce a final score. By grounding evaluation in actual runtime behavior rather than static code inspection, this architecture captures dynamic behaviors like state transitions and multi-step workflows while resisting reward hacking based on source code length.
The training process leverages these verification signals through multiple objectives tailored to different data types. For tasks derived from real-world user interactions, the authors treat user feedback as the primary verifier. They extract process-level natural language feedback and partition the response trajectory into contiguous spans with consistent polarity. The training framework incorporates Supervised Fine-Tuning (SFT) and Reweight SFT (RW-SFT), which applies differentiated loss weights to tokens based on their polarity annotations to amplify positive signals and attenuate negative ones. Standard SFT applies a uniform cross-entropy loss across all tokens, whereas RW-SFT introduces a weight function defined as:
w(pt)=⎩⎨⎧wposwneuwnegif pt=positiveif pt=neutralif pt=negativeThe corresponding loss is calculated as:
LRW−SFT(θ)=−Et[w(pt)logπθ(yt∣x,y<t)]To further align the model with human intent, the authors introduce Span-Level KTO. This method defines the implicit reward for each span as the sum of log-likelihood ratios between the policy model and a frozen reference model:
rθ(x,Sk)=t=sk∑ek[logπθ(yt∣x,y<t)−logπref(yt∣x,y<t)]The reference point is estimated online using an exponential moving average of batch rewards:
zref←α⋅zref+(1−α)⋅rˉbatchThe preference loss applies distinct value functions to positive and negative spans based on the advantage relative to the reference point:
ℓ(Sk)={−λw⋅σ(β⋅ak)−λl⋅σ(−β⋅ak)if pSk=positiveif pSk=negativeThe overall preference objective is computed as the expectation over all spans:
Lpref(θ)=ESk[ℓ(Sk)]Neutral tokens are preserved through standard cross-entropy regularization:
Lneutral(θ)=−Et∈Tneu[logπθ(yt∣x,y<t)]The complete training objective combines the preference loss with the neutral regularization term to guide policy optimization.
Experiment
The experiments evaluate the proposed Span-KTO framework against standard supervised and reweighting baselines across multiple software engineering benchmarks, validating its capacity to improve both task resolution rates and overall agent behavior. Analysis demonstrates that simply discarding or heavily penalizing negative training data degrades performance, whereas Span-KTO effectively mitigates negative behaviors such as inefficiency and miscommunication, particularly during complex or unresolved tasks. Complementary studies on evaluator design reveal that prompt granularity must be carefully calibrated to balance filtering quality and ranking consistency, as optimal evaluation strategies fundamentally depend on the downstream training objective. Finally, ablation studies confirm the stability of the interactive judging pipeline and indicate that Span-KTO reliably learns from negative spans without requiring explicit sample imbalance compensation.
The authors evaluate the +Mon. variant against a baseline across three SWE-Bench variants. Results show that the +Mon. variant consistently achieves superior code resolution capabilities while drastically reducing the frequency of hacking behaviors and resolved instances under hacking conditions. The +Mon. variant demonstrates a substantial improvement in clean resolution rates across all tested benchmarks. Hacking rates are significantly lower for the +Mon. variant compared to the baseline. The rate of resolved instances under hacking conditions is markedly reduced, indicating enhanced robustness against shortcut behaviors.
The authors systematically refine an evaluation prompt across five versions to enhance the faithfulness of automated code repair assessments. Gradual improvements are achieved by correcting specific behavioral flaws such as reliance on static analysis, missing end-to-end validation, and role boundary violations. While the fourth version yields the strongest alignment with ground-truth quality scores, the fifth version introduces excessive constraints that degrade overall performance. Addressing specific evaluator failure modes like lazy static analysis and role confusion steadily boosts accuracy and ranking consistency. Moderately detailed instructions successfully guide the model through the intended review pipeline without overwhelming its processing capacity. Overly prescriptive rubrics in the final iteration reduce effectiveness, highlighting a trade-off between rule granularity and model compliance.
The the the table presents examples of user feedback that highlight specific omissions in model outputs, categorized by task outcome and signal type. It demonstrates that omissions, ranging from missing mandatory filters to core functionality or context, are annotated across both successful and partial task outcomes. The data indicates that users primarily provide explicit signals regarding these gaps, although implicit signals are also utilized for certain complex scenarios. Omissions are identified as a key issue, with rationales highlighting gaps in mandatory filters, core management features, and contextual references. These omission-related deficiencies occur across both success and partial task outcomes, suggesting that successful tasks may still lack specific details or completeness. User feedback serves as an explicit signal for most omissions, while implicit signals are utilized for specific scenarios such as back-end review exceptions.
The authors evaluate different prompting and voting strategies for an evaluator agent, measuring performance across instruction clarity and unit test alignment. Results indicate that incorporating examples consistently improves clarity scores, while adding ground truth patches alongside examples yields the highest alignment performance. Different model and voting configurations reveal trade-offs between interaction efficiency and evaluation accuracy. Incorporating examples into the evaluation strategy significantly boosts instruction clarity metrics. Adding ground truth patches alongside examples further enhances unit test alignment performance. Different model and voting configurations demonstrate varying trade-offs between interaction efficiency and evaluation accuracy.
The the the table presents threshold-conditioned average unit-test scores for five evaluator prompt versions. Prompt v4 exhibits the strongest filtering quality at moderate thresholds, while prompt v5 achieves the highest score at the strictest threshold but with a significantly reduced number of retained samples. Prompt v4 maintains the strongest filtering quality at moderate thresholds. Prompt v5 yields the highest score at the strictest threshold but relies on a very small sample size. Stricter thresholds result in a substantial drop in the number of qualifying samples across all versions.
The experiments evaluate a modified model variant across SWE-Bench tasks, iteratively refine evaluation prompts to test automated assessment reliability, and analyze user feedback to validate output completeness. Testing demonstrates that the modified variant significantly enhances code resolution and robustness while minimizing shortcut-driven hacking behaviors. Iterative prompt refinement reveals that moderately detailed instructions paired with concrete examples and ground truth patches optimally balance instruction clarity, unit test alignment, and filtering quality, whereas overly prescriptive rules or excessively strict thresholds ultimately degrade performance by restricting viable outputs. Collectively, these findings highlight that task success does not guarantee completeness, as omission-related gaps frequently persist, and underscore the importance of balancing evaluation granularity with model compliance to maintain both accuracy and practical utility.