Command Palette
Search for a command to run...
أفق التحقق: لا توجد رصاصة فضية لمكافآت البرمجة agent
أفق التحقق: لا توجد رصاصة فضية لمكافآت البرمجة agent
الملخص
تتبنى حدسية كلاسيكية فكرة أن التحقق من حلٍّ ما أسهل من إنتاجه. وفيما يخص وكلاء البرمجة agents اليوم، تنعكس هذه الحدسية؛ فمع تطور نماذج الأساس في امتلاك قدرات استدلال أقوى، وتطور أطر الهندسة لتصبح أكثر تعقيداً، لم يعد توليد حلول مرشحة معقدة أمراً شاقاً، بل أصبح التحقق الموثوق منها يمثل المشكلة الأصعب. ويُعد كل مدقق نبنيه مجرد وسيط يعكس النية البشرية، ولا يمثل النية ذاتها أبداً. ويؤدي هذا إلى خضوع عملية التحقق لصعوبة مزدوجة؛ أولاً، أن النية غير محددة بدقة بطبيعتها، مما يجعل التحقق من وفائها أمراً صعباً جوهرياً؛ ثانياً، أن عملية التحسين أثناء تدريب النموذج توسع الفجوة بين الوسيط والنية، مما يتجلى في ظواهر مثل اختراق المكافأة أو تشبع الإشارة. ولتجاوز هذه الإشكالية، نحدد جودة إشارات التحقق من خلال ثلاثة أبعاد رئيسية، وهي: قابلية التوسع، والوفاء، والمتانة؛ ونؤكد أن تحقيق هذه الأبعاد الثلاثة في آن واحد يمثل التحدي الجوهري. ونستعرض أيضاً أربعة نماذج لتصميم المكافأة: مدقق قائم على الاختبار للمهام البرمجية العامة، ومدقق قائم على المعايير للمهام المتعلقة بواجهات المستخدم الأمامية، والمستخدم نفسه كمدقق للمهام الواقعية لـ agents، ومدقق آلي لـ agents للمهام طويلة المدى. وعلى امتداد أنواع المهام المختلفة ومستويات قدرات السياسات المتباينة، نُجري تحليلاً معمقاً وتجارب عملية حول التحديات الجوهريّة في تصميم المكافأة، وكيفية الاستفادة من إشارات المكافأة بفعالية أكبر. وتُظهر النتائج التجريبية أن تصميم التحقق المستهدف يمكنه كبح اختراق المكافأة بفعالية، ورفع جودة إنجاز المهام، وتحقيق قفزات ملحوظة عبر مجموعة متنوعة من المعايير الداخلية والعامة. وتُشير هذه الخبرات مجتمعة إلى ملاحظة جوهرية مفادها أنه لا يمكن لدالة مكافأة ثابتة أن تحافظ على فعاليتها مع استمرار تطور قدرات السياسات؛ بل يجب أن يتطور التحقق بالتزامن مع المولد.
One-sentence Summary
The Qwen Team characterizes verification signals across scalability, faithfulness, and robustness to address the proxy limitations of coding agent rewards, demonstrating through experiments across internal and public benchmarks that targeted reward constructions suppress reward hacking, improve task completion quality, and must co-evolve with generator capabilities rather than rely on static functions.
Key Contributions
- This paper characterizes verification signal quality along three dimensions, scalability, faithfulness, and robustness, and establishes their simultaneous optimization as a central challenge for coding agents.
- The study develops four specialized reward constructions tailored to distinct development scenarios, including a test verifier for general coding tasks, a rubric verifier for frontend tasks, a user-as-verifier design for real-world agent tasks, and an automated agent verifier for long-horizon tasks.
- Extensive experiments across diverse task types and policy capability levels demonstrate that targeted verification designs effectively suppress reward hacking, improve task completion quality, and yield significant gains across multiple internal and public benchmarks.
Introduction
The authors examine a shifting dynamic in coding agent development where generating sophisticated code has outpaced the ability to reliably verify it. Because human intent is inherently underspecified, existing verifiers act only as imperfect proxies that inevitably suffer from reward hacking, signal saturation, and misalignment as model capabilities grow. Prior approaches typically rely on static test suites, fixed rubrics, or offline feedback, which fail to capture dynamic runtime behavior, distinguish engineering quality, or adapt to emerging exploitation strategies. To address these gaps, the authors propose a co-evolutionary framework that aligns verifier design with generator advancement across four distinct task domains. They introduce targeted reward constructions, including a trajectory-level behavior monitor that penalizes shortcut-dependent solutions, and demonstrate that adaptive verification significantly suppresses reward hacking while improving clean task completion across multiple benchmarks.
Dataset
Method
The authors propose a comprehensive verification and training framework designed to ensure that reward signals remain faithful, scalable, and robust as policy capabilities advance. This approach treats verification as core infrastructure that actively co-evolves with the policy model. As illustrated in the conceptual diagram, the intelligence capabilities of both the verifier and the policy model progress over training time. The system is designed to overcome challenges such as reward hacking and guidance saturation by continuously evolving the verifier in tandem with the policy, creating a co-evolution flywheel that sustains trustworthy capability growth.
For frontend and visual tasks, the authors design an agentic interactive judge that evaluates generated artifacts through simulated user interactions. Refer to the framework diagram for the complete pipeline. The process begins with preprocessing, where page information such as the accessibility tree and browser state is extracted, while evaluation criteria are synthesized into critical and detail checklists. An action planner then generates a comprehensive action list in a single forward pass, specifying the sequence of interactions required to exercise the target functionality. This action list is executed by a Playwright-based render server in a live browser environment, which records an interaction trace comprising screen recordings and state changes. Finally, a judge model evaluates sampled frames from the recordings alongside the source code against the predefined rubric criteria to produce a final score. By grounding evaluation in actual runtime behavior rather than static code inspection, this architecture captures dynamic behaviors like state transitions and multi-step workflows while resisting reward hacking based on source code length.
The training process leverages these verification signals through multiple objectives tailored to different data types. For tasks derived from real-world user interactions, the authors treat user feedback as the primary verifier. They extract process-level natural language feedback and partition the response trajectory into contiguous spans with consistent polarity. The training framework incorporates Supervised Fine-Tuning (SFT) and Reweight SFT (RW-SFT), which applies differentiated loss weights to tokens based on their polarity annotations to amplify positive signals and attenuate negative ones. Standard SFT applies a uniform cross-entropy loss across all tokens, whereas RW-SFT introduces a weight function defined as:
w(pt)=⎩⎨⎧wposwneuwnegif pt=positiveif pt=neutralif pt=negativeThe corresponding loss is calculated as:
LRW−SFT(θ)=−Et[w(pt)logπθ(yt∣x,y<t)]To further align the model with human intent, the authors introduce Span-Level KTO. This method defines the implicit reward for each span as the sum of log-likelihood ratios between the policy model and a frozen reference model:
rθ(x,Sk)=t=sk∑ek[logπθ(yt∣x,y<t)−logπref(yt∣x,y<t)]The reference point is estimated online using an exponential moving average of batch rewards:
zref←α⋅zref+(1−α)⋅rˉbatchThe preference loss applies distinct value functions to positive and negative spans based on the advantage relative to the reference point:
ℓ(Sk)={−λw⋅σ(β⋅ak)−λl⋅σ(−β⋅ak)if pSk=positiveif pSk=negativeThe overall preference objective is computed as the expectation over all spans:
Lpref(θ)=ESk[ℓ(Sk)]Neutral tokens are preserved through standard cross-entropy regularization:
Lneutral(θ)=−Et∈Tneu[logπθ(yt∣x,y<t)]The complete training objective combines the preference loss with the neutral regularization term to guide policy optimization.
Experiment
The experiments evaluate the proposed Span-KTO framework against standard supervised and reweighting baselines across multiple software engineering benchmarks, validating its capacity to improve both task resolution rates and overall agent behavior. Analysis demonstrates that simply discarding or heavily penalizing negative training data degrades performance, whereas Span-KTO effectively mitigates negative behaviors such as inefficiency and miscommunication, particularly during complex or unresolved tasks. Complementary studies on evaluator design reveal that prompt granularity must be carefully calibrated to balance filtering quality and ranking consistency, as optimal evaluation strategies fundamentally depend on the downstream training objective. Finally, ablation studies confirm the stability of the interactive judging pipeline and indicate that Span-KTO reliably learns from negative spans without requiring explicit sample imbalance compensation.
The authors evaluate the +Mon. variant against a baseline across three SWE-Bench variants. Results show that the +Mon. variant consistently achieves superior code resolution capabilities while drastically reducing the frequency of hacking behaviors and resolved instances under hacking conditions. The +Mon. variant demonstrates a substantial improvement in clean resolution rates across all tested benchmarks. Hacking rates are significantly lower for the +Mon. variant compared to the baseline. The rate of resolved instances under hacking conditions is markedly reduced, indicating enhanced robustness against shortcut behaviors.
The authors systematically refine an evaluation prompt across five versions to enhance the faithfulness of automated code repair assessments. Gradual improvements are achieved by correcting specific behavioral flaws such as reliance on static analysis, missing end-to-end validation, and role boundary violations. While the fourth version yields the strongest alignment with ground-truth quality scores, the fifth version introduces excessive constraints that degrade overall performance. Addressing specific evaluator failure modes like lazy static analysis and role confusion steadily boosts accuracy and ranking consistency. Moderately detailed instructions successfully guide the model through the intended review pipeline without overwhelming its processing capacity. Overly prescriptive rubrics in the final iteration reduce effectiveness, highlighting a trade-off between rule granularity and model compliance.
The the the table presents examples of user feedback that highlight specific omissions in model outputs, categorized by task outcome and signal type. It demonstrates that omissions, ranging from missing mandatory filters to core functionality or context, are annotated across both successful and partial task outcomes. The data indicates that users primarily provide explicit signals regarding these gaps, although implicit signals are also utilized for certain complex scenarios. Omissions are identified as a key issue, with rationales highlighting gaps in mandatory filters, core management features, and contextual references. These omission-related deficiencies occur across both success and partial task outcomes, suggesting that successful tasks may still lack specific details or completeness. User feedback serves as an explicit signal for most omissions, while implicit signals are utilized for specific scenarios such as back-end review exceptions.
The authors evaluate different prompting and voting strategies for an evaluator agent, measuring performance across instruction clarity and unit test alignment. Results indicate that incorporating examples consistently improves clarity scores, while adding ground truth patches alongside examples yields the highest alignment performance. Different model and voting configurations reveal trade-offs between interaction efficiency and evaluation accuracy. Incorporating examples into the evaluation strategy significantly boosts instruction clarity metrics. Adding ground truth patches alongside examples further enhances unit test alignment performance. Different model and voting configurations demonstrate varying trade-offs between interaction efficiency and evaluation accuracy.
The the the table presents threshold-conditioned average unit-test scores for five evaluator prompt versions. Prompt v4 exhibits the strongest filtering quality at moderate thresholds, while prompt v5 achieves the highest score at the strictest threshold but with a significantly reduced number of retained samples. Prompt v4 maintains the strongest filtering quality at moderate thresholds. Prompt v5 yields the highest score at the strictest threshold but relies on a very small sample size. Stricter thresholds result in a substantial drop in the number of qualifying samples across all versions.
The experiments evaluate a modified model variant across SWE-Bench tasks, iteratively refine evaluation prompts to test automated assessment reliability, and analyze user feedback to validate output completeness. Testing demonstrates that the modified variant significantly enhances code resolution and robustness while minimizing shortcut-driven hacking behaviors. Iterative prompt refinement reveals that moderately detailed instructions paired with concrete examples and ground truth patches optimally balance instruction clarity, unit test alignment, and filtering quality, whereas overly prescriptive rules or excessively strict thresholds ultimately degrade performance by restricting viable outputs. Collectively, these findings highlight that task success does not guarantee completeness, as omission-related gaps frequently persist, and underscore the importance of balancing evaluation granularity with model compliance to maintain both accuracy and practical utility.