HyperAIHyperAI

Command Palette

Search for a command to run...

il y a 3 jours
LLM
Reasoning

Autoreason : une auto-amélioration capable de savoir quand s'arrêter

Résumé

L'auto-raffinement itératif échoue pour trois raisons principales : le promptpromptprompt biasbiasbias, où des promptpromptprompt de critique adverses poussent les modèles à halluciner des problèmes inexistants ; le scopescopescope creepcreepcreep, où chaque cycle de révision élargit le champ d'application du document de manière incontrôlée ; et le lacklacklack ofofof restraintrestraintrestraint (manque de retenue) : les modèles acceptent presque systématiquement d'apporter des modifications, même lorsque la sortie est déjà de haute qualité. Conjugués, ces facteurs garantissent une dégradation progressive.Nous présentons autoreason, une méthode qui remédie à ces trois problématiques en structurant chaque itération comme un choix tripartite : l'élément actuel inchangé (A), une révision adverse (B), et une synthèse (AB). De nouveaux agents, exempts de tout historique de promptpromptprompt ou de contexte de session, évaluent les candidats via un compte de Borda à l'aveugle (blind Borda count). Ce processus garantit que l'option « ne rien faire » est toujours considérée comme une option de premier plan, et assure que les évaluateurs ne partagent aucun des biais ayant produit les révisions.L'intérêt de cette méthode est maximal pour les modèles de milieu de gamme, là où l'écart entre la capacité de génération et la capacité d'auto-évaluation est le plus marqué. En utilisant Haiku 3.5 (\sim10×\times× moins cher que Sonnet 4), autoreason a obtenu un score de 42/42 au compte de Borda sur trois tâches, soit un succès total, tandis que les modèles de référence de raffinement standard ont dégradé les sorties de ce même modèle, les rendant moins performantes qu'une simple passe unique sans raffinement.Sur cinq niveaux de modèles (Llama 8B, Gemini Flash, Haiku 3.5, Haiku 4.5, Sonnet 4), l'avantage culmine au milieu de la gamme et s'amenuise aux deux extrémités : les modèles trop faibles pour générer des alternatives diverses (Llama) ou trop performants pour nécessiter une évaluation externe (Sonnet 4.6). Haiku 4.5 confirme cette transition : avec une précision de 60 % sur des tests privés de code (égale à une passe unique de Sonnet), les gains d'autoreason sur les tests hors échantillon (held-out tests) disparaissent totalement, malgré une augmentation de 4 points de pourcentage sur les tests visibles.

One-sentence Summary

The authors propose Autoreason, a self-refinement framework that mitigates prompt bias, scope creep, and lack of restraint by structuring each iteration as a three-way choice between the unchanged incumbent, an adversarial revision, and a synthesis, evaluated via blind Borda count by independent agents to enable models to stop refining when outputs are optimal.

Key Contributions

  • The paper introduces autoreason, a method that structures iterative refinement as a three-way choice between an unchanged incumbent, an adversarial revision, and a synthesis of both.
  • This approach utilizes fresh agents with no prior prompt history or session context to judge candidates via a blind Borda count, which prevents the biases and scope creep common in standard self-refinement.
  • Experimental results demonstrate that autoreason achieves a perfect Borda score of 42/42 on three tasks using Haiku 3.5, significantly outperforming standard refinement baselines that cause output degradation.

Introduction

Iterative self-refinement is a common technique used to improve large language model outputs, but it often suffers from progressive degradation due to prompt bias, uncontrolled scope creep, and a lack of restraint where models feel compelled to change even perfect outputs. Existing methods frequently fail because they rely on single-agent loops or lack mechanisms to allow a model to opt for no change at all. The authors leverage a structured three-way choice framework called autoreason to solve these issues. By presenting an unchanged incumbent, an adversarial revision, and a synthesis to independent judges using a blind Borda count, the method ensures that "doing nothing" remains a viable option. This approach is particularly effective for mid-tier models that possess the ability to generate diverse alternatives but lack the self-evaluation capability to select the best one.

Method

The authors leverage a structured iterative framework known as autoreason, which operates as a closed-loop system driven by three distinct agent roles: Critic, Author, and Synthesizer, all operating within an iteration loop. The process begins with a Task Prompt that initializes the incumbent document, denoted as A. At each pass, a fresh Critic agent evaluates the current incumbent A and generates a critique, identifying shortcomings without proposing solutions. This critique is then passed to a fresh Author agent, which revises the incumbent to produce a new adversarial revision, labeled B. Simultaneously, a fresh Synthesizer agent combines the original incumbent A and the adversarial revision B to create a synthesized candidate, AB, by integrating the strongest elements from both.

The autoreason loop framework
The autoreason loop framework

As shown in the figure below, the three candidates—A (unchanged incumbent), AB (synthesis), and B (revision)—are submitted to a Judge Panel composed of three blind judges. The judges rank the candidates using a Borda count system, assigning 3, 2, and 1 points for first, second, and third place, respectively, with ties broken in favor of the incumbent. The winner is selected based on the aggregated scores, and the process continues unless the incumbent A wins two consecutive passes, which triggers convergence.

The framework ensures that each agent role is a fresh, isolated instance with no shared context beyond the task prompt, promoting independence and reducing bias propagation. The iterative process is formally defined where dtd_tdt represents the incumbent document at pass ttt, and each pass generates candidates {dt,B(dt),S(dt,B(dt))}\{d_t, B(d_t), S(d_t, B(d_t))\}{dt,B(dt),S(dt,B(dt))}, where BBB is the adversarial revision operator and SSS is the synthesis operator. The winner at each step is determined by maximizing the Borda aggregation score over nnn judges, expressed as:

dt+1=argmaxc{A,B,AB}i=1n(3ri(c))d_{t+1} = \arg\max_{c \in \{A, B, AB\}} \sum_{i=1}^{n} (3 - r_i(c))dt+1=argc{A,B,AB}maxi=1n(3ri(c))

where ri(c)r_i(c)ri(c) is the rank assigned to candidate ccc by judge iii. The system converges when the incumbent wins k=2k=2k=2 consecutive passes, ensuring stability in the output.

Experiment

The experiments evaluate the autoreason method across subjective writing tasks, competitive programming, and various model tiers to determine when structured iterative refinement succeeds. The results demonstrate that autoreason effectively bridges the gap between a model's generation and evaluation capabilities, particularly for mid-tier models where traditional self-refinement often leads to quality degradation or unchecked verbosity. Ultimately, the study concludes that the method's effectiveness is maximized when tasks provide sufficient decision space and bounded scope, allowing a tournament structure to recover from initial failures through structured reasoning rather than mere reactive editing.

The authors compare autoreason variants against baseline methods, showing that autoreason with a margin requirement achieves higher Borda scores than other approaches. The critique-and-revise method leads in first-place rankings, but autoreason variants demonstrate improved overall performance through structured evaluation. Autoreason with a margin requirement achieves higher Borda scores than other methods Critique-and-revise leads in first-place rankings but autoreason variants show better overall performance Autoreason variants outperform conservative and baseline methods in Borda scoring

Autoreason outperforms baselines
Autoreason outperforms baselines

Autoreason achieves higher scores across multiple tasks compared to single-pass and critique-and-revise methods. The method's advantage is consistent, with significant gains in average performance over simpler iterative approaches. Autoreason outperforms single-pass and critique-and-revise baselines across all tasks The method achieves higher average scores compared to all other approaches Autoreason shows consistent superiority in both constrained and open-ended tasks

Autoreason outperforms baselines
Autoreason outperforms baselines

The the the table compares the performance of autoreason against baseline methods on constrained writing tasks. Autoreason achieves the highest scores on two out of three tasks, with the conservative baseline performing best on the postmortem task, indicating that task constraints influence which method is most effective. Autoreason outperforms baselines on two of three constrained tasks The conservative baseline wins on the postmortem task, suggesting task-specific effectiveness Performance differences highlight the impact of task constraints on method success

Constrained task performance comparison
Constrained task performance comparison

Autoreason achieves higher private-test pass rates and better performance on medium and hard problems compared to critique-and-revise and single-pass strategies. The method shows consistent gains across difficulty levels, particularly in constrained domains where iterative refinement without evaluation leads to degradation. Autoreason leads in private-test pass rates across all problem types The method outperforms critique-and-revise and single-pass strategies on medium and hard problems Autoreason maintains higher performance on difficult problems where baselines degrade

Autoreason outperforms baselines
Autoreason outperforms baselines

The the the table compares convergence speed across different judge variants for two tasks. Chain-of-thought judges converge significantly faster than baseline holistic judges, while decomposed specialists show intermediate performance on Task 1 but no convergence on Task 2. This suggests that structured reasoning improves evaluation efficiency, but specialized roles may not always be effective. Chain-of-thought judges converge faster than baseline holistic judges Decomposed specialist judges converge on Task 1 but fail to converge on Task 2 Structured reasoning improves convergence speed, but specialized roles may not be universally effective

Judge variant convergence comparison
Judge variant convergence comparison

The experiments compare various autoreason variants against single-pass, critique-and-revise, and conservative baseline methods across constrained, open-ended, and varying difficulty tasks. The results demonstrate that autoreason variants provide superior overall performance and consistency, particularly on difficult problems where simpler iterative approaches often degrade. While structured reasoning through chain-of-thought judges improves evaluation convergence speed, the effectiveness of specialized roles varies depending on the specific task requirements.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp
Autoreason : une auto-amélioration capable de savoir quand s'arrêter | Articles | HyperAI