Command Palette
Search for a command to run...
Autoreason: 중단 시점을 파악하는 자기 정교화(Self-Refinement) 기법
Autoreason: 중단 시점을 파악하는 자기 정교화(Self-Refinement) 기법
초록
반복적인 self-refinement는 세 가지 이유로 인해 실패합니다. 첫째는 prompt bias로, adversarial critique prompt가 존재하지 않는 문제를 모델이 hallucination하도록 유도하는 현상입니다. 둘째는 scope creep로, 각 수정 단계를 거칠 때마다 문서의 범위가 통제 없이 확장되는 문제입니다. 마지막은 lack of restraint로, 모델이 이미 충분히 좋은 결과물을 내놓았음에도 불구하고 수정을 거부하는 경우가 거의 없다는 점입니다. 이러한 요인들이 결합하여 결과적으로 성능의 점진적인 저하(progressive degradation)를 초래합니다.본 논문에서는 각 iteration을 세 가지 선택지인 '변경되지 않은 기존 안(A)', 'adversarial revision(B)', 그리고 이 둘을 결합한 'synthesis(AB)'로 구조화함으로써 위 세 가지 문제를 모두 해결하는 autoreason를 제안합니다. prompt history나 session context가 없는 새로운 agent가 blind Borda count 방식을 통해 후보군을 평가하며, 이를 통해 '아무것도 하지 않음(do nothing)'이 항상 대등한 선택지(first-class option)로 보장되도록 합니다. 또한 평가자는 수정 과정에서 발생한 bias를 전혀 공유하지 않습니다.이 방법론의 가치는 생성 능력과 self-evaluation 능력 사이의 격차가 가장 큰 mid-tier 모델에서 가장 극대화됩니다. Haiku 3.5(Sonnet 4 대비 약 10배 저렴)를 사용했을 때, autoreason는 세 가지 task에서 42/42 Borda를 기록하며 완벽한 성능을 보인 반면, 표준 refinement baseline들은 동일한 모델의 결과물을 수정 전의 single pass 결과보다도 낮게 저하시켰습니다. 5가지 모델 계층(Llama 8B, Gemini Flash, Haiku 3.5, Haiku 4.5, Sonnet 4)을 대상으로 실험한 결과, 성능 우위는 mid-tier에서 정점을 찍고 양극단에서는 감소하는 양상을 보였습니다. 즉, 다양한 대안을 생성하기에는 너무 약한 모델(Llama)이나, 외부 평가가 필요 없을 정도로 강력한 모델(Sonnet 4)에서는 그 효과가 상대적으로 적었습니다. Haiku 4.5는 이러한 전이 현상을 확인시켜 줍니다. code 분야의 private-test 정확도가 60%(Sonnet의 single-pass와 동일한 수준)에 도달하면, autoreason는 가시적인 test에서 4pp의 성능 향상을 보였음에도 불구하고 held-out test에서의 이점은 완전히 사라졌습니다.
One-sentence Summary
The authors propose Autoreason, a self-refinement framework that mitigates prompt bias, scope creep, and lack of restraint by structuring each iteration as a three-way choice between the unchanged incumbent, an adversarial revision, and a synthesis, evaluated via blind Borda count by independent agents to enable models to stop refining when outputs are optimal.
Key Contributions
- The paper introduces autoreason, a method that structures iterative refinement as a three-way choice between an unchanged incumbent, an adversarial revision, and a synthesis of both.
- This approach utilizes fresh agents with no prior prompt history or session context to judge candidates via a blind Borda count, which prevents the biases and scope creep common in standard self-refinement.
- Experimental results demonstrate that autoreason achieves a perfect Borda score of 42/42 on three tasks using Haiku 3.5, significantly outperforming standard refinement baselines that cause output degradation.
Introduction
Iterative self-refinement is a common technique used to improve large language model outputs, but it often suffers from progressive degradation due to prompt bias, uncontrolled scope creep, and a lack of restraint where models feel compelled to change even perfect outputs. Existing methods frequently fail because they rely on single-agent loops or lack mechanisms to allow a model to opt for no change at all. The authors leverage a structured three-way choice framework called autoreason to solve these issues. By presenting an unchanged incumbent, an adversarial revision, and a synthesis to independent judges using a blind Borda count, the method ensures that "doing nothing" remains a viable option. This approach is particularly effective for mid-tier models that possess the ability to generate diverse alternatives but lack the self-evaluation capability to select the best one.
Method
The authors leverage a structured iterative framework known as autoreason, which operates as a closed-loop system driven by three distinct agent roles: Critic, Author, and Synthesizer, all operating within an iteration loop. The process begins with a Task Prompt that initializes the incumbent document, denoted as A. At each pass, a fresh Critic agent evaluates the current incumbent A and generates a critique, identifying shortcomings without proposing solutions. This critique is then passed to a fresh Author agent, which revises the incumbent to produce a new adversarial revision, labeled B. Simultaneously, a fresh Synthesizer agent combines the original incumbent A and the adversarial revision B to create a synthesized candidate, AB, by integrating the strongest elements from both.

As shown in the figure below, the three candidates—A (unchanged incumbent), AB (synthesis), and B (revision)—are submitted to a Judge Panel composed of three blind judges. The judges rank the candidates using a Borda count system, assigning 3, 2, and 1 points for first, second, and third place, respectively, with ties broken in favor of the incumbent. The winner is selected based on the aggregated scores, and the process continues unless the incumbent A wins two consecutive passes, which triggers convergence.
The framework ensures that each agent role is a fresh, isolated instance with no shared context beyond the task prompt, promoting independence and reducing bias propagation. The iterative process is formally defined where dt represents the incumbent document at pass t, and each pass generates candidates {dt,B(dt),S(dt,B(dt))}, where B is the adversarial revision operator and S is the synthesis operator. The winner at each step is determined by maximizing the Borda aggregation score over n judges, expressed as:
dt+1=argc∈{A,B,AB}maxi=1∑n(3−ri(c))where ri(c) is the rank assigned to candidate c by judge i. The system converges when the incumbent wins k=2 consecutive passes, ensuring stability in the output.
Experiment
The experiments evaluate the autoreason method across subjective writing tasks, competitive programming, and various model tiers to determine when structured iterative refinement succeeds. The results demonstrate that autoreason effectively bridges the gap between a model's generation and evaluation capabilities, particularly for mid-tier models where traditional self-refinement often leads to quality degradation or unchecked verbosity. Ultimately, the study concludes that the method's effectiveness is maximized when tasks provide sufficient decision space and bounded scope, allowing a tournament structure to recover from initial failures through structured reasoning rather than mere reactive editing.
The authors compare autoreason variants against baseline methods, showing that autoreason with a margin requirement achieves higher Borda scores than other approaches. The critique-and-revise method leads in first-place rankings, but autoreason variants demonstrate improved overall performance through structured evaluation. Autoreason with a margin requirement achieves higher Borda scores than other methods Critique-and-revise leads in first-place rankings but autoreason variants show better overall performance Autoreason variants outperform conservative and baseline methods in Borda scoring

Autoreason achieves higher scores across multiple tasks compared to single-pass and critique-and-revise methods. The method's advantage is consistent, with significant gains in average performance over simpler iterative approaches. Autoreason outperforms single-pass and critique-and-revise baselines across all tasks The method achieves higher average scores compared to all other approaches Autoreason shows consistent superiority in both constrained and open-ended tasks

The the the table compares the performance of autoreason against baseline methods on constrained writing tasks. Autoreason achieves the highest scores on two out of three tasks, with the conservative baseline performing best on the postmortem task, indicating that task constraints influence which method is most effective. Autoreason outperforms baselines on two of three constrained tasks The conservative baseline wins on the postmortem task, suggesting task-specific effectiveness Performance differences highlight the impact of task constraints on method success

Autoreason achieves higher private-test pass rates and better performance on medium and hard problems compared to critique-and-revise and single-pass strategies. The method shows consistent gains across difficulty levels, particularly in constrained domains where iterative refinement without evaluation leads to degradation. Autoreason leads in private-test pass rates across all problem types The method outperforms critique-and-revise and single-pass strategies on medium and hard problems Autoreason maintains higher performance on difficult problems where baselines degrade

The the the table compares convergence speed across different judge variants for two tasks. Chain-of-thought judges converge significantly faster than baseline holistic judges, while decomposed specialists show intermediate performance on Task 1 but no convergence on Task 2. This suggests that structured reasoning improves evaluation efficiency, but specialized roles may not always be effective. Chain-of-thought judges converge faster than baseline holistic judges Decomposed specialist judges converge on Task 1 but fail to converge on Task 2 Structured reasoning improves convergence speed, but specialized roles may not be universally effective

The experiments compare various autoreason variants against single-pass, critique-and-revise, and conservative baseline methods across constrained, open-ended, and varying difficulty tasks. The results demonstrate that autoreason variants provide superior overall performance and consistency, particularly on difficult problems where simpler iterative approaches often degrade. While structured reasoning through chain-of-thought judges improves evaluation convergence speed, the effectiveness of specialized roles varies depending on the specific task requirements.