Command Palette
Search for a command to run...
Agentic Systems als Verstärkung schwacher Modellierungsmodelle
Agentic Systems als Verstärkung schwacher Modellierungsmodelle
Varun Sunkaraneni Pierfrancesco Beneventano Riccardo Neumarker Tomer Galanti
Zusammenfassung
Kann ein Gremium aus schwachen Reasoning-Modellen die Leistungsfähigkeit viel stärkerer Modelle erreichen? Wir untersuchen die verifizergestützte Suche in Gremien (verifier-backed committee search) als Methode zur Leistungssteigerung bei Inference (inference-time boosting) für Reasoning-Language-Modelle (LLMs). Der zugrunde liegende Mechanismus ist nicht einfach „Mehr Agenten helfen“: Durch die Generierung von Samples werden latente korrekte Lösungen offengelegt, während Kritiker (critics) und Komparatoren (comparators) diese ohne Zugriff auf den verborgenen Verifizierer identifizieren müssen. Wir formalisieren diese Sichtweise, indem wir Proposal Coverage (Abdeckung der Vorschläge), lokale Identifizierbarkeit, Fortschritt und Diversität voneinander trennen. Wir beweisen, dass die Coverage durch wiederholtes Sampling verstärkt werden kann, allein aber keine nützlichen Kritiker oder Komparatoren erzeugen kann; eine zuverlässige Verstärkung erfordert ein zusätzliches Signal lokaler Korrektheit (local soundness signal), wie beispielsweise Ausführung (execution), Beweiskontrolle (proof checking), Typüberprüfung (type checking), Tests oder Constraint Solving. Wir stellen rankenbasierte Schranken vor, die aufzeigen, wann lokale Selektionsfehler zu zuverlässigen Trajektorien komposiert werden können, und charakterisieren die Obergrenze auf Seiten der Proposal-Generierung: Der oracle best-of-k-Ansatz konvergiert nur zur Masse der Aufgabenslices, auf denen das Proposal-System eine nicht-null, nutzbare Wahrscheinlichkeit zuweist. Empirisch zeigt sich, dass ein einzelnes GPT-5.4-nano-Proposal auf SWE-bench Verified 67,0 % der Aufgaben löst. Durch unsere Kritik-Komparator-Orchestrierung unter Verwendung desselben nano-Modells werden bei k=8 Proposals 76,4 % erreicht, was der Einzelleistung von Gemini 3 Pro und Claude Opus 4.5 Thinking entspricht und sich an der oberen oracle best-of-8-Grenze von 79,0 % orientiert.
One-sentence Summary
The authors propose verifier-backed committee search as an inference-time boosting mechanism for weak reasoning models, demonstrate that reliable amplification requires a local soundness signal beyond sampling coverage, and show that orchestrating critic-comparator systems with GPT-5.4 nano proposals achieves 76.4% accuracy on SWE-bench Verified with k=8 proposals, matching the standalone performance of Gemini 3 Pro and Claude Opus 4.5 Thinking.
Key Contributions
- This work formalizes verifier-backed committee search by separating proposal coverage, local identifiability, progress, and diversity to create a diagnostic lens for multi-agent systems. Theoretical analysis proves that coverage amplification requires local soundness signals and provides rank-based bounds on how selection errors compose into reliable trajectories.
- The method employs verifier-backed committee search as inference-time boosting where critics and comparators recover latent correct solutions from samples without access to a hidden verifier. This orchestration allows weak reasoning models to reach the performance of much stronger models through repeated sampling and local selection.
- Empirical evaluation on SWE-bench Verified demonstrates that a critic-comparator orchestration using a single GPT-5.4 nano model reaches 76.4% task resolution with eight proposals. This performance matches standalone results from Gemini 3 Pro and Claude Opus 4.5 Thinking while approaching the 79.0% oracle best-of-8 upper bound.
Introduction
Verifier-backed reasoning tasks such as code repair and theorem proving require systems to generate intermediate moves and validate them without access to hidden ground truth labels. Existing inference-time methods often rely on repeated sampling or voting, yet these strategies struggle when models share blind spots or lack the ability to distinguish correct partial solutions from flawed ones. The authors formalize agentic systems as inference-time boosting by separating proposal coverage from local identifiability and proving that reliable amplification requires additional soundness signals like execution or type checking. Their theoretical bounds characterize the limits imposed by shared blind spots while empirical results on SWE-bench Verified show that orchestrating weak models can match the performance of substantially stronger standalone systems.
Method
The authors model verifier-backed agent systems as a bounded-depth search process over partial objects, such as intermediate proof states or partially written programs. They define a valid state system where a state represents a partial reasoning object, and a rank function dx(s) measures the "distance from solution." The goal is to transition from an initial state s0(x) to a terminal state where the verifier accepts the solution. This process relies on the existence of progressing-sound actions, which preserve validity and strictly decrease the rank function.
The core architecture is the Committee Protocol Πk,m,r, which separates generation from identification. At each reachable non-terminal state s, the protocol executes a three-stage loop to select the next action. First, a proposer harness samples k candidate actions. Second, m independent critic calls are applied to each candidate to filter out locally refutable errors. Third, among the surviving candidates, a Copeland winner is selected using r comparator votes per pair. This selected action transitions the system to the next state st+1. The process repeats for a bounded number of steps L.
The theoretical guarantee of this architecture depends on two distinct resources: proposal coverage and local identifiability. Assumption 1 posits that a proposer portfolio exists such that a progressing-sound action can be sampled with non-zero probability α0. Assumption 2 ensures efficient local identifiability, meaning critics can reject unsound actions with probability at least β0, and comparators can prefer sound actions over unsound ones with probability at least 1/2+σ0.
The authors derive a local error decomposition for the probability of selecting an unsound action or failing locally. This error εloc(s) is bounded by the sum of proposal failure and identification failure. Specifically, the identification error decreases exponentially with the number of critic calls m and comparator votes r, governed by the edges β and σ. The global failure probability over a trajectory of length Lx is bounded by the sum of local errors at each step.
Further analysis of the proposal term reveals a "blind-spot floor." The probability of proposal failure converges to a value Bs as k→∞, representing latent subpopulations where the proposal system assigns zero probability to sound actions. Increasing k reduces the finite-sampling residual but cannot eliminate the blind-spot floor. To achieve reliable amplification, the system requires both a diverse proposer portfolio to cover the search space and a sound verification signal to identify valid moves. In the experimental instantiation, the critic is implemented as a binary patch judge, and the comparator performs pairwise evaluations of code patches to determine which is more likely to resolve the issue without breaking existing functionality.
Experiment
The experiments evaluate inference-time orchestration on SWE-bench Verified using a fixed pool of candidate patches to isolate selection quality from generation capabilities. Results indicate that while proposal diversity exposes latent correct solutions, a combined harness of critics and comparators is necessary to recover most of this potential, with critics filtering flawed patches and comparators ranking plausible ones. Failure decomposition reveals that remaining errors stem primarily from proposal coverage limitations rather than selection failures, suggesting that future gains require more diverse proposers alongside robust selection mechanisms.
The authors evaluate selector ablations to determine the impact of aggregation rules and critic thresholds on solve rates. Results show that all-pairs aggregation methods like Copeland round-robin and strict dominance outperform single-elimination brackets. Additionally, introducing a permissive critic gate significantly improves performance over using comparators alone, with optimal results found at lower thresholds before performance declines with stricter filtering. All-pairs aggregation methods yield better results than single-elimination brackets. Filtering out patches with zero critic support improves performance compared to using no gate. Stricter critic thresholds beyond the optimal level lead to slight performance decreases.
The authors evaluate selector ablations to determine the impact of aggregation rules and critic thresholds on solve rates. Results demonstrate that all-pairs aggregation methods, including Copeland round-robin and strict dominance, outperform single-elimination brackets. Furthermore, introducing a permissive critic gate enhances performance compared to using comparators alone, although performance declines with stricter filtering.