Command Palette
Search for a command to run...
推論のための協調的マルチエージェント・テスト時強化学習
推論のための協調的マルチエージェント・テスト時強化学習
Abstract
マルチエージェントシステムは、多様性と相互検証による堅牢性を活かして、多くの応用分野において実用的なLLM駆動型協働者へと進化している。しかし、マルチエージェント強化学習(MARL)の訓練はリソースを大量に消費し、安定性に欠ける:協調するエージェント同士が相互に適応する過程で非定常性が生じ、報酬はしばしば疎で高分散性となる。そこで本研究では、推論時におけるマルチエージェント協議に構造化されたテキスト経験を注入するフレームワーク「マルチエージェント推論時強化学習(MATTRL)」を提案する。MATTRLは、複数ラウンドの議論に適した専門家集団を構成し、推論時における経験を検索・統合し、最終的な意思決定に向けて合意形成を行う。さらに、ラウンド単位の経験プールを構築するための報酬割り当て(credit assignment)手法を検討し、その結果を対話プロセスに再注入する。医療、数学、教育分野における困難なベンチマークにおいて、MATTRLはマルチエージェントベースラインに対して平均3.67%、同等の単一エージェントベースラインに対しては8.67%の精度向上を達成した。アブレーションスタディにより、異なる報酬割り当てスキームの効果を詳細に比較し、訓練結果への影響を分析した。MATTRLは、ハイパーパラメータのチューニングを必要とせず、分布シフトに強いマルチエージェント推論の安定的かつ効果的かつ効率的な実現手段を提供する。
One-sentence Summary
The authors, affiliated with MIT, NUS, NYU, Microsoft, UW, Columbia, and NTU, propose MATTRL, a test-time reinforcement learning framework that enhances multi-agent reasoning by injecting structured textual experiences into deliberation, enabling consensus through a multi-expert team and turn-level credit assignment, achieving robust performance gains across medical, mathematical, and educational benchmarks without retraining.
Key Contributions
- We introduce Multi-Agent Test-Time Reinforcement Learning (MATTRL), a framework that enhances multi-agent reasoning at inference time by injecting structured textual experience, avoiding the instability and high cost of traditional multi-agent RL training while maintaining robustness under distribution shift.
- MATTRL employs a multi-expert team of specialized agents that collaboratively deliberate, using turn-level credit assignment to construct a dynamic experience pool from high-scoring utterances, which is then reinjected to refine subsequent reasoning steps.
- On benchmarks in medicine, math, and education, MATTRL achieves an average accuracy gain of 3.67% over multi-agent baselines and 8.67% over single-agent baselines, with ablation studies demonstrating the impact of different credit-assignment strategies on performance.
Introduction
The authors address the challenge of robust, scalable reasoning in multi-agent systems driven by large language models (LLMs), where collaboration improves performance through diversity and cross-checking but is hindered by the instability and high cost of multi-agent reinforcement learning (MARL). Prior approaches struggle with non-stationarity from co-adapting agents and sparse, high-variance rewards, limiting generalization and requiring extensive training. To overcome these issues, the authors introduce Multi-Agent Test-Time Reinforcement Learning (MATTRL), a framework that enhances reasoning at inference time by injecting structured textual experience into multi-turn agent deliberations. Instead of updating model weights, MATTRL conditions agent behavior using a dynamically built experience pool derived from high-scoring utterances, with credit assignment strategies determining which contributions are retained. This enables rapid, distribution-shift-robust adaptation without sacrificing original capabilities. Experiments across medical diagnosis, math, and education benchmarks show MATTRL improves accuracy by 3.67% over multi-agent baselines and 8.67% over single-agent models, demonstrating its effectiveness and efficiency.
Dataset
- The dataset is composed of three domain-specific subsets: Medicine, Math, and Education, each designed to evaluate different aspects of multi-agent collaboration.
- In Medicine, the dataset uses RareBench (Chen et al., 2024b) Task 4, focusing on differential diagnosis for rare diseases, with 2,185 patient cases covering 421 diseases. The task is framed as a multi-agent consultation where an attending agent coordinates domain specialists to propose, critique, and refine diagnoses.
- For Math, the dataset uses HLE (Humanity's Last Exam) with 856 text-only expert-level problems, assessing collaborative problem solving through exact-match solve rate judged by an LLM.
- In Education, the dataset is derived from 300 questions sampled from SuperGPQA (Du et al., 2025), simulating a three-stage teaching interaction: pre-test, instruction, and post-test. A GPT-4o student responds initially, a GPT-5 teacher provides two rounds of feedback, and the student re-answers; learning gains are measured as the difference in post-test and pre-test accuracy.
- The specialist pool for medicine includes 24 departments from core inpatient and outpatient specialties, selected to balance breadth and depth for efficient multi-disciplinary team (MDT) formation.
- The pedagogy specialist pool includes experts from academic, teaching, and cross-disciplinary domains, enabling targeted team assembly for instructional support.
- The authors use the data in a training-free, test-time experience setup: multiple agents (3 experts + attending) engage in up to 3 conversation turns, independently generating and refining responses with periodic synchronization.
- For experience construction, the authors extract the top 25% of scored utterances from agent interactions across 30 randomly selected cases to build a refined experience corpus used for guiding subsequent deliberation.
- The model’s performance is evaluated using domain-specific metrics: Hit@k and MRR for medicine, exact-match accuracy for math, and learning gain (ΔAcc) for education.
- All models, including baselines and the proposed MATTRL framework, are built on GPT-5 (OpenAI, 2025).
Method
The authors leverage a multi-expert team collaboration framework designed for structured, bounded, and evidence-augmented decision-making across diverse domains. The overall architecture operates in three distinct stages, each contributing to a coherent and auditable process. The framework begins with a task record X, a coordinator agent LLMCoo, a catalog of specialist agents SP, and a test-time experience pool E. The process is initiated by the coordinator, which selects a team of specialists from the catalog based on the task context, a process referred to as team formation. This stage is grounded in a predefined expert catalog, ensuring role selection is constrained and interpretable. The selected team then engages in a synchronized, multi-round discussion process, with a maximum of Rmax rounds. In each round, non-converged specialists retrieve relevant experiences from the pool E to inform their updated opinions. The retrieval mechanism uses a dense vector index, employing a shared encoder (Qwen3-Embedding-4B) and a FAISS index to select top-K entries based on cosine similarity. These retrieved experiences are appended to the specialist's prompt using a standardized "EXPERIENCE HINTS" template, which provides consultative guidance without requiring verbatim reproduction. After each round, the specialists' incremental updates are aggregated into a shared bulletin, which is then disseminated to all team members to align beliefs and prevent redundant discussion. A specialist is marked as converged when no further changes are proposed. The process terminates when all specialists converge or the round limit is reached. The final stage involves the coordinator synthesizing the team's cumulative evidence into a discussion report, which is then used to generate the final decision. This separation of evidence aggregation from decision-making enhances controllability and auditability. The framework is instantiated in various domains, including medical diagnosis, mathematical problem-solving, and teaching, demonstrating its domain-general applicability. 
The test-time experience construction process is a critical component that enables the system to learn from past interactions and reuse valuable insights. Given a multi-agent transcript, the framework evaluates each specialist's utterance using an LLM judge based on domain-relevant rubrics, yielding an individual score si,t. This score is combined with a terminal team-level outcome signal G to compute a turn-level reward ri,t for each agent. The reward is a weighted combination of the individual score and a decayed, contribution-weighted share of the terminal outcome, where later turns receive higher weight. High-value utterances, defined as those with a reward above a threshold τ, are selected for experience extraction. These utterances are then distilled into structured textual experience entries using an LLM summarizer. Each entry is a compact, retrievable record that includes minimal task context, the actionable step taken, and a short rationale for the assigned credit. These entries are stored in a test-time experience pool E, which is used to augment the reasoning of specialists during subsequent discussions. 
The framework's design emphasizes structured outputs and minimal, role-specific prompts to simplify downstream aggregation and evaluation. In the medical domain, the system is instantiated as a multi-disciplinary team (MDT) workflow for rare-disease differential diagnosis, where specialists produce a strict top-10 list each round. In the mathematical domain, the team formation process is adapted to allow for free recruitment, where the coordinator proposes a small set of specialist descriptions tailored to the current problem. The collaboration protocol includes structured peer review, where each specialist's attempt is evaluated by peers, and acceptance is only granted if all verdicts are positive and no issues remain. This ensures a high standard of reasoning and convergence. The experience-augmented prompting uses a standardized injection template to integrate retrieved experiences into the base diagnostic instruction, improving calibration and coverage of edge patterns. 
In the teaching domain, the framework is adapted for multi-specialist collaboration to guide students through problem-solving. The system involves a diagnostician, a pedagogy strategist, and a subject matter expert, who engage in a multi-round teaching session. The process begins with a student's pre-test answer and reasoning, which the specialists use to guide the student through a series of questions. The experience construction process is adapted to generate teaching experiences, which are used to inform the strategic thinking of the teacher agents. The retrieved experiences are used to identify patterns in student errors and adapt successful teaching strategies to the specific student's reasoning and error pattern. This approach ensures that the teaching process is both effective and personalized. 
Experiment
- MATTRL outperforms single-agent and multi-agent baselines across medicine, math, and education tasks by leveraging test-time collaborative adaptation and structured experience integration.
- On the medicine task (RareBench), MATTRL achieves Hit@k = 0.565 (k=1,3,5,10) and MRR = 0.51, surpassing MDAgent (0.515) and RareAgents-Refined (0.528), with significant gains at Hit@1 and Hit@10 indicating improved precision and coverage.
- In math (HLE), MATTRL reaches 0.36 exact-match accuracy, a 0.03 gain over multi-agent deliberation (0.33) and 0.09 over the single-agent baseline (0.27), demonstrating enhanced problem-solving through test-time experience.
- In education (SuperGPQA), MATTRL achieves post-test accuracy of 0.77 (ΔAcc = 0.33), nearly doubling the single-agent baseline’s gain (ΔAcc = 0.16), highlighting its effectiveness in teaching and misconception correction.
- Difference Rewards outperform Naive and Shapley-style approximations in credit assignment, achieving the highest Hit@1/3 (0.40/0.53) due to sharper, lower-variance credit signals that reduce free-riding and improve top-rank precision.
- An adaptive router that selects between single-agent CoT and MATTRL improves performance by 10% over single-agent and 5.5% over MATTRL, routing cases based on complexity and specialty divergence.
- Team size analysis shows optimal performance at three agents: Hit@1 peaks at three agents and declines with scale, while Hit@10 benefits most from larger teams, indicating a trade-off between precision and recall.
- MATTRL’s structured experience integration (general and disease-specific) significantly outperforms few-shot prompting, which only marginally improves Hit@1 and harms broader recall, confirming that experience quality and integration matter more than raw context.
The authors use MATTRL to improve retrieval quality in the medicine task, achieving the highest Hit@10 of 0.75 and MRR of 0.51, outperforming MDAgent and RareAgents-Refined. Results show that MATTRL significantly enhances top-rank precision and shortlist coverage, indicating that test-time collaborative adaptation provides benefits beyond prompt optimization alone.

Results show that the Difference method achieves the highest Hit@1 and Hit@3 scores, outperforming Naive and Shapley across these metrics, while all methods perform similarly at Hit@5 and Hit@10. The authors attribute the superior performance of Difference to its ability to isolate decisive turns and produce sharper credit peaks, which enhances top-rank precision.

Results show that MATTRL outperforms RareAgents + Fewshot across all metrics, achieving higher Hit@1, Hit@3, Hit@5, and Hit@10. The improvement over RareAgents + Fewshot is most pronounced at Hit@1 and Hit@3, indicating better top-rank precision, while the gains are smaller at Hit@5 and Hit@10, suggesting that MATTRL's advantage stems from structured experience integration rather than simply adding more context.

Results show that the Adaptive method achieves the highest Hit@1 score of 0.45, outperforming both the Single-Agent and MATTRL baselines. It also attains the best performance on Hit@3, Hit@5, and Hit@10, indicating that the adaptive routing strategy effectively combines the strengths of single-agent and multi-agent approaches across different retrieval thresholds.

Results show that MATTRL achieves the highest exact-match accuracy of 0.36 on HLE math problems, outperforming the single-agent baseline at 0.27 and the multi-agent approach at 0.33, indicating that test-time experience enhances collaborative problem solving beyond deliberation alone.
