HyperAIHyperAI

Command Palette

Search for a command to run...

in-context co-player inference를 통한 Multi-agent 협력

Marissa A. Weis Maciej Wołczyk Rajai Nasser Rif A. Saurous Blaise Agüera y Arcas João Sacramento Alexander Meulemans

초록

이기적인 agent 간의 협력을 달성하는 것은 Multi-agent reinforcement learning 분야의 근본적인 과제로 남아 있습니다. 최근 연구에 따르면, 동료 플레이어(co-player)의 학습 역학(learning dynamics)을 고려하고 이를 형성하는 'learning-aware' agent 사이에서는 상호 협력이 유도될 수 있음이 밝혀졌습니다. 그러나 기존 방식들은 대개 동료 플레이어의 학습 규칙에 대해 하드코딩된, 종종 일관되지 않은 가정을 의존하거나, 빠른 타임스케일(timescale)에서 업데이트되는 'naive learners'와 이러한 업데이트를 관찰하는 'meta-learners' 사이의 엄격한 분리를 강제하는 경습이 있습니다.본 논문에서는 sequence model의 in-context learning 능력을 통해, 하드코딩된 가정이나 명시적인 타임스케일의 분리 없이도 동료 플레이어에 대한 learning awareness를 구현할 수 있음을 입증합니다. 우리는 다양한 분포의 동료 플레이어를 대상으로 sequence model agent를 학습시키면, 에피소드 내부의 빠른 타임스케일에서 학습 알고리즘으로서 효과적으로 작동하는 in-context best-response 전략이 자연스럽게 유도됨을 보여줍니다.또한, 이전 연구에서 확인된 협력 메커니즘, 즉 '착취(extortion)에 대한 취약성이 상호 형성(mutual shaping)을 유도한다'는 기제가 본 설정에서도 자연스럽게 나타남을 발견했습니다. in-context adaptation은 agent를 착취에 취약하게 만들며, 그 결과 상대방의 in-context learning dynamics를 형성하려는 상호 압박이 협력적 행동을 학습하는 방향으로 귀결됩니다. 우리의 연구 결과는 sequence model 기반의 표준적인 decentralized reinforcement learning을 동료 플레이어의 다양성(co-player diversity)과 결합하는 것이 협력적 행동을 학습하기 위한 확장 가능한 경로를 제공함을 시사합니다.

One-sentence Summary

By training sequence model agents against a diverse distribution of co-players, the researchers demonstrate that in-context co-player inference naturally induces cooperative behaviors and best-response strategies without the need for hardcoded learning rules or explicit timescale separation.

Key Contributions

  • The paper introduces a decentralized multi-agent reinforcement learning setup where sequence model agents are trained against a diverse pool of co-players to induce in-context co-player inference and cooperation.
  • This work presents a new reinforcement learning method that leverages self-supervised learning of predictive sequence models to learn the in-context best-response policies required for mixed-pool training.
  • The research demonstrates that training against diverse co-players enables robust cooperation in the Iterated Prisoner's Dilemma by bridging in-context learning with co-player learning awareness without requiring explicit timescale separation or meta-gradient machinery.

Introduction

As autonomous agents based on foundation models move from isolated systems to interacting entities, ensuring cooperation in mixed-motive environments is critical for scalable multi-agent systems. Previous attempts to achieve cooperation through co-player learning awareness often rely on rigid assumptions about an opponent's learning rules or require a strict separation between fast-updating naive learners and slow-updating meta-learners. The authors leverage the in-context learning capabilities of sequence models to bridge this gap, demonstrating that training agents against a diverse distribution of co-players naturally induces in-context best-response strategies. This approach allows agents to function as both naive learners through intra-episode adaptation and learning-aware agents through parameter updates, enabling cooperative behaviors to emerge naturally through mutual extortion dynamics without complex meta-gradient machinery.

Dataset

The authors utilize an Iterated Prisoners Dilemma (IPD) environment to evaluate agent performance. The dataset and environment characteristics are summarized below:

  • Dataset Composition and Environment Rules: The environment consists of games played over 100 rounds. In each round, two agents choose between two actions: cooperate (C) or defect (D).
  • Observation and State Construction: The environment provides five distinct observations. These include the initial state s0s_0s0 and four subsequent observations based on the action pairs from the previous round: (C, C), (C, D), (D, C), and (D, D). While tabular agents only process the most recent observation oto_tot, the PPI and A2C agents are trained to leverage the full history xtx_{\leq t}xt.
  • Data Processing and Perspective: Agents receive observations from a first person perspective, meaning an agent's own action is always enumerated first in the observation sequence.
  • Reward Mechanism: Rewards are assigned to agents at each round based on a specific single round payoff matrix.

Method

The authors propose Predictive Policy Improvement (PPI) agents, which serve as a practical approximation of embedded Bayesian agents. The core of the PPI framework is the integration of a learned sequence model with a planning-based policy improvement mechanism, moving away from the standard reinforcement learning paradigm where a separate critic is used.

Sequence Model Architecture

The PPI agent utilizes a sequence model designed to act simultaneously as a world model and a policy prior. This model is implemented as a Gated Recurrent Unit (GRU) with a 128-dimensional hidden state. The input pipeline processes observations, actions, and rewards through modality-specific linear layers, projecting them into a shared 32-dimensional embedding space. Prior to this projection, observations and actions are one-hot encoded.

The embeddings are fed into the GRU, and the resulting outputs are processed using the Swish activation function. To facilitate multi-modal prediction, distinct linear output heads decode the hidden states to predict future tokens for each specific modality. Specifically, the model predicts:

  • Actions pϕ(atxt)p_{\phi}(a_{t} \mid x_{\leq t})pϕ(atxt) using a categorical distribution.
  • Observations pϕ(otx<t,at1)p_{\phi}(o_{t} \mid x_{<t}, a_{t-1})pϕ(otx<t,at1) using a categorical distribution.
  • Rewards pϕ(rtx<t,at1,ot)p_{\phi}(r_{t} \mid x_{<t}, a_{t-1}, o_{t})pϕ(rtx<t,at1,ot) using a normal distribution with fixed variance.

Training Process

The training of the sequence model follows an iterative, multi-phase approach. The authors employ a performative prediction strategy where the model is trained on a dataset D\mathcal{D}D that accumulates interaction histories from all previous and current phases. This ensures more stable training as the agent's own policy influences the data distribution.

In each of the 30 training phases, the model parameters ϕ\phiϕ are re-initialized and optimized to minimize a joint next-token prediction loss: Ltrain=λobsLobs+λactLaction+λrewardLrewardL_{\text{train}} = \lambda_{\text{obs}} L_{\text{obs}} + \lambda_{\text{act}} L_{\text{action}} + \lambda_{\text{reward}} L_{\text{reward}}Ltrain=λobsLobs+λactLaction+λrewardLreward

The individual loss components are defined as: Lobs=1NTn=1Nt=1Tlogpϕ(ot(n)xt1(n))L_{\text{obs}} = - \frac{1}{NT} \sum_{n=1}^{N} \sum_{t=1}^{T} \log p_{\phi}(o_{t}^{(n)} \mid x_{\leq t-1}^{(n)})Lobs=NT1n=1Nt=1Tlogpϕ(ot(n)xt1(n)) Lreward=1NTn=1Nt=1Tlogpϕ(rt(n)xt1(n),ot(n))L_{\text{reward}} = - \frac{1}{NT} \sum_{n=1}^{N} \sum_{t=1}^{T} \log p_{\phi}(r_{t}^{(n)} \mid x_{\leq t-1}^{(n)}, o_{t}^{(n)})Lreward=NT1n=1Nt=1Tlogpϕ(rt(n)xt1(n),ot(n)) Laction=1NTn=1Nt=1Tlogpϕ(at(n)xt1(n),ot(n),rt(n))L_{\text{action}} = - \frac{1}{NT} \sum_{n=1}^{N} \sum_{t=1}^{T} \log p_{\phi}(a_{t}^{(n)} \mid x_{\leq t-1}^{(n)}, o_{t}^{(n)}, r_{t}^{(n)})Laction=NT1n=1Nt=1Tlogpϕ(at(n)xt1(n),ot(n),rt(n))

Optimization is conducted using the AdamW optimizer over 10 epochs per phase, with a batch size of 256 and gradient clipping at a norm of 1.0.

Inference and Policy Improvement

During deployment, the agent does not rely on a traditional value function. Instead, it estimates QQQ values by performing Monte Carlo roll-outs into the future using the learned sequence model as a simulator. By sampling future trajectories from the model, the agent evaluates the expected return of potential actions based on its internal representation of environment dynamics and co-player responses.

The final action selection is performed by a policy π(axt)\pi(a \mid x_{\leq t})π(axt) that re-weights the model's prior probability p(axt;ϕ)p(a \mid x_{\leq t}; \phi)p(axt;ϕ) using the estimated value Q^p(xt,a)\hat{Q}^{p}(x_{\leq t}, a)Q^p(xt,a) through a Boltzmann distribution: π(axt)=1Zp(axt;ϕ)exp(βQ^p(xt,a))\pi(a \mid x_{\leq t}) = \frac{1}{Z} p(a \mid x_{\leq t}; \phi) \exp(\beta \hat{Q}^{p}(x_{\leq t}, a))π(axt)=Z1p(axt;ϕ)exp(βQ^p(xt,a))

In this formulation, β\betaβ acts as an inverse temperature parameter that defines a trust region around the behavioral prior pϕp_{\phi}pϕ. This mechanism allows the agent to improve its policy by selecting actions that the sequence model predicts will yield higher cumulative rewards.

Experiment

The researchers evaluate the emergence of cooperation in the Iterated Prisoner's Dilemma by training agents in a mixed population of learning models and static tabular agents. Using both Predictive Policy Improvement and Independent A2C, the study validates that training against a diverse pool of opponents induces robust in-context inference capabilities. The findings demonstrate a causal chain where diversity drives in-context best-response mechanisms, which in turn creates a vulnerability to extortion that ultimately settles into mutual cooperation through reciprocal shaping.

The the the table lists the hyperparameters used for the A2C algorithm across four different experimental steps. It details various settings including batch size, reward rescaling, and learning rates to ensure consistency or controlled variation throughout the study. Batch sizes increase from the first two steps to the final two steps The reward rescaling factor decreases progressively across the four steps The learning rate is adjusted differently across the steps, with the lowest value appearing in step three

The evaluation utilizes the A2C algorithm across four experimental steps with controlled variations in batch size, reward rescaling, and learning rates. These adjustments are designed to test the impact of different hyperparameter configurations on agent performance. The setup ensures a systematic investigation into how scaling and learning dynamics influence the stability and effectiveness of the reinforcement learning process.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp