HyperAIHyperAI

Command Palette

Search for a command to run...

정렬은 언어 모델을 기술적이지 않고 규범적으로 만듭니다.

Eilam Shapira Moshe Tennenholtz Roi Reichart

초록

훈련 후 정렬(post-training alignment)은 언어 모델을 인간의 선호 신호와 일치하도록 최적화하지만, 이 목표는 관찰된 인간 행동을 모델링하는 것과 동일하지 않다. 우리는 120 개의 베이스 모델과 정렬된 모델 쌍을 bargaining, 설득, 협상, 반복 행렬 게임 등 다수 라운드의 전략적 게임에서 10,000 건 이상의 실제 인간 의사결정을 기준으로 비교 분석하였다. 이러한 환경에서 베이스 모델은 정렬된 모델에 비해 인간 선택을 예측하는 성능이 거의 10:1 비율로 우월했으며, 이는 모델 계열, 프롬프트 구성, 게임 설정 전반에 걸쳐 견고하게 나타났다. 그러나 인간 행동이 규범적 예측을 따를 가능성이 높은 환경에서는 이 패턴이 반전된다. 정렬된 모델은 테스트된 12 가지 유형 전반에서 일회성 교과서적 게임과 비전략적 복권 선택에서 우위를 점했으며, 다수 라운드 게임 내에서도 상호작용 이력이 형성되기 전인 1 라운드에서조차 우위를 보였다. 이러한 경계 조건 패턴은 정렬이 규범적 편향(normative bias)을 유발함을 시사한다. 즉, 인간 행동이 규범적 해법으로 비교적 잘 설명될 때는 예측 성능을 향상시키지만, 상호성, 보복, 이력에 의존적 적응과 같은 기술적 역동성(descriptive dynamics)에 의해 행동이 형성되는 다수 라운드 전략적 환경에서는 예측 성능을 저하시킨다. 이러한 결과는 인간 활용을 위한 모델 최적화와 인간 행동의 대리자(proxy)로서의 모델 활용 사이에 근본적인 상충 관계가 존재함을 드러낸다.

One-sentence Summary

Researchers from Technion reveal that post-training alignment introduces a normative bias, causing aligned models to fail at predicting human behavior in multi-round strategic games where base models excel by a 10:1 margin, while aligned versions succeed only in one-shot or non-strategic scenarios.

Key Contributions

  • The paper introduces a systematic comparison of 120 base-aligned model pairs across more than 10,000 real human decisions in four multi-round strategic game families, revealing that base models outperform aligned counterparts by a nearly 10:1 margin in predicting human choices.
  • This work presents a deterministic token probability extraction method that bypasses text generation to directly measure how well a model's internal distribution matches human decision distributions, enabling a fair comparison between base and aligned variants on identical inputs.
  • Results demonstrate that alignment induces a normative bias that improves prediction accuracy in one-shot textbook games and non-strategic lottery choices but significantly degrades performance in multi-round strategic settings where human behavior is driven by reciprocity and history-dependent adaptation.

Introduction

Researchers increasingly use aligned Large Language Models as behavioral proxies to simulate human opinions and predict experimental outcomes, yet this practice assumes alignment preserves behavioral fidelity. Prior studies indicate that alignment techniques like RLHF and instruction tuning often collapse opinion diversity, introduce cognitive biases, and degrade reasoning capabilities, but these findings have largely been limited to static judgments rather than dynamic strategic interactions. The authors address this gap by conducting the first systematic comparison of base versus aligned models across 120 model pairs in four game families, demonstrating that alignment makes models normative rather than descriptive and significantly reduces their ability to predict the full range of human strategic behavior.

Top Figure
Top Figure

Dataset

  • Dataset Composition and Sources The authors evaluate models across four strategic game families and two boundary condition datasets. The primary strategic data comes from the GLEE benchmark (Shapira et al., 2024b) and Akata et al. (2025), while the boundary tests utilize procedurally generated matrix games from Zhu et al. (2025) and binary lottery choices from Marantz and Plonsky (2025).

  • Key Details for Each Subset

    • Bargaining: Based on the Rubinstein model, this subset includes 1,788 human decisions where players alternate offers with optional free-text messages and face discount factors framed as inflation.
    • Persuasion: A repeated cheap talk game over 20 rounds containing 3,180 human decisions where buyers decide to purchase based on seller messages, featuring both long-living and myopic buyer variants.
    • Negotiation: A bilateral price negotiation with 1,182 human decisions involving ternary choices to accept, reject, or take an outside option with a third party.
    • Repeated Matrix Games: Includes 3,900 decisions (1,950 per game) from 195 participants playing 10 rounds of Prisoner's Dilemma and Battle of the Sexes against pre-computed GPT-4 strategies.
    • One-shot Matrix Games: A procedurally generated set of 2,416 games spanning 12 topologies with approximately 93,000 aggregated human decisions, reduced to 71 valid pairs after filtering.
    • Binary Lottery Choices: Comprises 1,001 non-strategic problems where participants choose between gambles, resulting in 90 valid pairs after filtering.
  • Data Usage and Processing The authors use these datasets to generate over 2.4 million total predictions across 120 same-provider base-aligned model pairs. For the GLEE games, human participants interacted with LLM opponents via a web interface without knowing the opponent's nature, ensuring decisions were uncontaminated by bias. The repeated matrix games are formatted as multi-turn prompts that include the payoff matrix and round history. The one-shot matrix games use a counterbalanced format to control for position bias, while lottery choices are presented as verbal descriptions.

  • Filtering and Metadata Construction The study excludes model pairs where the base and aligned checkpoints are identical or where the aligned model lacks a chat template. For the boundary condition datasets, the authors apply strict filtering to retain only valid same-provider pairs, resulting in 71 pairs for the one-shot matrix games and 90 pairs for the lottery choices. The evaluation covers 10,050 human decisions per model to ensure statistical robustness across the different game structures.

Method

The authors frame human decision prediction as a token probability extraction task. For each human decision point within a game, they construct a prompt consisting of a system message that describes the game rules and the participant's role. This is followed by the dialogue history up to the specific decision point. The model then performs a single forward pass to extract the log-probabilities assigned to each decision token from the next-token distribution at the final position. For example, in bargaining scenarios, the tokens might be "accept" versus "reject".

They normalize the extracted probabilities to obtain a predicted decision distribution. The formula for the probability of acceptance is defined as:

paccept=p(yes)dp(d)p_{\mathrm{accept}} = \frac{p(\mathrm{yes})}{\sum_{d} p(d)}paccept=dp(d)p(yes)

where ddd ranges over all decision tokens for a given family. This includes two tokens for bargaining, persuasion, and matrix games, while negotiation adds a third token for the outside option. The resulting pacceptp_{\text{accept}}paccept value falls within the range [0,1][0, 1][0,1] and captures the model's relative preference for the affirmative action. This normalization removes the influence of non-decision tokens.

This method requires no text generation and no sampling. It is a deterministic extraction of the model's internal probability distribution over decision tokens. The approach is applicable to both base and aligned models without requiring different decoding strategies. However, the normalization is robust only when decision tokens receive substantial probability mass. When the model distributes mass primarily to non-decision tokens, the normalized probabilities become unreliable. To address this, the authors apply two pair-level filters per game family. The first is a mass filter that excludes pairs where either model assigns less than 80% average probability mass to decision tokens. The second is a minimum correlation filter that excludes pairs where both models correlate below 0.3 with human decisions. These filters are applied independently per family to ensure the base advantage remains robust across threshold choices.

Experiment

  • Multi-round strategic games (bargaining, persuasion, negotiation, repeated matrix games) validate that base models significantly outperform aligned models in predicting actual human behavior, as alignment introduces a normative bias that suppresses descriptive dynamics like reciprocity and retaliation.
  • One-shot textbook games and non-strategic lottery choices validate that aligned models excel in settings where human behavior aligns with normative predictions, demonstrating that alignment improves accuracy when interaction history is absent.
  • Round-by-round analysis within multi-round games confirms that aligned models perform better in the initial round before history accumulates, but lose predictive advantage as interaction history develops and strategic dynamics emerge.
  • Robustness tests across model families, prompt formats, and game parameters validate that the performance gap is driven by model weights and alignment effects rather than prompt formatting or specific game configurations.
  • The overall conclusion establishes a fundamental trade-off where alignment optimizes models for human use and normative ideals but degrades their utility as proxies for describing complex, history-dependent human behavior.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
정렬은 언어 모델을 기술적이지 않고 규범적으로 만듭니다. | 문서 | HyperAI초신경