1ヶ月前

概要

フィールド実験（A/Bテスト）は、社会的システムにおける手法の信頼性の高い基準として広く用いられるが、その費用と遅延の問題が反復的な手法開発の主要なボトルネックとなっている。大規模言語モデル（LLM）を用いたペルソナシミュレーションは、低コストな合成代替手段を提供するが、人間をペルソナに置き換えることで、適応型手法が最適化対象とするベンチマークのインターフェースが保持されるかどうかは不明である。本研究では、必要十分条件としての定理を証明する：（i）手法が集計結果（集計のみの観測）しか観測しない場合、かつ（ii）評価がアルゴリズムの識別情報や出典に依存せず、提出された成果物のみに依存する（アルゴリズム無視評価）場合、手法の観点から人間をペルソナに置き換えることは、単に評価対象集団を変更する（たとえばニューヨークからジャカルタに変更する）ことと区別がつかない。さらに、有効性から有用性へと視点を移す。導かれる集計チャネルの情報理論的判別可能性を定義し、ペルソナベンチマークがフィールド実験と同等の意思決定の重要性を持つためには、本質的にサンプルサイズの問題であることを示す。これにより、所定の分解能で意味的に異なる手法を信頼性高く区別するための独立したペルソナ評価回数の明示的な上限が導かれる。

One-sentence Summary

The authors propose that LLM-based persona simulation validly replaces human A/B testing under aggregate-only observation and algorithm-blind evaluation, proving it is indistinguishable from changing populations; they define information-theoretic discriminability to show sufficient persona samples make synthetic benchmarks as decision-relevant as field experiments for reliably distinguishing methods at desired resolution. (58 words)

Key Contributions

Field experiments (A/B tests) for societal systems are credible but costly and slow, creating a bottleneck for iterative development, while LLM-based persona simulations offer a cheap alternative whose validity as a drop-in benchmark substitute remains uncertain due to potential mismatches in the evaluation interface.
The paper proves that persona simulations become indistinguishable from a simple population panel change (e.g., New York to Jakarta) if and only if two conditions hold: methods observe only aggregate outcomes (aggregate-only observation) and evaluation depends solely on the submitted artifact, not the algorithm's origin (algorithm-blind evaluation).
It introduces an information-theoretic discriminability metric for the aggregate channel, showing that achieving decision-relevant persona benchmarking equivalent to field experiments requires sufficient independent persona evaluations, with explicit sample-size bounds derived to reliably distinguish meaningfully different methods at a specified resolution.

Introduction

Field experiments are the gold standard for benchmarking methods in societal systems like marketplace design or behavioral interventions, but their high cost and slow execution severely bottleneck iterative development. Prior attempts to use LLM-based persona simulations as cheaper alternatives face critical uncertainty: it remains unclear whether swapping humans for personas preserves the benchmark's core interface that methods optimize against, especially given evidence of confounding in causal applications where prompt manipulations inadvertently alter latent scenario aspects.

The authors prove that persona simulation becomes a theoretically valid drop-in substitute for field experiments if and only if two conditions hold: (i) methods observe only aggregate outcomes (not individual responses), and (ii) evaluation depends solely on the submitted artifact, not the algorithm's identity or provenance. Crucially, they extend this identification result to practical usefulness by defining an information-based measure of discriminability for the persona-induced evaluation channel. This yields explicit sample-size bounds—showing how many independent persona evaluations are required to reliably distinguish meaningful method differences at a target resolution—turning persona quality into a quantifiable budget question.

Method

The authors leverage a formal framework to model algorithm benchmarking as an interactive learning process, where an algorithm iteratively selects method configurations and receives feedback from an evaluator. This process is structured around three core components: the configuration space, the evaluation pipeline, and the feedback-driven adaptation mechanism.

At the heart of the method is the concept of a method configuration $\theta \in \Theta$ , which encapsulates all controllable degrees of freedom—such as model weights, prompts, hyperparameters, decoding rules, or data curation policies—that define a system or procedure. Deploying $\theta$ yields an artifact $w(\theta) \in \mathcal{W}$ , which is the object submitted to the benchmark for evaluation. The artifact space $\mathcal{W}$ is flexible, accommodating single outputs, stochastic distributions, interaction policies, or agent rollouts, depending on the task.

The evaluation process is modeled as a two-stage pipeline: micro-level judgments are first elicited and then aggregated into a single feedback signal. This pipeline is fully specified by a tuple $(P, I, \Gamma, L)$ , where $P$ is a distribution over evaluators (human or LLM personas), $I(\cdot \mid w, p)$ is a micro-instrument that generates individual responses from an evaluator $p$ given artifact $w$ , $\Gamma$ is a deterministic aggregation function mapping $L$ micro-responses to a single observable feedback $o \in \mathcal{O}$ , and $L$ is the panel size. The entire evaluation call induces a Markov kernel $Q_{P,I}(\cdot \mid w)$ over $\mathcal{O}$ , which represents the distribution of the aggregate feedback for artifact $w$ .

The algorithm operates as an adaptive learner in a repeated “submit-observe” loop. At each round $t$ , it selects a configuration $\theta_t$ (or equivalently, artifact $w_t$ ) based on a decision kernel $\pi_t(\cdot \mid H_{t-1}, S)$ , where $H_{t-1}$ is the observable history of past submissions and feedback, and $S$ represents any side information available before benchmarking begins. The feedback $o_t$ received at round $t$ is drawn from $Q_{P,I}(\cdot \mid w_t)$ , and the algorithm updates its strategy accordingly.

Two benchmark hygiene conditions are critical to ensure the integrity of this interface. The first, Aggregate-only observation (AO), mandates that the algorithm observes only the aggregate feedback $o_t$ and not any micro-level details such as panel identities or raw votes. The second, Algorithm-blind evaluation (AB), requires that the feedback distribution depends solely on the submitted artifact $w_t$ and not on the identity or provenance of the algorithm that produced it. Together, these conditions ensure that the evaluation behaves as a well-defined oracle channel, enabling the method to treat the benchmark as a stable environment.

Under these conditions, swapping human evaluators for LLM personas is equivalent to a “just panel change” (JPC) from the method’s perspective: the interaction structure remains unchanged, and the only difference is in the induced artifact-to-feedback kernel $Q(\cdot \mid w)$ . This equivalence is formalized through transcript laws that factorize into submission kernels and artifact-dependent feedback kernels, preserving the method’s information structure regardless of the evaluator type.

To assess the usefulness of such a benchmark—beyond its validity—the authors introduce the concept of discriminability $\kappa_Q$ , defined as the infimum of Kullback-Leibler divergence between feedback distributions of artifacts that differ by at least a resolution threshold $r$ under a metric $d_{\mathcal{W}}$ . Under a homoscedastic Gaussian assumption, this reduces to the worst-case pairwise signal-to-noise ratio (SNR), which is empirically estimable from repeated evaluations. The sample complexity for reliable pairwise comparisons scales inversely with $\kappa_Q$ , requiring approximately $L \geq \frac{2}{\kappa_Q} \log \frac{1}{\delta}$ independent evaluations to achieve a misranking probability of at most $\delta$ .

The choice of $d_{\mathcal{W}}$ and $r$ is method-specific and should reflect the developer’s degrees of freedom and minimal meaningful iteration unit. For example, in prompt tuning, $d_{\mathcal{W}}$ may be Levenshtein distance over instruction clauses, and $r=1$ corresponds to a single atomic edit. This operationalization allows practitioners to estimate $\kappa_Q$ from pilot runs and derive the required dataset size for stable method comparison.

In summary, the framework provides a rigorous, modular structure for modeling adaptive benchmarking, grounded in information-theoretic principles and practical design guidelines. It enables systematic analysis of when persona-based evaluation is a valid and useful substitute for human judgment, while also quantifying the data requirements for reliable method optimization.

Experiment

Compared human benchmark (human evaluators with micro-instrument) and persona benchmark (LLM judges with persona profiles) setups
Validated both approaches produce equivalent observable feedback kernels (Q_hum and Q_pers) for the evaluation method
Confirmed the algorithm treats aggregate feedback distributions identically regardless of human or persona origin

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

1ヶ月前

Enoch Hyunwook Kang

概要

One-sentence Summary

Key Contributions

Field experiments (A/B tests) for societal systems are credible but costly and slow, creating a bottleneck for iterative development, while LLM-based persona simulations offer a cheap alternative whose validity as a drop-in benchmark substitute remains uncertain due to potential mismatches in the evaluation interface.
The paper proves that persona simulations become indistinguishable from a simple population panel change (e.g., New York to Jakarta) if and only if two conditions hold: methods observe only aggregate outcomes (aggregate-only observation) and evaluation depends solely on the submitted artifact, not the algorithm's origin (algorithm-blind evaluation).
It introduces an information-theoretic discriminability metric for the aggregate channel, showing that achieving decision-relevant persona benchmarking equivalent to field experiments requires sufficient independent persona evaluations, with explicit sample-size bounds derived to reliably distinguish meaningfully different methods at a specified resolution.

Introduction

Method

Experiment

Compared human benchmark (human evaluators with micro-instrument) and persona benchmark (LLM judges with persona profiles) setups
Validated both approaches produce equivalent observable feedback kernels (Q_hum and Q_pers) for the evaluation method
Confirmed the algorithm treats aggregate feedback distributions identically regardless of human or persona origin

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

Command Palette

LLMのペルソナが手法のベンチマーキングにおけるフィールド実験の代替としての役割を果たす可能性

Enoch Hyunwook Kang

概要

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

LLMのペルソナが手法のベンチマーキングにおけるフィールド実験の代替としての役割を果たす可能性

Enoch Hyunwook Kang

概要

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

LLMのペルソナが手法のベンチマーキングにおけるフィールド実験の代替としての役割を果たす可能性

Enoch Hyunwook Kang

概要

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AIでAIを構築

HyperAI Newsletters