HyperAI초신경

VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang
발행일: 6/4/2025
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in
  Multi-Agent Environments
초록

Recent advancements in Vision Language Models (VLMs) have expanded theircapabilities to interactive agent tasks, yet existing benchmarks remain limitedto single-agent or text-only environments. In contrast, real-world scenariosoften involve multiple agents interacting within rich visual and linguisticcontexts, posing challenges with both multimodal observations and strategicinteractions. To bridge this gap, we introduce Visual Strategic Bench(VS-Bench), a multimodal benchmark that evaluates VLMs for strategic reasoningand decision-making in multi-agent environments. VS-Bench comprises eightvision-grounded environments spanning cooperative, competitive, andmixed-motive interactions, designed to assess agents' ability to predictothers' future moves and optimize for long-term objectives. We consider twocomplementary evaluation dimensions, including offline evaluation of strategicreasoning by next-action prediction accuracy and online evaluation ofdecision-making by normalized episode return. Extensive experiments of fourteenleading VLMs reveal a significant gap between current models and optimalperformance, with the best models attaining 47.8% prediction accuracy and 24.3%normalized return. We further conduct in-depth analyses on multimodalobservations, test-time scaling, social behaviors, and failure cases of VLMagents. By standardizing the evaluation and highlighting the limitations ofexisting models, we envision VS-Bench as a foundation for future research onstrategic multimodal agents. Code and data are available athttps://vs-bench.github.io.