HyperAIHyperAI
Back to Headlines

Tsinghua University Team Releases VS-Bench, a Multi-Agent Evaluation Benchmark

22 days ago

A team led by Professor Wang Yu from Tsinghua University has introduced VS-Bench, a new evaluation benchmark designed to assess the reasoning and decision-making capabilities of Vision-Language Models (VLMs) in multi-agent environments. As large models evolve beyond static, single-turn tasks like question answering and basic reasoning, they are increasingly being applied to complex, interactive, and multi-step scenarios such as software development, computer interaction, and game-based tasks. However, existing benchmarks have largely focused on single-agent or text-only settings, leaving a significant gap in the evaluation of multi-agent, multimodal AI systems. To address this, the research team—led by Ph.D. student Ze-Lai Xu and collaborators—developed VS-Bench, a comprehensive benchmark comprising eight diverse multi-agent environments that simulate cooperation, competition, and mixed interactions. The motivation behind evaluating models in such settings is rooted in the real-world nature of environments where multiple agents coexist and interact dynamically. These interactions introduce two major challenges: strategic reasoning and adaptive decision-making. In multi-agent systems, outcomes depend on the joint actions of all agents, requiring each agent not only to choose optimal individual actions but also to anticipate the behaviors of others—a capability akin to "theory of mind." Furthermore, the presence of cooperation and competition, combined with constantly shifting strategies, creates non-stationary environments. This demands robust long-term planning and decision-making under uncertainty, pushing the limits of current AI models. To evaluate these capabilities, the team introduced two complementary assessment methods. The first is offline strategic reasoning, measuring how accurately a model can predict the next actions of other agents. The second is online decision-making, assessing the long-term rewards achieved by the model in dynamic environments. The researchers tested 14 state-of-the-art VLMs—including reasoning models, dialogue models, and open-source models—on VS-Bench. Key findings revealed that while existing models show some level of strategic reasoning, they still fall far short of accurately predicting other agents' behaviors. All 14 models outperformed random agents, but the best-performing model, o4-mini, achieved only a 47.8% average accuracy in action prediction. Reasoning models consistently led in this task, while dialogue and open-source models performed comparably and lagged behind. More strikingly, the models demonstrated weak decision-making abilities in multi-agent settings. Ten out of 14 models achieved performance similar to random agents, and only three reasoning models significantly outperformed them. Even the top model, o4-mini, scored just 24.3% in overall decision-making performance. An unexpected insight emerged in social dilemma tasks, such as a multi-agent version of the Prisoner’s Dilemma. In these scenarios, cooperation leads to mutual benefit, but individual agents may defect for higher personal gain. While reasoning models tended to act more rationally and often chose to defect, open-source models showed a stronger inclination toward cooperation. This collaborative behavior allowed them to achieve higher collective rewards and, in some cases, even surpass certain reasoning models. The researchers attribute this to the fact that although open-source models may have lower individual capabilities, their tendency toward cooperative strategies enables better outcomes in environments where mutual trust and coordination are key. Looking ahead, the team aims to establish VS-Bench as a standard benchmark for evaluating multi-agent VLMs, fostering advancements in algorithms and real-world applications such as game AI and human-machine collaboration. Future work includes conducting human subject experiments to benchmark human performance, expanding the set of environments to include more complex and diverse scenarios, and testing newer, emerging models. The goal is to create a more comprehensive and realistic evaluation framework that reflects the true potential and limitations of large models in interactive, multi-agent worlds.

Related Links