Tsinghua AI Challenge: Testing the Reasoning Capabilities of Multimodal Large Models
In recent years, multimodal large language models (MLLMs) have seen rapid advancements, appearing capable of tasks ranging from image captioning to video understanding. However, a crucial question remains: Do these models truly comprehend what they see and think through complex, multi-step visual reasoning tasks as humans do? To address this, a team led by Professor Yang Liu, the executive dean of Tsinghua University's Institute for Artificial Intelligence Research (AIR), in collaboration with Tsinghua University's Department of Computer Science and Fudan University, has introduced EscapeCraft—a groundbreaking 3D escape room environment designed to test and evaluate the reasoning capabilities of MLLMs. EscapeCraft challenges these models to perform tasks that require integrating visual, spatial, and logical information, such as finding keys, opening boxes, solving puzzles, and escaping rooms. The results were surprising and often humorous: models frequently saw doors but circled around walls; picked up keys but forgot how to use them; and even tried to pick up a sofa, speculating that it might contain a hidden compartment. These failures highlight a systemic issue: seeing does not equate to understanding. Even advanced models like GPT-4o only succeed in a fraction of sub-tasks due to genuine comprehension, while the rest are mere coincidental successes. The project homepage can be found at Project Homepage, and the GitHub repository is available at GitHub. This research has been accepted at the prestigious International Conference on Computer Vision (ICCV) 2025, with lead authors Ziyue Wang and Yurui Dong contributing equally. The EscapeCraft Environment EscapeCraft features a 3D environment that can generate and configure itself, offering models a realistic scenario to act out their reasoning processes. Each room can vary in style, complexity, and difficulty, allowing for customization of tasks like finding keys, solving puzzles, and ultimately escaping. The environment is not just limited to escape tasks; it can be extended to include other challenges such as question answering, logical reasoning, and narrative reconstruction. The primary goal of EscapeCraft is to assess the exploration and decision-making behaviors of AI models during the escape process, focusing on their ability to integrate and utilize multiple types of information. For instance, a model must correctly interpret visual cues, navigate the space logically, and use tools effectively to achieve its objective. Model Reasoning and Process Evaluation Unlike traditional evaluations that focus solely on whether a model's final answer is correct, EscapeCraft evaluates the entire process. It measures whether the model explores autonomously, avoids repetitive errors, and uses props correctly. This approach aims to capture the "human-like reasoning" of the model. The paper introduces several innovative metrics to evaluate the reasoning process: Intent-Outcome Consistency: This metric assesses whether the interaction outcomes are consistent with the model's intended actions, ensuring that the model performs the right action at the right place. Prop Gain / Grab Ratio / GSR: These indicators capture the model's behavior patterns during exploration and reasoning, reflecting its interaction quality, reasoning efficiency, and overall intelligence. Evaluation Results The evaluation of various models, including GPT-4o, Gemini-1.5 Pro, Claude 3.5, LLaMA-3.2, Qwen, and Phi-3, revealed striking insights. For example, in the Difficulty-3 level, GPT-4o achieved only 26.5% of its sub-goals through genuine understanding, with the majority of its successes being accidental. Interesting Failures Several notable failures were observed during the tests. For instance: Visual Perception Mistakes: Even when models identified objects correctly, they often struggled to use them appropriately. Logical Reasoning Errors: Models sometimes engaged in illogical behavior, such as attempting to move a sofa to find a hidden compartment. These errors were categorized into two main types: Reasoning Issues: Most common in models like Claude 3.5, where 61.1% of mistakes were attributed to poor reasoning and only 38.9% to visual perception errors. Visual Perception Issues: Where models failed to recognize objects or their attributes accurately. Conclusion EscapeCraft provides a robust and flexible platform for evaluating the reasoning capabilities of MLLMs in complex, interactive environments. The findings suggest that while these models can recognize objects and extract surface-level information, they often fall short in deeper understanding and effective problem-solving. This study underscores the need for more nuanced and comprehensive methods to assess AI performance, moving beyond simple correctness to evaluate true reasoning and decision-making processes. As AI continues to evolve, platforms like EscapeCraft will play a crucial role in identifying and addressing the limitations of current models.
