Command Palette
Search for a command to run...
MindWatcher:よりスマートなマルチモーダルツール統合推論への道
MindWatcher:よりスマートなマルチモーダルツール統合推論への道
Abstract
従来のワークフローに基づくエージェントは、外部ツールの呼び出しを必要とする現実世界の問題に対して、限定的な知能しか発揮できません。ツール統合型推論(Tool-integrated Reasoning: TIR)エージェントは、自律的な推論とツール呼び出しを可能とするため、外部環境との複数ステップにわたるインタラクションを伴う複雑な意思決定タスクにおいて、急速に強力なアプローチとして注目されています。本研究では、インタリーブド・シンキング(交互思考)とマルチモーダル・チェーン・オブ・シンキング(CoT)推論を統合したTIRエージェント「MindWatcher」を提案します。MindWatcherは、人間のプロンプトやワークフローに依存することなく、どのようなツールを呼び出し、どのように使用するかを自律的に決定し、それらを調整することが可能です。インタリーブド・シンキングのアーキテクチャにより、モデルは中間段階においていつでも思考フェーズとツール呼び出しフェーズを自由に切り替えることができます。また、マルチモーダルCoT機能により、推論中に画像を操作することで、より正確な検索結果を得ることが可能になります。本研究では、自動データ監査および評価パイプラインを実装するとともに、手動で選別された高品質なデータセットを用いた学習を実施し、性能評価を目的としたベンチマーク「MindWatcher-Evaluate Bench(MWE-Bench)」を構築しました。MindWatcherは、幅広い分野のマルチモーダル問題に対応できる包括的な補助推論ツール群を備えており、8つのカテゴリ(車両、動物、植物など)をカバーする大規模かつ高品質なローカル画像検索データベースを搭載することで、モデルのサイズが小さいにもかかわらず、堅牢なオブジェクト認識能力を有しています。さらに、MindWatcherのためのより効率的な学習インフラストラクチャを設計し、学習速度とハードウェア利用効率の向上を実現しました。実験の結果、MindWatcherは、優れたツール呼び出し能力により、より大規模または最新のモデルと同等、あるいはそれ以上の性能を発揮することが示され、エージェント学習における重要な知見、たとえばエージェント強化学習(agentic RL)における「遺伝的継承現象」の存在が明らかになりました。
One-sentence Summary
The authors from Li Auto Inc. introduce MindWatcher, a multimodal Tool-Integrated Reasoning agent that leverages interleaved thinking and multimodal chain-of-thought reasoning to autonomously coordinate diverse tools—including image region manipulation, local visual search, and code execution—without relying on human workflows. By adopting continuous reinforcement learning with a novel step-wise normalization strategy and a large-scale, expert-curated local image retrieval database, MindWatcher achieves state-of-the-art performance on MWE-Bench, outperforming larger models in complex, real-world tasks while enabling efficient distillation into smaller variants.
Key Contributions
- MindWatcher introduces an interleaved thinking and multimodal chain-of-thought framework that enables autonomous, dynamic decision-making between reasoning and tool invocation, supporting fine-grained visual manipulation and cross-modal reasoning for real-world multimodal tasks beyond text-only capabilities.
- The agent is trained via continuous reinforcement learning with automated data pipelines for image-text pair generation, avoiding the pitfalls of supervised fine-tuning such as rigid imitation and excessive tool calls, while achieving superior performance on complex, multi-step tasks.
- MindWatcher is evaluated on MWE-Bench, a newly constructed benchmark, and demonstrates state-of-the-art results on 2B, 3B, and 4B distilled models, outperforming larger or more recent models through efficient tool coordination and a local image retrieval database covering eight domains.
Introduction
The authors address the growing need for intelligent agents capable of autonomous, multi-step reasoning in real-world environments, where traditional workflow-based systems and even modern multi-agent architectures struggle with adaptability, latency, and cross-modal integration. Prior TIR agents are largely limited to text-based tasks, with minimal support for fine-grained visual reasoning, and suffer from training instability, redundant tool use, and reliance on expensive external APIs. To overcome these challenges, the authors introduce MindWatcher, a multimodal TIR agent that combines interleaved thinking with multimodal chain-of-thought reasoning, enabling dynamic, context-aware tool invocation and direct image manipulation. The system leverages continuous reinforcement learning instead of supervised fine-tuning to avoid rigid imitation, employs a novel RL algorithm with step-wise normalization for stable training, and integrates a cost-effective local visual corpus and automated data pipelines. MindWatcher achieves state-of-the-art performance on a new multimodal benchmark, MWE-Bench, outperforming larger models while demonstrating that foundation model limitations impose a fundamental "genetic constraint" on agentic reasoning, even after extensive RL optimization.
Dataset
-
The dataset is composed of multiple sources, including a private, high-quality multimodal database, authoritative sports portals, and curated open-source benchmarks. The core of the dataset is the MindWatcher Multi-modal Retrieval Database (MWRD), which spans eight categories: Person, Car, Plant, Animal, Logo, Landmark, Fruit & Vegetable, and Dish. It contains 50,000 retrieval entities, with 3–10 high-quality images per entity, totaling over 300,000 images. The images are sourced from the internet and professional museum databases, and undergo expert-level filtering and curation to ensure over 99% precision.
-
The MWRD is used to support multimodal reasoning and tool use in training. For the training data, the authors construct a cross-modal QA dataset through a three-stage pipeline: source knowledge anchoring and generation, QA quality validation, and difficulty grading. This includes fine-grained visual-knowledge mapping via object localization and retrieval labeling, followed by knowledge graph augmentation using web search to enrich factual context. Initial QA pairs are generated based on both visual and external knowledge.
-
A dedicated pipeline is used for domain-specific ingestion, particularly for sports data. A focused crawler collects articles from authoritative sports portals, retaining only those with non-empty text and at least one relevant image. This creates a repository of event-centric multimedia documents, which are later used to generate QA pairs.
-
The MWE-Bench benchmark is constructed with strict separation from training data to prevent data leakage. It includes 373 car, 351 animal, 397 plant, 63 person, 90 landmark, and 142 sports-related instances. For non-sports categories, knowledge entries from the internal database are excluded from training. The construction process involves expanding the knowledge base with web data, applying category-specific constraints, and using closed-source models for "uniqueness deconstruction" to extract core identifying facts. These facts form the basis of single-turn QA pairs, which are then synthesized into multi-step reasoning tasks. All samples undergo automated and manual verification for quality and temporal accuracy.
-
For sports data, text and image corpora from non-overlapping time points are merged, and atomic facts are extracted using a powerful LLM. Complex queries are built from these facts, followed by data cleaning and filtering similar to the non-sports pipeline.
-
The training data is split across online and offline environments. Online training includes 1,639 VQA samples from private images and 2,949 from public news sources. Open-source data, drawn from WebSailor, ToolStar, and SimpleDeepSearcher, totals 5,000 samples after rigorous filtering to remove outdated or leaked content. Offline RL training uses approximately 20,000 samples. A curriculum learning strategy guides training on the Qwen2.5-VL-32B model for one epoch, with asynchronous tool invocation and reward computation.
-
The fully trained MindWatcher-32B model is used to distill multimodal reasoning and tool-use capabilities into smaller models. This involves generating 1–3 TIR trajectories per sample from a base dataset of 181K samples (VLAA SFT, WebWalker silver, and a self-built RAG QA dataset). After filtering, the final distillation dataset contains 124K samples—100K multimodal and 24K text-only—used to train the MindWatcher-2B, -3B, and -4B models.
-
The MWE-Bench and filtered open-source benchmarks (MMSearch subset: 221 samples; SimpleVQA subset: 823 samples) are used for evaluation under the ReAct/Agent paradigm. All tests use a temperature of 0.7 and top-p of 0.95, with correctness assessed via LLM-as-Judges and the primary metric being pass@1.
Method
The authors leverage a reinforcement learning framework to train MindWatcher, an agent designed for multimodal reasoning and autonomous tool use. The core of the system is an autoregressive generation loop that models the task as a Markov Decision Process (MDP), where the agent iteratively generates actions and processes environment observations. At each step, the policy model, parameterized by a large language model (LLM), conditions on the full history of interactions to produce a unified action, which can be either a reasoning step (thought) or a tool call. These actions are serialized using dedicated tags, enabling interleaved thinking and action generation within a single decoding sequence. This approach supports a multimodal chain-of-thought mechanism, allowing the agent to reason with images by embedding image-dependent operations into its reasoning process.

As shown in the figure below, the agent begins with an initial user prompt and an image, then enters an iterative planning and tool-call process. After each tool call, the resulting observation is appended to the context, informing the next action. The agent continues this cycle until it generates a final action containing the concluding response. The framework integrates a large-scale local retrieval corpus and a comprehensive multimodal toolkit, including tools for region cropping, visual search, external text retrieval, webpage content extraction, and local code execution. This enables the agent to perform complex multimodal question answering by dynamically interacting with the environment and leveraging external knowledge sources.

The training process employs an enhanced version of Group Relative Policy Optimization (GRPO) as the core learning algorithm. To address the challenges of interleaved thinking, where trajectories consist of multiple "Think and Tool-call" cycles with varying lengths, the authors propose a step-wise normalization mechanism. This approach optimizes the objective function on individual action segments rather than the global token stream, ensuring balanced supervision across every reasoning step. The optimization objective incorporates two normalization factors: action-step normalization, which weights each trajectory equally regardless of the number of cycles, and token-length normalization, which averages the loss within each episode. This dual-normalization mechanism prevents longer episodes from dominating the optimization process.
The reward function is designed to guide the model toward both syntactic correctness and factual accuracy through a hybrid reward system. It consists of three components: Outcome Accuracy Reward, Format Reward, and Hallucination Tool-call Penalty. The Outcome Accuracy Reward is a sparse reward computed at termination, evaluated by a model-based judge to assess factual consistency. The Format Reward enforces strict schema adherence using a regex-based parser, penalizing format errors and non-whitespace residues outside valid tags. The Hallucination Tool-call Penalty discourages the model from generating consecutive tool calls without waiting for environmental feedback, penalizing discrepancies between the number of model calls and actual responses. The final reward is a weighted sum of these components, with specific coefficients set to balance their influence.
Experiment
- Main experiments validate the effectiveness of tool-augmented reasoning in multimodal agents, particularly through the MindWatcher series, which achieves SOTA performance on MWE-Bench with a global score of 75.35, outperforming models like Gemini 2.5 Flash and GPT-5 mini.
- On MWE-Bench, MindWatcher-32B surpasses Qwen3-VL 32B and demonstrates strong performance across four domains (Vehicle, Animal, Plant, Person), while distilled variants (MindWatcher-2B, 3B, 4B) match or exceed the base model performance, proving that agentic capabilities can compensate for limited parameter size.
- Tool integration dramatically boosts performance: Qwen3-VL 32B’s score nearly triples with tool access, and GPT-5 mini’s score in Sports jumps from 13.38 to 80.28, highlighting the critical role of external tools in overcoming internal knowledge limitations.
- On filtered benchmarks (MMSearch, SimpleVQA, WebWalkerQA), MindWatcher-32B achieves SOTA results across all, including outperforming Qwen3-VL 32B on SimpleVQA and remaining competitive on the pure-text WebWalkerQA, indicating successful enhancement of multimodal agentic reasoning without sacrificing text-based capabilities.
- The choice of search engine significantly impacts performance, with Quark outperforming Sogou by 42.86% on Chinese football queries, demonstrating that agent performance is highly sensitive to tool environment and not solely dependent on model architecture.
- Case studies reveal that internal world knowledge critically influences tool-use success: Qwen3-VL 32B leverages prior knowledge to formulate effective queries, while MindWatcher (based on Qwen2.5-VL) fails without such knowledge, showing that benchmark results can be confounded by long-tail knowledge disparities.
- Genetic inheritance analysis shows that both agentic RL and SFT preserve the base model’s cognitive ceiling: while RL enables stable, structured tool-use behavior with synchronized performance decay, SFT leads to volatile tool-call patterns and less consistent optimization, indicating RL’s superior alignment for complex reasoning.
The authors use a win-tie-loss analysis to compare MindWatcher-32B against four models across public benchmarks and the MWE-Bench, showing that MindWatcher-32B consistently outperforms its 32B counterparts and achieves superior results compared to SOTA closed-source models like Gemini 2.5 Flash and GPT-5 mini. The results indicate that MindWatcher-32B wins in 15.7% of comparisons against Qwen3-VL 32B Thinking, 30.2% against WebWatcher-32B, 18.9% against Gemini 2.5-Flash, and 14.4% against GPT-5 mini, with the majority of outcomes being ties.

The authors use the MWE-Bench to evaluate the performance of MindWatcher distilled models against their foundation models across multiple domains. Results show that MindWatcher-2B, -3B, and -4B achieve significantly higher average scores than their respective base models, with MindWatcher-4B attaining the highest overall score of 69.63, demonstrating that the distilled models outperform their larger counterparts in most categories.

The authors use the provided bar charts to compare the performance of various models across six domains on the MWE-Bench, with MindWatcher-32B achieving the highest accuracy in four of them: Car, Animal, Plant, and Person. Results show that MindWatcher-32B outperforms prominent closed-source models such as Gemini 2.5 Flash and GPT-5 mini across all domains, demonstrating its superior agentic capabilities.

The authors use the MWE-Bench to evaluate model performance under direct inference and ReAct/Agent modes, showing that internal knowledge cutoffs do not consistently predict performance, as models like Gemini 2.5 Pro outperform more recent models in direct inference. Results show that tool-augmented inference significantly boosts performance, with MindWatcher-32B achieving the highest overall score of 75.35, outperforming both open-source and closed-source models across multiple domains.

Results show that the choice of search engine significantly impacts agent performance on sports-related queries, with Quark Search achieving the highest average score of 34.81 across all domains and languages. The performance varies substantially depending on the query domain and language, indicating that tool capacity is highly dependent on the specific context rather than being universally superior.
