4달 전

Mingyang Song Haoyu Sun Jiawei Gu Linjie Li Luxin Xu Ranjay Krishna Yu Cheng

초록

인간이 자신의 즉각적인 능력 범위를 초월한 문제에 직면할 때, 도구에 의존하게 된다. 이는 다모달 대규모 언어 모델(Multimodal Large Language Models, MLLMs)의 시각적 추론 능력을 향상시키기 위한 유망한 패러다임을 제공한다. 따라서 효과적인 추론은 어떤 도구를 사용할지, 언제 도구를 호출할지, 그리고 다수의 단계에 걸쳐 도구들을 어떻게 조합할지를 아는 데 달려 있다. 이는 새로운 도구나 새로운 작업에 직면했을 때도 마찬가지다. 우리는 도구 사용을 도구 특화적이거나 명시적인 지도 없이 일반적인 추론 능력으로 학습하는 다모달 모델의 가족인 AdaReasoner를 소개한다. AdaReasoner는 (i) 장기적인 수평적이고 다단계 도구 상호작용을 모델에 노출시켜주는 확장 가능한 데이터 정제 파이프라인, (ii) 최종 작업 성공률을 기반으로 도구 선택과 순서를 최적화하는 강화 학습 알고리즘인 Tool-GRPO, 그리고 (iii) 도구 사용을 동적으로 조절하는 적응형 학습 메커니즘에 의해 가능해진다. 이러한 구성 요소들이 결합되어, 모델은 작업의 맥락과 중간 결과로부터 도구의 유용성을 추론하고, 여러 도구를 조율하며, 알려지지 않은 도구에 대한 일반화도 가능하게 한다. 실험적으로 AdaReasoner는 강력한 도구 적응성과 일반화 능력을 보이며, 명시적으로 학습되지 않았음에도 불구하고 유익한 도구를 자동으로 채택하고, 관련 없는 도구는 억제하며, 작업의 요구에 따라 도구 사용 빈도를 조정한다. 이러한 능력은 도전적인 벤치마크에서 최첨단 성능으로 이어지며, 7B 기반 모델의 평균 성능을 +24.9% 향상시켰고, VSP 및 Jigsaw와 같은 여러 과제에서 GPT-5와 같은 강력한 프라이버시 시스템을 초월하는 성과를 거두었다.

One-sentence Summary

Researchers from Fudan, Tongji, NUS, UW, and UESTC propose AdaReasoner, a multimodal model that autonomously learns tool use via scalable data, Tool-GRPO reinforcement learning, and adaptive regulation, enabling generalization to unseen tools and outperforming GPT-5 on VSP and Jigsaw benchmarks.

Key Contributions

AdaReasoner introduces a scalable data pipeline and Tool-GRPO reinforcement learning algorithm to train multimodal models on long-horizon, multi-step tool interactions, enabling them to autonomously select, sequence, and adapt tool usage based on task context and intermediate outcomes.
The model incorporates an adaptive learning mechanism that decouples tool-use logic from specific tasks, allowing generalization to unseen tools and novel task distributions without explicit supervision or predefined invocation patterns.
Evaluated on challenging benchmarks, AdaReasoner improves a 7B base model by +24.9% on average and outperforms proprietary systems like GPT-5 and Claude Sonnet 4 on tasks such as VSP and Jigsaw, demonstrating robust tool-adaptive reasoning in open-source multimodal models.

Introduction

The authors leverage external tools to enhance visual reasoning in multimodal large language models, recognizing that human-like problem solving often requires dynamic tool selection and multi-step coordination. Prior approaches either relied on rigid, pre-defined tool invocation patterns or were limited to single-tool use, failing to adapt to unseen tools or novel tasks. AdaReasoner addresses this by introducing a scalable data pipeline for long-horizon tool interactions, a reinforcement learning algorithm (Tool-GRPO) that optimizes tool sequencing based on task outcomes, and an adaptive learning mechanism that decouples tool logic from specific tasks. This enables the model to autonomously select, suppress, and modulate tools based on context and feedback, achieving state-of-the-art performance—even outperforming proprietary models like GPT-5—while generalizing to unseen tools and tasks.

Dataset

The authors use a curated dataset of high-fidelity, tool-augmented reasoning trajectories across three tasks: VSP, Jigsaw, and GUIQA. Each trajectory follows a structured, human-like problem-solving blueprint and includes reflection, backtracking, and tool failure cases to promote robustness.

VSP (Navigation & Verification):
- Source: Procedurally generated environments using Gymnasium, with training grids sized 4x4, 6x6, 8x8; test grids 5x5, 7x7, 9x9.
- Trajectory: Uses POINT to locate start/end/obstacles, textual reasoning, and DRAW2DPATH for path verification. Includes failure cases with self-correction.
- Processing: Tool calls executed programmatically; CoT generated via Gemini 2.5 Flash.
Jigsaw:
- Source: Images from COCO-2017 training set, split into 3x3 grids; one patch removed as target, one of five others as distractor.
- Trajectory: DETECTBLACKAREA to locate missing region, then iterative INSERTIMAGE attempts with visual feedback.
- Variations: Randomized patch order, tool failure cases, and tool-free solvable instances to encourage adaptive tool use.
GUIQA:
- Source: Filtered from 44k Guichat instances using Qwen-VL-2.5-72B; retained 7,100 where model failed.
- Processing: Manual inspection reduced to 1,800 valid cases; ground-truth answer coordinates rendered as bounding boxes.
- Trajectory: CROP to isolate relevant region, then OCR for extraction. CoT and tool sequence generated via Gemini 2.5 Flash; final dataset: 1,139 validated instances.

All trajectories are built via a two-stage pipeline: (1) programmatic tool execution to populate inputs/outputs, (2) LLM-generated CoT reasoning. The final dataset supports supervised fine-tuning (TC stage) and is designed to teach not just tool use, but strategic reasoning between steps, including fallback behavior during tool failure.

Method

The authors leverage a multi-stage training framework to develop AdaReasoner, a multimodal large language model (MLLM) capable of effective tool-augmented reasoning. The overall architecture is designed around a sequential decision-making process where the model, represented as a policy $\pi_{\theta}$ , generates a reasoning trajectory $\tau$ to solve a problem. This trajectory is a sequence of state-action-observation tuples, where the state $s_t$ represents the current problem context, the action $a_t$ is a tool call encapsulated by special tokens, and the observation $o_t$ is the result from the tool's execution. The policy transitions from state $s_t$ to $s_{t+1}$ based on the new information. The framework is built upon a diverse suite of visual tools categorized into perception (e.g., POINT, OCR), manipulation (e.g., DRAWLINE, INSERTIMAGE), and calculation (e.g., AStar), which are integrated into the reasoning process. The training process consists of two primary stages: a cold-start stage (TC) and a reinforcement learning stage (TG).

The first stage, Tool Cold Start (TC), is a supervised fine-tuning (SFT) phase. It begins with the generation of high-quality training data using the AdaDataCuration module, which leverages a Tool Server to execute tool calls and integrate results into a coherent dialogue. This stage uses a dataset derived from tasks like VSP, Jigsaw, and WebQA. The model is trained to generate multi-turn trajectories that follow a specific structure: a thinking phase enclosed in <think> tags, followed by either a tool call enclosed in <tool_Call> tags or a final response in <response> tags. The training data includes abstract problem-solving blueprints with chain-of-thought (CoT) placeholders, which are filled by a large language model to create detailed reasoning steps. The model learns to correctly format its output and execute the initial steps of the reasoning process.

The second stage, Tool-GRPO (TG), is a reinforcement learning phase that refines the model's ability to plan and execute complex, multi-turn tool sequences. This stage employs a Group Relative Policy Optimization (GRPO) algorithm. The policy $\pi_{\theta}$ samples a group of $N$ candidate trajectories for a given problem. Each trajectory is evaluated by a reward function, and a group-relative advantage is calculated for each trajectory by normalizing its reward against the group's mean and standard deviation. The policy is then updated to favor trajectories with higher relative advantages, using a clipped surrogate objective function that includes a Kullback-Leibler (KL) divergence penalty to ensure stable updates. The reward function is designed to be multi-faceted, with a total reward $R_{\text{total}}$ that is a composite of a format reward, a tool reward, and an accuracy reward. The format reward acts as a binary gate, ensuring that rewards for tool usage and final answer accuracy are only granted if the output adheres to the required structural syntax. The tool reward provides a fine-grained evaluation of the tool-calling process, with a hierarchical scoring system from 0 to 4 based on the correctness of the invocation structure, tool name, parameter names, and parameter content. The accuracy reward is granted only if the final answer is correct.

To enhance the model's generalization capabilities, the authors introduce an Adaptive Learning strategy that is applied during both the TC and TG stages. This strategy randomizes tool definitions at two levels. At the token level, functional identifiers such as tool names and argument names are replaced with random alphanumeric strings (e.g., "GetWeather" becomes "Func_X7a2"), forcing the model to rely on the semantic descriptions rather than the identifiers themselves. At the semantic level, the descriptions of tools and arguments are paraphrased using a large language model to create diverse phrasings while preserving the original functional meaning. This prevents the model from overfitting to specific identifiers or phrasings and encourages robustness to variations in tool documentation. The effectiveness of this approach is demonstrated in evaluations where the model trained with randomized definitions shows significant improvements in generalizing to new tasks and new tools, outperforming models trained with standard definitions. The overall framework, AdaReasoner, is an end-to-end system that orchestrates the entire life-cycle of the model, from data curation to evaluation, with a central Tool Server managing all available tools.

Experiment

Evaluated on VSP, Jigsaw, GUIQA, and Visual Search tasks using Qwen2.5-VL-3B/7B; tool augmentation (TC + TG) boosted 7B model by +38.66% avg, lifting VSP from ~31.64% to 97.64%, surpassing Direct SFT (46.64%) and Direct GRPO (30.18%).
Achieved state-of-the-art on VSP (97.64%) and Jigsaw (96.60%), outperforming GPT-5 (80.10%) and closing gap between 3B/7B models—both reached ~95%+ accuracy, showing tools overcome scale limitations.
AdaReasoner learns adaptive tool use: adopts A* for navigation (96.33% score), suppresses irrelevant tools (verification stays 99.20%), and modulates call frequency (e.g., Point tool: 3.2 calls/sample for navigation, ~1.0 for verification).
Generalizes to unseen tools and tasks: on Jigsaw, achieved 88.60% accuracy with 3.54 CPS and 98.50% success; on VStar (no tool supervision), invoked tools 1.47 times/sample and scored 70.68%, outperforming baselines under zero-shot tool adaptation.
Outperformed proprietary models (GPT-5, Claude, Gemini) and large open-source models (Qwen-32B/72B) on structured tasks; on WebMMU’s agent action subset, achieved 72.97% with pure GRPO, showing SFT can hinder open-ended domains.

The authors use a tool-augmented approach to enhance the performance of multimodal models on visual reasoning tasks, demonstrating that their method achieves near-perfect accuracy (97.6%) on the VSP benchmark, significantly surpassing the baseline models and even outperforming larger proprietary models like GPT-5. Results show that tool augmentation effectively overcomes scale-based limitations, enabling smaller models to reach performance levels comparable to much larger ones by shifting the bottleneck from internal reasoning to effective tool planning.

The authors use the Qwen2.5-VL-7B-Instruct model with a frozen vision tower and full finetuning, training on 332,649 samples with a maximum sequence length of 35,536 and 64 preprocessing workers. The model is trained for 3 epochs using a cosine learning rate scheduler with a warmup ratio of 0.1, a mixed precision of bfloat16, and a batch size of 1 per device, with logging and checkpoint saving every 10 and 100 steps respectively, and evaluated on a 90%/10% train/validation split with a validation batch size of 1.

The authors use a combination of tool cold-start (TC) and tool-GRPO (TG) training to enhance model performance on visual reasoning tasks. Results show that the Rnd TC + Rnd TG method achieves the highest scores across all benchmarks, with significant improvements over the base model, particularly in VSP and WebMMU, where it outperforms other configurations by large margins.

The authors use a comprehensive set of hyperparameters to train their model, with key settings including a maximum prompt length of 8192 tokens, a train batch size of 32, and a PPO micro-batch size of 16384 per GPU. The training employs a vLLM engine, FSDP for model parallelism, and a GRPO advantage estimator, with specific configurations for policy, rollout, and critic components to ensure stable and efficient reinforcement learning.

The authors use a controlled ablation study to evaluate the impact of reflection and tool availability on adaptive tool usage in the VSP task. Results show that enabling reflection during training significantly improves performance on both navigation and verification tasks, with the best results achieved when reflection is combined with A* availability at inference, leading to a 99.80% verification accuracy and a 96.33% navigation accuracy. The A* tool's effectiveness is highly dependent on its context, as its inclusion without reflection leads to unstable behavior and performance degradation, highlighting the importance of stable training signals for reliable tool adaptation.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

4달 전

Mingyang Song Haoyu Sun Jiawei Gu Linjie Li Luxin Xu Ranjay Krishna Yu Cheng

초록

One-sentence Summary

Key Contributions

AdaReasoner introduces a scalable data pipeline and Tool-GRPO reinforcement learning algorithm to train multimodal models on long-horizon, multi-step tool interactions, enabling them to autonomously select, sequence, and adapt tool usage based on task context and intermediate outcomes.
The model incorporates an adaptive learning mechanism that decouples tool-use logic from specific tasks, allowing generalization to unseen tools and novel task distributions without explicit supervision or predefined invocation patterns.
Evaluated on challenging benchmarks, AdaReasoner improves a 7B base model by +24.9% on average and outperforms proprietary systems like GPT-5 and Claude Sonnet 4 on tasks such as VSP and Jigsaw, demonstrating robust tool-adaptive reasoning in open-source multimodal models.

Introduction

Dataset

VSP (Navigation & Verification):
- Source: Procedurally generated environments using Gymnasium, with training grids sized 4x4, 6x6, 8x8; test grids 5x5, 7x7, 9x9.
- Trajectory: Uses POINT to locate start/end/obstacles, textual reasoning, and DRAW2DPATH for path verification. Includes failure cases with self-correction.
- Processing: Tool calls executed programmatically; CoT generated via Gemini 2.5 Flash.
Jigsaw:
- Source: Images from COCO-2017 training set, split into 3x3 grids; one patch removed as target, one of five others as distractor.
- Trajectory: DETECTBLACKAREA to locate missing region, then iterative INSERTIMAGE attempts with visual feedback.
- Variations: Randomized patch order, tool failure cases, and tool-free solvable instances to encourage adaptive tool use.
GUIQA:
- Source: Filtered from 44k Guichat instances using Qwen-VL-2.5-72B; retained 7,100 where model failed.
- Processing: Manual inspection reduced to 1,800 valid cases; ground-truth answer coordinates rendered as bounding boxes.
- Trajectory: CROP to isolate relevant region, then OCR for extraction. CoT and tool sequence generated via Gemini 2.5 Flash; final dataset: 1,139 validated instances.

Method

Experiment

Evaluated on VSP, Jigsaw, GUIQA, and Visual Search tasks using Qwen2.5-VL-3B/7B; tool augmentation (TC + TG) boosted 7B model by +38.66% avg, lifting VSP from ~31.64% to 97.64%, surpassing Direct SFT (46.64%) and Direct GRPO (30.18%).
Achieved state-of-the-art on VSP (97.64%) and Jigsaw (96.60%), outperforming GPT-5 (80.10%) and closing gap between 3B/7B models—both reached ~95%+ accuracy, showing tools overcome scale limitations.
AdaReasoner learns adaptive tool use: adopts A* for navigation (96.33% score), suppresses irrelevant tools (verification stays 99.20%), and modulates call frequency (e.g., Point tool: 3.2 calls/sample for navigation, ~1.0 for verification).
Generalizes to unseen tools and tasks: on Jigsaw, achieved 88.60% accuracy with 3.54 CPS and 98.50% success; on VStar (no tool supervision), invoked tools 1.47 times/sample and scored 70.68%, outperforming baselines under zero-shot tool adaptation.
Outperformed proprietary models (GPT-5, Claude, Gemini) and large open-source models (Qwen-32B/72B) on structured tasks; on WebMMU’s agent action subset, achieved 72.97% with pure GRPO, showing SFT can hinder open-ended domains.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

AdaReasoner: 반복적 시각적 추론을 위한 동적 툴 오케스트레이션

Mingyang Song Haoyu Sun Jiawei Gu Linjie Li Luxin Xu Ranjay Krishna Yu Cheng

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

AdaReasoner: 반복적 시각적 추론을 위한 동적 툴 오케스트레이션

Mingyang Song Haoyu Sun Jiawei Gu Linjie Li Luxin Xu Ranjay Krishna Yu Cheng

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

AdaReasoner: 반복적 시각적 추론을 위한 동적 툴 오케스트레이션

Mingyang Song Haoyu Sun Jiawei Gu Linjie Li Luxin Xu Ranjay Krishna Yu Cheng

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters