il y a 9 heures

Wenxuan Huang Yu Zeng Qiuchen Wang Zhen Fang Shaosheng Cao Zheng Chu Qingyu Yin Shuang Chen Zhenfei Yin Lin Chen

Table des matières

Résumé

Les modèles linguistiques à grande échelle multimodaux (MLLM) ont connu un succès remarquable dans une large gamme de tâches visuelles. Toutefois, limités par la capacité de leurs connaissances internes sur le monde, les travaux antérieurs ont proposé d’enrichir les MLLM par une approche « raisonnement puis appel d’outil » pour intégrer des moteurs de recherche visuelle et textuelle, permettant ainsi d’obtenir des gains significatifs sur des tâches nécessitant une grande quantité d’informations factuelles. Toutefois, ces approches définissent généralement la recherche multimodale dans un cadre naïf, en supposant qu’une seule requête image complète ou au niveau entité, accompagnée de quelques requêtes textuelles, suffit à retrouver les éléments clés nécessaires pour répondre à la question — une hypothèse peu réaliste dans des scénarios du monde réel marqués par une forte présence de bruit visuel. En outre, elles sont souvent restreintes en profondeur de raisonnement et en étendue de recherche, ce qui rend difficile la résolution de questions complexes nécessitant l’agrégation d’éléments probants provenant de sources visuelles et textuelles diverses. À partir de ces constats, nous proposons Vision-DeepResearch, un nouveau paradigme de recherche approfondie multimodale, qui réalise des recherches itératives, multi-entités et multi-échelles sur les contenus visuels et textuels, afin de garantir une efficacité robuste face aux moteurs de recherche du monde réel, même en présence de bruit important. Vision-DeepResearch supporte des dizaines d’étapes de raisonnement et des centaines d’interactions avec des moteurs, tout en intégrant internement des capacités de recherche approfondie dans le MLLM grâce à une supervision à démarrage froid et à une formation par renforcement (RL). Ce processus donne naissance à un MLLM multimodal puissant, capable de recherche approfondie en bout à bout. Ce modèle surpasse de manière significative les MLLM existants spécialisés dans la recherche multimodale approfondie, ainsi que les workflows basés sur des modèles fondamentaux fermés de haute performance tels que GPT-5, Gemini-2.5-pro et Claude-4-Sonnet. Le code source sera publié à l’adresse suivante : https://github.com/Osilly/Vision-DeepResearch.

One-sentence Summary

Researchers from CUHK MMLab, USTC, and collaborators propose Vision-DeepResearch, a new multimodal paradigm enabling multi-turn, multi-entity, multi-scale search via deep-reasoning MLLMs trained with SFT and RL, outperforming GPT-5 and Gemini-2.5-pro on noisy real-world image retrieval tasks with fewer parameters.

Key Contributions

Vision-DeepResearch introduces a new multimodal deep-research paradigm that performs multi-turn, multi-entity, and multi-scale visual and textual searches to overcome the low hit-rate problem in noisy real-world search engines, where prior methods rely on simplistic full-image or single-entity queries.
The framework internalizes deep-research capabilities into MLLMs via cold-start supervised fine-tuning and reinforcement learning, enabling dozens of reasoning steps and hundreds of engine interactions—significantly expanding reasoning depth and search breadth beyond existing approaches.
Evaluated on six benchmarks, Vision-DeepResearch achieves state-of-the-art performance with smaller models (8B and 30B-A3B scales), outperforming both open-source multimodal deep-research MLLMs and workflows built on closed-source models like GPT-5, Gemini-2.5-pro, and Claude-4-Sonnet.

Introduction

The authors leverage multimodal large language models (MLLMs) to tackle complex, fact-intensive visual question answering by enabling deep-research capabilities that go beyond single-query, single-scale retrieval. Prior methods treat visual search as a one-off operation using full-image or entity-level queries, ignoring real-world noise and search engine variability—leading to low hit rates and shallow reasoning. They also limit training to short trajectories, preventing models from performing iterative, multi-step evidence gathering. The authors’ main contribution is Vision-DeepResearch, a new paradigm that synthesizes long-horizon, multi-turn trajectories involving multi-entity, multi-scale visual and textual search. Through cold-start supervision and RL training, they equip MLLMs to perform dozens of reasoning steps and hundreds of engine interactions, achieving state-of-the-art results on six benchmarks—even outperforming agent workflows built on closed-source models like GPT-5 and Gemini-2.5-Pro.

Dataset

The authors use a curated collection of real-world, high-quality images from multiple open-source datasets, filtering out those smaller than 224×224 pixels. They apply an MLLM to select visually complex, non-trivial images and discard any that can be answered without external evidence or that return exact matches via image search.
From the retained images, they generate “Fuzzy Multi-hop VQA” instances: first prompting an MLLM to propose entity-level bounding boxes, then cropping regions at multiple scales and verifying entity consistency via image search. Entity-level questions (e.g., “What is the name of the cat?”) are generated, then deliberately obfuscated via two techniques: (1) answer chaining to deepen reasoning, and (2) entity replacement via random walks over webpages to simulate multi-hop knowledge. These are interleaved to avoid templated or shortcut-prone patterns.
The final synthesis pipeline emulates human question design: extracting keywords, retrieving external evidence, generating multiple candidate questions, and selecting the best via a judge MLLM. This produces complex, realistic VQA problems paired with answers, used for both trajectory synthesis and RL training.
For supervised fine-tuning (SFT), the authors construct 30K high-quality multimodal deep-research trajectories: 16K from verified fact-centric VQA problems (augmented with trajectories), 8K text-only QA trajectories, and 6K fuzzy VQA trajectories. All are trained using autoregressive CE loss to teach multi-turn, multi-scale, cross-modal reasoning and planning.
For RL training, they use 15K verified VQA instances, sampling trajectories via interaction with a live search environment. Reward is computed via an LLM-as-Judge evaluating answer correctness and adherence to ReAct formatting, optimized using the rllm framework.
The training data is processed using Ms-Swift for SFT and rllm for RL, applied to Qwen3-VL-30B and Qwen3-VL-8B models. The pipeline emphasizes visual cropping, external search, and long-horizon reasoning behaviors, with evaluation on six benchmarks including VDR-Bench, FVQA, and BC-VL.

Method

The authors leverage a highly automated data pipeline to construct long-horizon multimodal deep-research trajectories, enabling their Vision-DeepResearch agent to perform complex vision-language reasoning in noisy web environments. The pipeline integrates visual search with text-based reasoning, bridged through image descriptions, and is structured into two primary phases: visual evidence gathering and text-based deep-research extension.

As shown in the figure below, the process begins with an input image and question. The model first generates reasoning steps and localizes relevant regions via multi-entity and multi-scale cropping, producing a set of bounding boxes $S_b = \{I_b^1, \ldots, I_b^n\}$ . Each crop triggers a visual action $A^t = \text{Tool-Call}(S_b^t)$ , submitted to a visual search tool pipeline. The pipeline returns observations $\mathcal{O}^{t_v}$ , which are accumulated into visual evidence $\mathcal{V}^{t_v} = \{ \mathcal{O}^1, \ldots, \mathcal{O}^{t_v} \}$ . This sequence includes three sequential tools: visual search (to retrieve URLs), website visit (to fetch markdown content), and website summary (to extract relevant text while filtering noise).

To control search depth, an external judge model evaluates whether the accumulated evidence $\mathcal{V}^{t_v}$ is sufficient to support downstream reasoning, outputting a binary hit signal $h^{t_v} = \mathrm{Judge}(I, q, \mathcal{V}^{t_v}, a_{\mathrm{true}}) \in \{0, 1\}$ . If $h^{t_v} = 0$ , the pipeline continues; if $h^{t_v} = 1$ , the visual phase terminates at step $T_v$ . The resulting visual trajectory is denoted as $\mathcal{C}_{\mathrm{vision}} = \{ I, q, p_v, R^1, A^1, \mathcal{O}^1, \ldots, p_v, R^{T_v}, A^{T_v}, \mathcal{O}^{T_v} \}$ .

The authors then bridge this visual trajectory to text by replacing the original image $I$ with a detailed textual description $D$ , while preserving the reasoning, actions, and observations. This bridged context is fed into a text-based deep-research foundation LLM, which extends the trajectory using tools such as web search, website visit & summary, and Python code execution. The textual trajectory is denoted as $\mathcal{C}_{\mathrm{text}} = \{ D, q, R^1, A^1, \mathcal{O}^1, \ldots, R^{T_v}, A^{T_v}, \mathcal{O}^{T_v}, p_t, R^{T_v+1}, A^{T_v+1}, \mathcal{O}^{T_v+1}, \ldots, R^{T_v+T_t}, A^{T_v+T_t}, a_{\mathrm{output}} \}$ , where $T_t$ is the number of text-based steps and $a_{\mathrm{output}}$ is the final answer.

The full multimodal trajectory $\mathcal{C}_{\mathrm{multimodal}}$ merges both phases and undergoes rejection sampling: an LLM verifies whether $a_{\mathrm{output}}$ matches the ground-truth $a_{\mathrm{true}}$ , retaining only consistent trajectories for training. The authors also incorporate text-only trajectories generated directly from the original question.

For training, the authors combine supervised fine-tuning (SFT) with reinforcement learning (RL). The RL phase employs a high-throughput asynchronous rollout architecture built on the rLLM framework, enabling concurrent tool calls and achieving over 10× higher throughput than synchronous methods. Training uses Group Relative Policy Optimization (GRPO) with a Leave-One-Out trick, applied to 15K high-quality VQA instances. The model interacts with a real online search environment, sampling long-horizon trajectories capped at 50 turns, 64K context tokens, and 4K response tokens per turn.

Reward is determined via an LLM-as-Judge paradigm: a reward of 1.0 is assigned if the final answer is correct, 0.0 otherwise. To ensure training stability, the authors implement several engineering safeguards: trajectory interruption for repetitive text or cascading tool-call failures, masking of anomalous trajectories from gradient updates, and training in BF16 precision to avoid numerical overflow from long contexts.

Refer to the framework diagram for an overview of the end-to-end Vision-DeepResearch paradigm, including the integration of factual VQA synthesis, multi-turn trajectory generation, and the long-horizon ReAct-style reasoning loop.

The final training data includes both multimodal trajectories and text-only deep-research trajectories, enabling the agent to generalize across modalities and perform robust, long-horizon reasoning in complex, real-world web environments.

Experiment

Our approach outperforms existing open models and rivals strong proprietary systems on multimodal deep-research tasks, particularly when using agentic workflows that combine reasoning with tool use.
Ablation studies confirm that multi-scale visual cropping and text search are jointly essential: cropping improves object-level grounding, while text search provides missing factual context, together enabling balanced performance across benchmarks.
Data ablation shows that supervised fine-tuning with tool-augmented trajectories significantly improves performance, and reinforcement learning further refines long-horizon decision making, yielding the best overall results.
RL training reduces trajectory length while increasing reward, indicating more efficient tool usage, with further gains expected from larger-scale RL optimization.
Direct answering without tools performs poorly, while ReAct-style agentic reasoning consistently delivers substantial improvements, validating the necessity of iterative evidence gathering for complex multimodal tasks.

The authors use a multimodal agent framework combining multi-scale visual cropping and text search to significantly improve open-domain reasoning performance over baseline models. Results show that their approach outperforms both proprietary and open-source models under agentic workflows, with gains driven by better long-horizon tool-use behavior and evidence grounding. Ablation studies confirm that both visual localization and textual retrieval are jointly necessary, and reinforcement learning further refines decision-making beyond supervised fine-tuning.

The authors evaluate different retrieval strategies in multimodal reasoning, finding that combining multi-scale visual cropping with text search (CIS+TS) yields the strongest and most balanced performance across benchmarks. Results show that relying solely on direct answers or whole-image search leads to poor outcomes, while integrating localized visual anchors with textual evidence significantly improves accuracy. This indicates that effective multimodal reasoning requires both precise visual grounding and complementary factual retrieval.

The authors use a combination of supervised fine-tuning with tool-augmented trajectories and reinforcement learning to significantly improve multimodal reasoning performance. Results show that adding verified and fuzzy multi-hop trajectories boosts accuracy, while RL further refines long-horizon decision-making, leading to the best overall scores across benchmarks. The final model outperforms the base version by a substantial margin, demonstrating the value of iterative tool use and reward-driven optimization.

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

HyperAI

il y a 9 heures

Génération Augmentée Par La Recherche

Réponse À Des Questions Visuelles

Wenxuan Huang Yu Zeng Qiuchen Wang Zhen Fang Shaosheng Cao Zheng Chu Qingyu Yin Shuang Chen Zhenfei Yin Lin Chen

Table des matières

Résumé

One-sentence Summary

Key Contributions

Vision-DeepResearch introduces a new multimodal deep-research paradigm that performs multi-turn, multi-entity, and multi-scale visual and textual searches to overcome the low hit-rate problem in noisy real-world search engines, where prior methods rely on simplistic full-image or single-entity queries.
The framework internalizes deep-research capabilities into MLLMs via cold-start supervised fine-tuning and reinforcement learning, enabling dozens of reasoning steps and hundreds of engine interactions—significantly expanding reasoning depth and search breadth beyond existing approaches.
Evaluated on six benchmarks, Vision-DeepResearch achieves state-of-the-art performance with smaller models (8B and 30B-A3B scales), outperforming both open-source multimodal deep-research MLLMs and workflows built on closed-source models like GPT-5, Gemini-2.5-pro, and Claude-4-Sonnet.

Introduction

Dataset

The authors use a curated collection of real-world, high-quality images from multiple open-source datasets, filtering out those smaller than 224×224 pixels. They apply an MLLM to select visually complex, non-trivial images and discard any that can be answered without external evidence or that return exact matches via image search.
From the retained images, they generate “Fuzzy Multi-hop VQA” instances: first prompting an MLLM to propose entity-level bounding boxes, then cropping regions at multiple scales and verifying entity consistency via image search. Entity-level questions (e.g., “What is the name of the cat?”) are generated, then deliberately obfuscated via two techniques: (1) answer chaining to deepen reasoning, and (2) entity replacement via random walks over webpages to simulate multi-hop knowledge. These are interleaved to avoid templated or shortcut-prone patterns.
The final synthesis pipeline emulates human question design: extracting keywords, retrieving external evidence, generating multiple candidate questions, and selecting the best via a judge MLLM. This produces complex, realistic VQA problems paired with answers, used for both trajectory synthesis and RL training.
For supervised fine-tuning (SFT), the authors construct 30K high-quality multimodal deep-research trajectories: 16K from verified fact-centric VQA problems (augmented with trajectories), 8K text-only QA trajectories, and 6K fuzzy VQA trajectories. All are trained using autoregressive CE loss to teach multi-turn, multi-scale, cross-modal reasoning and planning.
For RL training, they use 15K verified VQA instances, sampling trajectories via interaction with a live search environment. Reward is computed via an LLM-as-Judge evaluating answer correctness and adherence to ReAct formatting, optimized using the rllm framework.
The training data is processed using Ms-Swift for SFT and rllm for RL, applied to Qwen3-VL-30B and Qwen3-VL-8B models. The pipeline emphasizes visual cropping, external search, and long-horizon reasoning behaviors, with evaluation on six benchmarks including VDR-Bench, FVQA, and BC-VL.

Method

Experiment

Our approach outperforms existing open models and rivals strong proprietary systems on multimodal deep-research tasks, particularly when using agentic workflows that combine reasoning with tool use.
Ablation studies confirm that multi-scale visual cropping and text search are jointly essential: cropping improves object-level grounding, while text search provides missing factual context, together enabling balanced performance across benchmarks.
Data ablation shows that supervised fine-tuning with tool-augmented trajectories significantly improves performance, and reinforcement learning further refines long-horizon decision making, yielding the best overall results.
RL training reduces trajectory length while increasing reward, indicating more efficient tool usage, with further gains expected from larger-scale RL optimization.
Direct answering without tools performs poorly, while ReAct-style agentic reasoning consistently delivers substantial improvements, validating the necessity of iterative evidence gathering for complex multimodal tasks.

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

Vision-DeepResearch : Inciter la capacité de DeepResearch dans les modèles linguistiques à grande échelle multimodaux

Wenxuan Huang Yu Zeng Qiuchen Wang Zhen Fang Shaosheng Cao Zheng Chu Qingyu Yin Shuang Chen Zhenfei Yin Lin Chen5 more

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Vision-DeepResearch : Inciter la capacité de DeepResearch dans les modèles linguistiques à grande échelle multimodaux

Wenxuan Huang Yu Zeng Qiuchen Wang Zhen Fang Shaosheng Cao Zheng Chu Qingyu Yin Shuang Chen Zhenfei Yin Lin Chen5 more

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Vision-DeepResearch : Inciter la capacité de DeepResearch dans les modèles linguistiques à grande échelle multimodaux

Wenxuan Huang Yu Zeng Qiuchen Wang Zhen Fang Shaosheng Cao Zheng Chu Qingyu Yin Shuang Chen Zhenfei Yin Lin Chen5 more

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Wenxuan Huang Yu Zeng Qiuchen Wang Zhen Fang Shaosheng Cao Zheng Chu Qingyu Yin Shuang Chen Zhenfei Yin Lin Chen

Wenxuan Huang Yu Zeng Qiuchen Wang Zhen Fang Shaosheng Cao Zheng Chu Qingyu Yin Shuang Chen Zhenfei Yin Lin Chen

Wenxuan Huang Yu Zeng Qiuchen Wang Zhen Fang Shaosheng Cao Zheng Chu Qingyu Yin Shuang Chen Zhenfei Yin Lin Chen