Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

Chung, Jiwan ; Kim, Junhyeok ; Kim, Siyeol ; Lee, Jaeyoung ; Kim, Min Soo ; Yu, Youngjae

발행일: 6/2/2025

Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

초록

We present v1, a lightweight extension to Multimodal Large Language Models(MLLMs) that enables selective visual revisitation during inference. Whilecurrent MLLMs typically consume visual input only once and reason purely overinternal memory, v1 introduces a simple point-and-copy mechanism that allowsthe model to dynamically retrieve relevant image regions throughout thereasoning process. This mechanism augments existing architectures with minimalmodifications, enabling contextual access to visual tokens based on the model'sevolving hypotheses. To train this capability, we construct v1g, a dataset of300K multimodal reasoning traces with interleaved visual grounding annotations.Experiments on three multimodal mathematical reasoning benchmarks -- MathVista,MathVision, and MathVerse -- demonstrate that v1 consistently improvesperformance over comparable baselines, particularly on tasks requiringfine-grained visual reference and multi-step reasoning. Our results suggestthat dynamic visual access is a promising direction for enhancing groundedmultimodal reasoning. Code, models, and data will be released to support futureresearch.

논문 세부 정보 보기