Command Palette
Search for a command to run...
ズームしないズーム:細粒度マルチモーダル認識のための領域から画像への蒸留
ズームしないズーム:細粒度マルチモーダル認識のための領域から画像への蒸留
概要
マルチモーダル大規模言語モデル(MLLM)は広範な視覚理解において優れた性能を発揮するが、決定的な証拠が微細かつ全体的な文脈に簡単に埋もれてしまう細粒度の知覚においては依然として課題を抱えている。近年、「画像を用いた思考(Thinking-with-Images)」と呼ばれる手法が、推論時に注目領域を反復的に拡大・縮小することでこの課題を軽減しているが、繰り返しのツール呼び出しと視覚的再符号化により高い遅延が生じるという問題がある。この問題に対処するため、本研究では「領域から画像への蒸留(Region-to-Image Distillation)」を提案する。この手法は、拡大操作を推論時におけるツールとしてではなく、学習時における基本的な構成要素として定義することで、エージェント型の拡大機能をMLLMの単一のフォワードパス内に内包化する。具体的には、強力な教師モデルを用いて微細なクロップ領域に拡大し、高品質なVQA(視覚質問応答)データを生成した後、その領域に基づく教師信号を元の全画像に蒸留する。このようなデータで訓練された小規模な学生モデルは、ツールの使用なしに「一瞥による」細粒度の知覚能力を向上させる。この能力を厳密に評価するため、本研究では6つの細粒度知覚次元にわたる合計845件のVQAデータを含むハイブリッドアノテーション型ベンチマーク「ZoomBench」を提案するとともに、グローバル視点と領域視点の間の「拡大ギャップ」を定量的に評価するための二重視点プロトコルを導入した。実験の結果、本研究のモデルは複数の細粒度知覚ベンチマークで最先端の性能を達成し、視覚的推論やGUIエージェントなど、一般的なマルチモーダル認知タスクにおいても性能向上を示した。さらに、「画像を用いた思考」が本当に必要となる状況と、その利点を単一のフォワードパスに蒸留可能となる状況についても考察した。本研究のコードは、https://github.com/inclusionAI/Zooming-without-Zooming にて公開されている。
One-sentence Summary
Researchers from Shanghai Jiao Tong University, Ant Group, and collaborators propose Region-to-Image Distillation, enabling MLLMs to internalize fine-grained perception during training—replacing costly iterative zooming with single-pass inference—validated on ZoomBench and boosting performance across multimodal tasks without runtime tools.
Key Contributions
- We introduce Region-to-Image Distillation, a training method that uses micro-cropped regions to generate high-quality, region-grounded VQA supervision for full-image models, enabling single-pass fine-grained perception without inference-time tool use or re-encoding.
- We present ZoomBench, a hybrid-annotated benchmark of 845 VQA samples spanning six fine-grained perceptual dimensions, paired with a dual-view protocol to quantify the global-regional “zooming gap” and evaluate model acuity rigorously.
- Experiments show our distilled models (ZwZ-4B/7B/8B) achieve state-of-the-art performance on fine-grained perception tasks, outperforming larger MLLMs and agentic “Thinking-with-Images” methods while improving general multimodal cognition across visual reasoning and GUI agent benchmarks.
Introduction
The authors leverage Region-to-Image Distillation to address the persistent challenge in multimodal large language models (MLLMs) of detecting fine-grained visual details—like tiny text or subtle attributes—where global context drowns out critical micro-evidence. Prior methods, such as “Thinking-with-Images,” rely on iterative, tool-based zooming during inference, which improves accuracy but introduces high latency due to repeated visual re-encoding and tool calls. The authors’ key contribution is reframing zooming as a training-time operation: they generate high-quality VQA data from micro-cropped regions using strong teacher models, then distill that region-grounded supervision back into smaller student models trained on full images, enabling single-pass fine-grained perception at inference without tool use. They also introduce ZoomBench, a hybrid-annotated benchmark with a dual-view protocol to quantify the global-regional “zooming gap,” and show their models outperform larger MLLMs and agentic baselines while maintaining low latency.
Dataset

- The authors construct ZoomBench using a Region-to-Image Distillation method: a powerful MLLM (Gemini-2.5-Pro) generates questions and candidate answers from cropped micro-regions, then maps them to full images to form spatially ungrounded but evidence-backed QA pairs.
- ZoomBench contains 845 high-quality, diverse, and challenging QA pairs drawn from high-resolution images sourced across multiple public datasets (detailed in appendices), with no overlap between training and benchmark splits to prevent data leakage.
- Each instance includes a full image and a small-ratio cropped region (typically <10% of image area) that serves as visual evidence; this dual-view setup enables evaluation of “zooming” ability and attention-based interpretation.
- Human verification is applied: three PhD-level annotators validate each QA pair for clarity, answerability, and correctness against both full and cropped views; ~1,960 raw samples are filtered down to 845, removing overly easy or ambiguous cases.
- The benchmark covers six fine-grained perception dimensions: Fine-Grained Counting, OCR, Color Attributes, Structural Attributes, Material Attributes, and Object Identification.
- Evaluation includes 224 open-ended questions with canonical answers and 621 multiple-choice questions; scoring follows a hybrid protocol detailed in Appendix 9.3.
- For training data, the authors introduce explicit visual grounding (bounding boxes) to resolve ambiguity observed in cropped views, unlike the benchmark which intentionally omits spatial grounding to test model robustness.
- Core generation rules require image-based, concise, factual answers; encourage diversity in question types (counting, OCR, structure, material, etc.); and reject low-quality images by returning empty lists.
Method
The authors leverage Region-to-Image Distillation (R2I) to synthesize high-veracity, fine-grained VQA training data from unlabeled image corpora, enabling single-pass inference without test-time tool use. The core idea is to distill regional-view expertise from strong teacher models into a student model’s global-view predictions, thereby internalizing the benefits of “zooming in” during training while preserving inference efficiency.
The pipeline begins with object-centric region proposal. Given a raw image I, an object recognition and segmentation system generates candidate bounding boxes {B1,…,Bn}, each covering at least one visible object. To target fine-grained perception, only micro-regions Ri satisfying Area(Bi)/Area(I)<τ (e.g., τ=0.1) are retained, ensuring the decisive visual evidence is sparse and easily overlooked in the global view. For each such region R, a teacher model generates perception-centric questions QR that are strictly answerable from R alone—focusing on subtle cues like tiny text, symbols, or small-instance counts.
To ensure label veracity without manual annotation, multiple teacher models independently answer each question Q∈QR on the cropped region R. The authors employ majority voting across teacher responses; only triplets (R,Q,A) with high consensus (e.g., >6/8 agreement) are retained, substantially reducing hallucinated or invalid samples.
Refer to the framework diagram: the distillation phase maps these region-level QA pairs back to the full image. A grounding transformation G(I,Q,B) overlays the bounding box B onto the original image I to form I′, and appends a spatial constraint to Q to form Q′. This yields an augmented training triplet (I′,Q′,A), where I′ and Q′ jointly anchor the question to the intended micro-region, resolving referential ambiguity that arises when the question is viewed in the global context. The authors further filter the synthetic dataset using a smaller multimodal model to remove overly easy samples, producing the final distilled dataset Dsyn.

The student model is then trained to maximize the expected task reward over this synthetic data:
θmaxE(I′,Q′,A)∼Dsyn,A∼πθ(⋅∣I′,Q′)[r(A,A)],where πθ is the student policy and r(A,A) is a task-specific reward function. During inference, the bounding box is removed, but the model retains the ability to attend to the critical micro-region due to the structural hint provided during training. This aligns with the privileged information paradigm: the model learns P(A∣I,Q,B) during training, and generalizes to P(A∣I,Q) at test time.
As shown in the figure below, the overall architecture contrasts with prior “Thinking with Images” methods that require iterative tool calls at inference. R2I decouples the tool-use phase (zoom-in synthesis) from inference, enabling direct, single-pass reasoning on the full image. The authors formalize this as a tool-action distillation framework: a tool-call action f(⋅) (e.g., zoom-in) generates an altered observation I, from which a teacher synthesizes (Q,A); an inverse transformation f−1 maps (I,Q) back to (I,Q), yielding a distilled dataset that trains the student to solve the task directly from the full image.

The authors instantiate this framework using Qwen3-VL-235B for region proposal and question generation, and Qwen3-VL-235B and GLM-4.5V as answer generators. They curate a high-resolution image pool from SA-1B, LAION, MetaCLIP, Visual Genome, CC12M, and STPLS3D, and synthesize 74K training samples after consensus and difficulty filtering. The method is generalizable to other tool actions such as flipping, 3D grounding, or expert model calls, as the core distillation mechanism remains agnostic to the specific tool used during synthesis.
Experiment
- Region-to-Image Distillation enables models to internalize zooming expertise, achieving fine-grained perception in a single forward pass without iterative tool use.
- ZwZ variants consistently outperform baseline Qwen-VL models across general, specific, and OOD benchmarks, including surpassing larger open-source models and rivaling closed-source SOTA in accuracy.
- Training on distilled synthetic data proves more effective than using larger public datasets or proxy-task synthetic data, highlighting the value of fine-grained, high-quality supervision over data volume.
- The method narrows the “zooming gap” between global and regional view performance, particularly improving on structure, material, and counting tasks where attention dilution is common.
- Visual grounding via bounding boxes overlaid on images during training significantly enhances attention localization, leading to better real-world generalization without requiring boxes at test time.
- ZwZ models outperform agentic and tool-use baselines while being substantially faster, demonstrating that inference-time zooming benefits can be internalized into model weights.
- Attention map analysis confirms ZwZ models concentrate more relevant visual attention on key regions, aligning with improved perception and reduced perceptual oversight.
- The approach validates that “information-neutral” image actions (like zooming) can be effectively distilled into models, while “information-gain” actions (like web search) remain essential for external interaction.
The authors use Region-to-Image Distillation to train compact vision-language models that achieve fine-grained perception in a single forward pass, without relying on inference-time zooming tools. Results show these models consistently outperform larger open-source baselines and match or exceed closed-source models on diverse benchmarks, while also demonstrating superior attention localization and generalization to real-world tasks. This approach effectively internalizes the benefits of tool-based zooming into model weights, achieving higher accuracy with significantly lower inference latency.

The authors use Region-to-Image Distillation to train compact vision-language models on synthetic data, enabling them to achieve fine-grained perception from full images in a single forward pass. Results show consistent improvements over baseline models and outperform larger open-source systems across general, specific, and out-of-distribution benchmarks, even surpassing agentic models that rely on iterative zooming. The method also demonstrates superior data efficiency and internalizes tool-use benefits while maintaining significantly lower inference latency.

The authors use Region-to-Image Distillation to train models that internalize fine-grained perception without requiring test-time zooming. Results show that ZwZ variants consistently achieve higher attention coverage on key image regions compared to their Qwen-VL baselines, indicating improved localization of task-relevant visual evidence. This enhanced focus directly contributes to narrowing the performance gap between global and regional views in fine-grained tasks.

The authors use Region-to-Image Distillation with bounding box overlays on images to train models that internalize fine-grained perception without requiring inference-time zooming. Results show that this approach significantly outperforms direct synthesis and alternative grounding strategies, narrowing the performance gap between global and regional views while improving attention focus on task-relevant regions. The method proves more effective than training on larger public datasets or proxy-task synthetic data, achieving strong generalization across diverse benchmarks with minimal data.

The authors use Region-to-Image Distillation to train compact vision-language models that achieve fine-grained perception in a single forward pass, without relying on inference-time zooming tools. Results show these models consistently outperform both larger open-source baselines and agentic systems across multiple benchmarks, while maintaining significantly lower inference latency. The method effectively internalizes the benefits of region-focused reasoning, narrowing the performance gap between global and regional views without requiring test-time tool calls.
