HyperAIHyperAI

Command Palette

Search for a command to run...

시각이 텍스트가 되는 지점: Vision-Language Models에서 OCR Routing Bottleneck의 위치 파악

Jonathan Steinberg Oren Gal

초록

시각-언어 모델(Vision-language models, VLMs)은 이미지에서 텍스트를 읽을 수 있지만, 이러한 광학 문자 인식(OCR) 정보가 언어 처리 스트림의 어느 지점으로 유입되는지는 명확히 밝혀지지 않았습니다. 본 연구에서는 2B–8B 파라미터 규모의 5개 데스(dense) 모델을 대상으로 인과적 개입(causal interventions)을 수행하여, 세 가지 아키텍처 계열(Qwen3-VL, Phi-4, InternVL3.5)에 걸친 OCR 라우팅 메커니즘을 조사합니다.원본 이미지와 텍스트가 인페인팅(in-painted)된 버전 간의 활성화 차이(activation differences)를 계산함으로써, 우리는 시각-언어 통합 전략에 따라 주된 위치가 달라지는 아키텍처별 OCR 심도 대역(depth bands)을 식별했습니다. DeepStack 모델(Qwen)은 장면 텍스트(scene text)에 대해 중간 심도(~50%)에서 피크 민감도를 보이는 반면, 단일 단계 프로젝션(single-stage projection) 모델(Phi-4, InternVL)은 초기 레이어(6–25%)에서 피크를 나타냈습니다. 다만, 최대 효과가 나타나는 정확한 레이어는 데이터셋에 따라 차이가 있었습니다.OCR 신호는 놀라울 정도로 저차원적입니다. 주성분 1(PC1)이 분산의 72.9%를 포착합니다. 결정적으로, 한 데이터셋에서 학습된 주성분 분석(PCA) 방향이 다른 데이터셋에도 전이(transfer)된다는 점은 텍스트 처리 경로가 공유되고 있음을 입증합니다. 놀랍게도, 모듈형 OCR 회로를 갖춘 모델(특히 Qwen3-VL-4B)에서는 OCR을 제거했을 때 오히려 카운팅(counting) 성능이 향상(최대 +6.9pp)되는 현상이 나타났습니다. 이는 해당 모델 규모 범위 내에서 충분히 모듈화된 아키텍처의 경우, OCR이 다른 시각적 처리 과정을 방해할 수 있음을 시사합니다.

One-sentence Summary

By applying causal interventions and principal component analysis to Qwen3-VL, Phi-4, and InternVL3.5, the authors identify architecture-specific OCR routing depth bands, revealing that single-stage projection models process text signals at earlier layers than DeepStack architectures and that modular OCR circuits can occasionally interfere with visual counting performance.

Key Contributions

  • The paper identifies architecture-specific OCR routing mechanisms by using causal interventions and activation patching to locate depth bands where text information is processed in 2B–8B parameter vision-language models.
  • This research demonstrates that OCR signals are concentrated in a low-dimensional subspace, with the primary component capturing 72.9% of the variance and showing transferable properties across different datasets.
  • The study shows that surgically removing this OCR-specific subspace can improve performance in other visual tasks, such as counting, which suggests a capacity competition between OCR and other visual processing functions in modular architectures.

Introduction

Vision-language models (VLMs) rely on optical character recognition (OCR) to interpret text within images, a capability essential for tasks ranging from document understanding to autonomous web browsing. While these models demonstrate strong text-reading performance, the internal routing mechanisms that transition visual text into the language processing stream remain poorly understood. This lack of clarity poses significant safety risks, as adversarial text embedded in images can be used to hijack model behavior through typographic prompt injections. The authors leverage causal interventions and Principal Component Analysis (PCA) to identify architecture-specific depth bands where OCR information is processed. They demonstrate that the OCR signal is remarkably low-dimensional and that its processing pathways are consistent across different datasets. Furthermore, the authors show that surgically removing this OCR subspace can actually improve performance on non-textual tasks like counting, suggesting that OCR circuits can interfere with other visual reasoning processes in certain model architectures.

Dataset

The authors utilize several datasets to evaluate the impact of OCR interventions on visual models:

  • EgoTextVQA (Primary Dataset): The authors use this egocentric dataset, which features images containing natural text such as signs, labels, and screens. To study how text affects model activations, they create a paired dataset consisting of the original image and an inpainted version where the text has been removed.
  • Inpainting Processing: To generate the inpainted versions, the authors extract text region bounding boxes from existing annotations and expand them by 10 percent to ensure total coverage. They then apply a standard inpainting model to fill these masked regions. The quality of this process was verified through manual spot-checks of 50 random samples.
  • Retention Benchmarks: To measure collateral damage to non-OCR tasks, the authors employ three specific benchmarks using complete datasets and deterministic greedy decoding:
    • CountBench: A benchmark used to measure object counting abilities.
    • EmbSpatial: A benchmark designed to test spatial reasoning and understanding for embodied tasks.
    • RealWorldQA: A general visual question answering benchmark based on real-world images.

Method

To investigate how vision-language integration affects OCR routing, the authors examine three distinct architectural families of Vision-Language Models (VLMs). The first category consists of DeepStack models, including Qwen3-VL-4B-Instruct, Qwen3-VL-2B-Instruct, and Qwen3-VL-8B-Instruct. These models utilize a progressive injection strategy where visual features are introduced into multiple layers of the Large Language Model (LLM) backbone. In contrast, the second category comprises single-stage models, such as Phi-4-multimodal-instruct and InternVL3.5-4B. These architectures inject all visual tokens exclusively at the input layer. This variation in integration strategy allows for the identification of how the OCR bottleneck is distributed across the model layers.

As shown in the figure below:

The authors employ a PCA-based subspace removal method to isolate the components of the residual stream responsible for text processing. For each layer \ell, the activation difference is computed across a held-out training split using the formula: Δh=horiginalhinpainted\Delta h _ { \ell } = h _ { \ell } ^ { \mathrm { o r i g i n a l } } - h _ { \ell } ^ { \mathrm { i n p a i n t e d } }Δh=horiginalhinpainted where horiginalh_{\ell}^{\mathrm{original}}horiginal represents the activations from images containing text and hinpaintedh_{\ell}^{\mathrm{inpainted}}hinpainted represents activations from images where text has been removed. By performing Principal Component Analysis (PCA) on these differences, the authors identify principal components that capture the specific changes occurring when text is present.

During inference, the authors intervene by projecting out the top NNN principal components from the residual stream. The intervened activation hinth_{\ell}^{\mathrm{int}}hint is calculated as: \nhint=hαi=1N(hpci)pci\nh _ { \ell } ^ { \mathrm { i n t } } = h _ { \ell } - \alpha \sum _ { i = 1 } ^ { N } ( h _ { \ell } \cdot \mathbf { p c } _ { i } ) \, \mathbf { p c } _ { i }\nhint=hαi=1N(hpci)pci In this equation, hh_{\ell}h is the residual-stream activation at layer \ell, pci\mathbf{pc}_ipci is the iii-th principal component, NNN is the number of components removed, and α\alphaα is a scaling factor for intervention strength. The authors set α=1\alpha=1α=1 to achieve full projection removal. These interventions are implemented via forward hooks on the transformer layer outputs, specifically targeting the post-MLP residual stream.

Complementing the subspace removal, the authors also perform head ablation to identify attention heads specialized for OCR. They define a selectivity ratio for each attention head, calculated as the ratio of attention weights assigned to OCR regions versus background regions. Heads with a ratio greater than 1.0 are considered to preferentially attend to text. The authors rank all 720 heads based on this selectivity and ablate the top NNN heads by zeroing their output projections during both the prefill and decode stages to assess their causal importance to text reading tasks.

Experiment

The researchers evaluated five vision-language models across multiple OCR and visual reasoning benchmarks to identify and intervene upon text-processing bottlenecks. The experiments demonstrate that OCR signals are concentrated in low-dimensional subspaces and that the location of these bottlenecks is determined by model architecture, with DeepStack models showing mid-depth sensitivity and single-stage models showing early-layer sensitivity. Results indicate that targeted OCR removal can actually improve performance on certain visual tasks like object counting by reducing representational interference, particularly in larger models that exhibit more modular circuit organization.

The authors conduct control experiments to ensure that the observed OCR suppression stems from actual text removal rather than artifacts introduced by inpainting. They compare three different inpainting methods against a random box baseline to validate the specificity of their PCA-based intervention. Different inpainting techniques yield similar levels of explained variance and OCR suppression The random box control shows minimal impact on both variance and OCR accuracy Results confirm that the identified signals are specific to text removal rather than general image perturbations

The authors evaluate different layer-wise interventions on the Qwen3-VL-2B model to observe the impact on OCR and visual reasoning tasks. Results show that interventions in the mid-network range can suppress OCR while simultaneously improving performance on certain visual tasks like counting. Interventions at specific mid-layers achieve significant OCR suppression while increasing counting accuracy Certain interventions provide a trade-off where counting improves but spatial reasoning and general visual QA scores decrease Multi-layer interventions can effectively reduce OCR with varying impacts on downstream visual reasoning tasks

The authors identify the most OCR-selective attention heads by ranking them based on their selectivity ratio. These heads are primarily concentrated within specific mid-network layers. The most selective heads are located within layers 16, 18, 8, 12, and 17 Selectivity ratios for the top ranked heads range from approximately 1.10 to 1.41 A significant cluster of highly selective heads is found in the mid-layer range between layer 16 and layer 20

The authors evaluate how targeted OCR interventions affect various visual reasoning tasks across different model architectures. Results show that while some models achieve improved performance in counting tasks with minimal impact on other metrics, others experience significant degradation in spatial reasoning and general visual QA. Qwen3-VL-4B achieves a favorable trade-off by improving counting performance with only minor impacts on spatial reasoning and general visual QA. Smaller models like Qwen3-VL-2B and Phi-4 show gains in counting but suffer from notable declines in spatial reasoning and general visual tasks. InternVL exhibits a complete degradation in all measured metrics when interventions are applied to its early-layer bottleneck.

The authors evaluate different PCA-based interventions on the Qwen3-VL-4B model to identify optimal layers for OCR suppression. Results show that targeting specific mid-layer components can improve object counting performance while maintaining or slightly adjusting other visual reasoning capabilities. Multi-layer interventions at layers 17 through 19 achieve a favorable balance by improving counting and general visual QA with minimal impact on spatial reasoning. Single-layer interventions at layer 18 lead to a significant collapse in general visual QA performance. Certain mid-layer interventions, such as pca_L17_pc5, demonstrate effective trade-offs by increasing counting accuracy while preserving most other metrics.

The authors conduct control experiments and layer-wise interventions to validate that their PCA-based method specifically suppresses OCR without being caused by general image perturbations. By identifying highly selective attention heads in the mid-network layers, they demonstrate that targeted interventions can suppress text recognition while potentially improving certain visual tasks like counting. The effectiveness of these interventions varies across architectures, with some models achieving a favorable balance between OCR suppression and reasoning capabilities while others experience performance trade-offs.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp