HyperAIHyperAI

Command Palette

Search for a command to run...

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Yuling Shi Chaoxiang Xie Zhensu Sun Yeheng Chen Chenxu Zhang Longfei Yun Chengcheng Wan Hongyu Zhang David Lo Xiaodong Gu

Abstract

Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.

One-sentence Summary

Researchers from Shanghai Jiao Tong University, Singapore Management University, and others propose using Multimodal LLMs to process source code as compressed images, achieving up to 8× token reduction while preserving performance, leveraging visual cues like syntax highlighting for efficient code understanding and clone detection.

Key Contributions

  • LLMs face efficiency bottlenecks with growing codebases due to linear token growth; this work pioneers using rendered code images as a compressible alternative, enabling up to 8× token reduction while preserving semantic understanding.
  • Multimodal LLMs effectively exploit visual features like syntax highlighting, boosting code completion accuracy under 4× compression, demonstrating that visual cues enhance performance beyond raw text inputs.
  • Code clone detection tasks show strong resilience to image compression, with some ratios outperforming raw text inputs, validating image-based representation as a viable, efficient pathway for scalable code understanding.

Introduction

The authors leverage multimodal large language models (MLLMs) to explore representing source code as images rather than text, aiming to reduce computational costs as codebases grow. Traditional LLMs treat code as linear token sequences, making compression lossy and inefficient; prior work lacks systematic evaluation of code images despite MLLMs’ visual capabilities. Their main contribution is a comprehensive empirical study showing that MLLMs can understand code images effectively—even at up to 8x compression—with some tasks like clone detection outperforming text baselines. They also introduce CODEOCR, a tool for rendering code into compressible, visually enhanced images, and identify model-specific resilience patterns and optimal rendering strategies.

Dataset

The authors use a curated set of four benchmark tasks to evaluate visual code understanding, each targeting different comprehension levels and processed with specific metrics and context handling:

  • Code Completion:
    Uses 200 Python and 200 Java samples from LongCodeCompletion (Guo et al., 2023), drawn from a challenging subset (Shi et al., 2025a). Applies RAG to inject relevant code context (avg. 6,139 tokens for Python, 5,654 for Java). Evaluated via Exact Match and Edit Similarity.

  • Code Summarization:
    Leverages LongModule-Summarization (Bogomolov et al., 2024), with 109 samples averaging 6,184 tokens. Uses CompScore — an LLM-as-judge metric via DeepSeek-V3.2 — with bidirectional averaging for fairness (scores 0–100).

  • Code Clone Detection:
    Draws from GPT-CloneBench (Alam et al., 2023), focusing on Type-4 semantic clones. Samples 200 balanced pairs (100 positive, 100 negative) per language (Python avg. 125 tokens, Java avg. 216 tokens). Evaluated via Accuracy and F1.

  • Code Question Answering:
    Constructs a new 200-sample dataset to avoid leakage in LongCodeQA (Rando et al., 2025). Crawls 35 post-August-2025 GitHub repos, generates 1,000 candidate QA pairs via DeepSeek-V3.2, then filters via 3 PhD validators ensuring question validity, context necessity, and unambiguous correctness. Final set shuffled to remove positional bias. Evaluated via Accuracy.

All token lengths are computed using Qwen-3-VL’s tokenizer. The datasets are used to probe code understanding across languages, with Python as primary and Java for extended analysis.

Method

The authors leverage a multimodal pipeline to enable large language models to process source code as visual input, bypassing traditional text tokenization. The framework begins with rendering source code into high-resolution images—2240×2240 pixels by default—to ensure compatibility with common patch-based visual encoders. This resolution allows for clean division into fixed-size patches (e.g., 14×14 pixels) without partial tiles, preserving visual fidelity during tokenization. The rendered image, which may include syntax highlighting or bold styling for enhanced visual cues, is then processed alongside a text-based instruction prompt.

Refer to the framework diagram, which illustrates the four-stage processing pipeline. In Stage 1, the input consists of a rendered code image IRH×W×3I \in \mathbb{R}^{H \times W \times 3}IRH×W×3 and a natural language instruction. Stage 2 involves encoding: the image is split into patches, and a Vision Transformer encoder generates a sequence of visual embeddings V=Encoder(I)={v1,v2,,vN}V = \operatorname{Encoder}(I) = \{v_1, v_2, \ldots, v_N\}V=Encoder(I)={v1,v2,,vN}, where each viv_ivi represents a patch’s visual features. Simultaneously, the text prompt is tokenized into subword units and mapped to text embeddings via a lookup table.

In Stage 3, alignment and fusion occur. A V-L Adapter applies pooling operations—such as 2×22 \times 22×2 patch merging—to compress adjacent visual embeddings into denser representations. For example, a pooling operation combines four patches via concatenation and an MLP projection:

Tv=MLP(Concat(vi,i,vi+1,i,vi,i+1,vi+1,i+1))T_v = \mathrm{MLP} \big( \mathrm{Concat} \big( v_{i,i}, v_{i+1,i}, v_{i,i+1}, v_{i+1,i+1} \big) \big)Tv=MLP(Concat(vi,i,vi+1,i,vi,i+1,vi+1,i+1))

This reduces the visual token count while preserving semantic structure. The aligned visual embeddings are then concatenated with the text embeddings to form a unified input sequence:

Input=[Tv;Ttext]\mathrm{Input} = \left[ T_v ; T_{text} \right]Input=[Tv;Ttext]

Stage 4 involves the MLLM backbone, typically composed of self-attention layers, which processes the fused sequence to generate a response. Unlike text-only models that rely on discrete syntax, MLLMs interpret continuous visual patterns—such as indentation, bracket alignment, and color coding—directly from pixel data, enabling spatial reasoning over code structure.

To optimize cost and efficiency, the authors introduce dynamic resolution compression. Starting from the high-resolution base image, they apply bilinear downsampling to achieve target compression ratios (1×1\times1×, 2×2\times2×, 4×4\times4×, 8×8\times8×), where the visual token count is reduced to 1/k1/k1/k of the original text token count. This allows users to trade visual fidelity for token savings while maintaining performance on downstream tasks.

The CODEOCR middleware operationalizes this pipeline. As shown in the workflow diagram, users supply code and instructions; CODEOCR renders the code into a syntax-highlighted image and dynamically compresses it to meet specified token budgets. The resulting image is passed to the MLLM alongside the text instruction, and the model’s output is returned to the user. Internally, CODEOCR uses Pygments for syntax analysis and Pillow for rendering, supporting six core languages with extensibility to 500+ via Pygments’ lexer ecosystem.

The empirical design, depicted in the study overview, evaluates five research questions across four downstream tasks: code summarization, completion, clone detection, and question answering. The authors compare visual code input against raw text baselines, assess resilience to compression, evaluate rendering strategies (plain, bold, highlighted), test cross-language generalization, and analyze information degradation under compression. This systematic evaluation confirms that visual code representation not only matches but often exceeds text-based performance, particularly under compression, and adapts robustly across programming languages.

The visual rendering and compression process is further illustrated in the transformation diagram, which shows how 110 text tokens are converted into 110 visual tokens via rendering, then compressed to 27 visual tokens through resolution adjustment—enabling significant token savings without sacrificing task performance.

This architecture enables MLLMs to treat code as a visual artifact, unlocking new modalities for structured content representation while maintaining compatibility with existing text-based instruction paradigms.

Experiment

  • LLMs can effectively understand code images, matching or surpassing text-based performance in key tasks like clone detection and code completion, especially with models like Gemini-3-Pro and GPT-5.1.
  • Visual compression (up to 8x) is tolerated well in summarization and clone detection, but code completion and QA degrade beyond 2x–4x for most models, except top-tier ones like Gemini-3-Pro which improve under compression.
  • Syntax highlighting and bold rendering enhance code image understanding significantly at low-to-moderate compression (1x–4x), particularly for advanced models, but offer diminishing returns at high compression.
  • Findings generalize across programming languages—core patterns in performance, compression resilience, and enhancement benefits hold for Java as well as Python.
  • Compression degrades code information hierarchically: token errors appear first, line errors emerge at moderate compression, and block errors dominate at high compression; Gemini-3 models maintain low block error rates, explaining their consistent downstream performance.
  • Visual code processing introduces negligible latency overhead compared to text, enabling practical deployment with potential speedups via compression.

Results show that large language models can effectively understand code presented as images, with some models achieving comparable or better performance than text-based input across multiple programming tasks. Performance varies significantly by model and task, with state-of-the-art models like Gemini-3-Pro demonstrating strong resilience to visual compression and benefiting from syntax highlighting or bold rendering, while weaker models often degrade under compression. The findings generalize across programming languages, and visual code processing introduces minimal latency overhead, supporting practical deployment.

The authors evaluate seven multimodal LLMs on code understanding tasks using visual inputs at varying compression levels and rendering styles. Results show that visual code representations can match or exceed text-based performance, especially for tasks like clone detection and summarization, with stronger models like Gemini-3-Pro maintaining robustness even at high compression. Visual enhancements such as syntax highlighting and bold rendering improve performance at moderate compression but offer diminishing returns at extreme levels.

Results show that large language models can effectively understand code presented as images, with some models achieving comparable or superior performance to text-based input across multiple programming tasks. Performance varies significantly by model and task, with Gemini-3-Pro demonstrating consistent strength in both raw and compressed visual inputs, while open-weight models like Qwen-3-VL and GLM-4.6v show notable degradation under visual or compressed conditions. Visual enhancements such as syntax highlighting and bold rendering improve performance at moderate compression levels, but their benefits diminish at higher compression ratios where image clarity is reduced.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp