Command Palette
Search for a command to run...
LLaVA-UHD v4: ما الذي يجعل الترميز البصري الفعال في نماذج اللغات متعددة الوسائط الكبرى (MLLMs)؟
LLaVA-UHD v4: ما الذي يجعل الترميز البصري الفعال في نماذج اللغات متعددة الوسائط الكبرى (MLLMs)؟
Kechen Fang Yihua Qin Chongyi Wang Wenshuo Ma Tianyu Yu Yuan Yao
الملخص
تمثل التشفير البصري (Visual Encoding) عنق زجاجة حاسوبيّة رئيسي في نماذج اللغة الكبيرة متعددة الوسائط (MLLMs)، ولا سيما عند التعامل مع مدخلات الصور عالية الدقة. وتتبع الممارسات السائدة عادةً نهج التشفير الشامل يليه ضغط ما بعد نموذج الترميز البصري القائم على البنية الانتباهية (post-ViT compression). ويؤدي التشفير الشامل إلى إنتاج سلاسل حويصلية (token sequences) ضخمة، بينما يتكبّد الضغط الذي يلي ViT تكلفة الانتباه التربيعية الكاملة لـ ViT قبل أي عملية تقليل للحويصلات (tokens). في هذا العمل، نعيد النظر في هذا النهج التقليدي من بعدين: استراتيجية التشفير وضغط الحويصلات البصرية. أولاً، تُظهر التجارب الخاضعة للتحكم أن التشفير القائم على الشرائح (slice-based encoding) يتفوق على التشفير الشامل عبر مجموعة من الاختبارات المعيارية (benchmarks)، مما يشير إلى أن الحفاظ على التفاصيل المحلية من خلال المشاهد المقطعية قد يكون أكثر فائدة من تطبيق الانتباه الشامل للإدراك دقيق التفاصيل. ثانياً، نقدم ضغطاً مبكراً داخلياً لـ ViT (intra-ViT early compression)، والذي يقلل عدد الحويصلات في الطبقات الضحلة من ViT، مما يخفض بشكل كبير عمليات الضرب والجمع (FLOPs) الخاصة بالتشفير البصري مع الحفاظ على الأداء في المهام اللاحقة. ومن خلال دمج الضغط الداخلي لـ ViT ضمن إطار عمل التشفير القائم على الشرائح، نقدم نموذج LLaVA-UHD v4، وهو مخطط كفاءة وتتحكم في الحساب للتشفير البصري مصمم خصيصاً لمدخلات عالية الدقة. عبر مجموعة متنوعة من الاختبارات المعيارية التي تغطي فهم المستندات، والاستخراج الضوئي للحروف (OCR)، والاستفسار البصري العام (VQA)، يقلل LLaVA-UHD v4 من عمليات الضرب والجمع (FLOPs) الخاصة بالتشفير البصري بنسبة 55.8% بينما يتطابق مع أداء الأساسيات أو يتفوق عليها حتى.
One-sentence Summary
The authors present LLaVA-UHD v4, an efficient and compute-controllable visual encoding scheme for multimodal large language models tailored for high-resolution inputs that integrates slice-based encoding with intra-ViT early compression to reduce visual-encoding FLOPs by 55.8% while matching or surpassing baseline performance across document understanding, OCR, and general VQA benchmarks.
Key Contributions
- Controlled experiments demonstrate that slice-based encoding outperforms global encoding across benchmarks by preserving local details through sliced views. This finding suggests that sliced views are more beneficial than applying global attention for fine-grained perception.
- The paper introduces a novel parameter-reusing intra-ViT early compression module that reduces tokens in shallow ViT layers. This technique substantially lowers visual-encoding FLOPs while maintaining downstream performance.
- The paper presents LLaVA-UHD v4, an efficient visual encoding scheme integrating intra-ViT compression into the slice-based framework. Experiments across document understanding, OCR, and general VQA benchmarks show a 55.8% reduction in visual-encoding FLOPs while matching or surpassing baseline performance.
Introduction
Multimodal Large Language Models increasingly rely on high-resolution inputs for fine-grained perception, yet the standard global encoding paradigm creates a quadratic computational bottleneck as image area grows. Existing efficiency solutions typically apply token compression after the vision encoder completes its heavy computation, and attempts to compress tokens internally often disrupt pretrained visual representations. The authors challenge the global encoding convention by demonstrating that slice-based strategies offer superior detail preservation without the quadratic overhead. They further introduce a parameter-reuse early compressor inserted into the shallow layers of the vision encoder that initializes using adjacent pretrained weights to maintain representation stability. This approach powers LLaVA-UHD v4, enabling aggressive token reduction inside the encoder itself to achieve significant speedups while maintaining competitive accuracy.
Method
The authors propose LLaVA-UHD v4, a visual encoding scheme designed to address the computational bottlenecks associated with high-resolution inputs in Multimodal Large Language Models. The architecture fundamentally shifts from the prevailing global encoding paradigm to a slice-based approach combined with intra-ViT early compression.
As shown in the figure below:

The framework begins by decomposing the input image into a low-resolution thumbnail and a set of high-resolution slices selected by an aspect-ratio-aware policy. Unlike previous works that process the entire image globally before token reduction, LLaVA-UHD v4 rescales and concatenates these views along the sequence dimension. This allows the Vision Transformer (ViT) to process them in a single forward pass while preserving per-view attention locality.
A key innovation is the integration of an intra-ViT early compressor module, denoted as D. This module is inserted directly into the ViT backbone, specifically after layer k=6, to reduce the token sequence length before the deeper layers process the data. This design choice adheres to the principle that compression should reduce the ViT's own compute, not just the downstream LLM's load. By reducing the token count early, the majority of the ViT layers operate on a significantly smaller sequence, slashing visual-encoding FLOPs.
The internal structure of the compressor D consists of two primary stages. First, a window attention operator WinAttn2×2 is applied to the input token sequence Xk. This attention is restricted to non-overlapping 2×2 windows, ensuring that each token interacts only with its three spatial neighbors to enrich local context. Second, a downsample-and-fuse block follows. A 2×2 PixelUnshuffle operation reshapes the intermediate representation Y into Z∈RN/4×4d. Finally, an MLP fuses these concatenated channels back to the original dimension d, producing the compressed sequence X.
To ensure that inserting this new module does not disrupt the pretrained representation manifold of the ViT, the authors employ a parameter-reuse initialization strategy. Rather than random initialization, D is initialized using the weights of the preceding ViT layer k. Specifically, the attention projections and LayerNorms are copied directly, while the MLP weights are constructed to mimic applying the original Feed-Forward Network independently to each patch within a window followed by averaging. The weight matrices are defined as:
W1=BlockDiag(F1(k),F1(k),F1(k),F1(k)),W2=λ1[F2(k)∣F2(k)∣F2(k)∣F2(k)].This initialization allows fine-tuning to begin on or near the pretrained manifold, avoiding the need to recover it from scratch.
Following the intra-ViT compression, the encoded visual features pass through a post-ViT MLP connector. This stage further reduces the token count and projects the features into the language model space. The combination of the intra-ViT compressor and the post-ViT connector achieves an end-to-end 16× reduction in token count.
The training process follows a four-stage recipe to optimize the model. First, vision-language alignment is performed on large-scale image-text pairs, updating only the projector and the new compressor D. Second, knowledge injection occurs via OCR, document, and chart data with only the ViT unfrozen. Third, interleaved training on image-text sequences facilitates multi-image and long-context reasoning. Finally, supervised instruction tuning is applied on a diverse mixture of general VQA, math, and conversational tasks to refine the model's capabilities.
Experiment
Controlled experiments on high-resolution MLLMs reveal that slice-based encoding consistently outperforms global encoding by preserving locality, while spatial-merging MLP connectors prove superior to query-based resamplers by maintaining explicit spatial correspondence. Furthermore, shifting compression inside the ViT pipeline substantially reduces computational costs without compromising accuracy when utilizing local window attention and weight reuse at an intermediate layer depth. These findings collectively establish a more efficient architecture that balances visual encoding quality with compute savings.
The authors evaluate various intra-ViT compression designs to determine how to effectively reduce visual tokens inside the encoder. The results indicate that simple merging strategies like average pooling or standard MLPs often underperform or match the post-ViT baseline, whereas designs incorporating local window attention and weight reuse yield superior performance. Specifically, combining window attention with a reused MLP projector achieves the highest average accuracy, demonstrating that preserving local context and aligning with pretrained weights are critical for effective early compression. The Win-Attn w/ Reused MLP method achieves the highest average score, outperforming the Post-ViT Baseline. Naive compression strategies such as Pixel-Unshuffle MLP and Average Pooling generally yield lower or comparable results to the baseline. Methods incorporating window attention tend to outperform cross-attention variants and simple MLP projections.
The authors evaluate naive in-ViT compression strategies against a post-ViT baseline to assess the trade-off between efficiency and accuracy. Results show that moving compression inside the ViT drastically reduces computational cost, yet simple merging methods like average pooling and pixel-unshuffle fail to match the baseline's performance. This indicates that early token reduction requires more sophisticated design to preserve representational quality. In-ViT compression methods achieve significantly lower computational costs than the post-ViT baseline. Naive merging strategies result in a noticeable drop in average accuracy compared to the baseline. Pixel-unshuffle performs slightly better than average pooling but remains inferior to the post-ViT baseline.
The authors evaluate intra-ViT compression strategies using cross-attention against a post-ViT baseline. The results demonstrate that internal compression significantly reduces computational cost while maintaining competitive accuracy, with the top-left query strategy performing nearly as well as the baseline. Intra-ViT compression achieves a substantial reduction in FLOPs compared to the post-ViT baseline. The top-left query strategy preserves accuracy at levels comparable to the baseline. The mean query strategy results in a minor performance drop relative to the top-left approach.
The authors conduct a robustness study replacing the standard SigLIP 2 backbone with MoonViT and testing an alternative higher-resolution slicing schedule under a 16x compression rate. In these settings, Slice-based Encoding consistently matches or outperforms Global Encoding, with the performance gap increasing as the training data scale expands. Slice-based encoding achieves higher average accuracy than global encoding across all tested data scales and configurations. The performance advantage of the proposed method is most significant on OCR-intensive benchmarks requiring fine-grained recognition. Increasing the training data volume from 8M to 16M samples further amplifies the benefits of slice-based encoding over the baseline.
The authors evaluate the robustness of slice-based encoding against global encoding using different vision backbones and resolution settings. Results indicate that slice-based encoding consistently outperforms global encoding across all tested configurations, including the MoonViT backbone at varying data scales and a higher-resolution slicing schedule. Slice-based encoding maintains a performance advantage over global encoding when using the MoonViT backbone across different data scales. Adopting a higher-resolution slicing schedule further widens the performance gap in favor of slice-based encoding. The effectiveness of slice-based encoding generalizes across different visual encoder architectures and training data volumes.
The study assesses intra-ViT compression designs and slice-based encoding strategies against post-ViT baselines to balance efficiency with accuracy. Findings reveal that naive merging strategies degrade performance, while incorporating local window attention and weight reuse preserves representational quality for superior results. Furthermore, robustness evaluations show that slice-based encoding consistently outperforms global encoding across different architectures and data scales, with advantages increasing as training volume grows.