HyperAIHyperAI

Command Palette

Search for a command to run...

ViQ: تمثيلات بصرية مُكمّاة متوافقة مع النص بأي دقة

Xumin Yu Zuyan Liu Zhenyu Yang Yuhao Dong Shengsheng Qian Jiwen Lu Han Hu Yongming Rao

الملخص

يُعدّ التمثيل الموحد للنص والرؤية سعيًا طبيعيًا، إذ يتيح نمذجة متعددة الوسائط أبسط وتدريبًا أكثر كفاءة. ومع ذلك، فإن تمثيل الصور كإشارات منفصلة بالطريقة نفسها المستخدمة للنص يؤدي حتمًا إلى فقدان شديد في المعلومات. وتواجه الأبحاث الحالية صعوبة في تحقيق التوازن بين التفاصيل منخفضة المستوى والدلالات عالية المستوى في التمثيلات المنفصلة: فغالبًا ما تفتقر التمثيلات الموجهة لإعادة البناء إلى المعلومات الدلالية، بينما تعاني الميزات الأقوى دلاليًا عادةً من فقدان شديد للتفاصيل. نقدم ViQ، وهو إطار عمل للتمثيلات البصرية المُكمَّمة، صُمم لتحقيق التوازن بين الدلالات والتفاصيل في التمثيلات المنفصلة مع دعم الإدخالات بدقة أصلية، مما يمكّنه من أن يعمل كتمثيل منفصل موحد وعام للإدخالات البصرية المختلفة. ينظم نهجنا تعلم التكميم على مرحلتين: التدريب المسبق المتوافق مع النص وتجزئة الميزات. ومن خلال التدريب المسبق المتوافق مع النص، نعزز الإشراف الغني بالدلالات لمُشفِّر الرؤية المستمد من النموذج اللغوي المُدرَّب مسبقاً، ونُمكّنه من معالجة الإدخالات البصرية بدقة أصلية. وأثناء عملية التجزئة، نقترح استراتيجية لتعلم التمثيل القريب لضغط مساحة الميزات تدريجيًا، إلى جانب آلية تكميم على مستوى الرؤوس واعية بالموضع تتيح معالجة مرنة للدقات المختلفة. وتُظهر التجارب الواسعة النطاق على المهام متعددة الوسائط أن ViQ يحقق أداءً تنافسيًا مقارنة بأحدث مُشفِّرات الرؤية متعددة الوسائط التي تستخدم ميزات بصرية مستمرة وعالية الأبعاد، مع الحفاظ على دقة عالية في إعادة البناء منخفضة المستوى. ونُظهر أيضًا أن التدريب متعدد الوسائط باستخدام التمثيلات البصرية المُكمَّمة يحسّن الكفاءة بشكل كبير، مما يحقق تسارعًا يصل إلى 20%-70% مع مختلف النماذج اللغوية الكبيرة الأساسية ووصفات التدريب.

One-sentence Summary

ViQ is a visual quantization framework that balances high-level semantics and low-level details by integrating text-aligned pre-training for native-resolution processing with a proximal representation learning strategy and position-aware head-wise quantization, yielding a unified discrete representation that enables simpler multimodal modeling and more efficient training across arbitrary resolutions.

Key Contributions

  • The paper introduces ViQ, a visual quantized representations framework that balances low-level details and high-level semantics in discrete formats while natively supporting arbitrary input resolutions.
  • The framework structures quantization learning into text-aligned pre-training and feature discretization phases, employing a proximal representation learning strategy to compact the latent space and a position-aware head-wise quantization mechanism to preserve resolution flexibility.
  • Evaluated across nine multimodal benchmarks and reconstruction tasks, the framework outperforms existing quantized models and achieves performance competitive with continuous encoders such as InternViT, AIMv2, and SigLIP2. Training efficiency improves by 20% to 70% across varying sequence lengths, while reconstruction fidelity achieves a PSNR of 22.73 and an rFID score of 0.62.

Introduction

Multimodal large language models benefit from unified representations that align vision and text, enabling simpler modeling and improved training efficiency. However, continuous visual features mismatch the discrete token structure of language and impose heavy computational costs, while existing discrete quantization methods often fail to balance high-level semantics with low-level visual details. The authors introduce ViQ, a framework that produces text-aligned visual quantized representations capable of handling native resolutions. By combining text-aligned pre-training with a proximal representation learning strategy and position-aware head-wise quantization, ViQ achieves competitive multimodal performance and reconstruction fidelity while delivering significant training speedups.

Method

The authors introduce ViQ, a visual quantization framework designed to bridge raw pixels and compact latent representations for multimodal learning. The architecture processes images at any resolution through a specific pipeline that converts continuous features into discrete codes.

As shown in the framework diagram, input images first pass through a ViQ Encoder to extract high-dimensional features. These features are then compressed by a Bottleneck Compressor. The core module, Position-Aware FSQ, visualizes the quantization process where continuous features are mapped to discrete code embeddings. The BottleNeck Decoder then reconstructs these embeddings. The output branches into two paths: an MM Projector for generating multi-modal features for language models and a Pixel Decoder for image reconstruction tasks.

The training process follows a two-stage approach to ensure robust alignment and effective quantization. In the first stage, text-aligned pre-training aligns the visual encoder with language embeddings. To support native resolution inputs, the model replaces fixed positional embeddings with resized positional embeddings that dynamically adjust dimensions. The optimization combines a text-guided cross-entropy loss and a self-distillation loss. The text loss is defined as:

Ltext=Cross Entropy[LLM(ViQ(I), T), A]\mathcal { L } _ { \mathrm { t e x t } } = \mathrm { C r o s s ~ E n t r o p y } [ \mathrm { L L M } ( \mathrm { V i Q } ( I ) , ~ T ) , ~ A ]Ltext=Cross Entropy[LLM(ViQ(I), T), A]

The self-distillation loss ensures semantic consistency with a fixed-resolution teacher model using cosine similarity:

Ldistill=1cos(zsstudent,zsteacher)\mathcal { L } _ { \mathrm { d i s t i l l } } = 1 - \cos \left( \mathbf { z } _ { s } ^ { \mathrm { s t u d e n t } } , \mathbf { z } _ { s } ^ { \mathrm { t e a c h e r } } \right)Ldistill=1cos(zsstudent,zsteacher)

The second stage involves progressive quantization of the continuous features.

This stage begins with proximal representation learning. High-dimensional features are compressed via a bottleneck layer and constrained to a hypercube surface using the LL_\inftyL norm to reduce feature space complexity. The feature transformation is formulated as:

f1=L(BN(f)),f^=BN(f1)f _ { 1 } = L _ { \infty } ( \mathrm { B N } ( f ) ) , \hat { f } = \mathrm { B N } ^ { \prime } ( f _ { 1 } )f1=L(BN(f)),f^=BN(f1)

Following this, the model employs Multi-Head Finite Scalar Quantization (FSQ). To enhance representational capacity, a multi-head attention mechanism expands each visual patch into a 2×22 \times 22×2 grid of codes. Additionally, 2D Rotary Position Embedding (RoPE) is applied to encode spatial resolution information:

f~m=fmei(hθh+wθw)\tilde { f } _ { m } = f _ { m } \odot e ^ { i ( h \theta _ { h } + w \theta _ { w } ) }f~m=fmei(hθh+wθw)

To preserve low-level details, the training incorporates a reconstruction loss supervised by a pre-trained visual autoencoder. The total objective combines the text loss, distillation loss, and reconstruction loss:

Ltotal=λtextLtext+λdistillLdistill+λreconLrecon\mathcal { L } _ { \mathrm { t o t a l } } = \lambda _ { \mathrm { t e x t } } \mathcal { L } _ { \mathrm { t e x t } } + \lambda _ { \mathrm { d i s t i l l } } \mathcal { L } _ { \mathrm { d i s t i l l } } + \lambda _ { \mathrm { r e c o n } } \mathcal { L } _ { \mathrm { r e c o n } }Ltotal=λtextLtext+λdistillLdistill+λreconLrecon

Experiment

The evaluation integrates ViQ with varying-scale language models and benchmarks it against general, multimodal-specialized, and quantized visual encoders across comprehensive understanding tasks, training efficiency tests, and image reconstruction pipelines. These experiments validate that ViQ achieves compact visual representations that preserve strong perceptual and semantic capabilities, particularly excelling in text- and document-centric tasks while delivering substantial training speed-ups through precomputed discrete codes. Ablation studies further confirm that gradually regularizing the latent space, employing non-learnable codebooks, and combining targeted reconstruction losses effectively balance low-level fidelity with high-level alignment. Overall, the findings establish ViQ as an efficient visual encoder that successfully navigates the trade-off between aggressive compression and robust multimodal understanding.

The authors evaluate different reconstruction loss formulations to optimize the training process. The experiments demonstrate that the VAE latent loss yields the best performance while being more computationally efficient than complex alternatives like MSE combined with LPIPS. The VAE latent loss achieves the highest average performance compared to other loss types. It requires less computational time than the MSE and LPIPS combination. It outperforms the baseline configuration with no specific loss.

The authors compare the performance of FSQ and SimVQ quantization methods across varying codebook sizes. The findings show that FSQ consistently outperforms SimVQ, and reducing the codebook size leads to better results for both approaches. FSQ achieves higher average performance than SimVQ. Smaller codebook sizes improve performance for FSQ. SimVQ performance declines as the codebook size increases.

The authors evaluate ViQ against various visual encoders on multimodal understanding benchmarks to assess its effectiveness. The results demonstrate that ViQ achieves competitive overall performance, frequently matching or surpassing continuous encoders while significantly outperforming existing quantized models. It exhibits particular strength in text and document-centric tasks, validating its capability as a compact visual representation. ViQ achieves higher average scores than most continuous visual encoders across different model sizes. The model significantly outperforms previous quantized baselines on multimodal understanding benchmarks. ViQ demonstrates superior performance in text and document recognition tasks compared to other encoders.

The authors investigate the impact of bottleneck width on the model's average performance across multiple benchmarks. The results demonstrate that a significantly reduced bottleneck width can achieve performance levels comparable to the widest configuration. This suggests that the model maintains robustness and quality even with substantial dimensionality reduction. A narrower bottleneck width achieves performance similar to the widest configuration. Reducing the width does not cause significant performance degradation. The model preserves high average scores despite reduced dimensionality.

The evaluation setup systematically tests key training and architectural components, validating that the VAE latent loss optimizes both reconstruction quality and computational efficiency compared to complex alternatives. Quantization analysis reveals that FSQ consistently outperforms SimVQ, with smaller codebook sizes further enhancing results across both methods. When benchmarked against standard visual encoders, the proposed ViQ model matches or exceeds continuous approaches on multimodal tasks while significantly surpassing prior quantized baselines, particularly in text and document understanding. Finally, bottleneck width tests confirm that substantial dimensionality reduction preserves overall performance, validating the framework's robustness and compact design.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp