HyperAIHyperAI

Command Palette

Search for a command to run...

نمط تدريب مُوجَّه بالفجوة الوظيفية لمحاذاة الفراغات الفرعية للنماذج الكبيرة للغة متعددة الوسائط

الملخص

رغم النجاح الذي حققته التعلم المتناقض متعدد الوسائط في محاذاة التمثيلات البصرية واللغوية، تظل هناك شوائب هندسية مستمرة، تُعرف بـ "الفجوة الوسائطية"، حيث تقع مُدمجات (embeddings) الوسائط المختلفة التي تعبر عن نفس المعنى في مناطق منفصلة بشكل منهجي. تُعد الطرق السابقة التي تهدف إلى تقليل هذه الفجوة محدودة إلى حد كبير بسبب افتراضات مبسطة وانسجامية (isotropic) مفرطة، مما يعيق تطبيقها في السياقات الكبيرة الحجم. في هذا البحث، نعالج هذه القيود من خلال تحديد شكل الفجوة الوسائطية بدقة، واستغلال هذا التمثيل الهندسي لتمكين التوسع الفعّال للنموذج. أولاً، نقترح نظرية الفجوة الوسائطية في الإطار الثابت (Fixed-frame Modality Gap Theory)، والتي تحلل الفجوة الوسائطية داخل إطار مرجعي ثابت إلى مكونات ثابتة (أخطاء منتظمة) ونواتج غير متماثلة (anisotropic residuals). وفقًا لهذا النموذج الدقيق، نقدم ReAlign، استراتيجية محاذاة وسائط دون الحاجة إلى تدريب. باستخدام إحصائيات من كميات هائلة من البيانات غير المزروعة (unpaired)، يقوم ReAlign بمحاذاة تمثيل النص إلى توزيع تمثيل الصورة من خلال عملية مكونة من ثلاث خطوات: محاذاة المُثَبَّت (Anchor)، ومُتابعة المسار (Trace)، ومحاذاة المركز (Centroid Alignment)، مما يصحح بشكل صريح الانحراف الهندسي. بالاعتماد على ReAlign، نقترح ReVision، نموذجًا قابلاً للتوسع للنماذج الكبيرة متعددة الوسائط للغة (Multimodal Large Language Models - MLLMs). يدمج ReVision ReAlign في مرحلة ما قبل التدريب، مما يمكّن النموذج من تعلّم توزيع التمثيلات البصرية من نصوص غير مزروعة قبل التدريب على التعليمات البصرية، دون الحاجة إلى أزواج ضخمة وعالية الجودة من الصور والنصوص. يُظهر إطارنا أن البيانات غير المزروعة التي تم محاذيتها إحصائيًا يمكنها أن تحل محل بفعالية الأزواج المكلفة من الصور والنصوص، مما يفتح طريقًا قويًا وفعالًا لتوسيع النماذج الكبيرة متعددة الوسائط للغة.

One-sentence Summary

Researchers from HKUST(GZ), NUS, and collaborators propose ReAlign and ReVision to resolve the modality gap in multimodal learning by modeling its geometric structure—stable bias and anisotropic residuals—enabling efficient, training-free alignment and scalable MLLM pretraining using unpaired data instead of costly image-text pairs.

Key Contributions

  • We introduce the Fixed-frame Modality Gap Theory, which decomposes the geometric misalignment between visual and text embeddings into stable biases and anisotropic residuals within a frozen reference frame, correcting prior oversimplified isotropic assumptions.
  • We propose ReAlign, a training-free alignment method that uses statistics from unpaired data to map text representations into the image embedding space via Anchor, Trace, and Centroid Alignment, enabling precise geometric correction without additional training.
  • We present ReVision, a scalable MLLM pretraining paradigm that leverages ReAlign to substitute expensive image-text pairs with unpaired text data, allowing models to learn visual distributions before instruction tuning while maintaining performance.

Introduction

The authors leverage multimodal contrastive learning’s known geometric flaw—the Modality Gap—to rethink how Multimodal Large Language Models (MLLMs) are trained. Prior methods assume this gap is isotropic noise and correct only the mean shift, which fails to capture the true anisotropic structure of representation misalignment and limits scalability. The authors’ key contribution is a precise geometric decomposition of the gap into stable biases and anisotropic residuals, enabling ReAlign: a training-free, three-step statistical alignment that maps text embeddings into the visual distribution using only unpaired data. Building on this, they introduce ReVision, a scalable two-stage training paradigm that substitutes expensive image-text pairs with massive text during pretraining, then refines with real images—proving text alone can encode rich visual semantics when aligned correctly.

Method

The authors leverage a fixed-reference framework to decompose the modality gap into interpretable geometric components, enabling precise characterization of persistent misalignments in dual-encoder contrastive models. This decomposition is grounded in a data-driven subspace construction that separates bias and residual terms, allowing independent estimation and reducing theoretical claims to second-moment conditions estimable from finite samples.

The framework begins by training a dual-encoder model using the InfoNCE loss on a large-scale image-text dataset. At a fixed reference time t0t_0t0, a task-relevant subspace UUU is constructed from a held-out probe set by computing the empirical covariance of unit-normalized embeddings for both modalities:

Σ^(t0):=Covp(ex(t0))+Covp(ey(t0)),\hat{\Sigma}(t_0) := \mathrm{Cov}_{\mathbf{p}}\big(e_x(t_0)\big) + \mathrm{Cov}_{\mathbf{p}}\big(e_y(t_0)\big),Σ^(t0):=Covp(ex(t0))+Covp(ey(t0)),

where U:=span{q1,,qr}U := \mathrm{span}\{q_1, \ldots, q_r\}U:=span{q1,,qr} is derived from the top rrr eigenvectors of Σ^(t0)\hat{\Sigma}(t_0)Σ^(t0), determined by an energy threshold. The orthogonal complement V:=UV := U^\perpV:=U is then defined, and fixed orthogonal projectors PUP_UPU and PVP_VPV are maintained for all subsequent analysis. This fixed frame enables consistent decomposition of the modality gap Δ(t):=ex(t)ey(t)\Delta(t) := e_x(t) - e_y(t)Δ(t):=ex(t)ey(t) into bias and residual components.

The bias terms are defined as projected mean components: β(t):=PUE[Δ(t)]U\beta(t) := P_U \mathbb{E}[\Delta(t)] \in Uβ(t):=PUE[Δ(t)]U (Principal Modality Bias, PMB) and γ(t):=PVE[Δ(t)]V\gamma(t) := P_V \mathbb{E}[\Delta(t)] \in Vγ(t):=PVE[Δ(t)]V (Constant Orthogonal Bias, COB). The zero-mean residuals are δ(t):=PU(Δ(t)E[Δ(t)])U\delta(t) := P_U(\Delta(t) - \mathbb{E}[\Delta(t)]) \in Uδ(t):=PU(Δ(t)E[Δ(t)])U and ζ(t):=PV(Δ(t)E[Δ(t)])V\zeta(t) := P_V(\Delta(t) - \mathbb{E}[\Delta(t)]) \in Vζ(t):=PV(Δ(t)E[Δ(t)])V. This yields the exact orthogonal decomposition:

Δ(t)=β(t)+δ(t)+γ(t)+ζ(t).\Delta(t) = \beta(t) + \delta(t) + \gamma(t) + \zeta(t).Δ(t)=β(t)+δ(t)+γ(t)+ζ(t).

As shown in the figure below, the residual structure exhibits extreme anisotropy in both subspaces, with the orthogonal component VVV hosting a stable bias vector that remains geometrically decoupled from the highly stretched noise structure of ζ(t)\zeta(t)ζ(t).

The authors further analyze the evolution of these components. Gradients are empirically observed to concentrate within the instantaneous task subspace UtU_tUt, with the reference leakage ratio PVgt/gt\|P_V g_t\| / \|g_t\|PVgt∥/∥gt tightly tracking the geometric baseline sinθ(Ut,U)\sin\theta(U_t, U)sinθ(Ut,U). This concentration implies that direct optimization in VVV is negligible, leading to passive drift of γ(t)\gamma(t)γ(t) — a slow, cumulative evolution driven by subspace rotation rather than active gradient correction. In contrast, within UUU, the residual δ(t)\delta(t)δ(t) exhibits extreme anisotropy (κ(ΣU)>103\kappa(\Sigma_U) > 10^3κ(ΣU)>103) and rapidly aligns with the gradient covariance structure, indicating that fluctuations are semantically coupled to the task.

To address these geometric misalignments without retraining, the authors propose ReAlign, a three-stage, training-free alignment strategy. The method operates in Euclidean space before spherical projection and consists of:

  1. Anchor Alignment: Centering the source modality and shifting its mean to match the target’s centroid: e˙y=(eyμy)+μx\dot{e}_y = (e_y - \mu_y) + \mu_xe˙y=(eyμy)+μx.
  2. Trace Alignment: Scaling the residuals to match the target’s global energy via s=Tx/Tys = \sqrt{\mathcal{T}_x / \mathcal{T}_y}s=Tx/Ty, yielding e~y=μx+s(eyμy)\tilde{e}_y = \mu_x + s(e_y - \mu_y)e~y=μx+s(eyμy), preserving the source’s anisotropic structure.
  3. Centroid Alignment: Correcting the non-linear drift induced by spherical projection by explicitly re-centering the distribution: ey=eyμ+μxe_y'' = e_y' - \mu' + \mu_xey′′=eyμ+μx, followed by re-normalization.

As illustrated in the ReAlign pipeline, this sequence ensures that the final aligned representation matches both the first- and second-order statistics of the target modality on the unit hypersphere, while preserving semantic hierarchy and mitigating phantom drift.

This geometrically grounded approach forms the foundation for ReVision, a scalable two-stage training paradigm for multimodal LLMs. In Stage 1, ReAlign is used to synthesize pseudo-visual embeddings from unpaired text, enabling pretraining on vast textual corpora. In Stage 2, the model is fine-tuned on real image-instruction pairs to refine fine-grained visual reasoning. During inference, the model directly consumes real images, benefiting from the asymmetric alignment established during pretraining.

Experiment

  • ReAlign effectively reduces modality gap by modeling anisotropic covariance, outperforming isotropic methods like C³ that hit geometric bottlenecks.
  • ReVision achieves superior performance in multimodal LLM training by preserving semantic hierarchy and angular topology, leading to better reasoning and reduced hallucinations.
  • Scaling unpaired text data (2M samples) with ReAlign surpasses performance of 1M real image-text pairs at 26% lower cost, proving cost-efficient scaling is viable.
  • ReAlign demonstrates rapid convergence with ~10K samples, stable numerical precision in Float64, linear time complexity, and constant memory usage, enabling scalable deployment.
  • Domain-specific statistics are essential; cross-domain alignment degrades performance, highlighting the need for tailored calibration.
  • Blockwise covariance alignment fails due to numerical instability and semantic distortion, while ReAlign’s isotropic scaling preserves local structure and ensures robustness.
  • Short captions outperform long ones due to compact, low-rank covariance and higher signal-to-noise ratio, revealing that linguistic density can harm geometric alignment.
  • Qualitative analysis confirms ReVision excels in abstract reasoning, spatial understanding, fine-grained perception, and knowledge integration across diverse cognitive tasks.

ReVision outperforms BC Align across all evaluated categories, including general perception, complex reasoning, and hallucination resistance, achieving a higher average score. The results confirm that ReVision’s geometric alignment strategy preserves semantic structure better than blockwise covariance methods, which suffer from numerical instability and semantic distortion. This leads to more robust and accurate multimodal understanding without requiring paired image-text data.

Results show that ReVision, which uses anisotropic geometric alignment, consistently outperforms its long-caption variant across general perception, reasoning, and hallucination benchmarks. The performance gap highlights that concise, geometrically compact captions enable more stable modality alignment than linguistically rich but noisy long captions. This suggests that signal-to-noise ratio and manifold structure matter more than raw textual length for effective cross-modal training.

The authors use ReVision to align text and image modalities by modeling anisotropic covariance, which significantly reduces the geometric gap compared to isotropic methods like C³. Results show that ReVision outperforms baselines across general perception, reasoning, and hallucination benchmarks, demonstrating that precise geometric alignment preserves semantic structure and improves downstream task performance. Scaling up text-only pretraining with ReVision also surpasses paired image-text training in performance while reducing data acquisition costs.

ReVision-2M achieves the highest average score across general, reasoning, and hallucination benchmarks, outperforming both text-only and paired image-text baselines while reducing data acquisition cost by 26%. The method’s geometric alignment strategy preserves semantic structure better than isotropic noise-based approaches, enabling superior performance even without real image inputs. Scaling up unpaired text data with precise alignment proves more cost-effective and competitive than relying on expensive paired datasets.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp