HyperAIHyperAI

Command Palette

Search for a command to run...

시각 정보 증가를 통한 대규모 비전 언어 모델의 선택적 훈련

Seulbi Lee Sangheum Hwang

초록

대규모 시각 언어 모델(LVLMs)은 놀라운 진전을 이루었으나, 종종 언어 편향(language bias) 문제를 겪으며, 시각적 증거에 의존하지 않고 답변을 생성하는 경우가 많다. 기존 연구들은 디코딩 전략, 아키텍처 수정, 또는 정제된 지시 데이터를 활용하여 이 문제를 완화하려 했으나, 대부분 개별 학습 샘플이나 토큰이 이미지로부터 실제로 얼마나 도움을 받는지를 정량적으로 측정할 수 있는 방법이 부족하다는 한계를 지닌다. 본 연구에서는 시각 입력이 예측 불확실성의 감소에 기여하는 정도를 측정하는 퍼플렉서티 기반 지표인 시각 정보 이득(Visual Information Gain, VIG)을 제안한다. VIG는 샘플 수준과 토큰 수준에서 세밀한 분석이 가능하게 하며, 색상, 공간 관계, 속성과 같은 시각적으로 기반을 둔 요소들을 효과적으로 부각시킨다. 이를 바탕으로, 고VIG 값을 가지는 샘플과 토큰을 우선순위에 두는 VIG 기반 선택적 학습 방식을 제안한다. 이 접근법은 시각적 기반성(visual grounding)을 강화하고 언어 편향을 완화하며, 시각적으로 정보가 풍부한 샘플과 토큰에만 집중함으로써 훨씬 적은 감독 데이터로도 우수한 성능을 달성한다.

One-sentence Summary

Seulbi Lee and Sangheum Hwang of Seoul National University of Science and Technology propose Visual Information Gain (VIG), a perplexity-based metric that quantifies image contribution per token, enabling selective training to reduce language bias and enhance visual grounding in LVLMs with less supervision.

Key Contributions

  • We introduce Visual Information Gain (VIG), a perplexity-based metric that quantifies how much visual input reduces prediction uncertainty, enabling fine-grained analysis of visual dependency at both sample and token levels across multimodal datasets.
  • VIG reliably identifies visually grounded elements such as colors, spatial relations, and attributes, distinguishing them from tokens driven by textual priors, and aligns with benchmark-level modality dependencies to validate its effectiveness as a grounding indicator.
  • Leveraging VIG, we design a selective training scheme that prioritizes high-VIG samples and tokens, improving visual grounding and reducing language bias while achieving superior performance with significantly less supervision compared to full-data training.

Introduction

The authors leverage a new metric called Visual Information Gain (VIG) to quantify how much visual input reduces prediction uncertainty in Large Vision Language Models (LVLMs), addressing the persistent problem of language bias—where models ignore images and rely on textual priors. Prior work mitigates this through architectural tweaks or decoding tricks, but none measure visual dependency at the sample or token level, leaving models prone to hallucinations and weak grounding. Their main contribution is VIG-guided selective training, which prioritizes visually informative samples and tokens, improving grounding and reducing supervision needs while maintaining performance—offering a data-centric, model-agnostic solution that complements existing methods.

Top Figure
Top Figure

Dataset

The authors use a curated mix of benchmarks and instruction-tuning datasets to evaluate and train their model. Here’s how the data is structured and applied:

  • Evaluation Benchmarks (Visual Understanding):

    • LLaVA-W (LLaVA-Bench In-the-Wild): 24 images, 60 questions covering diverse visuals like memes, paintings, and sketches. Evaluated via GPT-4 (gpt-4o-2024-11-20).
    • MMVet: 200 images, 218 questions with ground-truth references. Assesses conversational reasoning using GPT-4 (gpt-4-0613) for precision and utility scoring.
    • MMBench (English subset): ~3,000 multiple-choice questions spanning 20 skills. Uses GPT-3.5 (gpt-3.5-turbo-0613) to extract answer choices (A–D). Reported on dev split.
    • DocVQA: Focuses on document image understanding (forms, invoices, reports). Evaluated on official validation split; accuracy is reported.
  • Hallucination Evaluation Benchmarks:

    • POPE: Built from MSCOCO, A-OKVQA, and GQA. 27,000 QA pairs from 500 images each. Tests object hallucination with 50:50 existent/non-existent object queries. Uses three negative sampling strategies (random, popular, adversarial); six questions per image. Metrics: Accuracy and F1 averaged across strategies.
    • CHAIR: Measures caption hallucination via two metrics: CHAIR_I (instance-level hallucination ratio) and CHAIR_S (sentence-level hallucination rate). Formulae provided for both.
    • MMHal: 96 challenging queries from OpenImages. Graded by GPT-4 (gpt-4-0613) on 0–5 scale. Reports average score and hallucination rate (score <3 = hallucinated).
  • Instruction-Tuning Data:

    • For LLaVA-1.5 family: Uses instruction dataset from Liu et al. [2].
    • For ShareGPT4V variant: Replaces “detailed description” samples in LLaVA with high-quality captions from ShareGPT4V [7].
    • Selection thresholds τ_p (at p=70) are determined per model and listed in Table C.1.

No cropping or metadata construction is mentioned. The datasets are used as-is for evaluation, while instruction-tuning data is modified per protocol for training.

Method

The authors leverage a standard large vision-language model (LVLM) architecture comprising three core components: a pre-trained vision encoder Ev\mathcal{E}_vEv, an adapter P\mathcal{P}P, and a pre-trained language model D\mathcal{D}D. Training proceeds in two stages: pre-training and instruction tuning. In the pre-training phase, the adapter P\mathcal{P}P is optimized on large-scale image-caption pairs formatted as single-turn instructions. For each image III and its caption, a simple question QQQ (e.g., “Describe this image”) is sampled, and the caption serves as the target answer AAA. This stage aligns the visual feature space with the language model’s semantic space while keeping Ev\mathcal{E}_vEv and D\mathcal{D}D frozen. The visual feature and its projected embedding are computed as fv=Ev(I)f_v = \mathcal{E}_v(I)fv=Ev(I) and zv=P(fv)z_v = \mathcal{P}(f_v)zv=P(fv), respectively. The model’s predictive distribution over answer tokens is denoted qθ(a<t,Q,zv)q_\theta(\cdot \mid a_{<t}, Q, z_v)qθ(a<t,Q,zv), parameterized by θ\thetaθ. The per-sample instruction tuning objective is defined as:

L(AQ,I;θ)=1Tt=1Tlogqθ(ata<t,Q,zv)\mathcal { L } ( A \mid Q , I ; \theta ) = - \frac { 1 } { T } \sum _ { t = 1 } ^ { T } \log q _ { \theta } ( a _ { t } \mid a _ { < t } , Q , z _ { v } )L(AQ,I;θ)=T1t=1Tlogqθ(ata<t,Q,zv)

where ata_tat is the ttt-th token in answer AAA and TTT is the sequence length. For notational convenience, the authors denote qQ()=qθ(Q)q_{Q}(\cdot) = q_{\theta}(\cdot \mid Q)qQ()=qθ(Q) and qI,Q()=qθ(I,Q)q_{I,Q}(\cdot) = q_{\theta}(\cdot \mid I,Q)qI,Q()=qθ(I,Q), representing predictions without and with visual input, respectively.

To quantify the contribution of visual information at the sample level, the authors introduce Visual Information Gain (VIG), defined as the log-ratio of perplexities (PPL) with and without visual conditioning:

VIG=log(PPL(AQ)PPL(AQ,I))\mathrm { V I G } = \log \left( { \frac { \mathrm { P P L } ( A \mid Q ) } { \mathrm { P P L } ( A \mid Q , I ) } } \right)VIG=log(PPL(AQ,I)PPL(AQ))

PPL(AQA|QAQ) is computed using a blurred image to simulate the absence of visual cues, following prior work. A higher VIG indicates greater reduction in model uncertainty when visual input is provided. Reformulating VIG in terms of cross-entropy loss yields:

VIG=L(AQ)L(AQ,I).\mathrm { V I G } = \mathcal { L } ( A | Q ) - \mathcal { L } ( A | Q , I ) .VIG=L(AQ)L(AQ,I).

Under deterministic supervision (as in VQA and captioning datasets), where the target distribution ppp is a Dirac delta, VIG simplifies to the absolute difference in KL divergences:

VIG=DKL(pAQqQ)DKL(pAI,QqI,Q).\mathrm { V I G } = \left| D _ { \mathrm { K L } } ( p _ { A | Q } \| q _ { Q } ) - D _ { \mathrm { K L } } ( p _ { A | I , Q } \| q _ { I , Q } ) \right| .VIG=DKL(pAQqQ)DKL(pAI,QqI,Q).

This formulation shows that VIG empirically measures how much visual information reduces the divergence between the model’s predictions and the ground truth. Expanding further, VIG decomposes into token-level contributions:

VIG=1Tt=1T[logqθ(ata<t,Q)][logqθ(ata<t,Q,zv)]\mathrm { V I G } = \frac { 1 } { T } \sum _ { t = 1 } ^ { T } [ - \log q _ { \theta } ( a _ { t } \mid a _ { < t } , Q ) ] - [ - \log q _ { \theta } ( a _ { t } \mid a _ { < t } , Q , z _ { v } ) ]VIG=T1t=1T[logqθ(ata<t,Q)][logqθ(ata<t,Q,zv)]

Each term represents the token-wise cross-entropy loss with and without visual conditioning, revealing that VIG aggregates per-token visual gains. This decomposition enables fine-grained analysis of which response tokens are most dependent on visual input.

To demonstrate VIG’s practical utility, the authors implement VIG-guided selective training. For each training sample (Ii,Qi,Ai)(I_i, Q_i, A_i)(Ii,Qi,Ai), they compute sample-level VIG VIGi\mathrm{VIG}_iVIGi and token-level VIG VIGi,t\mathrm{VIG}_{i,t}VIGi,t, where:

VIGi=1Tit=1TiVIGi,t.\mathrm { V I G } _ { i } = \frac { 1 } { T _ { i } } \sum _ { t = 1 } ^ { T _ { i } } \mathrm { V I G } _ { i , t } .VIGi=Ti1t=1TiVIGi,t.

They rank samples by VIGi\mathrm{VIG}_iVIGi and select the top p%p\%p%, defining the selected set Sp={iVIGiτp}\mathcal{S}_p = \{ i \mid \mathrm{VIG}_i \geq \tau_p \}Sp={iVIGiτp}, where τp\tau_pτp is the threshold. Within this subset, they further select tokens using the same threshold: for each iSpi \in \mathcal{S}_piSp, the visually informative tokens are Ti+={tVIGi,tτp}\mathcal{T}_i^+ = \{ t \mid \mathrm{VIG}_{i,t} \geq \tau_p \}Ti+={tVIGi,tτp}. During instruction tuning, the loss is computed only over tokens in iSpTi+\bigcup_{i\in\mathcal{S}_p}\mathcal{T}_i^+iSpTi+, ensuring gradients are updated exclusively on the most visually informative regions. This dual-level selection—sample and token—focuses optimization on data with substantial visual grounding, enhancing visual reasoning efficiency.

Experiment

  • VIG effectively measures visual grounding at sample and token levels, aligning with benchmark characteristics: COCO and POPE show strong visual dependency, while GQA and SQA lean toward text reliance.
  • VIG-guided selective training improves performance across vision understanding and hallucination benchmarks while reducing training data by up to 70%, with token-level filtering proving critical for gains.
  • Larger models benefit more from VIG selection, achieving higher performance with fewer tokens, demonstrating improved data efficiency.
  • VIG training enhances visual attention across model layers and reduces “blind faith in text,” making models more robust to misleading textual cues.
  • VIG outperforms or matches existing visual grounding methods without architectural changes and combines well with them for additive gains.
  • Ablation studies confirm that VIG-based selection (sample + token level) consistently outperforms random or sample-only selection, with p=70 offering the best balance of efficiency and performance.
  • Qualitative results show VIG training suppresses object and attribute hallucinations, forcing models to ground responses in actual visual content rather than textual priors.

The authors use VIG-guided selective training to filter both samples and tokens based on visual information gain, achieving improved performance across vision understanding and hallucination benchmarks while reducing training data volume. Results show that models trained with this method allocate more attention to visual tokens and resist misleading textual cues, indicating stronger visual grounding. The approach consistently outperforms random data reduction and complements existing visual grounding methods without architectural changes.

The authors use VIG-guided selective training to filter both samples and tokens based on visual information gain, achieving improved performance with significantly fewer training tokens. Results show that models trained under this strategy exhibit stronger visual grounding, reduced hallucinations, and greater resistance to misleading text, even when using only 70% of the original data. The method consistently outperforms random data reduction and complements existing visual grounding techniques without requiring architectural changes.

The authors use VIG-guided selective training to prioritize visually grounded samples and tokens during instruction tuning, resulting in improved performance across vision understanding and hallucination benchmarks while reducing the number of active tokens by 34% to 79%. Results show that this data-centric approach enhances visual grounding without architectural changes, outperforming both random data reduction and existing training-free or training-based methods. The gains are consistent across model sizes and architectures, indicating that focusing supervision on visually informative content strengthens multimodal alignment and reduces reliance on spurious textual cues.

The authors use identical training configurations for both pretraining and instruction tuning stages across models, maintaining consistent hyperparameters such as learning rate, optimizer, and scheduler to ensure fair comparison. Results show that instruction tuning requires less training time than pretraining, despite using a smaller batch size, indicating computational efficiency in the fine-tuning phase.

The authors use VIG-guided selective training to prioritize visually informative samples and tokens, achieving stronger performance across vision understanding and hallucination benchmarks while using significantly fewer training tokens. Results show that even aggressive filtering (30% selection) maintains or improves performance on open-ended and hallucination tasks, though broader coverage tasks benefit from moderate filtering (70%). This approach consistently reduces reliance on spurious textual cues and enhances visual grounding without architectural changes.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp