HyperAIHyperAI

Command Palette

Search for a command to run...

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

Abstract

Large vision-language models (LVLMs) have achieved impressive results invisual question-answering and reasoning tasks through vision instruction tuningon specific datasets. However, there remains significant room for improvementin aligning visual and language modalities. Existing methods often depend onexternal models or data, leading to uncontrollable and unstable alignmentresults. In this paper, we propose SIMA, a self-improvement framework thatenhances visual and language modality alignment without external dependencies.SIMA leverages existing vision instruction tuning datasets to self-generateresponses, incorporating an in-context self-critic mechanism that constructspreference pairs for tuning. Crucially, our approach allows LVLMs to act ascritics by designing effective critic prompts, eliminating the need foradditional fine-tuning with external instruction data. We introduce three novelvisual metrics within the self-critic process to guide judgment, significantlyimproving the accuracy of self-critic. Through extensive experiments across 14hallucination and comprehensive benchmarks, we demonstrate that SIMAsignificantly improves LVLM's performance and outperforms previous approaches,achieving superior modality alignment.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp