Command Palette
Search for a command to run...
Enhancing Visual-Language Modality Alignment in Large Vision Language
Models via Self-Improvement
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
Abstract
Large vision-language models (LVLMs) have achieved impressive results invisual question-answering and reasoning tasks through vision instruction tuningon specific datasets. However, there remains significant room for improvementin aligning visual and language modalities. Existing methods often depend onexternal models or data, leading to uncontrollable and unstable alignmentresults. In this paper, we propose SIMA, a self-improvement framework thatenhances visual and language modality alignment without external dependencies.SIMA leverages existing vision instruction tuning datasets to self-generateresponses, incorporating an in-context self-critic mechanism that constructspreference pairs for tuning. Crucially, our approach allows LVLMs to act ascritics by designing effective critic prompts, eliminating the need foradditional fine-tuning with external instruction data. We introduce three novelvisual metrics within the self-critic process to guide judgment, significantlyimproving the accuracy of self-critic. Through extensive experiments across 14hallucination and comprehensive benchmarks, we demonstrate that SIMAsignificantly improves LVLM's performance and outperforms previous approaches,achieving superior modality alignment.