10 months ago

Visual Question Answering

Supervised Fine-Tuning

Xiyao Wang extsuperscript1,3† Jiucai Chen extsuperscript1 Zhaoyang Wang extsuperscript2 Yuhang Zhou extsuperscript1 Yiyang Zhou extsuperscript2 Huaxiu Yao extsuperscript2 Tianyi Zhou extsuperscript1 Tom Goldstein extsuperscript1 Parminder Bhatia extsuperscript3 Taha Kass-Hout extsuperscript3

Abstract

Large vision-language models (LVLMs) have achieved impressive results invisual question-answering and reasoning tasks through vision instruction tuningon specific datasets. However, there remains significant room for improvementin aligning visual and language modalities. Existing methods often depend onexternal models or data, leading to uncontrollable and unstable alignmentresults. In this paper, we propose SIMA, a self-improvement framework thatenhances visual and language modality alignment without external dependencies.SIMA leverages existing vision instruction tuning datasets to self-generateresponses, incorporating an in-context self-critic mechanism that constructspreference pairs for tuning. Crucially, our approach allows LVLMs to act ascritics by designing effective critic prompts, eliminating the need foradditional fine-tuning with external instruction data. We introduce three novelvisual metrics within the self-critic process to guide judgment, significantlyimproving the accuracy of self-critic. Through extensive experiments across 14hallucination and comprehensive benchmarks, we demonstrate that SIMAsignificantly improves LVLM's performance and outperforms previous approaches,achieving superior modality alignment.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

10 months ago

Visual Question Answering

Supervised Fine-Tuning

Xiyao Wang extsuperscript1,3† Jiucai Chen extsuperscript1 Zhaoyang Wang extsuperscript2 Yuhang Zhou extsuperscript1 Yiyang Zhou extsuperscript2 Huaxiu Yao extsuperscript2 Tianyi Zhou extsuperscript1 Tom Goldstein extsuperscript1 Parminder Bhatia extsuperscript3 Taha Kass-Hout extsuperscript3

Abstract

Large vision-language models (LVLMs) have achieved impressive results invisual question-answering and reasoning tasks through vision instruction tuningon specific datasets. However, there remains significant room for improvementin aligning visual and language modalities. Existing methods often depend onexternal models or data, leading to uncontrollable and unstable alignmentresults. In this paper, we propose SIMA, a self-improvement framework thatenhances visual and language modality alignment without external dependencies.SIMA leverages existing vision instruction tuning datasets to self-generateresponses, incorporating an in-context self-critic mechanism that constructspreference pairs for tuning. Crucially, our approach allows LVLMs to act ascritics by designing effective critic prompts, eliminating the need foradditional fine-tuning with external instruction data. We introduce three novelvisual metrics within the self-critic process to guide judgment, significantlyimproving the accuracy of self-critic. Through extensive experiments across 14hallucination and comprehensive benchmarks, we demonstrate that SIMAsignificantly improves LVLM's performance and outperforms previous approaches,achieving superior modality alignment.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp