HyperAIHyperAI

Command Palette

Search for a command to run...

Visual Instruction Tuning

Haotian Liu Chunyuan Li Qingyang Wu Yong Jae Lee

Abstract

Instruction tuning large language models (LLMs) using machine-generatedinstruction-following data has improved zero-shot capabilities on new tasks,but the idea is less explored in the multimodal field. In this paper, wepresent the first attempt to use language-only GPT-4 to generate multimodallanguage-image instruction-following data. By instruction tuning on suchgenerated data, we introduce LLaVA: Large Language and Vision Assistant, anend-to-end trained large multimodal model that connects a vision encoder andLLM for general-purpose visual and language understanding.Our early experimentsshow that LLaVA demonstrates impressive multimodel chat abilities, sometimesexhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, andyields a 85.1% relative score compared with GPT-4 on a synthetic multimodalinstruction-following dataset. When fine-tuned on Science QA, the synergy ofLLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We makeGPT-4 generated visual instruction tuning data, our model and code basepublicly available.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp