HyperAIHyperAI

Command Palette

Search for a command to run...

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li Dongxu Li Silvio Savarese Steven Hoi

Abstract

The cost of vision-and-language pre-training has become increasinglyprohibitive due to end-to-end training of large-scale models. This paperproposes BLIP-2, a generic and efficient pre-training strategy that bootstrapsvision-language pre-training from off-the-shelf frozen pre-trained imageencoders and frozen large language models. BLIP-2 bridges the modality gap witha lightweight Querying Transformer, which is pre-trained in two stages. Thefirst stage bootstraps vision-language representation learning from a frozenimage encoder. The second stage bootstraps vision-to-language generativelearning from a frozen language model. BLIP-2 achieves state-of-the-artperformance on various vision-language tasks, despite having significantlyfewer trainable parameters than existing methods. For example, our modeloutperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainableparameters. We also demonstrate the model's emerging capabilities of zero-shotimage-to-text generation that can follow natural language instructions.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp