Flamingo: a Visual Language Model for Few-Shot Learning

Building models that can be rapidly adapted to novel tasks using only ahandful of annotated examples is an open challenge for multimodal machinelearning research. We introduce Flamingo, a family of Visual Language Models(VLM) with this ability. We propose key architectural innovations to: (i)bridge powerful pretrained vision-only and language-only models, (ii) handlesequences of arbitrarily interleaved visual and textual data, and (iii)seamlessly ingest images or videos as inputs. Thanks to their flexibility,Flamingo models can be trained on large-scale multimodal web corpora containingarbitrarily interleaved text and images, which is key to endow them within-context few-shot learning capabilities. We perform a thorough evaluation ofour models, exploring and measuring their ability to rapidly adapt to a varietyof image and video tasks. These include open-ended tasks such as visualquestion-answering, where the model is prompted with a question which it has toanswer; captioning tasks, which evaluate the ability to describe a scene or anevent; and close-ended tasks such as multiple-choice visual question-answering.For tasks lying anywhere on this spectrum, a single Flamingo model can achievea new state of the art with few-shot learning, simply by prompting the modelwith task-specific examples. On numerous benchmarks, Flamingo outperformsmodels fine-tuned on thousands of times more task-specific data.