HyperAIHyperAI
2 months ago

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac, Jean-Baptiste ; Donahue, Jeff ; Luc, Pauline ; Miech, Antoine ; Barr, Iain ; Hasson, Yana ; Lenc, Karel ; Mensch, Arthur ; Millican, Katie ; Reynolds, Malcolm ; Ring, Roman ; Rutherford, Eliza ; Cabi, Serkan ; Han, Tengda ; Gong, Zhitao ; Samangooei, Sina ; Monteiro, Marianne ; Menick, Jacob ; Borgeaud, Sebastian ; Brock, Andrew ; Nematzadeh, Aida ; Sharifzadeh, Sahand ; Binkowski, Mikolaj ; Barreira, Ricardo ; Vinyals, Oriol ; Zisserman, Andrew ; Simonyan, Karen
Flamingo: a Visual Language Model for Few-Shot Learning
Abstract

Building models that can be rapidly adapted to novel tasks using only ahandful of annotated examples is an open challenge for multimodal machinelearning research. We introduce Flamingo, a family of Visual Language Models(VLM) with this ability. We propose key architectural innovations to: (i)bridge powerful pretrained vision-only and language-only models, (ii) handlesequences of arbitrarily interleaved visual and textual data, and (iii)seamlessly ingest images or videos as inputs. Thanks to their flexibility,Flamingo models can be trained on large-scale multimodal web corpora containingarbitrarily interleaved text and images, which is key to endow them within-context few-shot learning capabilities. We perform a thorough evaluation ofour models, exploring and measuring their ability to rapidly adapt to a varietyof image and video tasks. These include open-ended tasks such as visualquestion-answering, where the model is prompted with a question which it has toanswer; captioning tasks, which evaluate the ability to describe a scene or anevent; and close-ended tasks such as multiple-choice visual question-answering.For tasks lying anywhere on this spectrum, a single Flamingo model can achievea new state of the art with few-shot learning, simply by prompting the modelwith task-specific examples. On numerous benchmarks, Flamingo outperformsmodels fine-tuned on thousands of times more task-specific data.

Flamingo: a Visual Language Model for Few-Shot Learning | Latest Papers | HyperAI