Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

Visual question answering (VQA) is a hallmark of vision and languagereasoning and a challenging task under the zero-shot setting. We proposePlug-and-Play VQA (PNP-VQA), a modular framework for zero-shot VQA. In contrastto most existing works, which require substantial adaptation of pretrainedlanguage models (PLMs) for the vision modality, PNP-VQA requires no additionaltraining of the PLMs. Instead, we propose to use natural language and networkinterpretation as an intermediate representation that glues pretrained modelstogether. We first generate question-guided informative image captions, andpass the captions to a PLM as context for question answering. Surpassingend-to-end trained baselines, PNP-VQA achieves state-of-the-art results onzero-shot VQAv2 and GQA. With 11B parameters, it outperforms the 80B-parameterFlamingo model by 8.5% on VQAv2. With 738M PLM parameters, PNP-VQA achieves animprovement of 9.1% on GQA over FewVLM with 740M PLM parameters. Code isreleased at https://github.com/salesforce/LAVIS/tree/main/projects/pnp-vqa