6 months ago

Abstract

Knowledge-based visual question answering (VQA) requires external knowledgebeyond the image to answer the question. Early studies retrieve requiredknowledge from explicit knowledge bases (KBs), which often introducesirrelevant information to the question, hence restricting the performance oftheir models. Recent works have resorted to using a powerful large languagemodel (LLM) as an implicit knowledge engine to acquire the necessary knowledgefor answering. Despite the encouraging results achieved by these methods, weargue that they have not fully activated the capacity of the \emph{blind} LLMas the provided textual input is insufficient to depict the required visualinformation to answer the question. In this paper, we present Prophet -- aconceptually simple, flexible, and general framework designed to prompt LLMwith answer heuristics for knowledge-based VQA. Specifically, we first train avanilla VQA model on a specific knowledge-based VQA dataset without externalknowledge. After that, we extract two types of complementary answer heuristicsfrom the VQA model: answer candidates and answer-aware examples. The two typesof answer heuristics are jointly encoded into a formatted prompt to facilitatethe LLM's understanding of both the image and question, thus generating a moreaccurate answer. By incorporating the state-of-the-art LLM GPT-3, Prophetsignificantly outperforms existing state-of-the-art methods on four challengingknowledge-based VQA datasets. Prophet is general that can be instantiated withthe combinations of different VQA models (i.e., both discriminative andgenerative ones) and different LLMs (i.e., both commercial and open-sourceones). Moreover, Prophet can also be integrated with modern large multimodalmodels in different stages, which is named Prophet++, to further improve thecapabilities on knowledge-based VQA tasks.

Source PDF