HyperAI

SearchLVLMs Framework

The SearchLVLMs framework is a plug-and-play solution jointly proposed by Shanghai Artificial Intelligence Laboratory (OpenGVLab), Beijing Institute of Technology, Zhejiang University and the University of Hong Kong in 2024, which aims to enhance the ability of existing large-scale visual language models (LVLMs) to handle visual question answering (VQA) about the latest knowledge.SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge".

Large visual-language models (such as the LLaVA family) perform poorly in many situations because they cannot be updated frequently and are unaware of the latest knowledge (e.g. the singer of the theme song of a new movie). The SearchLVLMs framework improves this problem by providing internet search augmentation at the inference stage to help LVLMs acquire the latest knowledge.

The SearchLVLMs framework consists of three main parts: query generation, search engine invocation, and hierarchical filtering. In the query generation stage, the framework needs to fully understand the question and image to convert them into text queries suitable for search engines. In the search engine invocation stage, users can choose the search engine category to invoke based on the question type. Finally, in the hierarchical filtering stage, the framework trains a model to effectively find the most helpful content from the web pages returned by the search engine.

Experimental results show that the SearchLVLMs framework can significantly improve the performance of LVLMs in answering questions that require the latest knowledge, with an accuracy rate that exceeds GPT-4V by about 25%. The introduction of the SearchLVLMs framework provides a plug-and-play solution for multimodal large models, enabling them to seamlessly integrate the latest Internet knowledge and improve the model's ability to respond to real-time information.