Retrieval-Augmented Perception
The Retrieval-Augmented Perception (RAP) plug-in was proposed by a team from Nanyang Technological University and Wuhan University in March 2025. The relevant research results were published in the paper “Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG", this work has been included in ICML 2025 and was rated as a Spotlight paper.
RAP is a high-resolution image perception plug-in based on RAG technology that does not require training. It aims to improve the performance of MLLMs in high-resolution image perception tasks while reducing computational costs. This enables the model to have stronger understanding, contextual awareness, and reasoning capabilities in complex environments. Experimental results show that RAP significantly improves performance in multiple high-resolution image benchmarks. For example, LLaVA-v1.5-13B improves performance by 43% on V* Bench and 19% on HR-Bench.