HyperAIHyperAI

Command Palette

Search for a command to run...

Proxy-Pointer RAG delivers multimodal answers without embeddings

Enterprise chatbots often struggle to reliably display images grounded in source documents, typically offering only links to external files. This limitation persists despite the clear value of visual evidence for users in fields like real estate and technical support. A new open-source solution, Multimodal Proxy-Pointer RAG, addresses this by treating documents as hierarchical trees of semantic blocks rather than fragmented text chunks. Traditional multimodal RAG approaches often fail due to misalignment between retrieval units and semantic meaning. Methods relying on image captioning can split visual context across unrelated chunks, while those using multimodal embeddings struggle with grounding. Visual similarity does not guarantee relevance, leading to systems that either return incorrect visuals or none at all. The Proxy-Pointer architecture solves this by retrieving complete document sections based on structural boundaries. Instead of using multimodal embeddings to search for images, the system stores image paths within the text of each section. When a user query is processed, the retrieval mechanism selects entire sections containing relevant text. The Language Model then uses the context of that specific section to decide which images to display, a process that mirrors human reading habits where context dictates visual selection. In a prototype testing five AI research papers containing over 270 figures and tables, the system achieved 95% accuracy on a 20-question benchmark. The setup utilized a text-only embedding model and the Adobe PDF Extract API to convert documents into markdown files with nested image paths. The pipeline includes steps for structure-guided chunking, breadcrumb injection, and semantic re-ranking to handle vague section headings. The synthesis stage reviews the top retrieved sections and selects up to six relevant images, generating accurate labels even for uncaptioned figures. An optional vision filter can further refine these selections by having the LLM visually inspect the images, though this adds latency. The system ensured zero instances of unrelated images appearing in results, significantly boosting trust in the bot's output. Key technical details include using a hierarchical tree where each node tracks figure paths, and re-ranking candidates with semantic snippets to handle non-descriptive academic headings. While the system faces challenges like LLM non-determinism and issues with detached image paths, these are manageable through careful path naming and context window management. The solution demonstrates that high-accuracy multimodal retrieval is achievable without expensive multimodal embeddings. By aligning retrieval with semantic structure, the system enables chatbots to provide precise visual evidence alongside text. The code is open-source under the MIT License, allowing developers to clone the repository, apply the pipeline to their own documents, and deploy custom multimodal RAG agents. This approach marks a significant step toward practical, grounded visual responses in enterprise AI applications.

Related Links