Small but Powerful: Boost Multimodal Search Accuracy with Llama Nemotron RAG Models for Visual Document Retrieval
Small Yet Mighty: Improve Accuracy in Multimodal Search and Visual Document Retrieval with Llama Nemotron RAG Models In real-world applications, data goes beyond plain text. Documents often include charts, tables, scanned contracts, screenshots, and slide decks—elements that text-only retrieval systems fail to capture. Multimodal Retrieval-Augmented Generation (RAG) pipelines solve this by enabling systems to retrieve and reason over text, images, and layout structure together, delivering more accurate and actionable insights. This post explores two compact yet powerful Llama Nemotron models designed for multimodal visual document retrieval: llama-nemotron-embed-vl-1b-v2 and llama-nemotron-rerank-vl-1b-v2. Both models are optimized for developers building enterprise-grade AI systems that search through large collections of PDFs and images with high precision and low latency. Why Multimodal RAG Needs High-Quality Retrieval Multimodal RAG combines a retriever with a vision-language model (VLM) so that responses are grounded in both textual content and visual elements. The quality of retrieval hinges on two key components: embeddings and reranking. Embeddings determine which documents are retrieved, while rerankers refine the order of top candidates. Inaccurate embeddings or reranking increase the risk of hallucination, even with confident outputs. Using multimodal embeddings paired with a multimodal reranker ensures the VLM stays anchored to the correct content. State-of-the-Art in Commercial Multimodal Search The llama-nemotron-embed-vl-1b-v2 is a dense, single-vector embedding model that efficiently encodes both visual and textual information into a unified 2048-dimensional representation. It is fully compatible with standard vector databases and supports millisecond-scale search at enterprise scale. The llama-nemotron-rerank-vl-1b-v2 is a cross-encoder reranker that reorders retrieved results to boost relevance and improve downstream answer quality—without requiring changes to your existing index or storage setup. Evaluation on Five Visual Document Retrieval Benchmarks The models were tested across five datasets, including ViDoRe V1, V2, and V3, a composite benchmark using eight public datasets, and two internal enterprise datasets. Results show significant gains: On average, llama-nemotron-embed-vl-1b-v2 outperforms its predecessor llama-3.2-nemoretriever-1b-vlm-embed-v1 and the text-only llama-nemotron-embed-1b-v2 across all modalities. When combined with the reranker, retrieval accuracy improves by 7.2%, 6.9%, and 6% respectively for text, image, and image+text inputs. Image+text input leverages both the visual content and extracted text (via tools like NV-Ingest), enabling richer representation and higher retrieval precision. Comparison with Other Rerankers When benchmarked against jina-reranker-m0 and MonoQwen2-VL-v0.1, llama-nemotron-rerank-vl-1b-v2 delivers superior performance on text and image+text tasks. Unlike jina-reranker-m0, which is restricted to non-commercial use, the new model comes with a permissive commercial license—ideal for enterprise adoption. Architectural Design and Training The embedding model is a 1.7B-parameter transformer-based encoder, fine-tuned from the NVIDIA Eagle family. It uses the Llama 3.2 1B language model and SigLip2 400M vision encoder. It applies mean pooling to generate a single embedding vector and is trained via contrastive learning to maximize similarity between relevant query-document pairs. The reranker is also a 1.7B-parameter cross-encoder, fine-tuned on a mix of public and synthetic data. It uses mean-pooled hidden states and a binary classification head trained with CrossEntropy loss to rank documents. Real-World Use Cases Cadence uses the models to index complex design documents, enabling engineers to ask questions like “How do I extend the interrupt controller for low power?” and receive precise, context-aware responses with suggested updates. IBM Storage treats each page of technical PDFs as a multimodal document, using the reranker to surface pages where domain-specific terms and product names appear in context—improving AI understanding of infrastructure documentation. ServiceNow powers its “Chat with PDF” feature by embedding pages and applying the reranker to maintain relevance across conversation turns, enabling coherent, context-aware interactions with large document sets. Get Started Developers can integrate these models directly into existing RAG pipelines or combine them with other open models on Hugging Face. Try them today and enhance your AI systems to truly understand documents—not just their text. Stay updated on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, YouTube, and the Nemotron channel on Discord.
