HyperAI

NVIDIA has unveiled Nemotron ColEmbed V2, a new family of late-interaction multimodal embedding models designed to set a new standard in visual document retrieval. Built to address the growing complexity of enterprise documents that combine text, tables, charts, and figures, these models deliver state-of-the-art performance on the ViDoRe V3 benchmark—the latest and most comprehensive evaluation of multimodal retrieval for real-world business use cases. The Nemotron ColEmbed V2 family includes three variants: a 3B, 4B, and 8B parameter model, all optimized for high accuracy in cross-modal search. The nemotron-colembed-vl-8b-v2 model leads the pack, ranking #1 on ViDoRe V3 with an NDCG@10 score of 63.42, while the 4B and 3B versions rank 3rd and 6th respectively in their respective parameter categories as of February 3, 2026. These results underscore the effectiveness of late-interaction architectures in capturing fine-grained semantic relationships between query and document components. Unlike single-vector models that encode entire queries and documents into one embedding, Nemotron ColEmbed V2 uses a multi-vector approach. Each query token interacts independently with every document token—textual or visual—via the MaxSim operator, which computes the maximum similarity for each query token and sums these values to produce a final relevance score. This method requires storing individual token embeddings for documents, increasing storage needs but significantly improving retrieval precision. The models are built on top of advanced vision-language foundations: the 3B model is based on Google’s SigLIP-2 Giant Opt-Patch16-384 and Meta’s Llama-3.2-3B, while the 4B and 8B variants leverage Qwen3-VL-Instruct models. All models were trained using a bi-encoder architecture with contrastive learning, maximizing similarity between relevant query-document pairs and minimizing it for negative examples. The 3B model underwent a two-stage training process—first on text-only QA pairs, then on text-image pairs—while the larger models were trained exclusively on text-image data. Hard negative mining techniques from the NV-Retriever paper were applied to enhance performance. Key improvements over the previous version include advanced model merging, which combines multiple fine-tuned checkpoints to achieve ensemble-like accuracy without added inference cost, and enhanced synthetic data training with multilingual and diverse document types to improve cross-lingual and cross-format understanding. Nemotron ColEmbed V2 is ideal for researchers and developers building high-precision multimodal retrieval systems, especially in applications like enterprise search, conversational AI with rich input understanding, and RAG pipelines that need to retrieve specific content from complex visual documents. The models are now available on Hugging Face for immediate use. Developers can also access them via NVIDIA NGC as a microservice container or explore the NVIDIA Enterprise RAG Blueprint, which leverages the same underlying technology behind the ViDoRe V3 top performer. This release marks a significant leap forward in accurate, scalable, and enterprise-ready multimodal retrieval.

Related Links

Related Links

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Command Palette

NVIDIA Unveils Nemotron ColEmbed V2: State-of-the-Art Multimodal Retrieval with ViDoRe V3 Leadership

Related Links

Command Palette

NVIDIA Unveils Nemotron ColEmbed V2: State-of-the-Art Multimodal Retrieval with ViDoRe V3 Leadership

Related Links

Command Palette

NVIDIA Unveils Nemotron ColEmbed V2: State-of-the-Art Multimodal Retrieval with ViDoRe V3 Leadership

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models