Boost Vector Search Performance with NVIDIA cuVS and Faiss for Faster AI Workloads
NVIDIA cuVS enhances Faiss, a widely used library for vector search, by bringing powerful GPU acceleration to both index creation and real-time search. As organizations deal with ever-growing volumes of unstructured data and increasingly rely on large language models, the need for fast, scalable, and cost-effective similarity search has become critical. Traditional CPU-based systems struggle to keep up, often requiring thousands of processors to meet real-time demands, which drives up infrastructure costs. The integration of cuVS with Faiss enables significant performance gains in both latency and throughput, particularly for large-scale applications like retrieval-augmented generation (RAG) and real-time recommendation systems. cuVS leverages NVIDIA’s GPU technology to accelerate key operations, including clustering, quantization, and graph construction, while maintaining seamless compatibility between CPU and GPU environments. One of the key benefits of this integration is the ability to build indexes on the GPU and then deploy them for search on the CPU. This hybrid approach allows users to take advantage of faster GPU index creation while preserving existing CPU-based search infrastructure. For example, building Hierarchical Navigable Small-World (HNSW) indexes on the CPU can take hours or even days at scale, but the new CAGRA index, built on the GPU, can be created up to 12 times faster. Once built, the CAGRA graph can be converted to an HNSW format and used for efficient CPU-based search, offering a best-of-both-worlds solution. Performance benchmarks on datasets like Deep100M (100 million 96-dimensional vectors) and OpenAI text embeddings (5 million 1,536-dimensional vectors) show consistent improvements. For IVF-PQ and IVF-Flat indexes, cuVS reduces index build time and search latency, while also increasing batch throughput—enabling millions of queries per second. In graph-based search, CAGRA outperforms CPU-based HNSW by up to 4.7x in search speed, with comparable or better accuracy. The cuVS integration is available in Faiss starting from version 1.10.0 and supports several index types, including IVF-PQ, IVF-Flat, Flat, and CAGRA. Users can install the faiss-gpu-cuvs package via Conda or use nightly builds. The library automatically uses cuVS for supported index types, requiring no code changes to benefit from acceleration. To maximize performance, it is recommended to use RMM (RAPIDS Memory Manager) for GPU memory pooling. Code examples demonstrate how to create and search an IVFPQ index using the cuVS backend, as well as how to build a CAGRA index and convert it to HNSW for CPU inference. By combining the flexibility of Faiss with the speed of GPU acceleration through cuVS, organizations can now process massive vector datasets more efficiently, reduce time-to-insight, and scale their AI systems with lower cost and higher performance. This advancement is especially valuable for teams working with multi-modal embeddings, real-time search, or large-scale RAG pipelines. For those ready to get started, the Faiss cuVS notebook provides full code examples and guidance.
