HyperAI

As large language models grow in size and complexity, inference—the process of generating responses—has become increasingly resource-intensive. A major bottleneck in this process is the Key-Value (KV) Cache, a critical component of the attention mechanism used by models like GPT-OSS and DeepSeek-R1. The KV Cache stores intermediate data from previous tokens, enabling the model to maintain context during response generation. However, as context windows expand to hundreds of thousands or even millions of tokens, the KV Cache can quickly consume vast amounts of GPU memory, limiting scalability and increasing costs. GPU memory is both limited and expensive. When the KV Cache grows too large, systems must either reduce prompt length, limit concurrency, or deploy more expensive hardware. This trade-off hampers performance, especially in applications involving long conversations, deep research, or code generation, where context retention is essential. NVIDIA Dynamo addresses this challenge by introducing KV Cache offloading—moving the cache from GPU memory to more cost-effective storage such as CPU RAM, local SSDs, or remote network storage. This is made possible through NVIDIA NIXL, a low-latency data transfer library that enables near-instantaneous movement of KV Cache blocks between GPU and storage without interrupting inference. The Dynamo KV Block Manager (KVBM) is the core system that orchestrates this process. It operates across three layers: a storage interface, a memory coordination layer, and a policy engine that determines when and where to offload or retrieve cache data. By decoupling memory management from specific inference engines, KVBM simplifies integration and allows storage and compute resources to scale independently. This architecture supports a wide range of storage solutions and enables seamless reuse of cached data across sessions. Dynamo integrates with LMCache, an open-source caching system designed to manage and reuse KV Cache across CPUs, local, and remote storage. Together, they allow inference engines like vLLM to offload frequently used data—such as conversation history—without recomputation. This reduces Time to First Token (TTFT), improves throughput, and lowers the cost per token. Real-world testing with storage providers confirms Dynamo’s effectiveness. Vast AI achieved 35 GB/s transfer speeds to a single H100 GPU using GPU Direct Storage (GDS), demonstrating full GPU utilization. WEKA’s lab tests showed up to 270 GB/s across eight H100 GPUs using a zero-copy, RDMA-based data path, proving that high-performance storage can keep pace with inference demands. To implement KV offloading in vLLM, users can enable Dynamo KVBM via environment variables and containerized deployment. By setting DYN_KVBM_CPU_CACHE_GB or DYN_KVBM_DISK_CACHE_GB, users allocate CPU or disk space for offloaded cache. The system automatically manages data movement, and metrics can be monitored through a Grafana dashboard for real-time insights. For teams using LMCache, enabling the system with environment variables allows for flexible storage backends and advanced caching policies. The integration supports persistent cache reuse, significantly reducing recomputation time—especially beneficial for long prompts and repeated queries. In summary, NVIDIA Dynamo’s KV Cache offloading capability offers a scalable, cost-effective solution to one of the most pressing challenges in large-scale LLM inference. By leveraging high-speed storage and intelligent memory management, it enables longer context windows, higher concurrency, faster response times, and lower infrastructure costs. As generative AI continues to evolve, Dynamo provides a powerful foundation for building efficient, high-performance inference systems.

NVIDIA Dynamo Enables Efficient KV Cache Offloading to Slash LLM Inference Costs and Boost Scalability

Related Links