HyperAI

Large Language Models (LLMs) like Llama 3 70B and Llama 4 Scout 109B demand substantial memory, often exceeding the capacity of GPU memory alone. For instance, loading Llama 3 70B in FP16 requires about 140 GB, while the KV cache for a 128k token context can consume an additional 40 GB per user. With typical GPU memory limited to around 96 GB, such models cannot fit entirely in GPU memory, leading to out-of-memory (OOM) errors during inference. The NVIDIA Grace Hopper GH200 Superchip and Grace Blackwell systems address this challenge through a unified memory architecture powered by NVLink-C2C. This high-bandwidth interconnect offers 900 GB/s and enables memory coherency between the CPU and GPU, creating a single, shared address space. This allows both processors to access the same data without explicit data transfers or redundant copies. In the GH200, the 96 GB of high-bandwidth GPU memory can be combined with up to 480 GB of LPDDR memory connected to the CPU. This effectively expands the available memory pool, enabling large models and datasets to be processed even when they exceed GPU capacity. To demonstrate, attempting to load Llama 3 70B directly into GPU memory results in an OOM error. The GPU runs out of space, as shown by nvidia-smi, which reports nearly full memory usage. This failure highlights the limitations of traditional memory management. The solution lies in managed memory allocation using the RAPIDS Memory Manager (RMM). By enabling managed memory and configuring PyTorch to use RMM’s allocator, the system can transparently access both CPU and GPU memory. This allows the model to be loaded and executed without manual data movement. The key steps are: Initialize RMM with managed memory enabled. Set PyTorch’s memory allocator to use RMM. Load the model using standard Hugging Face pipelines. With these changes, the model loads successfully, even though its total memory footprint exceeds the GPU’s physical capacity. The system automatically offloads data to CPU memory when needed, maintaining performance and eliminating OOM errors. Once loaded, the model can process prompts efficiently. For example, sending a request like "Which is the tallest mountain in the world?" returns a response without interruption. This approach is not limited to inference. It also benefits fine-tuning, KV cache offloading, and other memory-intensive AI workloads. In conclusion, unified memory architecture with NVLink-C2C and managed memory allocation enables scalable, efficient LLM inference on modern hardware. By seamlessly combining CPU and GPU memory, developers can run larger models and longer context windows without complex memory management code. This advancement is essential as model sizes continue to grow, making memory scalability a key enabler for real-world AI deployment.

Leveraging CPU-GPU Unified Memory for Scalable LLM Inference and KV Cache Offload on Grace Hopper Systems

Related Links