HyperAI

Deploying large language models (LLMs) often faces challenges in inference efficiency, particularly due to cold start latency—the time it takes to load large models into GPU memory. As models grow in size, requiring tens to hundreds of gigabytes of memory, loading them sequentially can severely impact user experience and system scalability. To address this, NVIDIA has introduced the Run:ai Model Streamer, an open source Python SDK designed to reduce cold start latency by streaming model weights directly from storage into GPU memory in parallel. The Model Streamer uses a high-performance C++ backend and leverages multiple threads to read tensors concurrently from storage—whether local SSDs, network file systems, or cloud storage like Amazon S3—into a dedicated CPU buffer. While some tensors are being read from storage, others are simultaneously transferred from CPU to GPU. This overlapping of I/O and memory transfer operations takes full advantage of the independent data paths between CPU and GPU, enabling real-time streaming and minimizing idle time. In benchmark tests conducted on an AWS g5.12xlarge instance with an NVIDIA A10G GPU and 2nd Gen AMD EPYC CPU, the Model Streamer was compared against two widely used loaders: the Hugging Face Safetensors Loader and CoreWeave Tensorizer. The evaluation covered three storage types: GP3 SSD, IO2 SSD, and Amazon S3. On GP3 SSD, the Model Streamer achieved a loading time of 14.34 seconds at 16 concurrent threads—significantly faster than the Safetensors Loader’s 47.99 seconds. On IO2 SSD, which offers higher throughput, the Model Streamer reduced loading time to just 7.53 seconds at 8 concurrent threads, a nearly 6x improvement over Safetensors. Even with 16 workers, the Model Streamer maintained performance close to the storage’s theoretical limit, demonstrating strong scalability. In cloud environments using Amazon S3, where sequential loading is especially slow, the Model Streamer again outperformed Tensorizer. At 32 concurrent streams, it loaded the model in 4.88 seconds, compared to Tensorizer’s best time of 37.36 seconds with 16 workers. This performance gap highlights the Model Streamer’s ability to fully utilize cloud storage bandwidth through efficient concurrency. When integrated with vLLM, a popular inference engine, the Model Streamer dramatically reduced total readiness time. On GP3 SSD, it brought vLLM’s total load and readiness time down to 35.08 seconds—compared to 66.13 seconds with Safetensors. On IO2 SSD, the time was reduced to 28.28 seconds. On S3, the Model Streamer achieved 23.18 seconds, far outpacing Tensorizer’s 65.18 seconds. The Model Streamer supports the Safetensors format without requiring conversion, preserving compatibility while delivering superior performance. It is designed to work across diverse storage environments and scales effectively with available bandwidth. These results underscore the importance of concurrent streaming and storage choice in optimizing LLM inference. For production deployments, especially in cloud or dynamic environments, adopting the NVIDIA Run:ai Model Streamer can significantly reduce cold start latency, accelerate time-to-inference, and improve scalability. With seamless integration into frameworks like vLLM, it offers a practical, high-impact solution for modern AI systems.

NVIDIA Run:ai Model Streamer Reduces LLM Cold Start Latency with Concurrent Streaming Across SSD and Cloud Storage

Related Links