HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago
NVIDIA
LLM
vLLM

NVIDIA enhances distributed inference with new Transfer Library

NVIDIA has launched the Inference Transfer Library (NIXL), an open-source, vendor-agnostic data movement solution designed to optimize performance in complex distributed AI inference workloads. As large language models grow in scale, inference tasks increasingly rely on storing and retrieving large Key-Value (KV) caches across diverse storage tiers to avoid recomputing prefill tokens. Furthermore, wide expert parallelism requires low-latency communication for intermediate activations across hundreds of GPUs. Traditional infrastructure struggles to handle the dynamicity, resiliency, and hardware heterogeneity inherent in these 24/7 services, where compute resources and storage configurations frequently shift based on demand or failures. NIXL addresses these challenges by unifying communication and storage technologies into a single, high-performance abstraction layer. It supports a wide array of backends including RDMA, GPU-initiated networking, GPU-Direct storage, NVMe, and cloud object stores like AWS S3 and Azure Blob Storage. This flexibility allows developers to build inference frameworks that run efficiently across diverse environments, from on-premises clusters to multi-cloud setups. The library is already integrated into major AI frameworks such as NVIDIA TensorRT LLM, vLLM, SGLang, and Anyscale Ray. The core architecture of NIXL relies on a conductor process and NIXL transfer agents that operate in an object-oriented manner. Agents handle the actual data movement through put and get operations, which abstract one-sided network communications and storage transfers. The system utilizes descriptors to define memory or storage locations across host memory, GPU memory, and various storage tiers. A critical feature is the dynamic metadata exchange, which allows the system to adapt to changing network topologies. The conductor process enables agents to discover each other and exchange necessary connection information, facilitating seamless scaling, node recycling, and failure recovery without halting the service. NIXL offers a fully non-blocking API with minimal overhead, enabling efficient overlapping of computation and communication through zero-copy transfers. For scenarios requiring direct GPU kernel involvement, a Device API mode is available. The library automatically selects the optimal backend for each transfer request unless explicitly specified by the user, ensuring maximum performance while maintaining hardware agnosticism. To assist developers, NVIDIA provides comprehensive benchmarking tools. NIXLBench offers a model-agnostic view of system performance by executing real data transfers and reporting bandwidth and latency metrics. For Large Language Model engineers, KVBench calculates exact KV cache input/output sizes for specific models and generates ready-to-run benchmark commands. Additionally, KVBench includes a profiling module to analyze KVCache transfers specifically. NIXL is written in C++ for efficiency and is available on GitHub, with bindings for C, Python, and Rust. Currently, it supports Linux environments including Ubuntu and RHEL. NVIDIA invites the community to test the library and contribute to its evolution ahead of the upcoming v1.0.0 release. By unifying complex data movement requirements, NIXL aims to simplify the deployment of robust, high-performance AI inference systems.

Related Links

NVIDIA enhances distributed inference with new Transfer Library | Trending Stories | HyperAI