HyperAIHyperAI
Back to Headlines

NVLink Fusion Unlocks Next-Gen AI Inference with Scalable, Flexible GPU Connectivity and High-Speed Fabric Integration

3 days ago

The rapid growth in AI model complexity, with parameters scaling from millions to trillions, has created an urgent need for advanced computational infrastructure. Modern AI workloads, especially those using mixture-of-experts architectures and test-time scaling for reasoning, demand massive parallelization across large GPU clusters. To meet these demands, AI systems now rely on scale-up strategies such as tensor, pipeline, and expert parallelism, which require a unified compute and memory fabric connecting dozens of GPUs. NVIDIA’s NVLink technology has evolved over the past several years to support this scale-up vision. Introduced in 2016, NVLink replaced PCIe limitations by enabling high-speed, low-latency GPU-to-GPU communication and a shared memory space. The 2018 introduction of the NVLink Switch delivered 300 GB/s all-to-all bandwidth across an 8-GPU system, enabling efficient multi-GPU compute fabrics. The third-generation NVLink Switch added SHARP (Scalable Hierarchical Aggregation and Reduction Protocol), which reduced collective operation latency and optimized bandwidth usage. With the fifth-generation NVLink released in 2024, NVIDIA now supports 72-GPU all-to-all communication at 1,800 GB/s, delivering 130 TB/s of aggregate bandwidth—800 times the capacity of the first-generation NVLink. This performance leap, paired with annual advancements in NVLink generations, ensures the technology keeps pace with the exponential growth of AI models. At the software layer, NVIDIA’s NCCL (NVIDIA Collective Communication Library) plays a critical role. As an open-source library integrated into all major deep learning frameworks, NCCL enables near-theoretical bandwidth for GPU communication across single and multi-node systems. It provides automatic topology awareness and continuous optimization, making it essential for large-scale inference and training. The 72-GPU rack architecture, powered by NVLink, is central to maximizing AI factory efficiency. By optimizing the balance between throughput per watt and latency, this configuration helps maximize the area under the performance curve—directly impacting revenue and productivity. Different scale-up configurations show significant performance differences, even with consistent NVLink speeds, highlighting the importance of fabric design. To extend these capabilities beyond NVIDIA’s standard platforms, NVIDIA introduced NVLink Fusion. This solution provides hyperscalers with direct access to production-proven NVLink scale-up technologies, including NVLink SERDES, chiplets, switches, and rack-scale architecture. The high-density design includes a spine network, copper cabling, mechanical innovations, and advanced power and liquid cooling systems. NVLink Fusion supports flexible deployment options, including custom CPUs, custom XPUs, or hybrid configurations. It is available as a modular Open Compute Project (OCP) MGX rack solution, allowing integration with any NIC, DPU, or scale-out switch. For custom XPUs, the interface leverages the open UCIe standard, with NVIDIA providing a bridge chiplet to connect UCIe to NVLink. This enables high-performance, interoperable integration while preserving design flexibility. For custom CPUs, NVIDIA recommends using NVLink-C2C IP to connect directly to GPUs, unlocking access to hundreds of CUDA-X libraries and the full CUDA platform for accelerated computing. The NVLink Fusion ecosystem includes a broad network of silicon partners, IP providers, and system integrators. Production-ready systems like the GB200 NVL72 and GB300 NVL72 are already being deployed at scale, enabling rapid time-to-market and reducing system bring-up time. By combining decades of NVLink innovation with open standards and a mature ecosystem, NVLink Fusion delivers unmatched performance and customization for AI reasoning workloads. It empowers hyperscalers to build tailored, high-performance infrastructure that meets the demands of next-generation AI.

Related Links