HyperAIHyperAI

Command Palette

Search for a command to run...

Hybrid-EP: Optimizing MoE Training Communication with NVIDIA’s Hybrid Network Architecture for Scalable, High-Performance LLM Training

In large language model training, Expert Parallel (EP) communication for hyperscale mixture-of-experts (MoE) models presents significant challenges due to its dynamic and sparse nature—only the top-k experts are activated per token, making traditional all-to-all communication inefficient. To address this, NVIDIA has introduced Hybrid-EP, an optimized communication solution designed for MoE training within the Megatron Core framework, leveraging NVIDIA Quantum InfiniBand and Spectrum-X Ethernet platforms. DeepSeek-V3 exemplifies the new generation of fine-grained MoE models that balance performance and computational cost through sparse activation. However, this comes at the expense of severe communication bottlenecks. Without optimization, EP communication can consume over 50% of total training time. Additionally, dynamic routing leads to load imbalance, where certain experts become overloaded ("hot") while others remain underutilized, reducing hardware efficiency. These challenges are exacerbated at scale, demanding advanced parallel strategies and hardware-aware optimizations. Hybrid-EP is a software-hardware co-designed solution that achieves near-peak communication bandwidth on NVIDIA’s latest architectures. It supports two core operations: dispatch, which routes tokens to their assigned experts, and combine, which aggregates expert outputs back to the attention layer. The design integrates advanced technologies such as TMA (Tensor Memory Accelerator) for NVLink, and low-level IBGDA (InfiniBand GPU Direct RDMA) for RDMA networks, enabling hybrid communication across nodes and within nodes. The solution uses a fine-grained data pipeline that breaks token data into small chunks, allowing overlapping of communication and computation. Each CUDA block acts as an independent data channel, with different warp groups handling distinct pipeline stages—RDMA, G2S (GPU to Shared memory), and S2G (Shared to GPU). This pipelined approach masks communication latency and enables high throughput. For the combine operation, hierarchical reduction is performed: intra-node partial accumulation is followed by inter-node aggregation, ensuring accuracy while maintaining efficiency. Performance benchmarks demonstrate Hybrid-EP’s effectiveness. On an 8-GPU DGX Hopper system, only eight SMs are needed to saturate NVLink bandwidth. In a 32-GPU cluster across four DGX Hopper systems, Hybrid-EP achieves near-maximum NIC bandwidth using just four SMs. On the NVIDIA Grace Blackwell platform with 36 GPUs, only 16 SMs are required to fill NVLink bandwidth, showcasing its scalability and efficiency. Integration into Megatron Core is straightforward, with Hybrid-EP available in the DeepEP/Hybrid-EP branch as callable PyTorch operators. Buffer management is handled through two types: registered buffers for cross-node communication and normal buffers managed by PyTorch. To handle dynamic token loads, a worst-case preallocation strategy ensures sufficient buffer space without excessive memory overhead. In practical testing, Hybrid-EP delivers substantial speedups. On DeepSeek-V3 with MXFP8 precision, throughput increases from 829 TFLOPS/GPU with DeepEP to 943 TFLOPS/GPU with Hybrid-EP—a 1.14x improvement. Similar gains are observed across other models like Qwen 3 235B and DeepSeek-V3-FSDP, with speedups ranging from 1.05x to 1.14x depending on the model and configuration. These results highlight Hybrid-EP’s ability to unlock the full potential of next-generation hardware, enabling efficient, scalable, and high-performance MoE training. By minimizing communication overhead and maximizing hardware utilization, Hybrid-EP plays a key role in advancing the deployment of large-scale AI models with 10x performance gains and 1/10 the cost.

Related Links