HyperAIHyperAI

Command Palette

Search for a command to run...

Fused Kernels Accelerate MoE Training Throughput Up to 2x

NVIDIA has introduced a suite of advanced fused multi-layer perceptron kernels engineered to accelerate the training of Mixture-of-Experts and dense large language models. Built using the CuTe Domain-Specific Language, these optimized kernels resolve persistent memory and synchronization bottlenecks that historically constrain MoE throughput. The release introduces three custom, synchronization-free kernel variants that seamlessly combine GroupGEMM operations with quantization, activation functions, and transpose routines. Modern architectures increasingly depend on Gated Linear Unit variants such as SwiGLU and GeGLU, which traditionally demand multiple memory passes due to their dual-chunk computational requirements. NVIDIA’s solution repacks weights during checkpoint initialization, enabling thread blocks to compute input and gate tensors simultaneously within the GEMM epilogue. This architecture eliminates intermediate global memory transfers, natively accommodates feature scaling, tensor clamping, and bias additions, and maximizes utilization by overlapping residual memory operations with active compute cycles. To eradicate host-device synchronization overhead, the updated kernels track tokens per group directly within GPU memory. This design removes CPU dependency during iterative launches, enables full-iteration CUDA Graphs, and effectively neutralizes traditional host-side launch bottlenecks. The implementation also natively fuses MXFP8 and NVFP4 quantization workflows into the primary computation pipeline. By calculating array-maximum values and performing transpositions within a single kernel pass, the design drastically reduces exposed memory overhead while preserving model accuracy. Performance benchmarks across NVIDIA GB200 systems demonstrate forward pass acceleration of up to 1.3x and backward pass improvements reaching 2.1x relative to unfused execution paths. When deployed within comprehensive pre-training environments, the optimizations deliver an 8 percent end-to-end throughput increase for DeepSeek-V3 and a 93 percent boost for GPT-OSS. The kernels are immediately accessible through the cuDNN Frontend, Transformer Engine, and Megatron-Core abstraction layers, granting developers flexible integration options across diverse software stacks. NVIDIA is actively expanding the library to support additional fusion patterns, JAX framework compatibility, activation recomputation, and ahead-of-time compilation to reduce initialization latency. Continuous development focuses on minimizing CPU overhead and refining heuristic kernel selection, establishing these fused architectures as critical infrastructure for next-generation AI training workloads.

Related Links