HyperAI

NVIDIA NeMo Framework enables faster training throughput using FP8 precision, delivering substantial performance gains across large language models. This post evaluates the most effective FP8 scaling techniques—delayed scaling, current scaling, sub-channel scaling, MXFP8, and generic block scaling—on NVIDIA H100 and DGX B200 GPUs, focusing on real-world efficiency, numerical stability, and scalability. FP8 reduces computational and memory demands by using 8-bit precision instead of 16- or 32-bit formats. This leads to faster matrix multiplications, reduced communication overhead in distributed training, and improved utilization of modern GPU hardware. The key challenge lies in balancing speed with numerical accuracy through effective scaling strategies. Hardware-native FP8 scaling methods, such as tensor-wise, channel-wise, and sub-channel-wise approaches, show significant speedups over BF16. On H100 GPUs, tensor-wise scaling achieves up to 2x faster GEMM performance due to minimal overhead from managing a single scaling factor per tensor. Finer-grained methods like 128×128 block scaling offer better numerical stability but incur higher overhead, resulting in slightly lower throughput. As GEMM size increases, the performance advantage of FP8 grows. Larger operations make the computational savings from reduced precision more dominant, allowing FP8 to outperform BF16 even with added scaling complexity. This trend is especially pronounced in large models, where FP8 efficiency scales with model size. Training loss curves for Llama 3.1 models reveal critical trade-offs. While per-tensor scaling delivers high throughput, it shows greater loss fluctuations. In contrast, block-wise scaling (e.g., FP8-blockwise) closely follows the BF16 baseline, indicating superior convergence and stability. This highlights a key insight: finer granularity improves numerical fidelity at the cost of some raw speed, but the resulting model quality often justifies the trade-off. Experiments using NeMo Framework 25.04 on Llama 3 8B, 70B, 405B, and Nemotron 15B, 340B models show strong model-size-dependent speedups. On H100 GPUs, the current scaling recipe delivers 1.30x speedup for 8B models, rising to 1.53x for the 405B model. On DGX B200 systems, MXFP8 achieves consistent 1.28x to 1.37x speedup across models, with peak gains at larger scales due to optimized memory and compute on the Blackwell architecture. MXFP8, which applies a shared scaling factor to 32-value blocks, is specifically designed for Blackwell’s Tensor Cores. It balances dynamic range and efficiency, enabling stable convergence and high throughput even at hundreds of billions of parameters. The architecture’s unified memory and high bandwidth further enhance performance, especially in large-scale training. The NVIDIA GB200 Grace Blackwell Superchip, combining two B200 GPUs with a Grace CPU via NVLink, offers a unified memory space and higher bandwidth. Benchmarks show it outperforms standalone B200 systems, particularly for large models, due to reduced data movement and improved memory access patterns. In summary, FP8 training delivers measurable speedups, especially for large models. Per-tensor scaling maximizes raw throughput, while block-based methods like MXFP8 provide better convergence and stability. The choice depends on workload priorities—speed versus accuracy. With NeMo Framework, developers can now deploy these techniques efficiently, unlocking faster, more scalable AI training on modern NVIDIA hardware.

FP8 Training Speedup Showdown: Comparing Scaling Recipes in NVIDIA NeMo for LLMs

Related Links