HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago
LLM
GPU
Transformer

NVIDIA Blackwell Ultra Boosts AI Inference Speed by Doubling SFU Throughput for Efficient Softmax Operations

NVIDIA Blackwell Ultra introduces a critical advancement in AI inference efficiency by addressing a long-standing bottleneck in large language model performance: the softmax function. As LLM context lengths grow and attention mechanisms evolve—adopting complex patterns like Multi-Head Latent Attention (MLA) and Grouped Query Attention (GQA)—the computational burden of softmax becomes a dominant factor in inference speed. Unlike basic arithmetic operations that Tensor Cores excel at, softmax relies on transcendental functions, particularly the natural exponential, which are handled by Special Function Units (SFUs). In NVIDIA’s SASS assembly, this is executed via the MUFU.EX2 instruction, creating a pipeline stall when the SFU cannot keep pace with the high-throughput matrix engines. Blackwell Ultra overcomes this limitation by doubling the SFU throughput compared to the standard Blackwell architecture. This enhancement directly reduces the latency of softmax normalization, especially in large attention matrices. For example, in a sequence of 8,192 tokens, the attention matrix scales quadratically, requiring billions of exponential operations. Without sufficient SFU capacity, Tensor Cores must idle while waiting for normalization, severely limiting overall throughput. The impact is visible in the attention loop execution. On previous Blackwell GPUs (GB200), the pipeline shows a clear dependency: the second matrix multiplication (BMM2) cannot start until softmax completes. This creates idle time for the Tensor Cores. With Blackwell Ultra (GB300), the MUFU.EX2 processing time is nearly halved, reducing the gap between BMM1 and BMM2. This tighter pipeline allows the Tensor Cores to remain active for longer, significantly improving inference efficiency. A synthetic benchmark using the exp2-bg300.cu kernel confirms the performance gain. When compiled for sm100f (GB300) and sm103a (GB200), the results show approximately a 2x increase in FLOPs performance across all data types, validating the doubled SFU throughput. This improvement is especially impactful in FP8 precision, where matrix operations are already fast, making softmax time a larger fraction of the total execution. In forward propagation (FPROP) benchmarks for models like DeepSeek-V3 using GQA, Blackwell Ultra delivers a ~35% increase in throughput for FP8 operations. This gain underscores that in modern, optimized architectures, the bottleneck is no longer just raw matrix multiplication—it’s the speed of non-linear, transcendental math. Blackwell Ultra’s hardware-software co-design enables this leap through targeted optimizations in the attention loop. Features like enhanced SFU performance, combined with other low-level improvements in cuDNN and TensorRT-LLM, ensure that the entire inference pipeline runs more smoothly. The result is faster AI “thought” with reduced latency, particularly in long-context applications. For developers and researchers, NVIDIA’s trtllm-gen repository offers tools to measure and leverage this performance boost. The key takeaway is clear: accelerating AI inference isn’t just about faster matrix engines. It’s equally about ensuring that the specialized units handling transcendental math—like SFUs—can keep up. Blackwell Ultra delivers exactly that, unlocking new levels of efficiency in next-generation AI workloads.

Related Links