NVIDIA cuBLAS 12.9 Enhances Matrix Multiplication Performance and Precision on Blackwell and Hopper Architectures
The NVIDIA CUDA-X math libraries have been instrumental in advancing AI, scientific computing, and data processing by providing optimized tools for developers. One of the most critical components of these libraries is cuBLAS, which handles fundamental linear algebra operations like matrix multiplications (matmuls), essential for training and inference in large language models (LLMs). cuBLAS 12.9 Enhancements NVIDIA's cuBLAS 12.9, released with the CUDA Toolkit 12.9, introduces significant improvements to matmul performance and flexibility. These enhancements are particularly optimized for the NVIDIA Blackwell and Hopper architectures, which are widely used in high-performance computing and machine learning. Channel- and Block-Scaled FP8 Matmuls on Hopper In the latest version of cuBLAS, one of the key features is the support for channel- and block-scaled FP8 matmuls on Hopper GPUs. Channel-wide or outer vector scaling allows a single scaling factor to be applied to each row of matrix A or each column of matrix B. Block scaling further refines this by applying a scaling factor to each 128-element 1D block or a 128×128 2D block within the matrices. This flexibility improves accuracy and performance, with benchmarks showing a 1.75x speedup for large matmuls and at least a 1.25x speedup for most other cases compared to the BF16 baseline. Block-Scaled FP4 and FP8 Matmuls on Blackwell NVIDIA Blackwell Tensor Cores now support 1D block-scaled FP4 and FP8 floating-point types, which offer a balance between reduced precision and high throughput. These new precisions allow for more precise value representation within each block, enhancing overall accuracy. cuBLAS 12.9 can utilize these new data types via specific cuBLASLt APIs. Additionally, the new scaling modes enable the computation of scaling factors for the D tensor, eliminating the need for pre-estimation or additional data passes. When D is an FP4 tensor, a secondary scaling factor is applied to improve quantization accuracy. Performance Benchmarks on Blackwell The performance gains from cuBLAS 12.9 on Blackwell GPUs are substantial. Synthetic benchmarks for large, compute-bound matrices show that block-scaled FP4 is 4.6x faster on the GB200 compared to the H200 FP8 baseline, reaching up to 6787 TFLOPS/s. For a dataset of random matrix sizes, the speedup is less dramatic but still significant, achieving at least a 1.7x improvement and up to a 2.2x speedup over H200 baselines using BF16 and FP8 data types. Real-world applications, such as LLM training and inference, also benefit, with Blackwell delivering consistent speedups. FP32 Emulation on Blackwell Another notable feature of cuBLAS 12.9 is FP32 matmul emulation using BF16 tensor cores on Blackwell GPUs. This technique can significantly enhance performance and energy efficiency. For instance, emulating FP32 on B200 GPUs can achieve between 3 and 4x more TFLOPS than native FP32 on both B200 and H200 for large matrices. FP32 emulation has been successfully demonstrated in applications like weather forecasting, where it provided a 1.4x performance boost and a 1.3x energy efficiency improvement. The library also offers APIs for autotuning and heuristic-based optimization to further enhance performance. Getting Started with cuBLAS 12.9 Developers can start using cuBLAS 12.9 by downloading the CUDA Toolkit 12.9. The cuBLAS documentation provides detailed information on the new scaling schemes, block-scaled data types, and FP32 emulation capabilities. Example usage can be found in the cuBLASLt Library API examples, and further insights are available from NVIDIA GTC talks and sessions. Industry Insights and Company Profiles Industry experts have praised the advancements in cuBLAS 12.9, noting that the improved scaling and new data types offer a significant leap forward in balancing accuracy and performance. The ability to emulate FP32 using BF16 tensor cores is seen as particularly innovative, as it not only speeds up computations but also reduces energy consumption, making it ideal for large-scale applications in both research and industry. NVIDIA, known for its leadership in GPU technology, continues to push the boundaries of computational performance with the release of cuBLAS 12.9, reinforcing its position as a key player in the high-performance computing and AI fields.