HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago
NVIDIA
GPU

CUDA 13.2 adds enhanced tile support and Python features

NVIDIA has released CUDA 13.2, a major toolkit update designed to enhance developer productivity through expanded Python support, new C++ runtime APIs, and advanced GPU management features. The release introduces CUDA Tile support for Ampere, Ada, and Blackwell architectures, with full support for all Ampere-based devices planned for a future update. Developers can quickly install cuTile Python via pip, bypassing the need for a system-wide CUDA Toolkit installation. Significant improvements have been made to memory management and driver behavior. New API functions, cudaMemcpyWithAttributesAsync and cudaMemcpy3DWithAttributesAsync, allow developers to apply specific attributes to single memory transfers without using cumbersome batched interfaces. On Windows, local memory usage on GPUs has been reduced for driver mode WDDM, benefiting virtualized environments. Additionally, CUDA drivers now default to MCDM mode instead of TCC to improve compatibility with modern operating systems, though TCC remains available for legacy users. The toolkit includes critical updates for AI and high-performance computing. The math libraries feature new capabilities, including Grouped GEMM support for MXFP8 on Blackwell GPUs in cuBLAS, offering up to 4x speedups in specific use cases. cuSOLVER now supports FP64-emulated calculations, delivering up to 2x performance gains for large matrix operations on NVIDIA B200 systems. In the realm of compilers, CUDA 13.2 adds support for Visual Studio 2026 and ARM C Language Extensions, while unifying the toolkit for Tegra and desktop GPUs to streamline container management. For embedded and Arm-based systems, CUDA 13.2 supports the NVIDIA Jetson Orin devices and introduces Multi-Instance GPU (MIG) support for the Jetson Thor. This feature partitions the GPU into isolated instances, allowing safety-critical workloads like motor control to run without interference from non-critical tasks such as perception models, ensuring predictable latency and quality of service. Tooling has been significantly upgraded with the introduction of NVIDIA Nsight Python, enabling seamless kernel profiling directly from Python code. Numba-CUDA kernels are now debuggable using CUDA-GDB and the NVIDIA Nsight Visual Studio Code Edition. Nsight Compute includes new features for report clustering and a Register Dependency correlation window to identify bottlenecks. Furthermore, the CCCL library has been updated to version 3.2, providing modern C++ runtime APIs that replace C-like wrappers for better safety and productivity. New algorithms in CCCL, such as Top-K selection and fixed-size segmented reduction, offer substantial performance improvements over traditional sorting and reduction methods. In the Python ecosystem, CuPy now fully supports CUDA 13.0 and 13.1, enabling direct stream sharing with PyTorch and JAX for zero-copy interoperability. The cuda.core 0.6 update brings NVML bindings for GPU monitoring and advances in CUDA Graphs, allowing for efficient execution of complex operation sequences with conditional logic. NVIDIA encourages developers to download the CUDA 13.2 Toolkit to leverage these productivity enhancements and performance optimizations. The release marks a continued effort to make Python a first-class citizen in GPU programming while providing modern, high-performance interfaces for C++ developers.

Related Links

CUDA 13.2 adds enhanced tile support and Python features | Trending Stories | HyperAI