HyperAI
Back to Headlines

RAPIDS 25.04: Boosting Python ML with Zero-Code-Change Acceleration and Advanced IO Performance

5 days ago

RAPIDS, NVIDIA's suite of open-source software libraries for data science and machine learning, has recently introduced several groundbreaking updates across its releases. These enhancements focus on improving performance, usability, and scalability, making it easier for data scientists to leverage the power of GPUs in their workflows. Zero-Code-Change Acceleration with NVIDIA cuML One of the most notable features is the zero-code-change acceleration for workflows using scikit-learn, UMAP, and hdbscan. Introduced as an open beta, this new functionality allows data scientists to use familiar PyData APIs while automatically harnessing the power of NVIDIA GPUs for significant performance gains. Speedups range from 5x to 175x, depending on the algorithm and dataset size. To utilize this feature, users simply need to load the IPython extension before importing their standard CPU machine learning libraries. More details can be found in the cuML documentation. Major IO Performance Improvements in cuDF Cloud Object Storage RAPIDS has also made substantial improvements to the IO performance of cuDF, particularly for cloud data processing workloads. By integrating NVIDIA KvikIO and optimizing the parallel reading of Parquet file footers, cuDF and Dask can now read Parquet files from Amazon S3 over 3x faster. This benchmark utilized a g4dn.12xlarge EC2 instance with a 50 Gbps bandwidth and a dataset of 360 Apache Parquet files, totaling about 46 GB. The performance gains are significant and come with no changes required from the user. Hardware-Based Decompression The introduction of the hardware-based decompression engine in NVIDIA's Blackwell architecture has further enhanced IO performance. cuDF 25.02 and beyond can now take advantage of this decompression engine, resulting in a 35% faster end-to-end runtime for the Polars Decision Support (PDS) benchmark at scale factor 100. This improvement is attributed to the low latency and high throughput of the Blackwell Decompression Engine. Enhanced Usability for Polars GPU Engine The Polars GPU engine, powered by cuDF, has received major usability enhancements. Two highly requested features are now available starting with RAPIDS 25.04 and Polars 1.25: Global Configuration: Users can now set a default GPU engine once at the beginning of their workflow. If the GPU engine does not support a specific query, it will fall back to the Polars CPU engine seamlessly. GPU-Aware Profiling: The Polars profiler now supports GPU execution. The .profile() method on LazyFrame can be configured to use the GPU, allowing users to understand the performance of their queries better, regardless of whether they are running on CPUs or GPUs. Out-of-Core XGBoost for Large Datasets In collaboration with the DMLC community, NVIDIA released XGBoost 3.0 in March. This release features a redesigned external memory interface, making it possible to train models on datasets that exceed the memory capacity of a single GPU. Optimized for coherent memory systems like NVIDIA GH200 Grace Hopper and B200 Grace Blackwell, a single Grace Hopper system can handle datasets over 1 TB with the RAPIDS Memory Manager (RMM). The new ExtMemQuantileDMatrix interface and data iterators simplify out-of-core training, and the external memory interface supports multi-GPU and distributed training for even larger datasets. Redesigned Forest Inference Library (FIL) cuML 25.04 includes a stable and higher-performance version of the Forest Inference Library (FIL). This library is crucial for fast inference of tree models like XGBoost, LightGBM, and Random Forest. The redesigned FIL offers a median speedup of 40% over the previous version, with significant gains depending on the model's characteristics such as depth, number of trees, and batch size. Additionally, three new features have been added to enhance deployment experiences. Platform Updates Blackwell Support All RAPIDS projects now support NVIDIA Blackwell-architecture GPUs, starting with the 25.02 release. This includes hardware-based functionalities like the decompression engine, further boosting performance and efficiency. Conda Improvements Installing RAPIDS libraries with Conda is now smoother. With the strict channel priority for CUDA 12 on both x86 and ARM SBSA-based systems, creating environments and installing packages is faster and more reliable. This update aligns with longstanding community requests for better installation processes. Google Colab Integration Google Colab, a popular managed notebook platform, now includes cuML and GPU-accelerated Polars. This expansion of Colab's battery-included libraries allows users to easily access GPU-accelerated data processing with zero code change required. The Colab Gemini assistant is now "RAPIDS-aware," capable of generating GPU-accelerated pandas code powered by cuDF. Industry Evaluation and Company Profiles These updates from NVIDIA RAPIDS represent a significant leap in accelerating data science workflows. Industry insiders highlight the potential for these enhancements to democratize GPU-accelerated computing by reducing the barriers to entry for data scientists. The zero-code-change capability is particularly praised for its ease of adoption and minimal disruption to existing pipelines. With these features, RAPIDS continues to solidify its position as a leading framework for GPU-accelerated data science, garnering a strong community of over 3,500 members on the RAPIDS Slack. For those new to RAPIDS, comprehensive resources are available to help get started quickly and efficiently. NVIDIA GTC 2025 showcased a wide array of data science sessions and workshops, further emphasizing the company's commitment to advancing the field through robust and accessible GPU technologies. Whether you're working in the cloud, on-premises, or with massive datasets, the latest RAPIDS releases promise substantial performance improvements and enhanced user experience.

Related Links