HyperAI

Tree-ensemble models, such as random forests and gradient-boosted trees, have long been a staple in the world of tabular data due to their accuracy and efficiency in training. However, deploying these models for inference often hits a roadblock when using Python on CPUs, especially when low latency (sub-10 ms) or high throughput (millions of predictions per second) are required. The Forest Inference Library (FIL) in NVIDIA cuML aims to address this challenge by leveraging GPUs to speed up inference times. Originally introduced in cuML 0.9 in 2019, FIL has undergone a significant redesign in RAPIDS 25.04, bringing a host of new features and performance enhancements. What's New in FIL in cuML 25.04 Auto-Optimization One of the key updates in the latest version of FIL is the introduction of an auto-optimization method. This feature automatically tunes the performance hyperparameters for any given model and batch size, eliminating the need for empirical determination of optimal settings. Once the .optimize method is called, subsequent prediction calls use the best configuration found, which can be verified by checking the .layout and .default_chunk_size attributes. This simplification makes it easier for users to achieve optimal performance without delving into complex parameter tuning. New Prediction APIs FIL now offers two additional prediction methods: .predict_per_tree and .apply. The .predict_per_tree method returns the predictions of each individual tree in the ensemble, allowing users to experiment with advanced ensembling techniques or perform detailed analysis of the model's decision-making process. For instance, one can weight each tree based on its age, out-of-bag AUC, or data-drift score to generate more informed final predictions without retraining. The .apply method, on the other hand, provides the node ID of the leaf node that each tree assigns to the input data. This functionality can be particularly useful for comparing similarities between data points, as it measures how many trees send two rows to the same leaf. GPU and CPU Support While earlier versions of FIL focused on accelerating inference on GPUs, the latest release also supports CPU-only environments. This flexibility is crucial for developers who need to test models locally with small datasets before deployment and for scenarios where scaling down to CPU-only machines during low traffic periods and scaling up to GPUs during peak times can help manage costs and performance. Users can compile FIL in CPU-only mode and call it from C++ without any CUDA dependencies, leveraging OpenMP to parallelize across multiple CPU cores. For Python users, a new context manager in cuML 25.04 enables easy execution of FIL on CPUs. Performance Improvements Memory Optimization The performance gains in cuML 25.04 are primarily attributed to better memory management. Decision tree nodes are now stored in the minimum required size (8 or 16 bytes) and arranged in smarter layouts. The depth_first layout, which is used by default, is optimized for deeper trees (depth ≥ 4), while the layered layout is best for small batches (1–128 rows) and breadth_first for larger batches. Additionally, a new hyperparameter, align_bytes, aligns trees to cache line boundaries, which can improve performance on CPUs and sometimes on GPUs. Benchmark Results Extensive benchmarks were conducted to evaluate the performance of cuML 25.04, comparing it to the previous version and to scikit-learn's native inference. These tests covered a wide range of model parameters, including maximum tree depth (2, 4, 8, 16, 32), tree count (16, 128, 1024, 2048), and feature count (8, 32, 128, 512), as well as various batch sizes (1, 16, 128, 1,024, 1,048,576, 16,777,216). Batch Size 1 Inference: cuML 25.04 outperformed the previous version in 81% of the tested models, with a median speedup of 1.6x. Slight performance regressions were observed for models with many deep trees, but the overall improvement was significant. Maximum Throughput: The new version outperformed the previous one in 76% of models, achieving a median speedup of 1.4x. Minor regressions were noted in shallow tree cases, but the overall throughput was notably higher. When compared to scikit-learn's native execution on an AMD EPYC 9654P 96-core CPU, cuML 25.04 on a single NVIDIA H100 (80GB HBM3) GPU consistently outperformed scikit-learn. The median speedup for batch size 1 was an impressive 239x, and the maximum throughput was similarly superior. Applications and Future Developments The enhanced capabilities of FIL make it an excellent choice for a variety of applications, from high-traffic online services requiring real-time inference to large-scale batch processing jobs. By integrating auto-optimization and additional prediction methods, FIL simplifies model deployment and analysis while maintaining high performance. Future releases will include FIL in NVIDIA Triton Inference Server, further expanding its utility. Industry Insights and Company Profile Industry experts praise the latest release of FIL for its significant performance enhancements and user-friendly features. The auto-optimization method, in particular, is seen as a game-changer for simplifying the deployment of complex tree-ensemble models. NVIDIA, a leader in GPU technology and accelerated computing, continues to innovate with tools like cuML, which enhance the capabilities of data scientists and machine learning engineers working with large datasets. The company's commitment to both GPU and CPU support underscores its dedication to providing flexible and powerful solutions for a wide range of computing environments. Users looking to explore the new features and performance benefits of FIL can download cuML 25.04 from the official NVIDIA RAPIDS repository. Comprehensive documentation and tutorials are available to guide developers through setup and usage, ensuring a smooth transition to this powerful library. Upcoming blog posts will delve deeper into the technical aspects of the new implementation, additional benchmarks, and the integration with Triton Inference Server.

NVIDIA cuML's Updated Forest Inference Library Boosts Tree-Based Model Performance with Auto-Optimization and New Prediction APIs

Related Links