NVIDIA cuML 25.04 Boosts Tree-Based Model Inference with Enhanced Forest Inference Library (FIL)
Tree-ensemble models are widely used in machine learning for their accuracy and efficiency in handling tabular data. However, deploying Python inference for these models on CPUs often becomes a bottleneck, especially when low latency or high throughput is required. The Forest Inference Library (FIL), a component of NVIDIA's cuML suite, aims to address this issue by delivering high-performance inference for tree-based models like gradient-boosted trees and random forests, regardless of the initial training environment. Originally introduced in 2019 with cuML 0.9, FIL has undergone a significant redesign in the latest version, RAPIDS 25.04, enhancing both its functionality and speed. New Capabilities and Features Auto-Optimization One of the standout features of the updated FIL is its auto-optimization capability. Previously, users had to manually tune various hyperparameters to achieve optimal performance, which was challenging and time-consuming. The new .optimize method automates this process, allowing users to find the best settings for their specific model and batch size with just a single call. After optimization, subsequent prediction calls will use the determined optimal hyperparameters. The .layout and .default_chunk_size attributes provide insight into the selected optimization settings. New Prediction APIs FIL now includes two new prediction methods that offer more granular control over the inference process: .predict_per_tree: This method returns the predictions of each individual tree in the ensemble. This can be particularly useful for experimenting with advanced ensembling techniques or analyzing how the ensemble reaches its collective prediction. For instance, you can weight each tree by its age, out-of-bag AUC, or data drift score to make more informed decisions. .apply: This method provides the node IDs of the leaf nodes for each tree given an input. This feature extends the utility of tree-based models beyond traditional regression and classification tasks. One simple application is measuring the similarity between data points by counting how many trees send them to the same leaf. GPU and CPU Support While FIL initially focused on accelerating inference on GPUs, the new version also supports CPU inference. This flexibility is essential for various use cases, such as local testing on small datasets and scaling down to CPU-only systems during low traffic periods. For Python users, a new context manager enables CPU execution. In the future, Python packages will be available for installation on CPU-only systems. Performance Improvements The performance gains in cuML 25.04 are achieved through several optimizations: Memory Fetch Reduction: Data for decision nodes is now stored in minimum sizes (typically 8 or 16 bytes) and arranged in smarter layouts. This minimizes the number of memory fetches needed during inference, making the process faster. Advanced Layouts: The library supports three main layouts—depth_first, layered, and breadth_first—each optimized for different scenarios. The depth_first layout works best for deeper trees (depth ≥ 4), while layered and breadth_first are more effective for shallow trees and large batch sizes, respectively. Cache Line Alignment: A new performance hyperparameter, align_bytes, aligns tree nodes to cache line boundaries, which can improve performance on both CPUs and GPUs. On CPUs, aligning to 64 bytes generally yields the best results, while on GPUs, some models benefit from 128-byte alignment. Benchmark Results An extensive benchmarking study was conducted to evaluate the performance of the new FIL version against the previous one across a wide range of model parameters and batch sizes. The benchmarks involved training RandomForestRegressor models with various depths, tree counts, and feature counts using synthetic data, and then measuring inference times on both an NVIDIA H100 (80GB HBM3) GPU and a 2-socket Intel Xeon Platinum 8480CL CPU. Batch Size 1 Performance: cuML 25.04 outperformed the previous version in 81% of the tested models, with a median speedup of 1.6x. While there were slight performance regressions for models with many deep trees, the overall improvement was substantial. Maximum Throughput: The new FIL version outperformed the previous one in 76% of the models, with a median speedup of 1.4x. The regressions were minimal and mostly affected shallow tree models. The performance of FIL was also compared to scikit-learn's native inference. Using an AMD EPYC 9654P 96-core CPU against a single H100 (80GB HBM3) GPU, FIL consistently outperformed scikit-learn across all tested scenarios. For batch size 1, the median speedup was an impressive 239x, and for large batch sizes, the median speedup was 156x. Industry Insights and Company Profiles Industry insiders are enthusiastic about the advancements in FIL. The auto-optimization feature significantly reduces the barrier to entry for users looking to leverage GPU acceleration for inference tasks, making it accessible and practical for a broader audience. The new prediction APIs, such as .predict_per_tree and .apply, expand the versatility of tree-based models, enabling researchers and engineers to explore innovative applications. NVIDIA, known for its cutting-edge GPU technology and AI solutions, continues to innovate with the RAPIDS suite. cuML, as part of RAPIDS, offers a comprehensive set of tools for accelerated machine learning, making it easier for data scientists and engineers to harness the power of GPUs without deep expertise in CUDA programming. Getting Started To take advantage of the new features and performance improvements in FIL, users can download the cuML 25.04 release. These capabilities will also be integrated into future releases of NVIDIA Triton Inference Server, further streamlining the deployment of machine learning models. For detailed information on performance, API documentation, and benchmarks, visit the cuML FIL documentation. Additionally, NVIDIA's Deep Learning Institute (DLI) offers hands-on courses to help users navigate and maximize the benefits of this powerful tool. The redesign of FIL in cuML 25.04 represents a significant step forward in making tree-based models more efficient and versatile, catering to a wide range of deployment scenarios from local testing to high-throughput production environments. Whether you are a seasoned data scientist or a beginner, the new FIL offers a user-friendly and highly performant solution for your inference needs.