GPU-Native Data Analytics: IBM and NVIDIA Accelerate Velox with cuDF for Faster Large-Scale Query Processing
As data workloads grow in size and complexity, the need for faster, more efficient data processing has driven the adoption of GPU-accelerated systems. GPUs offer significantly higher memory bandwidth and thread count compared to CPUs, making them ideal for compute-intensive tasks such as multiple joins, complex aggregations, and string processing. With the increasing availability of GPU-enabled infrastructure and broader support for GPU algorithms, leveraging GPUs for data analytics is now more practical than ever. To meet this demand, IBM and NVIDIA are collaborating to integrate NVIDIA cuDF into the Velox execution engine, enabling GPU-native query execution for widely used platforms like Presto and Apache Spark. This initiative is open source and aims to accelerate large-scale data analytics across the ecosystem. Velox serves as a middleware layer that translates query plans from systems like Presto and Spark into optimized GPU pipelines powered by cuDF. As shown in Figure 1, this allows SQL queries to be executed directly on GPUs, unlocking substantial performance improvements. The integration supports end-to-end GPU execution, with enhancements to key operators such as TableScan, HashJoin, HashAggregation, and FilterProject. Initial performance results from the Presto tpch benchmark using Parquet data sources demonstrate the power of GPU acceleration. At scale factor 1,000, Presto on CPU (AMD 5965X) took 1,246 seconds to complete 21 queries. In contrast, Presto on NVIDIA RTX PRO 6000 Blackwell Workstation finished in just 133.8 seconds, and Presto on the NVIDIA GH200 Grace Hopper Superchip completed in 99.9 seconds. With CUDA managed memory enabling completion of the full query set on GH200, the runtime was 148.9 seconds—highlighting the importance of memory management in large-scale GPU workloads. For distributed execution, the team implemented a UCX-based Exchange operator in Velox, which supports high-speed data movement between GPUs. UCX leverages NVLink for high-bandwidth intra-node communication and RoCE or InfiniBand for inter-node transfers. This approach delivers substantial performance gains. On an eight-GPU NVIDIA DGX A100 node, using NVLink in the exchange operator resulted in over 6x speedup compared to the traditional HTTP-based exchange method. All 22 queries in the benchmark were completed efficiently using async memory allocation, without requiring managed memory. In Apache Spark, the integration with Apache Gluten and cuDF focuses on hybrid CPU-GPU execution. Rather than requiring full GPU migration, this approach offloads only the most compute-intensive stages—such as the second stage of TPC-DS Query 95 at scale factor 100—to the GPU. Even when the initial TableScan runs on CPU, the seamless interoperability between CPU and GPU results in faster total execution time. With a single NVIDIA T4 GPU and eight vCPUs, the GPU-offloaded version outperformed the CPU-only version. This collaborative effort is part of a broader movement to unify GPU acceleration across the data stack. By building reusable GPU operators in Velox, the project reduces redundancy, simplifies maintenance, and enables rapid innovation across Presto, Spark, and other systems. The open source community is invited to participate, contribute, and provide feedback. This initiative marks a significant step toward democratizing high-performance data analytics at scale. Contributors from IBM include Zoltán Arnold Nagy, Deepak Majeti, Daniel Bauer, Chengcheng Jin, Luis Garcés-Erice, Sean Rooney, and Ali LeClerc. NVIDIA contributors include Greg Kimball, Karthikeyan Natarajan, Devavret Makkar, Shruti Shivakumar, and Todd Mostak.