HyperAI

NVIDIA has partnered with the University of Wisconsin-Madison to introduce Sirius, a GPU-accelerated execution engine for DuckDB that has achieved record-breaking performance on ClickBench, a widely used analytics benchmark. Sirius is designed to bring the power of GPU computing to SQL-based analytics without requiring a complete rebuild of existing database systems. DuckDB has gained popularity among major organizations like DeepSeek, Microsoft, and Databricks due to its speed, simplicity, and flexibility. As analytics workloads are highly parallelizable, GPUs offer superior performance, throughput, and cost efficiency compared to CPUs. However, building a GPU-native database from scratch is complex and resource-intensive. Sirius solves this challenge by acting as a composable, GPU-native backend for DuckDB, leveraging existing components while accelerating query execution using NVIDIA’s CUDA-X libraries. Sirius is implemented as a DuckDB extension, requiring no changes to DuckDB’s core codebase and minimal adjustments to the user interface. It accepts query plans in the Substrait format, ensuring compatibility with other data systems. The architecture reuses DuckDB’s mature subsystems—such as the query parser, optimizer, and scan operators—while offloading computation to the GPU. Data is transferred from CPU to GPU memory in a format aligned with Apache Arrow, and subsequent operations like aggregations, projections, and joins are executed at GPU speed using cuDF primitives. Results are then returned to the CPU and converted back to DuckDB’s expected output format, delivering both high performance and a seamless user experience. In ClickBench tests, Sirius running on an NVIDIA GH200 Grace Hopper Superchip instance from Lambda Labs—priced at $1.50 per hour—outperformed the top five systems on the benchmark, all of which ran on more expensive CPU-only instances costing between $7.30 and $9.80 per hour. Sirius achieved the lowest relative runtime across all queries, demonstrating at least 7.2x better cost-efficiency. The results highlight the advantage of GPU acceleration, especially when using high-performance hardware. Performance analysis of individual queries shows Sirius excelling in common operations like filtering, projection, and aggregation. For example, in queries q4, q5, and q18, it delivered significant speedups. Some queries, such as q23, q24, q26, and q27, revealed areas for improvement—particularly in string operations, top-N queries, and large-scale aggregations. Future updates will focus on optimizing these workloads. A key innovation is seen in query q28, which involves complex regular expression matching. Sirius uses cuDF’s JIT-compiled string transformation framework to break down regex operations into efficient, low-level string functions. This approach achieved a 13x speedup over a precompiled API, with warp occupancy rising from 32% to 85%, showing better GPU utilization and reduced register pressure. Looking ahead, NVIDIA and the University of Wisconsin-Madison are developing foundational, open, and interoperable building blocks for GPU data processing, guided by the MICE principles—modular, interoperable, composable, and extensible. These components aim to lower the barrier to entry for building GPU-native analytics systems, benefiting not just Sirius but the broader open-source ecosystem. Sirius is open source under the permissive Apache-2.0 license and welcomes contributions from researchers and developers. The project is driven by a shared vision to advance data analytics in the GPU era.

Related Links

Related Links

Related Links

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

Command Palette

NVIDIA and UW-Madison Launch Sirius, a GPU-Accelerated DuckDB Engine That Sets New ClickBench Records with 7.2x Cost-Efficiency Gains

Related Links

Command Palette

NVIDIA and UW-Madison Launch Sirius, a GPU-Accelerated DuckDB Engine That Sets New ClickBench Records with 7.2x Cost-Efficiency Gains

Related Links

Command Palette

NVIDIA and UW-Madison Launch Sirius, a GPU-Accelerated DuckDB Engine That Sets New ClickBench Records with 7.2x Cost-Efficiency Gains

Related Links

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.