Nvidia Unveils Rubin CPX to Disaggregate Long-Context AI Inference, Boosting Efficiency and Reducing Costs
Nvidia is introducing a new approach to AI inference with the Rubin CPX GPU accelerator, designed to tackle the rising costs and bottlenecks of long-context AI workloads by disaggregating compute and memory-intensive tasks. As demand for AI systems grows—especially for applications like code generation, analysis, and video processing that require context windows of over a million tokens—Nvidia is addressing the imbalance between compute and memory bandwidth by separating the prefill (context processing) and decode (token generation) phases of inference. The Rubin CPX is optimized for the decode phase, which is compute-heavy but less dependent on high-bandwidth memory. By using GDDR7 memory instead of expensive HBM, the CPX reduces cost and power consumption while maintaining high throughput. The chip is rated at 30 petaflops of FP4 performance, achieved by running a single Rubin chiplet at higher clocks. This contrasts with the full Rubin R100/R200 GPUs, which use HBM and are designed for broader workloads. Nvidia’s strategy leverages architectural specialization. The Rubin CPX includes dedicated attention acceleration cores, which handle the most compute-intensive part of transformer models—identifying relevant context—without requiring high memory bandwidth. This makes it ideal for long-context inference, where reprocessing the full context for each token is inefficient. By disaggregating the workflow, two Rubin CPX GPUs can deliver up to six times the throughput compared to a single high-end GPU, at just 2.25 times the compute cost. This approach is particularly effective when paired with the Vera Rubin rackscale system. Adding 144 Rubin CPX accelerators to a Vera Rubin rack boosts total FP4 compute to 4.4 exaflops and adds 25 TB of fast memory. Nvidia claims that for every $100 million invested in such a setup, the system could generate $5 billion in revenue over four years through API and application usage—highlighting the strong return on investment for long-context workloads. The architecture also allows flexibility. Rubin CPX nodes can operate independently from standard Rubin GPU nodes, enabling scalable deployment without complex NVLink interconnects. This could support not only large-scale code and video models but also smaller models that benefit from dedicated decode acceleration. The move reflects Nvidia’s response to the growing imbalance between demand for HBM memory and limited supply, driven by increasingly dense, high-bandwidth stacks that suffer from low manufacturing yields. By offloading non-memory-bound tasks to cost-effective, GDDR-based accelerators, Nvidia is reducing dependency on scarce HBM while improving efficiency across the AI stack. With the Rubin CPX, Nvidia is not just introducing a cheaper chip—it’s redefining how inference workloads are structured. The focus on disaggregation, specialized acceleration, and cost-effective memory usage positions the CPX as a key enabler for scalable, high-throughput AI systems in the 2026–2027 window, when AI spending is expected to peak.