HyperAIHyperAI

Command Palette

Search for a command to run...

NVIDIA Blackwell Ultra Boosts Agentic AI Performance by 50x, Cuts Costs by 35x with Advanced Hardware and Software Optimization

NVIDIA has unveiled new performance data demonstrating that its Blackwell Ultra platform, specifically the GB300 NVL72 system, delivers up to 50x higher throughput per megawatt and reduces cost per token by as much as 35x compared to the previous Hopper platform. This advancement is particularly impactful for agentic AI and coding assistants, which are driving rapid growth in AI-driven software development. Agentic AI workloads, including interactive coding assistants, have seen query volumes rise from 11% to nearly 50% of total inference traffic in just one year, according to OpenRouter’s State of Inference report. These applications demand both low latency for real-time responsiveness across multi-step processes and the ability to handle long context when reasoning through entire codebases. The GB300 NVL72 system, powered by the new Blackwell Ultra GPU, achieves these goals through a combination of hardware innovation, system architecture, and deep software optimization. The platform leverages NVIDIA’s TensorRT-LLM, Dynamo, Mooncake, and SGLang software stacks to deliver significant gains in mixture-of-experts (MoE) inference performance. Recent improvements in TensorRT-LLM alone have boosted low-latency performance on the GB200 NVL72 by up to 5x in just four months. When combined with the enhanced capabilities of the Blackwell Ultra GPU, the GB300 NVL72 delivers up to 50x better performance per watt compared to Hopper. This translates into dramatically lower costs—up to 35x less per million tokens at low latency, where agentic AI operates. The performance leap enables AI platforms to scale real-time, interactive experiences to far more users without sacrificing speed or quality. For long-context workloads such as AI assistants analyzing large codebases—handling inputs of up to 128,000 tokens and outputs of 8,000 tokens—the GB300 NVL72 shows a 1.5x improvement in cost efficiency over the earlier GB200 NVL72. This is due to 1.5x higher NVFP4 compute performance and 2x faster attention processing, allowing agents to efficiently process complex code structures. Major cloud providers and AI infrastructure leaders are already adopting these systems. Microsoft, CoreWeave, and Oracle Cloud Infrastructure (OCI) are deploying GB300 NVL72 in production for agentic coding and interactive AI applications. CoreWeave’s Chen Goldberg highlighted that the Grace Blackwell NVL72 systems, integrated into their AI cloud infrastructure, deliver predictable performance and superior cost efficiency, enabling scalable, real-world deployment of advanced AI workloads. Looking ahead, NVIDIA’s upcoming Vera Rubin platform—featuring a new architecture combining six chips into a single AI supercomputer—promises another leap in performance. For MoE inference, Rubin can deliver up to 10x higher throughput per watt than Blackwell, reducing cost per million tokens by a factor of ten. It also enables training large MoE models using just one-fourth the number of GPUs compared to Blackwell, accelerating the development of next-generation AI models. These advancements underscore NVIDIA’s continued leadership in AI infrastructure, driving both performance and cost efficiency across the entire AI stack—from agentic reasoning to large-scale model training.

Related Links