HyperAIHyperAI

Command Palette

Search for a command to run...

Software Drives AI Performance Gains More Than Hardware, Shifting Pareto Frontier Rapidly in Weeks

In the rapidly evolving world of AI, software is proving to be the dominant force reshaping performance, outpacing hardware advancements in driving progress along the Pareto frontier. The Pareto frontier—a graphical representation of optimal tradeoffs between competing objectives—has become a central concept in AI performance discussions, popularized by Nvidia CEO Jensen Huang during his GTC 2025 keynote. It illustrates how improvements in one metric, such as inference throughput, often come at the cost of another, like response time, with the best balance found along the curve’s edge. Huang’s presentation featured a striking comparison between Nvidia’s Hopper and Blackwell GPU architectures, using a dense, monolithic model like GPT-4. The data showed that shifting from H200 to B200 GPUs, combined with reduced precision (FP4 instead of FP8), a rackscale system, and software optimizations like Dynamo and TensorRT, delivered a 25X performance boost in tokens per second per megawatt. Some estimates even suggest a 31X gain, underscoring how software and system design can dramatically amplify hardware potential. But the real story lies in the shift from dense models to reasoning models—such as chain-of-thought systems like GPT-OSS or DeepSeek R1. These models generate and process multiple tokens across layers of reasoning before producing a final answer. While this increases computational load, the throughput per megawatt drops by about 11X. However, the performance advantage of Blackwell over Hopper in this context is still massive—40X—due to a combination of better parallelism, memory access, and software enhancements. Nvidia’s InferenceMax v1 benchmark, which evaluates models like GPT-OSS 120B, DeepSeek R1-0528, and Llama 3.3 70B Instruct, reveals how quickly software can shift the Pareto frontier. Between August and September, performance across the board nearly doubled on the GB200 NVL72 rackscale system. Then, in just weeks—on October 3 and October 9—further software updates delivered even more dramatic gains: a 5X increase in throughput at mid-range interactivity levels, and the ability to achieve 1,000 tokens per second per user for high interactivity. This rapid progress defies the traditional two-year software optimization cycle. What used to take years now happens in weeks, thanks to innovations like multi-token prediction (a form of speculative execution), advanced data parallelism, and optimized memory access across NVSwitch interconnects. The result is a performance shockwave that moves the entire frontier outward at an unprecedented pace. The numbers are telling: while 80% of Nvidia’s revenue comes from hardware, 80% of its employees work on software. This investment pays off—software is responsible for roughly 60% of the performance gains on any given GPU generation, far outpacing the 2X to 3X hardware improvements seen over a cycle. In short, the AI revolution isn’t just about faster chips. It’s about smarter software that unlocks hidden potential in existing hardware. As models grow more complex and demands for speed and efficiency rise, the ability to push the Pareto frontier through software isn’t just an advantage—it’s the key to staying competitive.

Related Links