HyperAIHyperAI

Command Palette

Search for a command to run...

NVIDIA Blackwell Sets New Standard in Inference Performance and Efficiency

A new benchmark, InferenceMAX v1 by SemiAnalysis, has declared NVIDIA’s Blackwell platform the leader in AI inference performance and efficiency, marking a pivotal moment in the evolution of AI deployment. The results show that Blackwell delivers up to a 15x performance gain over the previous Hopper generation, enabling a 15x return on investment—where a $5 million investment in an NVIDIA GB200 NVL72 system can generate $75 million in token revenue. This underscores a fundamental shift: inference is now the core economic engine of AI, not just a technical step. InferenceMAX v1 is the first open-source benchmark to measure total cost of compute across real-world scenarios, evaluating performance across diverse workloads including chat, summarization, and deep reasoning. It tests models like gpt-oss-120b, Llama 3.3 70B, and DeepSeek-R1 across multiple precisions, sequence lengths, and configurations—both single-node and multi-node with Expert Parallelism (EP). The benchmark uses continuous integration to publish daily results, ensuring transparency and reproducibility. NVIDIA Blackwell excels through a full-stack approach combining hardware and software innovation. The B200 GPU features fifth-generation Tensor Cores, native FP4 support, and 1,800 GB/s NVLink bandwidth via the NVLink Switch, enabling massive parallelism and low-latency communication. These advancements allow Blackwell to deliver over 10,000 tokens per second per GPU on Llama 3.3 70B at 50 TPS/user—four times higher than Hopper H200. Software optimizations have driven even greater gains. NVIDIA TensorRT-LLM has seen a 5x performance boost in just two months, with per-GPU throughput rising from 6,000 to 30,000 tokens per second on gpt-oss-120b. The introduction of speculative decoding in the gpt-oss-120b-Eagle3-v2 model triples throughput at 100 TPS/user, reducing cost per million tokens by 5x—from $0.11 to $0.02. Even at ultra-high interactivity (400 TPS/user), the cost remains low at $0.12, making complex multi-agent systems economically viable. For large-scale deployments, the GB200 NVL72 system delivers a 15x reduction in cost per million tokens compared to H200. At 75 TPS/user, cost drops from $1.56 to under $0.10, with a flatter cost curve that sustains efficiency at high user loads. This makes Blackwell ideal for AI factories—infrastructure designed to generate intelligence at scale. Disaggregated inference via NVIDIA Dynamo and optimized execution through TensorRT-LLM further unlock performance, especially for MoE models. By separating prefill and decode phases across nodes and intelligently balancing expert workloads, these tools prevent GPU underutilization and maximize throughput. NVIDIA has also partnered with open-source communities like SGLang and vLLM, contributing kernel-level optimizations for Attention, GEMM, MoE, and communication. These enhancements are integrated into FlashInfer and runtime frameworks, ensuring Blackwell’s full potential is realized across the ecosystem. In summary, InferenceMAX v1 confirms that NVIDIA Blackwell sets a new standard in inference performance, efficiency, and economic value. Its combination of hardware innovation, software optimization, and open collaboration is driving the next phase of AI—where speed, scale, and cost efficiency converge to deliver real-world ROI. With open benchmarking and continuous improvement, the future of AI inference is now defined by performance at scale.

Related Links