HyperAIHyperAI
Back to Headlines

Google Unveils Massive Inference Scale and AI Infrastructure Breakthroughs at AI Infra Summit

a month ago

Google has unveiled significant advancements in its AI inference infrastructure, highlighting its ability to scale massive workloads while driving down costs—a hallmark of hyperscale computing. At the AI Infra Summit in Santa Clara, Mark Lohmeyer, general manager of AI and computing infrastructure at Google, presented data showing an exponential rise in inference demand across Google’s products. Inference token rates have surged from 9.7 trillion in April 2024 to over 1.46 quadrillion tokens per month by August 2025, representing a 150-fold increase in just 16 months. This explosive growth is powered by Google’s custom Tensor Processing Units (TPUs), particularly the latest Ironwood TPU v7p, which delivers 5X the peak performance and 6X the HBM memory capacity of its predecessor, the Trillium TPU v6e. A single Ironwood cluster, connected via Google’s proprietary optical circuit switch (OCS), can scale to 9,216 TPUs with 1.77 PB of HBM memory—far surpassing even the most powerful GPU-based systems. The OCS enables dynamic reconfiguration and fault tolerance, allowing the system to heal around TPU failures without restarting jobs, a critical capability for maintaining uptime in large-scale AI operations. Google has also made strides in cooling infrastructure, now operating a fifth-generation liquid cooling system with around 1 gigawatt of capacity—70 times more than any other fleet at the time. The company plans to open-source its cooling distribution unit later this year, signaling a broader push to share infrastructure innovations. The Ironwood hardware is deployed in large-scale pods, each containing 144 racks and 9,216 TPUs. While initial visualizations suggested a 256-TPU pod, closer analysis reveals a more complex layout: seven racks per row, with 16 systems per rack and four TPUs each, totaling 448 TPUs per row. This implies a design that includes redundant, hot-spare TPUs—potentially 1,536 spares across a full system—suggesting a high level of fault tolerance and resilience. Beyond TPUs, Google is building a hybrid AI infrastructure on Google Cloud, dubbed the “AI hypercomputer.” It supports Nvidia’s Blackwell GPUs, including 8-way B200 nodes and 72-way B200 rackscale systems. While the GB300 NVL72—designed for cost-efficient inference—has not yet been added to Google Cloud, Nvidia’s Dynamo inference stack is available as an option. Google also emphasizes its own inference stack, built around GKE (Google Kubernetes Engine), vLLM, Anywhere Cache, and the GKE Inference Gateway. The GKE Inference Gateway uses AI-driven load balancing to route requests to compute engines with the required context already in memory, reducing latency and queueing. It separates the “prefill” and “decode” phases of inference, enabling specialized hardware for each stage—similar to Nvidia’s Rubin CPX accelerator. Anywhere Cache, a flash-based caching layer, reduces read latencies by up to 96% across regions and lowers network costs by minimizing data transfers. Google’s GKE Inference Quickstart tool helps customers optimize complex configuration decisions early, avoiding costly inefficiencies. Together, these technologies enable customers to reduce inference latency by up to 96%, increase throughput by 40%, and cut token costs by as much as 30%. One of the most striking innovations is speculative decoding, which Google says has boosted Gemini’s performance and reduced energy consumption by up to 33 times. At scale, such efficiency gains translate into massive cost and environmental benefits. With these advances, Google continues to cement its leadership in AI infrastructure, combining custom silicon, intelligent software, and cutting-edge cooling to power the next generation of AI at unprecedented scale.

Related Links