Nvidia Employee Criticizes Microsoft’s Blackwell GPU Cooling as Wasteful, Citing Energy Use Over Water Efficiency
An internal Nvidia email reveals that a company employee described Microsoft’s cooling system for its Blackwell GPU deployments as “wasteful,” highlighting ongoing tensions around efficiency in AI infrastructure. The observation, made in early fall during the installation of Nvidia’s GB200 Blackwell architecture at a Microsoft data center, focused on the cooling setup for two NVL72 server racks, each housing 72 GPUs. The Nvidia Infrastructure Specialists (NVIS) team member noted that while the liquid cooling system used for the servers was effective, the broader data center cooling approach—particularly at the facility level—appeared inefficient. The staffer pointed to the large physical footprint of the cooling infrastructure and its reliance on air cooling rather than water-based systems, which are typically more efficient at heat removal. According to Shaolei Ren, an associate professor of electrical and computer engineering at the University of California, this distinction is critical. Building-level cooling systems that rely on air instead of water consume more energy but avoid water usage—a growing concern in regions facing water scarcity. “There’s a trade-off,” Ren explained. “Air cooling uses more electricity but reduces water consumption, which matters for public perception and regulatory scrutiny.” Microsoft responded by clarifying its cooling strategy. The company described its setup as a closed-loop liquid cooling system integrated into existing air-cooled data centers. This hybrid approach allows Microsoft to enhance cooling capacity without overhauling entire facilities. “These systems ensure we maximize our existing global data center footprint for scale while promoting efficient heat dissipation and optimizing power delivery to meet the demands of AI and hyperscale systems,” a Microsoft spokesperson said. The email also detailed logistical challenges during the deployment, including the need for extensive documentation and validation processes due to unfamiliarity with cluster and system testing protocols. Handover procedures between Nvidia and Microsoft required more coordination than previous projects. Despite these hurdles, the hardware performed well—both GB200 NVL72 racks achieved a 100% pass rate on key compute performance tests, a marked improvement over earlier qualification samples. Nvidia emphasized that its Blackwell systems deliver high performance, reliability, and energy efficiency. The company confirmed that hundreds of thousands of GB200 and newer GB300 NVL72 systems have been deployed by customers, including Microsoft, to support the growing demand for AI workloads. As data centers expand to meet AI’s insatiable need for compute, the balance between energy, water, and infrastructure efficiency remains a central challenge. Microsoft has pledged to be carbon negative, water positive, and zero waste by 2030, with plans for a zero-water cooling design in its next-generation facilities and innovations in on-chip cooling.
