HyperAI

Is It Difficult to Dissipate Heat in Data Centers? See How Google and DeepMind Use AI to Solve It

6 years ago
Recommended List
Information
Dao Wei
特色图像

By Super Neuro

Scenario description:Google and DeepMind have collaborated to use machine learning methods to optimize data center energy consumption and effectively achieve automated management of data centers.

Keywords:Machine Learning Data Center Thermal Control

With the development of Internet technology, people's demand for computing power has increased, and large data centers have become more and more numerous. However, such development has also brought a threat to the environment and energy. 

Data centers consume a large proportion of the energy consumed by large-scale commercial and industrial systems. From an environmental perspective, Data from 2017 shows that data centers are using 3% of global energy usage and emitting 2% of global greenhouse gases.

Another report said data centers use an estimated 200 terawatt hours (TWh) of electricity per year, which is roughly equivalent to Iran's total national energy consumption.

A Google data center

If the energy usage of data centers can be optimized, even slight improvements can greatly reduce greenhouse gas emissions and effectively alleviate energy and environmental problems.

Google has been using AI technology to do this. 

If you don’t cool down, you’ll burn money

A large part of the extra energy consumption in data centers comes from cooling, just like the heat needed when a laptop is running.

Google data centers provide servers for popular applications such as Google Search, Gmail, and YouTube, which also generate huge amounts of heat during operation and must be effectively dissipated to ensure their normal operation. 

Data center cooling system

However, conventional cooling methods, such as pumps, chillers, and cooling towers, are difficult to use in dynamic environments such as data centers. The main obstacles come from the following aspects: 

1. How engineers operate equipment and the complex and nonlinear effects of the environment on the equipment. Traditional methods and human intuition often fail to capture these interactions in the complex environment of a data center. 

2. The system cannot adapt quickly to internal or external changes (such as weather). This is because engineers cannot develop rules and heuristics for all operating scenarios. 

3. Each data center has a unique architecture and environment. A custom-tuned model for one system may not be applicable to another. Therefore, a general intelligent framework is needed to understand the interactions of data centers. 

Hundreds of lines of code save hundreds of millions of dollars

To solve the above problems, Google and DeepMind are trying to use machine learning (ML) methods to improve the energy efficiency of Google data centers. 

In 2016, Google and DeepMind launched an ML-based recommendation system that used different operating scenarios and parameters within the data center to train the neural network system, creating an efficient and adaptive framework. 

The data they trained on was historical records collected by thousands of sensors in the data center, including temperature, power, pump speed, set points, and other data. 

PUE (Power Usage Effectiveness) is defined as the ratio of total building energy consumption to IT energy consumption. The closer the ratio is to 1, the more efficient the energy use is. 

Since the goal is to improve the energy efficiency of data centers, the neural network is trained using the average PUE (Power Usage Effectiveness) as a parameter. 

Google Data Center PU Measurement Range

Additionally, they trained two ensembles of deep neural networks to predict future temperatures and pressures in the data center one hour into the future. The purpose of these predictions is to simulate the recommended actions in the PUE model to ensure that no operating constraints are exceeded. 

The models are tested by deploying them live in a data center. The following figure shows one of the tests, including predictions about when machine learning is turned on and when it is turned off. 

By using ML methods, the system was able to consistently reduce the energy used for cooling by 40%, and after eliminating electrical losses and other non-cooling inefficiencies, the overall PUE overhead was reduced by 15%. This was equivalent to saving hundreds of millions of dollars in capital expenditures at the time. It also produced the lowest PUE ever. 

PUE data for all of Google's large-scale data centers

Cloud-based AI is about to replace human labor

In 2018, they took this system to the next level, where the AI gained more autonomy and now directly controlled the cooling of the data center, but also remained under the professional supervision of the data center operator. The upgraded new system is already providing energy-saving services for multiple Google data centers. 

This technology provides analytics and policies as a cloud-based service. 

Every five minutes, cloud-based AI takes a snapshot of the data center’s cooling system from thousands of sensors and feeds it into a deep neural network, predicting how different combinations of potential actions will affect future energy consumption. 

The AI system then identifies which actions will result in the least energy consumption while satisfying the constraints that guarantee safety. These actions are then sent back to the data center, where they are verified and then implemented by the local control system. 

Four steps of specific operation

The idea stemmed from feedback from a data center operator using an AI recommendation system. The operator said that while the system had taught some new best practices, such as spreading the cooling load across more equipment with the help and supervision of operators, it was curious whether similar energy savings could be achieved without manual implementation. 

Then, the AI takes over completely. Little to no operator assistance is required.

In the new system, they redesigned the AI agent and underlying infrastructure, while also focusing on security and reliability, using a variety of mechanisms to ensure that the system always runs as expected.

Other security control modes

Moreover, the highest control belongs to the operator, not the AI. The staff can choose to exit the AI control mode at any time, and by limiting the optimization boundary of the system, the use of AI can be controlled within a safe and reliable range.

Google officials said, "We hope to achieve energy conservation with less manpower. Automated systems can perform more detailed operations at a higher frequency while avoiding errors." 

AI says: There is no strongest, only stronger

In the months of trialing the new system, they have achieved an average of 30% sustained energy savings, and are still improving. And these systems get better over time and with more data, as shown in the figure below.

This graph depicts how AI has changed over time, with blue representing the amount of data and green representing changes in performance.

Over a period of several months, AI control system performance increased from a 12% improvement (at the initial launch of autonomous control) to approximately a 30% improvement.

As the technology matures, the optimization scope of the system will also be expanded, thereby achieving greater energy consumption reductions. 

Google officials said data centers are just the beginning. In the long run, this technology has the potential to be applied to other industrial fields and help improve environmental problems on a larger scale.

Click to read the original article