How data centers are making the giant leap to 1 megawatt per rack

How data centers are making the giant leap to 1 megawatt per rack

The electrical appetite of data centers is almost insatiable. A single server rack will require up to 1,000 kilowatts, or 1 megawatt, in the near future. Why are such racks necessary, and what will they be capable of?

Data centers already require a lot of electricity, but part of that demand is based on inefficiency. The major players in global IT infrastructure have therefore set their sights on streamlining the power supply to server racks, with significantly fewer transformations between AC and DC and higher voltages within data centers themselves. All this was once impractical; IT equipment requires low voltages and has traditionally been focused on energy efficiency. AI is changing that: the scale of each server rack is increasing, resulting in higher wattage consumption.

The need for this increase in scale lies in the nature of AI calculations. Both the training and running of AI models are highly parallelized, i.e., dependent on countless small calculations that can be performed simultaneously. GPUs excel at this compared to CPUs, but with much higher consumption per rack as a result. The CPU racks of old have tended to stay below 100 kW, or 0.1x the proposed 1 Megawatts outlined below.

Nvidia is the leader, but by no means alone

The upward trend in power density is most clearly visible on the roadmap of AI chip manufacturer Nvidia. Whereas the A100 GPUs from 2022 reached up to 25 kilowatts per rack, the latest Blackwell generation of AI chips has already increased this to 132 kilowatts per rack. In addition, 72 GPUs are integrated into a single Nvidia system, but Nvidia customers typically place such systems together in large quantities. An entire data center architecture is therefore based on those 132 kilowatts per rack spread over a large space. This often forces IT architects to use liquid cooling to efficiently manage such wattages.

However, this is by no means the end of the story: Blackwell Ultra will require up to 150 kilowatts per rack later this year, after which the Rubin and Feynman generations of Nvidia chips will gradually increase the wattages per rack to 1 megawatt. This will supposedly not become a reality until 2028, but lest we forget: 2028 is only 26 months away. Please note: these are the maximum wattages; many configurations stick to much lower, more manageable figures. Nevertheless, AI hyperscalers are keen to maximize their capacity and, therefore, their computing power per rack.

Nvidia is the most outspoken proponent of increasing computing power per rack, but it is by no means the only party talking about 1 megawatt per rack. “AI infrastructure is hot,” Google stated back in April of this year. Proposals to house 1 MW and up the voltages in use at data centres have come from several vendors, including Google at the time. OCP (Open Compute Project) is working on standards for data centers (these standards also include such basics as the classic rack and the 1U measurement unit), and Google presented new proposals for standardization under that banner. In addition to Google, Meta and Microsoft are examples of tech giants that want to see standardized electrical and mechanical interfaces. They are not doing this for altruistic reasons, but out of practical considerations. If every AI player comes up with its own standard for what are essentially the same electrical requirements as the competition, unnecessary duplication of effort occurs and money that could have gone to the actual AI hardware goes up in smoke. Hence the call for standards in this area, one which we think has a high probability of success.

400 VDC, 800 VDC

In April, Google introduced 400 VDC (Volts Direct Current), a voltage that can theoretically support 1 MW per rack. The advantage of 400 VDC is that electric vehicles already use it, so adoption is already underway. The move to 400 VDC also simplifies the transition from high-voltage grid electricity to data centers: currently, there are more conversions than necessary between AC and DC. First, Google wants to improve the efficiency of these conversions by 3 percentage points, but eventually, the higher DC voltages need to get closer to the racks with even fewer conversions. That initial improvement should be possible thanks to so-called “sidecars.” These are pieces of electrical infrastructure that are physically located next to server racks and provide the required power and cooling to the AI compute. So, as the name suggests, they are peripheral devices for the AI infrastructure on which performance depends.

As usual, Nvidia is leading the way in this area. Last week, it was already talking about 800 VDC. Incidentally, this is also a standard for newer EV platforms. “More than 150 percent of additional power is transferred via the same copper with 800VDC [compared to traditional systems, ed.],” eliminating the need for 200-kilogram copper voltage rails to power a single rack, Nvidia states. In addition, the company wants to relax cooling requirements with Vera Rubin, the successor to its own Blackwell chips. Up to 45 degrees Celsius should be acceptable as an inlet temperature for liquid cooling, significantly higher than the 32 degrees Celsius that is commonplace for GPUs today.

Conclusion: more power, less space

It is clear that any single data center may require more power than ever before in the future. GPUs are the AI powerhouses that require a total overhaul of the IT infrastructure, from the power supply to the cooling techniques. But this also presents opportunities, as it turns out. The increase in scale to 1 megawatt per rack leads to the logical conclusion that higher voltages closer to the computing power are desirable. This leads to a certain simplification of the IT infrastructure, but also to the growing pains associated with deviating from old standards. The advantage for all AI players is that they are joining forces to achieve new standardization on a large scale. This is happening even before the envisioned 1 megawatt per rack actually becomes a reality.

Read also: Nvidia and TSMC start production of Blackwell chips