Back to AI Data Center Energy Performance Framework
Impact
Artificial intelligence (AI) and high-performance computing (HPC) are fundamentally reshaping data center design. Unlike traditional facilities with predictable, lower-density workloads, AI “factories” concentrate extreme and rapidly changing power and thermal demands. Rack densities have escalated from approximately 120 kW to several hundred kilowatts, with megawatt-class racks anticipated in the near term. These conditions invalidate many legacy assumptions around power delivery, cooling, and physical infrastructure.
The result is a decisive shift toward integrated design, where architectural, electrical, and mechanical systems are optimized as a single, interdependent system. Power and cooling can no longer be treated as separable domains; decisions in one directly affect the other. When it is applied effectively, integrated design enables significant gains in efficiency, water stewardship, reliability, and speed to market. When it is not, facilities risk stranded capacity, operational instability, and costly retrofits.
At the same time, the pace of technological change introduces uncertainty. Chip roadmaps, cooling technologies, and power architectures are evolving faster than traditional design cycles. Successful facilities, therefore, balance immediate performance requirements with long‑term adaptability, ensuring infrastructure can scale and pivot as AI workloads mature.
Author Acknowledgements
Back to top
Highlights
- AI rack densities are rapidly increasing, with megawatt‑scale racks on the horizon.
- Power and cooling should be designed as a unified system from the outset.
- Direct‑to‑chip liquid cooling has emerged as the industry standard for AI and HPC.
- Warm‑water, chiller‑less cooling enables near‑zero water use in many climates.
- Higher‑voltage rack distribution, particularly 800 VDC, is gaining adoption to reduce current, copper, and conversion losses.
- Integrated liquid‑cooled facilities can achieve PUE values near 1.10, compared to ~1.4 to 1.6 for traditional designs.
- Modular and off‑site construction are compressing delivery schedules and enabling pay‑as‑you‑grow expansion.
Back to top
Discussion
The data center industry is rapidly changing, especially the exponential growth of computational demand driven by AI, and it is recommended that data center designers engineer for agility. As such, it now encompasses two distinct paradigms. Traditional, central processing unit (CPU)‑centric facilities continue to support the majority of workloads and will remain essential. In contrast, AI and HPC facilities concentrate a smaller number of racks that dominate power consumption, thermal output, and capital intensity. It is this latter category that drives the need for new integrated design principles. This section discusses integrated design at various scales, from the rack level through the whole facility level.
Power and Cooling Integration
High‑density AI workloads generate large, synchronized power swings that challenge conventional electrical infrastructure. Traditional low‑voltage distribution becomes impractical at these scales, driving interest in higher‑voltage architectures such as 800 VDC. While sustainability benefits exist, the primary drivers are performance, reliability, and physical feasibility: higher voltage reduces current, limits copper constraints and supports tighter graphics processing unit (GPU) clustering, which improves application performance.
Thermally, air cooling has reached practical limits for high‑density deployments. Liquid cooling, particularly direct‑to‑chip cold plate systems, has become the dominant approach for AI and HPC due to its maturity, scalability, and reliability. Warm‑water operation further enables chiller‑less designs, shifting heat rejection to dry coolers and dramatically reducing water consumption.
Facility and Delivery Implications
Integrated design affects the entire facility. AI racks are significantly heavier, driving a shift toward reinforced slab‑on‑grade construction, seismic anchoring where required, and higher ceiling heights. Extensive liquid distribution networks necessitate enhanced leak detection, zoning, and containment strategies.
To address rapid demand growth and uncertainty, modular and off‑site construction methods are increasingly used. Standardized, factory‑built power, cooling, and information technology (IT) modules reduce construction timelines, improve quality control, and allow capacity to be added incrementally without extensive redesign.
Designing for an Industry in Flux
Our industry is a tale of two data centers. On the one hand, there are traditional data centers that support a diverse set of CPU-based workloads, such as payment card information (PCI) processing. These workloads are predictable and consume much less energy. Traditional data centers do not require specialized designs, as their demands are more easily anticipated and, therefore, more manageable. Air cooling systems alone can handle the thermal loads of these data centers. Though traditional facilities receive less attention, they still make up the majority of data centers. They will continue to play an important role in society, handling the workloads that pertain to most people’s daily lives. These are not the data centers covered in this section.
AI factories, GPU farms, and HPC are new data centers that pose unique challenges, with power density and thermal loads so great and so hard to predict that they render much of legacy hardware obsolete. Managing power distribution and accompanying thermal density is the greatest challenge data centers face. These are the challenges covered in this section.
Designers of high-density data centers need to balance the tension between meeting current needs and factoring anticipated growth in computational complexity, energy consumption, and thermal loads. As such, integrated design is more necessary than ever. The levels of heat that high-density data centers are producing means that every choice and calculation has a multiplied impact on related systems, components, and materials. Successful designers take a holistic view of architectural, electrical, and mechanical strategies, with an eye toward scaling infrastructure efficiently.
Data center designers, therefore, would do well to consider short- and long-term viability of power distribution and cooling solutions. What are their advantages and disadvantages today? What is their potential for future application? How might systems, both inside and outside of the data center, have to evolve in order to accommodate mass-scale deployments of certain systems and solutions? Are they practical or even feasible?
Key Roles and Functions of Rack Integrators
As rack densities rise above 400 kW per rack, the role of rack integrators is evolving in response to demand for more integrated design approaches. IT rack integrators are specialized partners or service providers that design, assemble, configure, and test high-density computing infrastructure (servers, storage, networking) as pre-assembled racks before they are delivered to the site. In the context of AI, these integrators may address challenges associated with the extreme complexity of liquid cooling, high-power requirements (often 30–100+ kW per rack), and complex networking topologies (such as InfiniBand or specialized fiber) required for training AI models. This approach can shift portions of the traditional on-site “rack-and-stack” process into a “plug-and-play” deployment, saving time and reducing errors. Activities under a rack integrator’s purview include:
- Engineering and design: Co-designing rack layouts, ensuring proper airflow, power distribution (intelligent PDUs), and thermal management.
- Physical “rack and stack”: Mounting heavy AI GPU servers, networking gear, and storage, which can weigh significantly over 2,000 lbs.
- Liquid cooling integration: Installing cooling distribution units (CDUs), installing manifolds, and connecting cold plates to GPUs.
- Complex cabling: Implementing structured cabling (e.g., MPO fiber) and high-speed networking, which can reduce congestion by up to 75%.
- Software provisioning and testing: Loading BIOS, firmware, and OS images, and conducting rigorous performance tests in controlled, off-site environments.
- Logistics and deployment: Shipping fully integrated, tested, and validated racks directly to the data center for immediate, rapid installation.
Rack integrators are actively involved in deploying holistic data center designs. Successful integrators control the power and cooling sidecars for voltage conversion and heat rejection, for the most part, and control rack integration. They are determining the infrastructure: the busbars, the piping for cooling, etc. As such, server original equipment manufacturers (OEMs) and integrators are becoming more important. Chip manufacturers would rather not get involved with integration and deployment, preferring to focus on the server level. Part of it has to do with the fact that the majority of liability lies in integration and deployment, which relates to the warranty.
In this respect, infrastructure providers are evolving to become more mechanical and more power-related, instead of focusing exclusively on servers.
Increasingly, power and cooling at the rack level cannot be separated. High-current busbars bringing power to the rack will need a cooling system. Though separation between rack manufacturers and server manufacturers may remain, they will have to be integrated, such as plug-and-play servers that can easily interface with electrical interconnections and liquid interconnections. Thus, as the industry evolves, it is likely that norms will shift such that servers will be shipped ready to be plugged into the system. The question then emerges of pipeline ownership and whether one manufacturer will handle this whole process. The industry is already seeing rack integrators heading in this direction as they demonstrate design and assembly ownership over the entire rack as a part of a larger system. Integrated designs take a holistic view of the data center, treating
Back to top
Benefits of Off-Site Rack Integration for AI Factories
AI workloads require specialized and dense hardware that traditional data center teams may struggle to deploy efficiently. Integrators are necessary for multiple reasons:
- High power/thermal loads: AI racks often exceed 60°C/140°F and require specialized, advanced liquid cooling.
- Faster time to market: Pre-integrated racks can be deployed in days rather than weeks, as the assembly is done before arrival.
- Risk reduction: Manufacturing best practices and off-site testing reduces on-site downtime and human error.
- Hardware optimization: Implementation of reference architectures provided by GPU, CPU, and tensor processing unit (TPU) providers to ensure optimized ITE performance.
For the implementation of high-density racks at scale, key considerations include adopting liquid cooling, ensuring massive power availability in gigawatt campuses, high-speed networking, and structural reinforcement to support heavier, denser equipment. The importance of time-to-market and data center reliability cannot be overstated given the growth forecasts.
The latest generation of GPUs are designed to operate with facility fluid inlet temperatures of 45°C (113°F), which is a key enabler of dry cooler–based heat rejection. Dry cooling uses ambient air to cool loops of fluid such as water or glycol that circulate from the server rooms. The elimination of mechanical cooling allows a higher percentage of available site power to be allocated to revenue-generating IT equipment. In extreme heat, adiabatic assistance (evaporative cooling support) will be required to maintain the 45°C/113°F setpoint.
Integrated design is about context. Therefore, it helps to begin with the data center industry’s current trajectory.
Context influencing Design Principles
AI projections. Though everyone agrees that AI will continue to grow, it’s impossible to predict at what pace. No one knows what percentage of data centers will be built for AI applications in the next few years. Different studies arrive at different projections about AI growth, acknowledging those projections may change. This creates a level of uncertainty for data centers.
Rate of change. Just three years ago, most industry professionals did not anticipate that server racks would exceed 120 kW any time soon. Now, one-megawatt (MW) racks are expected to ship next year. In other words, the rate of change is so great that the industry it is recommended to keep in mind this state of flux when designing data centers, accounting for immediate needs as well as emerging demands.
If a data center waits a year for a coolant distribution unit (CDU), for example, the investment may be squandered, as the technology may have already evolved. A similar approach should be taken for handbooks, standards, and design principles, revisiting them every year or two. A willingness to pivot is key to data center design today.
The influence of chipmakers. Chipmakers can have a major impact on the direction of data center design, power infrastructure, and cooling timelines. As an example, Dell used to incorporate throttling software in their laptops for air cooling to avoid costs by using more sophisticated cooling systems. As a result, air cooling technology lasted for many more years than would otherwise have been possible. A similar dynamic is likely to emerge with liquid cooling. When water use for cooling becomes too expensive, the industry will be motivated to protect and extend the lifecycles of investments and chipmakers may reengineer chipsets to support water efficiency.
Environmental Considerations
Data centers don’t operate in a vacuum. The environment in which they are housed informs the conditions and parameters within which they operate.
Local environment. The costs and availability of water in a region influences data center cooling efforts. The ramping up of data center construction and access to water also coincides with the growing global water crisis. Availability of water in a region will become a more valuable consideration, and the water-energy trade off may have significant sustainability and resource considerations. Additionally, as referenced when determining data center siting, the local weather plays an appreciable role in cooling efforts. In the last few years, record-high temperatures across the world have presented data centers with cooling challenges, even causing widespread outages because of cooling systems failures.
The Inseverable Relationship Between Power and Cooling
Power and cooling are inextricably bound. This dynamic is increasingly manifesting in how data center systems are designed.
Integrated, again. Early IT infrastructure was heavily consolidated, integrating all operations into an on-premises footprint. For example, IBM had a reputation of delivering high-quality systems because they maintained end-to-end control over their entire technology stack. The industry deviated from that approach for economic reasons and may need to revisit this approach. Power and thermal infrastructure for data center design are intrinsically linked; electrical and cooling mechanisms form an interdependent ecosystem and cannot be efficiently retrofitted as an afterthought. They need to be considered during the design phase, and it is recommended thermal control plans be highly integrated. As mentioned previously, industry roles are evolving to be less siloed and more holistic, such as the emerging role of rack integrators.
Workloads and power. Different workloads affect power demands. For example, AI training typically leads to power spikes, whereas inferencing tends to level off. As such, there needs to be a baseload (turbines, nuclear) but also a fast response, hence the importance of energy storage systems. AI factories need a power mix.
The time for high-voltage DC. Higher-voltage direct current (DC) is nothing new, but over the years it has been met with resistance, primarily because of the infrastructure already in place.
Regardless of that hesitation, there are a lot of advantages to moving to higher voltages. It reduces current demand; it reduces the size of busbars needed to bring power into the rack; it requires fewer transitions; and it’s more efficient. At the rack level, 800 VDC will be easier to distribute. Moreover, 800 VDC is so critical to the integrated design of HPC data centers that an entire section is dedicated to it.
The Arrival of 800 VDC
As technology presses forward into the era of AI factories, it is recommended data center designers rethink power distribution at megawatt scale and, consequently, power infrastructure. At the center of this reckoning is 800 VDC. Whenever the topic of 800 VDC is raised, two questions invariably come up: Why 800 VDC? And why now?
Back to top
Is It About Sustainability?
Leading companies leverage 800 VDC in their sustainability and product marketing, paying lip service to the sustainability benefits of 800 VDC such as significant efficiency, engineering, and cost benefits over traditional 48-volt DC architecture or 415-480 VAC architecture. The benefits are realized in different ways, including reduction in multiple voltage conversion steps, reduced power losses, reduced copper usage, and provided end-to-end efficiencies.
Fewer conductors and smaller connectors amount to a decrease of material utilization, such as copper and aluminum, reduced conductor size, and lower installation costs. Native DC architecture at the rack power distribution level obviates the need for AC-to-DC conversion steps, thereby boosting efficiency and reducing waste heat. All in all, this technology boasts a larger capacity to carry power and a reduction in complexity, which reduces the points of failure. As such, 800 VDC is arguably more reliable and more efficient. But none of these benefits can answer one of the two questions posed in the introduction of this section: Why now?
If the primary drivers were the benefits listed above, the industry would have made a concerted push for 800 VDC years ago. In other words, as much as players in the industry like to tout the efficiency of 800 VDC, sustainability is an ancillary benefit, a byproduct, a welcome one but a byproduct nonetheless, not a primary driver.
The Pressures Leading to the Primary Driver
To identify the primary driver for the adoption of 800 VDC, it is recommended to first understand the practical limitations of traditional architecture, such as the 50-volt busbar. These busbars are thick and heavy, upward of 200 or 300 pounds, and are fast approaching thermal limitations for distributing that power within GPU compute.
Many of the connections within a data center depend on copper. It is recommended that data center operators consider copper’s effective reach limitations, which create a performance bottleneck. It is essential for high-performance data centers to maximize more GPUs into a smaller physical space. When racks are jumping from tens of kilowatts to beyond 100 kW with one-megawatt racks just around the corner, delivering these increases of power at traditional low voltages quickly becomes impractical. The outsize current needed would demand an unsustainable amount of copper cabling, which would be physically and financially untenable.
Apart from power density and thermal loads, the other major challenge posed by AI workloads is their unpredictability. A traditional data center processes a range of unrelated tasks, while an AI factory recruits legion GPUs synchronously. When an AI training model spools up, thousands of GPUs work in concert, sharing data and communicating with one another in intense computational coordination. These power load swings present major power management challenges and can threaten the stability of a utility grid.
There are ways to tackle these challenges. Stored energy is one approach to the problem of volatility, while 800 VDC provides the high voltage needed to support HPC workloads without the need for ballooning infrastructure.
The Primary Driver
The primary driver for 800 VDC use in AI data centers is the same driver that prompted the industry to move toward liquid cooling: mitigating packet loss. Packet loss or data degradation is a critical issue that leads to disruption in digital communication. The consequences of power instability and overheatingcause packet loss and can disrupt synchronization. This can lead to a number of problems, including slow model training and reduced scalability. Packet loss also compels systems to use more resources to compensate for the lost data. A cascade of inefficiencies follows.
By bringing GPUs closer together, they can exist within a one-meter copper domain. This proximity allows the GPU cluster to run faster. While the carbon and resource utilization benefits are significant and quantifiable, it is paramount to optimize packet density and allow the application to move faster.
Why 800 VDC?
The limitations of traditional architecture answer the question of “Why now?” but do not answer the question of “Why 800 VDC?” Or better put, they only partly answer the question. Hardware limitations answer the question insofar as they identify the necessity of higher voltage in the era of HPC and AI workloads. But 800 VDC isn’t the only higher-voltage game in town.
690 VAC 3-Phase
Though a potential alternative, 690 volts alternating current (VAC) is less likely to be adopted, since 690 VAC systems tend to be found in regions with specific voltage requirements, as in certain European countries. The absence of a neutral conductor means that loads need to be balanced through the three phases. Failing to do so can cause voltage imbalances, resulting in deteriorating performance of connected equipment. At the end of the day, DC-to-DC is more efficient than AC-to-DC. AC/DC will be needed to power the servers.
±400 VDC
To raise 50 V to a higher value, the two most viable values are 800 VDC and ±400 VDC. One major manufacturer is making a push toward 800VDC, which is having a tremendous sway over the wider industry. This could shift if market leadership or TPU usage changes. Even in the absence of a seismic industry shift, ±400 VDC may play an important transitional role.
While 800 VDC is considered by industry leaders as the efficiency and scale standard, ±400 VDC could emerge as an interim measure as hyperscalers push for its adoption. Since ±400 VDC complies with existing standards and taps more common components simplifying sourcing, design, compliance, and training, it stands as a practical option. In other words, it’s a safe short- to medium-term bet for efficiency considerations.
Still, whether now or in the medium term, 800 VDC is emerging as the standard voltage in computing infrastructure. As a best practice, data center designers will plan for 800 VDC.
Electrical infrastructure from grid to busway is also expected to experience a major transition towards higher voltages. Examples include:
- Medium-voltage distribution: Power is brought in at 34.5 kV or 13.8 kV and stepped down as close to the white space, or physical operational area, as possible. This will improve electrical distribution design strategy to maximize space and efficiency Using 415/240 V distribution (instead of 208 V) is beneficial to reduce copper weight and improve performance.
- Busway systems: Traditional cables cannot handle the current hyperscale facilities and use overhead busways rated as 800 A, 1.200 A, or higher to deliver power to the racks.
- Continuous power (UPS & BESS): For AI training, “ride-through” is critical. Large-scale battery energy storage systems (BESS) are supplementing traditional lead-acid UPS to handle the massive, instantaneous load swings (from idle to 50 MW) that occur when a training job starts.
Cooling solutions timeline of readiness and effectiveness
Air cooling extended. The limits of air cooling are higher than originally thought (from 25-35 kW to ~40 kW), which extended its utility. While the majority of new data constructions will be geared toward AI workloads, at least for the next several years, most legacy data centers, as mentioned in the author’s note, will likely continue to use air cooling. Some legacy data centers may incorporate some liquid cooling but not as a major overhaul.
By volume, AI compute hardware comprises a small percentage of servers shipped globally compared to traditional servers. Traditional workloads remain the backbone of the market, but the immense power and cooling demands of high-end dynamic AI rack workloads are driving the sector’s biggest shifts in revenue and posing the greatest infrastructure challenges.
Direct-to-chip – industry standard. Direct-to-chip (DTC) cooling is emerging as the de-facto thermal management industry standard for cooling HPC infrastructure. Many agree DTC will be the most utilized liquid cooling solution for the foreseeable future. DTC cooling is the most mature liquid cooling solution and the most reliable liquid option. It will increasingly be the preferred solution for large-scale high-density deployments because of its clean and modular approach. Its ability to cool racks upward of 3 MW hints at this technology’s potential longevity.
Warm water cooling – sustainable but fleeting. There is a growing trend of warm water cooling as a sustainable approach because it eliminates the need for chillers, which minimizes thermal resistance. The solution boasts significant energy efficiency benefits, and it can simplify the cooling infrastructure design.
Despite these perceived benefits, not everyone is as convinced of the practical applications of warm water cooling. While many agree the trend toward 45°C/113°F is interesting and that, in theory, it works, they caution that bullishness on warm water is predicated on the assumption of sustainability. Not all chipmakers agree. The feasibility of this technology hinges on chip failure rates. It may be an attractive option for some, but others see it as more of a transitional solution that ultimately won’t be able to meet increasing power demands.
The reference architecture for GPUs allows the facility to reject heat directly to the ambient air using dry coolers, even in relatively warm climates, with high-temperature secondary cooling loops, which are the key enabler for dry cooling.
- High inlet temperatures: The system supports facility water inlet temperatures up to 45°C/113°F.
- Elevated return temperatures: Liquid returns from the racks at temperatures up to 65°C (149°F).
- Approach temperature: Because the return water is so much hotter than the outdoor air (e.g., 65°C/149°F water vs. 35°C/95°F ambient air), dry coolers can efficiently transfer heat without needing to refrigerate the water.
Integration with Dry Coolers
Dry coolers serve as the primary heat rejection method, functioning as large radiator-like units.
- Closed-loop system: Coolant circulates in a sealed loop from the coolant distribution units (CDUs) to the outdoor dry coolers, preventing water loss through evaporation.
- Chiller-less operation: By raising the inlet temperature to 45°C/113°F, facilities can achieve “free cooling” nearly year-round in most climates, as the ambient air is typically cooler than the required inlet water.
- Space requirements: For a 50 MW cluster, dry coolers require a larger physical footprint than traditional cooling towers to achieve the same cooling capacity because of the lower heat-carrying capacity of air versus evaporating water.
Back to top
Efficiency and Sustainability Gains
The combination of efficiency and sustainability gains drastically improves the facility’s environmental and financial metrics.
- Water usage effectiveness (WUE): Since dry coolers are closed loop, they use virtually zero water for cooling, representing a 300x improvement in water efficiency over traditional evaporative towers.
- Power usage effectiveness (PUE): Eliminating chillers and reducing server fan speeds (as 85% of heat is captured by liquid) can reduce total data center power consumption by approximately 10%.
- Operational cost savings: For a 50 MW facility, deploying these liquid-cooled systems can lead to over $4 million in annual savings compared to air-cooled infrastructure.
System Safeguards
- Thermal throttling: If ambient temperatures exceed the “dry cooler limit” (where the air is too warm to cool the water to 45°C/113°F, the system may trigger CPU/GPU throttling to prevent damage.
- Adiabatic assistance: In extreme climates, “hybrid” dry coolers may use a small amount of misting (adiabatic cooling) only during the hottest hours of the year to maintain the 45°C/113°F setpoint without a full chiller plant.
Dry Cooler PUE Benefits
In a 50 MW AI factory, the transition from a traditional chilled-water plant to a dry-cooled architecture shifts the PUE from an industry average of 1.40–1.60 to a high-efficiency range of 1.05–1.15.
1. PUE Comparison: Chilled-Water vs. Dry-Cooled
A 50 MW IT load requires different total facility power depending on the cooling method.
- Traditional Chilled-Water Plant (PUE ~1.40)
- Total Facility Power: 70 MW
- Non-IT Overhead: 20 MW
- Major Draw: Mechanical chillers account for 30–50% of total non-IT power
- GB200 Dry-Cooled Architecture (PUE ~1.10)
- Total Facility Power: 55 MW
- Non-IT Overhead: 5 MW
- Efficiency Driver: High-temperature liquid cooling (45°C [113°F] inlet) allows for “free cooling” via dry coolers, effectively eliminating mechanical refrigeration
2. Breakdown of Power Savings
The reduction in PUE is driven by three primary infrastructure shifts:
- Chiller elimination: Traditional plants use energy-intensive chillers to produce 7°C–12°C (44.6°F–53.6°F) water. Dry-cooled systems use warm water (up to 45°C [113°F]), requiring only low-power fans in dry coolers to reject heat to the ambient air.
- Reduced server fan power: Since 85% of the heat in a GB200 rack is captured by liquid, internal server fans run at significantly lower speeds. This reduces “IT Power” consumption by up to 7–10%, which further lowers the total facility PUE.
- Pumping efficiency: Water is roughly 3,200 times more efficient than air at absorbing heat by volume. Moving heat via liquid loops requires significantly less electrical work than moving equivalent heat via high-velocity computer room air conditioner (CRAC) fans.
3. Estimated Annual Operational Impact (50 MW) Based on an Average Electricity Cost of $0.10/kWh:
| Metric |
Traditional Chilled Water |
Dry-Cooled |
| Annual Energy Use |
~613 GWh |
~481 GWh |
| Annual Power Cost |
~$61.3 Million |
~$48.1 Million |
| Annual Savings |
— |
~$13.2 Million |
| Water Consumption |
Millions of Gallons (Evaporation) |
Near Zero (Closed-Loop) |
4. Key Technical Constraints
- Weather sensitivity: Dry cooling efficiency drops if ambient temperatures exceed 35°C–40°C. In extreme heat, adiabatic assistance (misting) may be required to maintain the 45°C inlet setpoint.
- CapEx trade-off: Dry coolers can cost three to four times more to install than traditional wet cooling towers, though this is offset by lower annual operating costs and the elimination of the chiller plant.
Immersion – long development, niche applications. Immersion cooling is technologically viable, but it faces resistance because of how messy and logistically challenging it is. It represents a paradigm shift in infrastructure and maintenance. Some speculate that immersion could be a good option past the 200 kW/rack threshold, but most are not convinced that the technology will develop quickly enough to be adopted.
Others are even less optimistic, believing immersion to be inferior compared to direct-to-chip. It may develop into a more viable option, but it needs a lot of work. Others see no path forward for immersion as a solution for HPC hardware cooling, apart from some niche cases.
Two-phase has potential but presents a paradigm shift. Two-phase cold plates are not necessary for racks based on current coolants, which are limited by their electrical inertness and poor thermal properties. As such, it’s not practical in the near term. Some believe two-phase will eventually be necessary. Even if that’s so, most agree that large-scale investments in DTC cooling, coupled with chip advancements, would likely delay a migration to two-phase. Like immersion, it represents a significant paradigm shift with major infrastructure implications.
Cooling optimization. Most agree that digital twins can deliver considerable value for liquid cooling, but plenty of development remains. While digital twins are well-established for air cooling, the liquid cooling infrastructure is still developing. Predictive maintenance would be a major benefit, increasing efficiencies, anticipating breakdowns, and improving uptime, but real-time telemetry is not as developed as it needs to be.
The effectiveness of digital twins in liquid cooling hinges on the quality of the inputs, such as physical dimensions, the correct spec sheet, etc. Without the right raw data, the outputs will be meaningless. Additionally, digital twins require a lot of instrumentation, such as Internet of Things (IoT) sensors. Since technology is evolving quickly, it is recommended digital twins be updated frequently. Despite that challenge, the industry is heading toward proactivity and self-healing. Although the potential benefits are significant, it is responsible to consider the associated risks and drawbacks.
Structural: Weight and Seismic Bracing
Other than high-power density, AI data centers also pose challenges with physical density caused by the weight of equipment. Upward of 400 racks weighing around 3,300 lb each creates a static load of over 1.3 million lb concentrated in a relatively small footprint.
- Slab-on-grade vs. raised floor: Most 50 MW AI factories abandon raised floors entirely in favor of reinforced concrete slabs to support the weight and simplify the massive liquid piping/manifold runs.
- Seismic anchoring: In many regions, it is recommended these high-center-of-gravity racks be bolted directly onto the structural slab with seismic isolators.
- Ceiling height and layout: Higher ceilings and larger, flexible, or modular data halls are needed for cabling and cooling infrastructure, such as rear door heat exchangers.
Back to top
Safety and Leak Detection
With miles of piping carrying liquid over millions of dollars of hardware, risk mitigation is paramount.
- Secondary containment: The facility needs trenching or sloped floors with drainage to manage potential coolant spills.
- Zone-based leak detection: A 50 MW facility need to be compartmentalized with automated shut-off valves and moisture-sensing cables under every rack and manifold.
High-Bandwidth, Low-Latency Networking
- 400-800 GbE & InfiniBand: To minimize latency between accelerators (GPUs), high-speed networks are essential.
- Lateral traffic optimization: The network architecture should consider the optimal way to handle extreme east-west traffic (GPU-to-GPU, server-to-server) during training as opposed to the north-south or vertical optimization (user-to-server) pipeline feeding the data.
- Designers have the option to choose between InfiniBand (the traditional benchmark for performance and lossless transport) and AI-optimized Ethernet (which offers better scalability and vendor neutrality through standards like Ultra Ethernet Consortium). Interconnects are moving toward 400 G and 800 G fabrics to prevent networking from becoming a bottleneck for compute resources.
Modular and Off-Site Construction
In AI data center construction, modular methods provide scalability and flexibility by shifting from traditional “stick-built” construction to a factory-first product model. This approach uses prefabricated, consistent building blocks – such as power skids, cooling modules, and IT pods – to enable rapid growth without the need for total facility redesigns.
- “Pay-as-you-grow” expansion: Operators can deploy only the capacity needed immediately and add repeatable “blocks” as demand increases. This aligns capital expenditure (CapEx) with actual revenue and prevents “stranded capacity” from overbuilding.
- Rapid deployment timeline: Prefabricated modules are manufactured off-site in controlled environments simultaneously with on-site civil works. This parallel workflow can compress construction schedules from 24–36 months down to 16–20 months.
- Consistent interfaces: Uniform connection points for power, coolant loops, and networking ensure new modules “mate” cleanly with existing ones. This “plug-and-play” capability allows for scaling without extensive reengineering.
- Non-disruptive upgrades: Because modules are self-contained, individual units (like power pods or cooling skids) can be added, replaced, or upgraded to support higher AI densities (e.g., transitioning from air to liquid cooling) without shutting down the entire facility.
- Factory-level precision: Building components in a factory allows for rigorous factory acceptance testing (FAT) before delivery. This reduces on-site commissioning surprises and the need for complex, disruptive field modifications.
- Adaptable configurations: Modular designs can be tailored for specific AI use cases, such as dense training clusters (high power/cooling) or nimble inference nodes at the edge, using the same underlying architecture.
Topics to Consider Moving Forward
This section earmarks space for topics that do not yet warrant priority in the consideration of data center design today but which should be monitored in the coming years. Some of these issues may well be explored more thoroughly in future editions.
Heat Reuse
The viability of heat reuse hinges on location. If a data center is close to the buildings and structures it aims to heat, reuse is more likely to be viable. However, the farther the data center is from those structures, the more of an obstacle transportation becomes, adding layers of logistical challenges. While there is a growing appetite for heat reuse in the U.S., it remains, for the time being, more viable in Europe and Asia because of district heating and data centers’ proximity to urban areas.
On-Site Generation for AI Data Centers
While obstacles for on-site generation currently exist, power demands are such that localized energy production may one day become a necessity. Many on-site generation options come with significant drawbacks. For example, gas-powered turbines are loud and draw negative attention from local communities, while fuel cells are faulted for retaining extreme amounts of heat. Small modular nuclear reactors have a lot of potential but are impeded by regulatory hurdles. Ultimately, the broader challenge will be to figure out how to get more out of the U.S. energy grid.
In an era where power availability is the primary bottleneck for AI and cloud expansion, on-site generation is transitioning from a niche option to a strategic necessity. The industry is rapidly shifting toward primary on-site generation as a core infrastructure strategy. By 2030, 38% of U.S. facilities expect to incorporate primary on-site generation, while 27% of facilities expect to be fully powered by on-site solutions. Globally, the total announced capacity for on-site power has already reached 8.7 GW. This transition is driven by three critical factors: grid bottlenecks, unprecedented demand surge, and reliability mandate.
Core Benefits of CCHP for Data Centers
Combined cooling, heating, and power (CCHP) systems are suited for energy-intense, mission-critical data centers because of their ability to synchronize power and thermal loads.
- Load synchronization: Data centers use 30%+ of their total electricity on cooling. CCHP addresses this by capturing high-quality waste heat (up to 600°C [1112°F] from turbines) to drive absorption chillers.
- Superior fuel utilization: Traditional power generation wastes over 60% of fuel energy as heat. However, CCHP systems achieve overall fuel utilization of 75% to 85%.
- Scalability and flexibility: CCHP systems can be deployed in modular “power islands” that expand as the data center grows. This “grow as needed” approach ensures sufficient power supply without overstretching local grid capacity.
- Community thermal integration: Excess heat can be recovered for local district heating networks, providing residential space heating and domestic hot water. This integration can potentially generate annual revenue and significantly reduce payback periods.
Data centers have multiple on-site generation options for ensuring an uninterrupted electricity supply, including liquid-fuel turbines, engines, and fuel cells. Prior planning and training is required to overcome implementation challenges arising from permitting, higher capital investments, and increased operational complexities.
Conclusion: The Difference Between Necessary and Useful
The industry is sure to change, but that reality doesn’t nullify the pressures that exist today. It is essential data center designers account for change while meeting today’s demands. To accomplish this, it is recommended data center designers be clear about their objectives, which hinges on their ability to distinguish between what is necessary and what is useful.
The topic of voltage in a previous section well illustrates this dynamic. As aforementioned, the carbon and resource utilization benefits of shifting toward 800 VDC cannot be denied, but that is not the primary driver. Mitigating packet loss is. That’s not to say the sustainability benefits are not welcome. Often, what is useful goes hand in hand with what is necessary, but understanding the difference is key to setting one’s priorities.
Power usage effectiveness (PUE) is another good example of this phenomenon. While PUE became a standard metric for assessing energy efficiency, it can give the wrong impression. For example, if a data center consolidates workloads and reduces the number of servers needed to do the same number of jobs and retains the existing infrastructure, the PUE may indicate a drop in energy efficiency even though power consumption would have dropped and IT efficiency would have improved. That’s not to say PUE is not a helpful metric. It just doesn’t paint a complete picture, which is why energy efficiency should not be the first criterion when designing a data center. PUE is not the sole determinant of data center efficiency.
What is necessary today may not be what is necessary tomorrow, just as what is useful today may not be what is useful tomorrow. Understanding the difference allows us to set priorities today. Keeping track of how they evolve enables us to adapt. That the industry will change is a guarantee. How it will change and at what rate remains to be seen. But our ability to prioritize and re-prioritize will keep us effective and agile – both qualities that will serve us for years to come.
Back to top
Recommended Practices