In this article by MCIM CEO Mike Parks, he explores how AI workloads are pressuring legacy data center operating models and why structural operational gaps are becoming harder to ignore as infrastructure density increases.
AI isn’t actually creating new operational risk in data centers. It’s just exposing the weaknesses that were already there.
For years, operators benefited from redundancy models that allowed imperfection: N+1, 2N, and slack across systems. Those buffers made it possible to tolerate siloed tools, reactive maintenance, fragmented teams, and inconsistent governance. The operating model wasn’t efficient, but the infrastructure absorbed the consequences.
AI changes the math.
When revenue depends on GPUs running continuously at full load, and racks exceed 100kW with liquid cooling architectures, there is no operational cushion. Voltage is higher. Cooling is more complex. Customer expectations are real time. The margin for delay disappears.
The workloads didn’t create the risk. They removed the buffer that used to contain it.
The Fragility Was Already There
Before AI accelerated demand, two structural issues were already present.
- First, the industry faces a deep talent shortage. The workforce that built today’s infrastructure is aging out just as global capacity demand is accelerating.
- Second, most operators still run fragmented operating models: IT in one system, OT in another, building telemetry somewhere else entirely. Different teams, different workflows, different data languages, and no common framework connecting them.
That structure was survivable when redundancy covered mistakes. In high-density AI environments, it isn’t.
IT and OT convergence isn’t a theory anymore. When you’re running 480V, and in some environments even higher voltages, directly to GPU racks, you don’t get to pretend those teams operate independently. This is electrical infrastructure with real arc flash risk, real lockout/tagout requirements, and real life-safety implications. When power, cooling, and compute are tightly coupled, procedural mistakes don’t stay contained within one department. A missed step in change control or a poorly coordinated maintenance activity can cascade across systems in seconds.
The market is forcing IT/OT convergence whether organizations are ready or not, because physics, and safety, don’t respect org charts.
Where Leadership Is Underestimating Risk
There is massive capital flowing into AI infrastructure. CEOs and CFOs are focused on land acquisition, capital deployment, return on invested capital. That’s their job. But expansion without operating model modernization creates risk that doesn’t show up until it’s too late.
When outages hit the news, no one asks about transformer lead times. They ask why the operating model didn’t prevent it. Durable advantage in this era will come from how work is governed, executed, validated, and audited across the portfolio. Upgrading hardware without upgrading how you run it is incomplete.
What Will Separate the Long-Term Winners
We’re in a period of abundant capital and historically low vacancy. That won’t last forever.
When pricing pressure returns, the operators who survive will be the ones who tightened their operating models before they were forced to. That means:
- Treating change control as a risk discipline, not a paperwork exercise. In high-density environments, an uncoordinated maintenance window or an incomplete lockout/tagout is a cascading failure waiting to happen.
- Closing the loop between telemetry and execution. If a technician closes a work order, the system should confirm that temperatures, pressures, and voltages have actually returned to expected parameters. If they haven’t, the work isn’t done.
- Eliminating fragmented workflows across IT, OT, and IoT for a resolved operational picture. When alarms, work orders, and performance data live in separate systems owned by separate teams, leadership has no reliable way to validate whether risk is increasing or decreasing across the portfolio.
- Knowing maintenance cost per kW and EBITDA margin at the asset level — not as an annual finance exercise, but as an operational benchmark.
This isn’t a hardware arms race. Capital can buy denser racks, higher voltage systems, and more GPUs, but it can’t compensate for fragmented workflows or reactive maintenance. As infrastructure scales, those weaknesses scale with it.
And the operators who modernize their operating models first, unifying how work is governed, executed, and validated across IT and OT, will be the ones who define the next generation of resilient infrastructure.