The data center operating model that worked for decades was designed around a boundary that no longer exists. This blog, the first in a three-part series, explores why IT and OT are converging and how operators are adapting to interconnected systems.
The traditional data center operating model was designed around a clear division of responsibility. Infrastructure teams were accountable for delivering stable power, temperature, and humidity, while IT teams focused on the performance of the equipment running inside that environment. That separation aligned with how facilities were built and how SLAs were structured, so it held up for years.
AI and HPC environments are changing the physical reality that model depends on:
- Compute is no longer isolated from infrastructure in the way it once was.
- The systems that support GPUs now extend directly into the rack, bringing cooling, power distribution, and monitoring much closer to the compute layer.
- Liquid cooling systems, pumps, and chemistry controls are now embedded alongside high-density compute.
- Electrical architecture is shifting from centralized systems to distributed components at the rack level.
That shift pulls operational complexity into a much tighter footprint. The boundary between IT and infrastructure used to sit at the edge of the white space. In AI environments, that boundary has moved inside the rack itself. Each rack now contains elements of both compute and critical infrastructure, which means operational responsibility can no longer be divided along traditional lines.
AI Workloads Are Tightening Operational Tolerances
The infrastructure changes are only part of the story. The way these environments generate value also affects how they need to be operated. AI workloads are designed to run continuously at high utilization, and the economics depend on keeping that utilization as close to maximum as possible. In earlier models, redundancy absorbed a significant amount of operational variability. Operators could rely on spare capacity and layered failover strategies to protect uptime. In AI environments, those buffers are smaller, and performance expectations are higher.
Small inefficiencies become visible much faster in this context. A cooling imbalance, a power quality issue, or a misconfigured system may not create an immediate outage, but it can reduce compute performance in a way that directly impacts revenue.
In these environments, failure doesn’t always present as downtime. Systems can remain online while operating below their intended capacity due to power or cooling constraints. These conditions, often described as brownouts, reduce output without triggering a full outage. Identifying and resolving them requires visibility across both IT and infrastructure systems, since the root cause often sits outside the compute layer.
Siloed Systems Limit How Quickly Operators Can Respond
Most organizations are still structured around separate systems and teams: IT systems manage tickets and network performance, infrastructure systems manage power and cooling, and building management and telemetry systems generate large volumes of sensor data. Each of these systems captures part of the operational picture, but they’re often managed independently.
When an issue crosses those boundaries, the response becomes slower and less precise. A temperature anomaly may appear in one system while degraded performance appears in another, and neither provides enough context on its own. Without a way to connect those signals, operators spend more time diagnosing the problem and less time resolving it.
In environments where systems are tightly coupled, that delay has a measurable impact on performance.
The Risk Profile is Changing Along With the Architecture
The move toward higher density and liquid cooling introduces additional operational considerations. Cooling systems now involve fluid moving in close proximity to high-voltage equipment, which increases the importance of monitoring, maintenance, and rapid response to anomalies.
This changes the nature of risk inside the facility. A leak is no longer just a maintenance issue. It introduces a safety concern with immediate operational consequences, particularly in environments where liquid cooling systems and high-voltage electrical components operate within the same rack. Distributed power systems bring additional components closer to compute, which changes how failures propagate and how they need to be managed.
These systems function as interconnected parts of a single operating environment. Issues in one area can influence performance in another, which means operators need to understand those relationships and respond accordingly. Managing that environment requires coordination across domains that were previously handled separately.
Convergence is an Operational Shift, Not Just a Technical One
The practical impact of IT/OT convergence shows up in how work is executed:
- Operators need to move beyond completing tasks within a single system and focus on understanding outcomes across systems.
- A maintenance activity on a cooling component needs to be evaluated based on its effect on compute performance.
- A resolved alarm needs to be validated against real-time operating conditions, not just closed in a ticketing system.
This requires workflows that connect data, actions, and results in a consistent way. Organizations that build those connections are better positioned to manage performance at scale, particularly as portfolios grow and complexity increases.
Why This Shift is Becoming Standard Practice
The pace of investment in AI infrastructure is accelerating, and operators are under pressure to deliver consistent performance while managing cost and risk. That pressure exposes limitations in fragmented operating models.
As one industry perspective puts it, the challenge isn’t the technology itself but the way operations are structured around it. Convergence provides a path to address that challenge by aligning teams, systems, and workflows around a shared operational view. The ability to adapt to this shift has become a defining factor for operators: operating models that don’t evolve introduce both operational and financial risk.
What Comes Next
Understanding why convergence is happening is only the starting point. The next step is translating that understanding into changes that work in live environments across teams, systems, and processes. The next blog in our IT/OT convergence series will focus on how operators can do exactly that.