SRE正在进入范式转变

来源: Thoughtworks Engineering

原文

SRE is entering a paradigm shift

Published: March 03, 2026 

Since the mid-2010s, SRE has been the foundation of modern software operations. It made large-scale distributed systems controllable.

Today, systems are no longer only distributed; they’re becoming cognitive and partially autonomous. Decision-making is moving into the system, while AI is simultaneously augmenting human cognition and code generation.

This raises a first-principles question: can our existing SRE model still keep systems under control, or, just as distributed systems once required the invention of SRE, are we entering a new paradigm?

 

The first principle of SRE: Control 

As a control system, SRE requires software systems to be observable, modelable, predictable and intervenable. Without these properties, no system can be governed at scale.

The three pillars of logs, metrics and traces are not debugging tools but modeling primitives:

  • Logs model the event timeline.

  • Metrics model the state space.

  • Traces model the causal topology.

SRE as a control system
Diagram 1: SRE as a control system

Together, they form the minimal system model required for control (Diagram 1.). Without this model, large-scale distributed systems are not operable. With it, they become engineering objects.

 

When SRE was born: A cognition crisis

SRE emerged in response to engineering failure, with humans losing the ability to control their systems. In the mid-2010s, software architecture crossed a structural complexity boundary. At the same time, the cost of failure increased dramatically as digital business exploded.

Back then, SRE wasn’t born as practive for optimization but instead one built to recover control. By modeling events, states and causality, SRE made distributed systems controllable again. That’s why SRE became an institutional foundation of system control, not a tool.

The hidden assumptions behind the three pillars

The three pillars of SRE — logs, metrics, and traces — are often treated as universal observability primitives. In reality, they rest on five control-theory premises:

Under these assumptions, distributed systems, though complex, remain controllable. Logs reconstruct event sequences; metrics approximate system state; traces expose causal topology. Together, they form a sufficient system model.

This is why SRE worked as a reliability discipline for deterministic execution systems, assuming behavior is the result of code and topology. However, fifteen years later, these assumptions may not be guaranteed.

 

The five forces shaping the current context

Fifteen years after the birth of SRE, we’re entering a new regime of complexity. This is one defined not only by scale, but by a fundamental change in what systems essentially are. There are five forces that define this new context:

This context mirrors the moment when distributed systems first broke human control capacity fifteen years ago. Just as SRE emerged then, a new control paradigm is now required.

 

An inevitable SRE paradigm shift

Control theory is simple: when system complexity exceeds an organization’s ability to observe, model, predict and intervene, the existing control system collapses.

At that point, failures become hard to replay, behavior becomes hard to predict, interventions become risky and responsibility becomes blurred. These are all signals that controllability has been lost.

This is exactly what enterprises experienced back in 2010s before the rise of SRE, and it’s what they’re beginning to experience again as intelligence integrates into enterprise IT systems at scale.

Before failures hit enterprise IT operations, IT leaders must confront the emergence of a structurally inevitable paradigm shift in SRE.

 

The new SRE paradigm: A new control architecture

Enterprises are entering a new operating context where system complexity continues to grow, non-deterministic behavior is becoming native, AI-generated code is entering production at scale, and autonomous intelligence is moving directly into business workflows.

SRE as a control must evolve. That’s why we want to introduce a new control architecture:

Diagram 2. SRE as a control system - the New Paradigm.
Diagram 2. SRE as a control system - the New Paradigm.

In the new model, we use LLMs to compensate for human cognitive limits by introducing AI agents into the controller, sensors, and actuators.

  • As controllers, AI agents work with engineers to explore problems, reason about incidents and propose actions. 

  • As sensors, they continuously retrieve and interpret telemetry to surface patterns and hypotheses. 

  • As actuators, they execute changes and remediation steps under human guidance.

SRE is no longer a human-only activity. It is AI-augmented (with controls).

A new context component is introduced as a first-class part of the system. This layer captures intent, policies, system understanding, historical knowledge and decision rationale, and is continuously enriched by AI agents. It turns raw telemetry into meaning and provides the substrate on which decisions are formed. Without this layer, systems can be observed but not properly understood.

Governance is extended beyond system reliability to include AI trust controls. Traditional SLOs and error budgets remain for execution, but they’re complemented by controls over autonomy, decision quality, drift, policy compliance and auditability. The goal is no longer only to keep services running, but to ensure the intelligent agents augmenting SRE behave correctly and can be trusted.

Together, these changes transform SRE from a discipline that controls execution into one that is intelligent by itself and governs the new state of complexity.

 

Conclusion

SRE was created in the 2010s to make distributed systems controllable. It succeeded by turning complexity into an engineering problem through modeling, observability, and disciplined control.

Today, systems are no longer only distributed. They are becoming cognitive, autonomous and increasingly self-directed. Decision-making is moving into software, while AI is reshaping how systems are built, changed and operated. The original control model is no longer sufficient.

We believe building intelligence into the new control system is the only way to govern intelligence itself. It’s no longer possible to rely solely on human cognition and the classical SRE construct.