为什么 MCP 是 AI 驱动 SRE 的关键

来源: Thoughtworks Engineering

原文

Published: May 23, 2025 

The Model Context Protocol (MCP) introduces a semantic context layer that allows an AI agent to seamlessly access meaningful context (tools, memory and state), driving more specific and responsible AI output.


Unlike traditional APIs focused on function execution, MCP emphasizes context-sharing, enabling more accurate, grounded and cooperative AI behavior across tools — with significantly reduced friction in interoperability.

MCP also makes it easier to integrate and implement retrieval-augmented generation (RAG), helping AI agents to fetch details or resources dynamically as needed. This mitigates the limitations of LLMs’ context windows and allows agents to remain focused on the task at hand while retrieving relevant data on demand.
 

MCP vs APIs: Context vs. function

Deciding whether to implement MCP starts with a fundamental architectural question:

“Does context matter here?”

Traditional APIs are great for executing repeatable functionalities. MCP, however, is for systems that learn, reason and collaborate — systems where understanding what just happened matters as much as what to do next.

Take AI-assisted software development, for example. Coding agents need to understand business logic, architectural constraints, tech debt and user feedback. This has created significant demand for standardized context-as-a-service; MCP is emerging as the preferred approach to supplying that context across tools like IDEs, AI assistants and coding models.

The table below offers a comparison of APIs and MCP:

SRE: A textbook use case for MCP

Site Reliability Engineering (SRE) is one of the clearest use cases for MCP. SRE workflows require:

  • Deep situational awareness.

  • Multi-agent collaboration.

  • Real-time decision-making across disparate systems.
     

As AI becomes more embedded in reliability practices, leaders like Rootly and Chronosphere are already building MCP-compatible capabilities into their incident and observability stacks.

Below is a comparison of the difference between using and not using MCP in an SRE context:

Key MCP-powered use cases in SRE

MCP is enabling a new level of intelligence and coordination across the SRE ecosystem. Here are three core use cases where it's driving real impact:
 

1. Context-aware observability engineering

AI agents can help correlate alerts, detect anomaly clusters and align issues with service topologies.

  • The context here includes SLOs (service-level objectives), historical trends, detailed telemetry and log data, ownership, past alerts and incidents.

  • The MCP host will be an observability platform, such as Chronosphere.

  • MCP clients include things like RCA agents, alert deduplication bots and FinOps optimizers.
     

2. AI-assisted incident investigation and triage

Agents can carry forward rich semantic context as they investigate causes, propose fixes or escalate incidents.
 

  • The context here includes alert state, system health, past incidents, prior remediations, who solved the incident, real-time log context.

  • The MCP host could be an incident management platform (like Rootly MCP).

  • MCP Clients here might be Slack bots, summarization agents, ticket generators, RCA agents, incident investigation and resolution agents.
     

(If you want to see how this works in practice, check out the Rootly MCP Server Demo.) 
 

3. Semantic handoffs across systems

MCP ensures AI agents working across Slack, Jira, Confluence or runbooks can hand off not just statuses, but shared understanding too.

  • The context here includes things like the incident timeline, related tickets and issues, standard operating procedure (SOP), recent updates, feature details and past response actions.

  • The MCP Host could be a knowledge management platform (e.g. MCP Atlassian).

  • The MCP clients include workflow orchestrators, playbook automation runners, content creators and human-in-the-loop assist tools.
     

Why MCP matters for SRE and platform leaders

If you lead SRE, platform engineering or intelligent operations, it's vital to understand that MCP isn't about replacing APIs: it’s about augmenting them. MCP introduces a dynamic context layer that lets AI agents act with awareness, use the right tools, collaborate with memory, generate more accurate outputs and evolve their behavior over time.

It’s a shift from “API calls” to “contextual reasoning.” And that shift is foundational to building autonomous, intelligent reliability systems.