HolmesGPT:为云原生时代打造的代理式故障排除工具

来源: CNCF

原文

If you’ve ever debugged a production incident, you know that the hardest part often isn’t the fix, it’s finding where to begin. Most on-call engineers end up spending hours piecing together clues, fighting time pressure, and trying to make sense of scattered data. You’ve probably run into one or more of these challenges: 

  • Unwritten knowledge and missing context:
    You’re pulled into an outage for a service you barely know. The original owners have changed teams, the documentation is half-written, and the “runbook” is either stale or missing altogether. You spend the first 30 minutes trying to find someone who’s seen this issue before — and if you’re unlucky, this incident is a new one. 
  • Tool overload and context switching:
    Your screen looks like an air traffic control dashboard. You’re running monitoring queries, flipping between Grafana and Application Insights, checking container logs, and scrolling through traces — all while someone’s asking for an ETA in the incident channel. Correlating data across tools is manual, slow, and mentally exhausting. 
  • Overwhelming complexity and knowledge gaps:
    Modern cloud-native systems like Kubernetes are powerful, but they’ve made troubleshooting far more complex. Every layer — nodes, pods, controllers, APIs, networking, autoscalers – introduces its own failure modes. To diagnose effectively, you need deep expertise across multiple domains, something even seasoned engineers can’t always keep up with. 

The challenges require a solution that can look across signals, recall patterns from past incidents, and guide you toward the most likely cause. 

This is where HolmesGPT, a CNCF Sandbox project, could help. 

 
HolmesGPT was accepted as a CNCF Sandbox project in October 2025. It’s built to simplify the chaos of production debugging – bringing together logs, metrics, and traces from different sources, reasoning over them, and surfacing clear, data-backed insights in plain language. 

What is HolmesGPT?

HolmesGPT is an open-source AI troubleshooting agent built for Kubernetes and cloud-native environments. It combines observability telemetry, LLM reasoning, and structured runbooks to accelerate root cause analysis and suggest next actions. 

Unlike static dashboards or chatbots, HolmesGPT is agentic: it actively decides what data to fetch, runs targeted queries, and iteratively refines its hypotheses – all while staying within your environment. 

Key benefits:

  • AI-native control loop: HolmesGPT uses an agentic task list approach  
  • Open architecture: Every integration and toolset is open and extensible, works with existing runbooks and MCP servers 
  • Data privacy: Models can run locally or inside your cluster or on the cloud  
  • Community-driven: Designed around CNCF principles of openness, interoperability, and transparency. 

How it works 

When you run:

holmes ask “Why is my pod in crash loop back off state” 

HolmesGPT: 

  1. Understands intent → it recognizes you want to diagnose a pod restart issue 
  2. Creates a task list → breaks down the problem into smaller chunks and executes each of them separately  
  3. Queries data sources → runs Prometheus queries, collects Kubernetes events or logs, inspects pod specs including which pod 
  4. Correlates context → detects that a recent deployment updated the image   
  5. Explains and suggests fixes → returns a natural language diagnosis and remediation steps. 

Here’s a simplified overview of the architecture:

HolmesGPT architecture

Extensible by design 

HolmesGPT’s architecture allows contributors to add new components: 

  • Toolsets: Build custom commands for internal observability pipelines or expose existing tools through a Model Context Protocol (MCP) server.
  • Evals: Add custom evals to benchmark performance, cost , latency of models 
  • Runbooks: Codify best practices (e.g., “diagnose DNS failures” or “debug PVC provisioning”). 

Example of a simple custom tool: 

holmes:
  toolsets:
    kubernetes/pod_status:
      description: "Check the status of a Kubernetes pod."
      tools:
        - name: "get_pod"
          description: "Fetch pod details from a namespace."
          command: "kubectl get pod {{ pod }} -n {{ namespace }}"

Getting started

  1. Install Holmesgpt 

There are 4-5 ways to install Holmesgpt, one of the easiest ways to get started is through pip

brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt

The detailed installation guide has instructions for helm, CLI and the UI. 

  1. Setup the LLM (Any Open AI compatible LLM) by setting the API Key  

In most cases, this means setting the appropriate environment variable based on the LLM provider.

  1. Run it locally 
holmes ask "what is wrong with the user-profile-import pod?" --model="anthropic/claude-sonnet-4-5" 
  1. Explore other features  

How to get involved 

HolmesGPT is entirely community-driven and welcomes all forms of contribution: 

Area How you can help 
Integrations Add new toolsets for your observability tools or CI/CD pipelines. 
Runbooks Encode operational expertise for others to reuse. 
Evaluation Help build benchmarks for AI reasoning accuracy and observability insights. 
Docs and tutorials Improve onboarding, create demos, or contribute walkthroughs. 
Community Join discussions around governance and CNCF Sandbox progression. 

All contributions follow the CNCF Code of Conduct

Further Resources