HolmesGPTļ¼äøŗäŗåēę¶ä»£ęé ē代ēå¼ę éęé¤å·„å ·
Source: CNCF
If youāve ever debugged a production incident, you know that the hardest part often isnāt the fix, itās finding where to begin. Most on-call engineers end up spending hours piecing together clues, fighting time pressure, and trying to make sense of scattered data. Youāve probably run into one or more of these challenges:Ā
- Unwritten knowledge and missing context:
Youāre pulled into an outage for a service you barely know. The original owners have changed teams, the documentation is half-written, and the ārunbookā is either stale or missing altogether. You spend the first 30 minutes trying to find someone whoās seen this issue before ā and if youāre unlucky, this incident is a new one.Ā - Tool overload and context switching:
Your screen looks like an air traffic control dashboard. Youāre running monitoring queries, flipping between Grafana and Application Insights, checking container logs, and scrolling through traces ā all while someoneās asking for an ETA in the incident channel. Correlating data across tools is manual, slow, and mentally exhausting.Ā - Overwhelming complexity and knowledge gaps:
Modern cloud-native systems like Kubernetes are powerful, but theyāve made troubleshooting far more complex. Every layer ā nodes, pods, controllers, APIs, networking, autoscalers ā introduces its own failure modes. To diagnose effectively, you need deep expertise across multiple domains, something even seasoned engineers canāt always keep up with.Ā
The challenges require a solution that can look across signals, recall patterns from past incidents, and guide you toward the most likely cause.Ā
This is where HolmesGPT, a CNCF Sandbox project, could help.Ā
Ā
HolmesGPT was accepted as a CNCF Sandbox project in October 2025. Itās built to simplify the chaos of production debugging ā bringing together logs, metrics, and traces from different sources, reasoning over them, and surfacing clear, data-backed insights in plain language.Ā
What is HolmesGPT?
HolmesGPT is an open-source AI troubleshooting agent built for Kubernetes and cloud-native environments. It combines observability telemetry, LLM reasoning, and structured runbooks to accelerate root cause analysis and suggest next actions.Ā
Unlike static dashboards or chatbots, HolmesGPT is agentic: it actively decides what data to fetch, runs targeted queries, and iteratively refines its hypotheses ā all while staying within your environment.Ā
Key benefits:
- AI-native control loop: HolmesGPT uses an agentic task list approachĀ Ā
- Open architecture: Every integration and toolset is open and extensible, works with existing runbooks and MCP serversĀ
- Data privacy: Models can run locally or inside your cluster or on the cloudĀ Ā
- Community-driven: Designed around CNCF principles of openness, interoperability, and transparency.Ā
How it worksĀ
When you run:
holmes ask āWhy is my pod in crash loop back off stateā
HolmesGPT:Ā
- Understands intent ā it recognizes you want to diagnose a pod restart issueĀ
- Creates a task list ā breaks down the problem into smaller chunks and executes each of them separatelyĀ Ā
- Queries data sources ā runs Prometheus queries, collects Kubernetes events or logs, inspects pod specs including which podĀ
- Correlates context ā detects that a recent deployment updated the imageĀ Ā Ā
- Explains and suggests fixes ā returns a natural language diagnosis and remediation steps.Ā
Hereās a simplified overview of the architecture:

Extensible by designĀ
HolmesGPTās architecture allows contributors to add new components:Ā
- Toolsets: Build custom commands for internal observability pipelinesĀ or expose existing tools through a Model Context Protocol (MCP) server.
- Evals: Add custom evals to benchmark performance, cost , latency of modelsĀ
- Runbooks: Codify best practices (e.g., ādiagnose DNS failuresā or ādebug PVC provisioningā).Ā
Example of a simple custom tool:Ā
holmes:
toolsets:
kubernetes/pod_status:
description: "Check the status of a Kubernetes pod."
tools:
- name: "get_pod"
description: "Fetch pod details from a namespace."
command: "kubectl get pod {{ pod }} -n {{ namespace }}"
Getting started
- Install HolmesgptĀ
There are 4-5 ways to install Holmesgpt, one of the easiest ways to get started is through pip.Ā
brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt
The detailed installation guide has instructions for helm, CLI and the UI.Ā
- Setup the LLM (Any Open AI compatible LLM) by setting the API KeyĀ Ā
In most cases, this means setting the appropriate environment variable based on the LLM provider.
- Run it locallyĀ
holmes ask "what is wrong with the user-profile-import pod?" --model="anthropic/claude-sonnet-4-5"Ā
- Explore other features Ā
- GitHub: https://github.com/robusta-dev/holmesgptĀ
- Docs: holmesgpt.dev
How to get involvedĀ
HolmesGPT is entirely community-driven and welcomes all forms of contribution:Ā
| AreaĀ | How you can helpĀ |
| IntegrationsĀ | Add new toolsets for your observability tools or CI/CD pipelines.Ā |
| RunbooksĀ | Encode operational expertise for others to reuse.Ā |
| EvaluationĀ | Help build benchmarks for AI reasoning accuracy and observability insights.Ā |
| Docs and tutorialsĀ | Improve onboarding, create demos, or contribute walkthroughs.Ā |
| CommunityĀ | Join discussions around governance and CNCF Sandbox progression.Ā |
All contributions follow the CNCF Code of Conduct.Ā
Further ResourcesĀ
- GitHub RepositoryĀ
- Ā Join CNCF Slack ā #holmesgptĀ
- Contributing Guide