ä»ēę§å°åÆč§ęµę§ļ¼Uberēäŗåēå¹³å°č¶ 级马ęę¾
Source: Uber Tech
Introduction
Managing a global corporate network at Uberās scale can feel a bit like running an ultra-marathon. There are long stretches of smooth sailing, but youāre always preparing for the unexpected mountain pass or sudden change in weather. For years, our engineering teams have navigated this terrain with a traditional, monolithic monitoring system. Frankly, it felt like running in heavy hiking bootsāsturdy, but slow, inflexible, and exhausting to scale up any hill.
We knew we needed to switch to a modern pair of carbon-fiber running shoes. This meant a complete overhaul: a journey to replace our legacy system with a cloud-native observability platform built for speed, flexibility, and endurance on an open-source stack.
Training Plan: Scope and Vision
Before diving deeper, itās important to clarify where this system operates and what we wanted to achieve.
The Scope: Uberās Corporate Network
The CorpNet Observability Platform focuses exclusively on Uberās corporate networkāthe infrastructure that connects offices, data centers, cloud environments, and internal services.
Itās not a production telemetry platform; instead, it monitors and analyzes:
- Network and infrastructure devices like switches, routers, PDUs, and IoT sensors
- Connectivity, latency, and device health across Uberās internal regions
- Operational data flows supporting enterprise networking and internal applications
The mission is simple: make Uberās internal network as observable, reliable, and automated as the systems it supports.
The Vision: Upgrading Our Gear with Open Source
Our vision was to build a new system on the pillars of data quality, scalability, and actionable data. We chose a foundation of best-in-class open-source tools.
| Function | Technology | Why It Matters |
| Metrics Collection | Telegraf⢠| A lightweight, agile runnerāable to handle diverse terrains (SNMP, API, and MQTT) without breaking stride. |
| Metrics Storage | PrometheusĀ® and Thanos⢠| Prometheus handles the quick climbsāreal-time metricsāwhile Thanos ensures endurance with long-term storage and global reach. |
| Visualization | GrafanaĀ® and KibanaĀ® | The panoramic viewpoints along the routeāgiving clarity and context when the trail gets steep. |
| Metadata and Search | ElasticsearchĀ® | The trail map and compassāindexing every step, making it easy to retrace paths or spot patterns. |
A Cloud-Native Architecture Built for the Long Run
To achieve our vision, we adopted a cloud-native, microservices-based architecture deployed on KubernetesĀ®. Each component is a modular, containerized service that communicates via APIs, making the system open to new integrations.
The system is deployed globally across regions (USC, EMEA, and APAC). This is like having well-stocked aid stations along the race course, ensuring our monitoring probes provide accurate, low-latency measurements by being geographically close to our devices.
Running on Kubernetes also gives the system an incredible ability to bounce back. If a component stumbles, itās automatically restarted, ensuring high availability.
Looking at other components of the architecture, we use a scalable, open-source stack. We use Prometheus for core time-series metrics, extended by Thanos for long-term storage and a global query view. Telegraf is our versatile data collection agent, while Grafana and Kibana provide unified dashboards for metrics and alerts.
Finding the Rhythm on the Mountain: The Architecture Behind the Run
Our observability platform is built like a trail raceāevery component works in sync to keep momentum on unpredictable terrain.

Keeping Up the Pace: Dynamically Tracking Network Changes
In trail running, staying still means falling behind. The terrain changes under your feetāthe weather, the altitude, the fatigueāand success depends on how fast you adapt.
Our observability system faces a similar challenge. Without adaptability, itād be like running the same loop foreverāpredictable, static, and quickly outdated.
Thatās why we introduced the Dynamic Config App, a service that keeps the entire monitoring system moving in rhythm with Uberās corporate network.
Before this, Telegraf pollers relied on static configuration filesāgreat for a short sprint, but not for a long, evolving route. Every time a new device appeared in the inventory or a router changed region, or a device was set for the maintenance state. It was functional, but it couldnāt scale or keep pace with a live, global environment.
The Dynamic Config App changed that completely.
This service dynamically tracks updates in the inventory stored in Elasticsearch, which is continuously refreshed from ServiceNowĀ® (the source of truth). Whenever a device is added, removed, or reassigned, the Config App immediately reflects that change in Telegrafās configurationāno manual edits, no redeploys.
Each Telegraf agent is launched with parameters that make it self-adjusting:

Using parameters like /v1/snmp/cisco?region=EMEA, the app filters which devices each Telegraf agent should poll based on region, site, or device type.
With this setup, Telegraf periodically checks the configuration URL to see if anything has changed. If the Last-Modified timestamp differs, it automatically reloads its configuration.
This mechanism allows us to:
- Instantly redistribute polling workloads across regions (USC, EMEA, and APAC)
- Filter configurations by tags such as region, site, or device type
- Roll out new collectors or plugin changes globally, without redeploying agents
- Maintain full alignment with the live network inventory at all times
The result? A system that never stands still.
Just as a good ultrarunner adapts to every climb and descent, our observability platform adapts to every change in the corporate networkāautomatically, continuously, and effortlessly.
Finding the Trail Beyond the Map: From Monitoring to Observability
Monitoring answers āIs it up?ā
Observability answers āWhy does it feel slow uphill?ā
We transitioned from static metrics to context-rich observability, powered by automation and smart data flow.
In trail running, checkpoints tell you where youāve been, but not how youāre doing. They confirm youāre still on the courseābut they donāt explain why your pace has dropped, where the next climb begins, or whatās happening behind the next ridge.
Traditional monitoring systems are like those checkpoints. They tell you if somethingās up or down, but not why or whatās next. Observability, on the other hand, is about understanding the entire raceāthe terrain, the conditions, and how every part of the system moves together.
That shiftāfrom monitoring to observabilityāis where the Alert Ingestion App comes into play.
If Telegraf and the Dynamic Config App represent the legs and lungs of our system, the Alert Ingestion App is its coach and race strategist.
Built on FastAPIā¢, Celeryā¢, and RedisĀ®, this service listens to every signal along the routeānot just from Prometheus alerts, but also from third-party systems like MerakiĀ®, HPE MISTā¢, and other webhook integrations.
It ingests alerts asynchronously, processes them in parallel, and ensures that critical events reach the right teams without flooding SlackĀ® or PagerDutyĀ® with noise.
Think of it as an intelligent checkpoint system:
- It filters and prioritizes alerts, focusing only on what matters most
- It groups similar events, like multiple runners crossing the same timing mat togetherāreducing alert storms
- It adds context and correlation, so engineers can see why an issue occurred, not just that it did
- And it updates in real time, posting alerts once and editing them as their state changesāno duplicate noise, no confusion
By routing every alertāfrom Prometheus, MQTT (Message Queuing Telemetry Transport) sensors, or network APIsāinto a single Elasticsearch index, the Alert Ingestion App creates one continuous trail of operational awareness. Engineers can retrace the full journey of an incident, from its first ping to its full resolution, across all systems and regions.

Together with the Dynamic Config App, it transforms a static monitoring setup into a living, breathing observability ecosystemāone that adapts to the terrain, listens to the data, and keeps perfect pace with the ever-changing rhythm of Uberās corporate network.
Paving the Way for a Personal Best: AIOps and Faster Resolutions
Centralizing all our alert dataāboth firing and resolvedācreates a rich, historical dataset. This is like a detailed training log of every run weāve ever done, and itās the perfect fuel for AIOps (AI for IT Operations). This historical context allows us to bring in an elite performance coach.
An AI engine processes all alert sources to provide advanced insights into network behavior patterns. By analyzing past performance, it helps engineers pinpoint root causes faster, significantly lowering our MTTR (Mean Time To Repair) and helping us set a new personal best for incident resolution.
This architecture also supports interactive AI agents, like a Slack bot, that act as a pacer during an incident. Engineers can ask questions in natural language (āHow have alerts for this site code trended over the last 24 hours?ā) and get immediate, context-rich answers, helping them push through the toughest parts of an incident.
Conclusion: The Finish Line
By migrating to this modern, cloud-native observability platform, weāve changed our shoes and our entire approach to running the network marathon. We now have:
- Faster finish times. With enriched data and intelligent, de-duplicated alerts, engineers can diagnose and resolve issues much faster.
- Readiness for any terrain. The modular, API-driven design makes the system highly extensible and always open to new integrations, ensuring weāre ready for whatever the course throws at us next.
- Smarter pacing. Weāve eliminated hundreds of thousands of dollars ofĀ recurring licensing fees, allowing us to invest those resources in more innovative parts of our strategy.
This project has been a rewarding endurance event. The result is a platform that empowers our teams with the deep insights needed to manage a network at Uberās scale and paves the way for a future of even smarter, AI-driven network operations
Cover Photo Attribution: āHOKA UTMB Mont-Blanc 2024 ā PTL, CCC, UTMBā by Sportograf, is from the authorās personal gallery.
Celery is an open-source distributed task queue licensed under the BSD License.
FastAPI⢠is a trademark of @tiangolo and is registered in the US and across other regions.
KibanaĀ® and ElasticsearchĀ® are registered trademarks of Elasticsearch BV, registered in the U.S. and in other countries.. No endorsement by Elasticsearch is implied by the use of these marks.
KubernetesĀ®, Thanosā¢, PrometheusĀ® are trademarks orĀ registered trademarks of The Linux FoundationĀ® in the United States and other countries. No endorsement by The Linux Foundation is implied by the use of these marks.
MerakiĀ® is a registered trademark of Cisco Technology, Inc.Ā
MISTā¢Ā is a trademark of HPE.
PagerDutyĀ® is a registered trademark of PagerDuty, Inc.
Redis is a registered trademark of Redis Ltd. Any rights therein are reserved to Redis Ltd. Any use by Uber is for referential purposes only and does not indicate any sponsorship, endorsement or affiliation between Redis and Uber.
ServiceNowĀ® is a registered trademark of ServiceNow, Inc.
SlackĀ® is a registered trademark and service mark of Slack Technologies, Inc.
Telegraf⢠is a trademark of InfluxData, which is not affiliated with, and does not endorse, this blog.