代ē ēŗ§é„ęµåē¹ļ¼ä»āč§é¬¼å»å§āå°āēé¦ā
Source: CNCF
A platform engineerās guide to developer buy-in
Originally published on the authorās personal blog, whitneylee.com
As platform engineers, we want the holistic system insights that instrumented code can give us āĀ yes, please. With code-level insights plus infrastructure observability, we can connect infrastructure signals to business outcomes. We can prove our value with Service Level Objectives (SLOs) like ā99% of checkouts complete within 2 seconds.ā Not to mention the ability to auto-detect dependencies between services, automatically create dependency graphs, and power intelligent scaling. But instrumenting code is often a pain in the ass for developers. How do we motivate them to participate?
Well, first off, we build application-adjacent observability collections into the platform as much as possible. What can we observe about our applications without the developer having to lift a finger?
- Framework-level telemetry: OpenTelemetry provides auto-instrumentation for popular programming languages, enabling your platform to collect trace data without modifying the code. You simply attach an agent, run your app with a startup flag, and it automatically implements common frameworks like HTTP servers, database clients, messaging libraries, Spring Boot, Express, etc.Ā
- Kernel-level telemetry: The platform can also collect kernel-level insights using an eBPF-powered observability tool, such as Pixie or Cilium.
- Network-level telemetry: These can come from your service mesh technologies, like Istio, Linkerd, or Cilium. All of this you can do without the developer writing a single line of observability-related code.Ā
But none of this gives us insight into the business logic, which is an important piece of the puzzle. It makes sense. The discipline of observability is maturing, and everything that can be automated for us, has been automated for us. Butāand this is a story weāll hear over and over again, across many disciplinesāthe parts of our business that are unique to our business alone need to be handled manually.Ā
This is where we need developer buy-in. Observability shouldnāt be a favor developers do for platform engineers; it should be a feature of the platform that enables developers.
As such, make sure telemetry provides value for your devs. Do your developers have access to all the system knowledge youāre collecting? Have you given them access to helpful, out-of-the-box dashboards, and a way to customize those dashboards? Can they see how their application connects to everything else the platform is already observingāthat framework-level, kernel-level, and network-level telemetry from earlier? If youāre asking them to do the work of instrumenting their code without the developers being able to understand or use the value it provides, youāre never going to have long-term success with it. A forward-thinking way where developers might especially enjoy telemetry is when they are using AI-assisted coding. You can enable the AI coding assistant to be able to query the observability backend by providing the developer with an MCP server that the AI assistant can use to bring the right knowledge into the developersā coding environment at the moment itās needed. Thatās a tremendous motivator to instrument code.Ā
Second of all, just like everything in platform engineering, we need to make the right thing to do the easy thing for the developer to do. Everything that can be automated should be automated. Everything that canāt be automated should be standardized. Semantic conventions are one of the most important parts of OpenTelemetry, yet many platform teams still underestimate their impact. Because, letās be honest, naming things consistently doesnāt seem very exciting. But it is! Semantic conventions are standardized naming rules for common technical stuff. Like, so. Much. Technical. Stuff. HTTP requests, database queries, messaging systems, cloud resourcesāall of these things get a standardized name.
Without conventions, something as simple as user ID ends up scattered across your telemetry as user_id, user.id, user-id, userId, UserID, or uidāmaking it nearly impossible to correlate a single userās journey across different services. These standards even help you correlate with your third-party integrations, giving you the full picture of how your code interacts with everything else!
This matters for developer experience, so that developers can reason about telemetry the same way they reason about code. But more importantly, from a platform engineerās point of view, semantic conventions enable correlation. They ensure that telemetry from different signals can be connected and reasoned about together, whether through shared context or consistent naming. This gives you a holistic view of your system across teams, across services, and across signals.
This is everything.
Back to developer experience and making the right thing to do the easy thing to do for the developer. As a platform team, you want to define an internal telemetry specification thatās built on top of OpenTelemetry semantic conventions. So it uses semantic conventions anywhere possible. And then, when you want to record telemetry information about something thatās specific to your business, you have your own standardized naming conventions for those types of things. Then you encode the rules into libraries that can be shared across teams, making them easy to use. You can treat your internal telemetry library like any other code library and provide mechanisms to validate, version, discover, extend, and share.
A tool like OpenTelemetry Weaver can help here. It can generate these libraries from both your organizationās telemetry conventions and the official OpenTelemetry ones, providing developers with telemetry autocomplete functionality in their IDEs and compile-time validation for telemetry attributes.OpenTelemetry Weaver can help to keep your instrumentation consistent as your platform evolves.
Iāve also been personally experimenting with building an AI agent that auto-instruments my code with the OpenTelemetry standards Iāve provided in my project, and Iāve been having some success with it. I suspect that Iām not the only one, and we will see the industry trend in this direction. But Iāll save that for another post.
(ā āæć»)āāĀ
Now Iām not saying that developers will ever love instrumenting their code, but platform engineers can make that effort feel worthwhile. We can maximize the value and minimize the pain, and everyone can benefit from the system insights we gain. ā«Ā
Whitney will be speaking at KubeCon + CloudNativeCon North America, November 11-13 in Atlanta! Attend the panel discussion: Beyond the Dashboard: Modern Observability for Platform Engineering at Scale