Code-Level Telemetry Instrumentation: From 'Go to Hell' to 'True Love'

A platform engineer’s guide to developer buy-in

Originally published on the author’s personal blog, whitneylee.com

As platform engineers, we want the holistic system insights that instrumented code can give us – yes, please. With code-level insights plus infrastructure observability, we can connect infrastructure signals to business outcomes. We can prove our value with Service Level Objectives (SLOs) like “99% of checkouts complete within 2 seconds.” Not to mention the ability to auto-detect dependencies between services, automatically create dependency graphs, and power intelligent scaling. But instrumenting code is often a pain in the ass for developers. How do we motivate them to participate?

Well, first off, we build application-adjacent observability collections into the platform as much as possible. What can we observe about our applications without the developer having to lift a finger?

Framework-level telemetry: OpenTelemetry provides auto-instrumentation for popular programming languages, enabling your platform to collect trace data without modifying the code. You simply attach an agent, run your app with a startup flag, and it automatically implements common frameworks like HTTP servers, database clients, messaging libraries, Spring Boot, Express, etc.

Kernel-level telemetry: The platform can also collect kernel-level insights using an eBPF-powered observability tool, such as Pixie or Cilium.

Network-level telemetry: These can come from your service mesh technologies, like Istio, Linkerd, or Cilium. All of this you can do without the developer writing a single line of observability-related code.

But none of this gives us insight into the business logic, which is an important piece of the puzzle. It makes sense. The discipline of observability is maturing, and everything that can be automated for us, has been automated for us. But—and this is a story we’ll hear over and over again, across many disciplines—the parts of our business that are unique to our business alone need to be handled manually.

This is where we need developer buy-in. Observability shouldn’t be a favor developers do for platform engineers; it should be a feature of the platform that enables developers.

As such, make sure telemetry provides value for your devs. Do your developers have access to all the system knowledge you’re collecting? Have you given them access to helpful, out-of-the-box dashboards, and a way to customize those dashboards? Can they see how their application connects to everything else the platform is already observing—that framework-level, kernel-level, and network-level telemetry from earlier? If you’re asking them to do the work of instrumenting their code without the developers being able to understand or use the value it provides, you’re never going to have long-term success with it. A forward-thinking way where developers might especially enjoy telemetry is when they are using AI-assisted coding. You can enable the AI coding assistant to be able to query the observability backend by providing the developer with an MCP server that the AI assistant can use to bring the right knowledge into the developers’ coding environment at the moment it’s needed. That’s a tremendous motivator to instrument code.

Second of all, just like everything in platform engineering, we need to make the right thing to do the easy thing for the developer to do. Everything that can be automated should be automated. Everything that can’t be automated should be standardized. Semantic conventions are one of the most important parts of OpenTelemetry, yet many platform teams still underestimate their impact. Because, let’s be honest, naming things consistently doesn’t seem very exciting. But it is! Semantic conventions are standardized naming rules for common technical stuff. Like, so. Much. Technical. Stuff. HTTP requests, database queries, messaging systems, cloud resources—all of these things get a standardized name.

Without conventions, something as simple as user ID ends up scattered across your telemetry as user_id, user.id, user-id, userId, UserID, or uid—making it nearly impossible to correlate a single user’s journey across different services. These standards even help you correlate with your third-party integrations, giving you the full picture of how your code interacts with everything else!

This matters for developer experience, so that developers can reason about telemetry the same way they reason about code. But more importantly, from a platform engineer’s point of view, semantic conventions enable correlation. They ensure that telemetry from different signals can be connected and reasoned about together, whether through shared context or consistent naming. This gives you a holistic view of your system across teams, across services, and across signals.

This is everything.

Back to developer experience and making the right thing to do the easy thing to do for the developer. As a platform team, you want to define an internal telemetry specification that’s built on top of OpenTelemetry semantic conventions. So it uses semantic conventions anywhere possible. And then, when you want to record telemetry information about something that’s specific to your business, you have your own standardized naming conventions for those types of things. Then you encode the rules into libraries that can be shared across teams, making them easy to use. You can treat your internal telemetry library like any other code library and provide mechanisms to validate, version, discover, extend, and share.

A tool like OpenTelemetry Weaver can help here. It can generate these libraries from both your organization’s telemetry conventions and the official OpenTelemetry ones, providing developers with telemetry autocomplete functionality in their IDEs and compile-time validation for telemetry attributes.OpenTelemetry Weaver can help to keep your instrumentation consistent as your platform evolves.

I’ve also been personally experimenting with building an AI agent that auto-instruments my code with the OpenTelemetry standards I’ve provided in my project, and I’ve been having some success with it. I suspect that I’m not the only one, and we will see the industry trend in this direction. But I’ll save that for another post.

(◠‿・)—☆

Now I’m not saying that developers will ever love instrumenting their code, but platform engineers can make that effort feel worthwhile. We can maximize the value and minimize the pain, and everyone can benefit from the system insights we gain. ♫

Whitney will be speaking at KubeCon + CloudNativeCon North America, November 11-13 in Atlanta! Attend the panel discussion: Beyond the Dashboard: Modern Observability for Platform Engineering at Scale