AI å †ę ˆäø­ē¼ŗå¤±ēš„äø€å±‚ļ¼šäøŠäø‹ę–‡ļ¼Œäøä»…ä»…ę˜ÆēŠ¶ę€

Source: Data Eng Weekly

Join The Great Data Debate to get answers to questions the data & AI industry is so curious about right now:

  • Where does context materialize in practice?

  • Semantic layers, ontologies, context graphs - what should data teams build in 2026?

  • Who owns context as meaning evolves?

  • Where should that context live: in the warehouse, inside agents, or in a dedicated context layer?

Register Here


Why Data Engineers Must Think in Graphs, Not Just Tables

If you have been following the ā€œSystems of Recordā€ debate on tech Twitter, you likely saw the clash between the ā€œAgents kill SaaSā€ camp and the ā€œLong live the Databaseā€ camp. But for data engineers, the reality is more nuanced—and far more interesting.

As we move from dashboards to autonomous agents, we are hitting a wall. It turns out that knowing the state (what happened) is not the same as knowing the reasoning (why it happened).

Drawing on recent insights from Foundation Capital, Jamin Ball (Altimeter), OpenAI’s internal engineering team, and the TrustGraph manifesto, this post explores the emergence of the Context Graph. This missing architectural layer will likely redefine how we build data platforms in the agentic era.


The Problem: State Machines vs. Decision Traces

For the past decade, our role as data engineers has been to centralize data in the warehouse (or Lakehouse). We built ETL pipelines to move data from Salesforce, NetSuite, and Zendesk into a ā€œSingle Source of Truth.ā€

However, traditional Systems of Record (SoR) effectively act as ā€œstate machines.ā€ They record the final output: the organization closed a deal, applied a discount, and escalated a ticket. But they fail to capture the decision traces.

As Foundation Capital notes, the reasoning behind a decision—the Slack threads, the cross-system synthesis, the VP’s verbal override of a policy—is rarely captured in the database. A CRM might show a ā€œ20% discount,ā€ but it won’t tell an AI agent why that exception was granted (e.g., ā€œCustomer represents a strategic entry into the APAC marketā€).

Without these traces, agents fly blind. They have the rules (ā€Do not give discounts >10%ā€), but they lack historical context on when and why they were violated.


The Solution: The Truth Registry and the Context Graph

To address this, we observe a bifurcation in the modern data stack, as illustrated by the Hybrid Agentic Architecture (see Figure 1 below).

This architecture consists of two distinct but integrated planes:

1. The Warehouse as the ā€œTruth Registry.ā€

Jamin Ball argues that systems of record aren’t dying; they are becoming ā€œboring, rock-solid sources of truthā€. In an agentic world, the warehouse must evolve into a ā€œTruth Registryā€ that encodes semantic contracts.

Agents are fragile. If an agent hallucinates the definition of ā€œChurn,ā€ it can automate disastrous decisions. Therefore, we must clean and canonize data before the agent sees it. In the architecture above, the flow is from Raw (Variant) to Silver (Extracted) to Gold (Canonical Model).

  • Engineering Takeaway: You cannot feed agents raw JSON blobs. Extracting variant columns into typed, named columns in the Silver layer is critical. It transforms ā€œavailable dataā€ into ā€œgoverned data,ā€ preventing agents from guessing schemas at runtime.

2. The Context Graph as the ā€œReasoning Layer.ā€

While the warehouse handles facts, the Context Graph handles relationships. TrustGraph defines a context graph as a ā€œtriples-representation of data (Subject → Predicate → Object) optimized for AIā€.

Why a graph? Because structure is information. When you feed an LLM structured data (like RDF or Cypher), the structure itself encodes meaning. This allows agents to traverse relationships that SQL joins struggle to represent—stitching together a user’s support ticket, their billing status, and their web activity into a single, queryable context.


Case Study: Inside OpenAI’s Data Agent

OpenAI recently reported that standard metadata was insufficient for their internal data agent. They had to build a custom ā€œContext Layerā€ that closely resembles the architecture above.

Their agent failed when it relied solely on table schemas. To fix this, they added:

  1. Human Annotations: Curated descriptions of what tables actually mean (e.g., ā€œThis table excludes logged-out usersā€).

  2. Code Enrichment: They used ā€œCodexā€ to crawl their own codebase, understanding data lineage not just by metadata, but by reading the pipelines that produced the data.

This confirms a major trend: The metadata is the model. Providing agents with a semantic ontology (machine-readable definitions of terms) is just as important as the data itself.


The ā€œFront Doorā€ is Moving

The implications for the industry are massive. Historically, if you owned the System of Record (like Salesforce), you owned the ā€œFront Doorā€ (the UI).

But as agents take over workflows, the UI is unbundling from the data. Jamin Ball compares this to the travel industry: GDS systems (Sabre, Amadeus) remained the backend source of truth, but Online Travel Agencies (Expedia, Booking) captured the front door—and the value.

In our new stack, the Agents become the OTAs. They are the new interface. The Warehouse/Lakehouse becomes the GDS—the invisible, essential infrastructure layer.


What This Means for Data Engineers

  1. Stop Hoarding State, Start capturing Traces: We need to instrument our systems to emit ā€œdecision tracesā€ on every run. If an agent (or human) makes a decision, record the inputs and the logic used, not just the result.

  2. The Rise of the ā€œGoldā€ Layer: Your dbt models are no longer just for dashboards. They are the safety rails for autonomous agents. Strict typing, ā€œGoldā€ tables, and canonical definitions are non-negotiable.

  3. Graph Literacy: You don’t need to be a Neo4j expert, but understanding the basics of triples (Subject-Predicate-Object) and ontologies is becoming a core DE skill.

  4. Extract Your Semi-Structured/Unstructured Data:Ā As shown in the architecture diagram, leaving data in unstructured blobs is a liability. Agents need explicit structure to reason safely.

As agents grow more capable, the infrastructure beneath them must evolve. The Context Graph offers a powerful new foundation—not just for smarter agents, but for more transparent, explainable, and aligned systems. It’s time for data teams to build not just pipelines, but reasoning engines.


References

https://x.com/KirkMarple/status/2003944353342149021

https://x.com/KirkMarple/status/2005443843848856047

https://foundationcapital.com/context-graphs-ais-trillion-dollar-opportunity/

https://trustgraph.ai/news/context-graph-manifesto/

https://openai.com/index/inside-our-in-house-data-agent/