2025ĺš´ć°ćŽĺˇĽç¨ĺ饞
Source: Data Eng Weekly
If 2023 was the year of âShockâ and 2024 was the year of âHype,â 2025 will be remembered as the year of Engineering.
For the past decade, our industry has been obsessed with the mechanics of movement. We argued about âETL vs. ELT.â We fought âFormat Warsâ over table specifications. We optimized commit protocols and debated the merits of various orchestrators. We were, fundamentally, digital plumbers ensuring the water reached the tap.
But in 2025, the mandate changed. The business no longer wants âdataâ; it demands âintelligence.â It demands systems that reason, agents that act, and infrastructure that guarantees truth in a non-deterministic world. The âBig Dataâ era of managing volume formally ended, replaced by the âContext Eraâ of managing meaning.
We are no longer just Data Engineers. We are the architects of the cognitive layer.
Here are the seven patterns that defined Data & AI Engineering in 2025.
1. Agent Engineering: The Inevitable Evolution of the Pipeline
The most significant shift of 2025 was the industryâs realization that âAgentsâ are not just fancy chatbotsâthey are the new compute engine. In 2024, we treated LLMs as text generators. In 2025, we started treating them as reasoning engines that execute logic we previously wrote in Python or SQL.
This birthed a new discipline: Agent Engineering.
We moved beyond the chaotic âvibes-basedâ coding of early experiments into structured, rigorous engineering. We stopped asking âCan AI write code?â and started asking âHow do we architect a system where AI reliably executes complex workflows?â
The Rise of Context Engineering
The bottleneck for intelligent systems shifted from model capacity to context management. We realized that an agent is only as smart as the context you feed it.
Anthropic defined the year with their masterclass on Effective Context Engineering, framing it as a discipline focused on managing the âattention budgetâ of models. It wasnât enough to dump documents into a prompt. Engineers at Manus demonstrated that we must curate, compress, and dynamically retrieve tokens during inference to sustain coherent behavior over long horizons in their piece on Context Engineering for AI Agents.
We learned that âContextâ is an information management problem. We saw teams optimizing âKV-cache hit ratesâ and treating context windows like precious RAM. The winning architecture wasnât the one with the biggest model; it was the one that engineered the most relevant context.
The USB-C of Intelligence: Model Context Protocol (MCP)
History will likely view the introduction of the Model Context Protocol (MCP) as the moment agents became viable enterprise software. Before MCP, connecting an LLM to a database or API was a bespoke, brittle integration task.
In 2025, MCP standardized this connection. It became the âUSB-C for Agents,â allowing developers to build a connector once and have it work across any MCP-compliant model or application, as detailed in Alibabaâs comprehensive analysis of MCP features. However, the rollout wasnât without caution; TigerData engineers noted that while MCP solved interoperability, it introduced new attack vectors, arguing that security is its Achilles heel.
From Chatbots to Colleagues
The proof was in the production deployments. Uber revealed Genie, an internal agent that didnât just answer questions but also acted as a ânear-humanâ subject-matter expert. LinkedIn unveiled its Hiring Assistant, an agent that handled complex recruiting workflows using speculative decoding to accelerate inference.
These werenât toys. They were engineered systems with rigorous orchestration, state management, and error handling. The industry formalized patterns like âPrompt Chaining,â âRouting,â and âParallelization.â We stopped treating agents as magic boxes and started treating them as software components that required a new kind of engineering.
2. âEvalsâ Are The New Unit Tests
If Agent Engineering was the engine of 2025, Evaluation (Evals) was the brakesâand the steering wheel.
The âVibe Codingâ eraâwhere we judged models by looking at a few outputs and saying âlooks good to meââdied a hard death. In 2025, organizations realized they could not ship non-deterministic software without rigorous, deterministic measurement.
The âJudge-LLMâ Pattern
How do you test a system that gives a different answer every time? You build a machine to grade the machine.
The industry standardized around the Judge-LLM framework. Booking.com offers practical tips for LLM evaluation, using a âstrongerâ model (trained on a âGolden Datasetâ of human-verified answers) to grade the outputs of âweaker,â cheaper production models. Pinterest followed suit with its LLM-powered relevance assessment, replacing costly manual labeling with fine-tuned LLMs that achieved high agreement with human experts.
This wasnât just about checking for âcorrectness.â We developed specific metrics for Hallucination Rate, Instruction Following, and Tone Consistency. Uber built Requirement Adherence systems that extracted rules from standard operating procedures (SOPs) and enforced them in real-time, reducing post-labeling audits by 80%.
Evaluation-Driven Development (EDD)
âTest-Driven Developmentâ (TDD) evolved into Evaluation-Driven Development (EDD). Engineers learned that you cannot optimize what you cannot measure.
Infrastructure teams integrated these evals directly into CI/CD pipelines. Databricks shared how they âshifted leftâ on reliability, scaling database reliability by embedding schema scorers into their build processes to catch data quality issues before they hit production.
The takeaway for every data engineer in 2025 was clear: If you donât have an eval for it, it doesnât exist. You arenât âprompt engineeringâ until you have a metric that tells you if your changes made things better or worse.
3. The Streaming-Lakehouse Merger: The End of Lambda Architecture
For fifteen years, we lived with the âLambda Architectureââmaintain a fast streaming path (Kafka/Flink) and a slow batch path (Hadoop/Spark), and pray they match. In 2025, we finally merged the lanes.
The barrier between âStreamâ and âTableâ dissolved. We entered the era of the Streaming Lakehouse.
Stream-Table Duality
The concept of âStream-Table Dualityââlong preached by Kafkaâs creatorsâbecame a reality in storage. New engines like Apache Paimon and Apache Fluss emerged to bridge the gap.
Alibaba championed Apache Paimon as a lake format designed specifically for real-time updates, offering the high-throughput ingestion of a stream with the query capabilities of a lakehouse table. Jack Vanlightlyâs deep dive into Understanding Apache Fluss revealed a system that combines log tablets with KV tablets, effectively creating a database that exposes its own changelog as a first-class citizen.
We stopped debating âStream vs. Batchâ and started designing architectures that ingest data once and make it immediately available for both real-time operational lookups and historical analytical queries.
The Zero-Copy Debate
However, this merger wasnât without conflict. The buzzword of the year was âZero-Copyââthe promise that you could point your data warehouse at your Kafka topic and query it without moving bytes.
But seasoned engineers pushed back. WarpStream argued the case for an Iceberg-native database, claiming that coupling your operational message bus directly to your analytical engine violates separation of concerns.
The consensus that emerged? âZero-Copyâ is great for ad-hoc exploration, but for production, Materialization (making a copy) is still the price of performance and isolation.
Diskless Kafka and the Cloud-Native Log
Even Kafka itself couldnât escape the modernization wave. The community rallied around KIP-1150 (Diskless Topics), a proposal to re-architect Kafka for the cloud era.
We realized that in a world of S3 Express and high-speed networking, storing data on local broker disks was an expensive relic. The future of streaming is âTiered Storage by Default,â where the broker is just a caching layer on top of infinite object storage. This shift promises to slash costs and make âinfinite retentionâ streams a standard architectural pattern.
4. The Efficiency Counter-Revolution: Small Data and Rust
While the AI teams were burning cash on GPUs, the Data Infrastructure teams were leading a quiet counter-revolution. After years of defaulting to massive, expensive distributed clusters (Spark/Hadoop) for every problem, 2025 was the year we right-sized our compute.
We realized that âBig Dataâ tools are overkill for âMedium Dataâ problems.
The Single-Node Renaissance
âDoes this really need a 50-node cluster, or just a bigger laptop?â
That question dismantled pipelines across the industry. Tools like DuckDB and Polars graduated from âanalyst favoritesâ to âproduction workhorses.â Decathlon shared a viral case study about being Ready to Play with Polars, replacing massive Spark clusters with Polars scripts for datasets under 50GB and slashing infrastructure costs to near zero.
Benchmarks from Daniel Beach confirmed this, showing that for 650GB of Data (Delta Lake on S3), single-node engines often beat distributed clusters simply by avoiding network overhead. We stopped being ashamed of âvertical scalingâ and started embracing it as a FinOps victory.
The Rust Rewrite
When we did need performance, we turned to Rust.
Agoda explained why they bet on Rust to supercharge their Feature Store, achieving a 5x increase in traffic capacity.
The lesson was clear: The âJava Taxâ is real. For critical, low-latency infrastructure, Rustâs safety and performance are worth the learning curve. We are entering a new era where the foundational tools of data engineering are being rebuilt, brick by brick, in Rust.
FinOps as Architecture
Efficiency wasnât just about code; it was about architecture. Wix documented how they slashed Spark costs by 60% not by rewriting code, but by migrating workloads from managed services (EMR) to Kubernetes (EKS).
We learned that âServerlessâ often means âWallet-lessâ if you arenât careful. The smartest teams in 2025 were those who aggressively optimized their compute substrates, moving workloads to Spot instances, ARM processors, and single-node containers whenever possible.
5. Lakehouse 2.0: The Catalog is the New Database
The âFormat Warsâ (Iceberg vs. Delta vs. Hudi) that dominated the early 2020s largely settled into a peace treaty in 2025. With the rise of interoperability layers, we stopped caring about data folder structure.
The battleground shifted âup the stackâ to the Catalog. We realized that the âLakehouseâ is just a database turned inside out, and the Catalog is its operating system.
The Catalog Wars
In 2025, the Catalog stopped being just a âlist of tablesâ and became the active control plane for the enterprise.
New concepts like DuckLake challenged the status quo, proposing that we use DuckDB itself as a catalog and metadata layer, replacing the heavy, complex Hive Metastore with a lightweight, transactional database.
Hyperscalers (AWS, Snowflake, Databricks) all converged on Managed Iceberg services. As Simon Späti noted in The Open Table Format Revolution, the major players stopped fighting the open format and started trying to own the metadata layer that manages it. The value proposition shifted from âwe store your dataâ to âwe govern your transactions.â
6. The âContextâ Supply Chain: Unstructured Data & Knowledge Graphs
For decades, Data Engineering was about rows and columns. In 2025, we had to get good at text, images, and relationships. To support the GenAI revolution, we had to build the Context Supply Chain.
We arenât just moving data anymore; we are moving meaning.
Knowledge Graphs Return
The most surprising comeback of 2025 was the Knowledge Graph. As we struggled with LLM hallucinations, we realized that probabilistic models need deterministic facts to ground them.
Netflix details its Unified Data Architecture and how it is Unlocking Entertainment Intelligence with Knowledge Graph to provide a âsource of truthâ for its AI models.
We learned that âRAGâ (Retrieval-Augmented Generation) isnât just about vector search. Itâs about âGraphRAGââusing the relationships between data points to provide richer, more accurate context to the model.
The Embedding Pipeline
Data Engineers had to master a new type of transformation: the Embedding.
We moved beyond simple âWord2Vecâ tutorials. Milvus released guides on how to choose the right embedding model for RAG, helping teams select between LLM2Vec, BGE-M3, and others for specific domains. We built pipelines that chunked documents, generated embeddings, and stored them in vector databases, treating âsemantic distanceâ as a first-class data type.
Unstructured Data Management
We formally recognized Unstructured Data Management at Scale as a core competency. Piethein Strengholt argued that it wasnât enough to dump PDFs into an S3 bucket; we needed to parse, clean, chunk, and govern that data with the same rigor we apply to our financial tables.
The âMedallion Architectureâ (Bronze/Silver/Gold) was adapted for unstructured data. We started talking about âRaw Documentsâ (Bronze), âParsed & Chunked Textâ (Silver), and âCurated Embeddingsâ (Gold).
7. Governance 2.0: The Safety Brake for Autonomous Agents
As AI agents began taking actionsâbooking interviews, executing code, modifying databasesâGovernance stopped being a âcompliance checkboxâ and became a âsafety brake.â
In 2024, if a dashboard was wrong, a manager made a bad decision. In 2025, if an agent is wrong, it might delete a production database or leak PII to a competitor.
Privacy-Aware Infrastructure
Meta writes about discovering data flows via lineage at scale. They didnât just write policies; they wrote code that tracked data lineage and enforced purpose limitations at the exabyte scale. Meta introduced Policy Zones to ensure that data collected for one purpose (e.g., safety) couldnât be used for another (e.g., ad targeting) without explicit permission.
This level of granularity is the future. We are moving toward systems where every piece of data carries its own âpassportâ of permissions, and every agent must present a âvisaâ to access it.
Shadow AI and the New Perimeter
The rise of âShadow AIââengineers spinning up local LLMs or using unapproved APIsâforced data teams to harden the perimeter.
We saw the emergence of Data Contracts as a defense mechanism. Grab implemented real-time data quality monitoring with strict contracts for its Kafka streams, ensuring that bad data was rejected before it could poison the downstream AI models.
Governance in 2025 is about Observability. Itâs about knowing exactly which agent accessed which document, why, and what it did with it.
Conclusion: The Era of the Context Engineer
Looking back at 2025, it is clear that the role of the Data Engineer has fundamentally changed.
We are no longer just building pipelines to move data from Point A to Point B. We are building the enterpriseâs Cognitive Nervous System.
We are Agent Engineers, designing the workflows that allow AI to reason and act.
We are Eval Architects, building the metric systems that keep AI honest.
We are Context Curators, ensuring that the âmeaningâ of our data is preserved and accessible.
We are Efficiency Experts, maximizing the ROI of every compute cycle in a world of expensive GPUs.
The tools will continue to change. Spark might yield to Polars; Kafka might yield to object storage. But the discipline of engineeringâof rigor, measurement, and architectureâis stronger than ever.
All rights reserved, Dewpeche Pvt Ltd, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employersâ opinions.
