Data Engineering Weekly #240

来源: Data Engineering Weekly

原文

The data platform playbook everyone’s using

We wrote an eBook on Data Platform Fundamentals to help you be like the happy data teams, operating undering a single platform.


In this book, you’ll learn:
- How composable architectures allow teams to ship faster
- Why data quality matters and how you can catch issues before they reach users
- What observability means, and how it will help you solve problems more quickly

Get the guide now


Kyle Weller: Apache Iceberg™ vs Delta Lake vs Apache Hudi™ - Feature Comparison Deep Dive

The industry has quickly moved on from the Great Lakehouse debate and is rapidly gaining adoption. The blog from Onehouse revisits the current state of the Lakehouse systems as features like real-time mutation capabilities and multi-table/multi-query transactions become more mainstream.

https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-apache-iceberg-lakehouse-feature-comparison


Anthropic: Effective context engineering for AI agents

As AI agents evolve beyond simple prompt-based systems, managing the limited “attention budget” of large language models has become a central engineering challenge. Anthropic writes about context engineering—a discipline focused on curating, compressing, and dynamically retrieving only the most relevant tokens during inference to sustain coherent, efficient, and long-horizon agent behavior. By combining strategies such as compaction, structured note-taking, and multi-agent architectures, the context engineering approach enables agents to act more autonomously while maintaining focus and reliability across extended tasks.

https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents


Uber: How Uber Standardized Mobile Analytics for Cross-Platform Insights

Inconsistent event definitions and ad-hoc instrumentation led to fragmented analytics data and high developer overhead. Uber’s Mobile Analytics team writes about rebuilding its analytics platform with standardized event types, centralized metadata collection, and a unified instrumentation framework that automatically emits consistent tap, impression, and scroll events across platforms. This standardization reduced redundant code, improved cross-platform data quality and reliability, and laid the foundation for a component-based analytics model that simplifies event naming and lifecycle management at scale.

https://www.uber.com/en-IN/blog/how-uber-standardized-mobile-analytics/


Sponsored: From Chaos to Real-Time Insights for Manufacturing

In this Deep Dive, Supplyco’s CTO Claudia Richoux will reveal how they built a pipeline in Dagster that processes 100,000+ data streams in real time — while ensuring 99.99% uptime. You’ll see their DAG architecture in action, learn how they built observability into every layer, and how they treat “data as code” to ship fast and smart.

Save your spot now


Netflix: 100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine

Large-scale workflow orchestrators often hit scalability goals but still impose multi-second engine overheads that slow iteration and block emerging low-latency use cases. Netflix writes about rebuilding Maestro’s core with a lightweight, stateful actor model using Java virtual threads, in-memory state, internal exactly-once queues, and group-based partitioning. The redesign delivers ~100× faster execution (step launch 5s→50ms; workflow start 200ms→50ms), smoothly migrated ~60k workflows (>1M tasks/day), and cut operational load (≈90% fewer DB queries, ~40 TB of legacy tables removed).

https://netflixtechblog.com/100x-faster-how-we-supercharged-netflix-maestros-workflow-engine-028e9637f041


Alibaba: Building a Unified Lakehouse for Large-Scale Recommendation Systems with Apache Paimon at TikTok

TikTok writes about the challenges associated with fragmented feature pipelines, complex Lambda architectures, and multi-day latency in processing user behavior data. The blog discusses rebuilding its recommendation data foundation as a unified Lakehouse on Apache Paimon, integrating Flink and Spark for exactly-once, stream-batch, and unified processing, as well as point-in-time joins.

https://www.alibabacloud.com/blog/building-a-unified-lakehouse-for-large-scale-recommendation-systems-with-apache-paimon-at-tiktok_602568


Sponsored: Join the industry’s first survey on data platform migrations.

Most data teams have faced code or logic migrations—rewriting SQL, translating stored procedures, or replatforming ETL workloads—and they’re rarely straightforward. But there’s no clear benchmark for how long these projects take, what they cost, or how teams actually validate success.

Datafold is running a 5-minute survey to collect ground truth from practitioners on timelines, validation, budgets, and how AI is introducing new ways of tackling code translation and validation. Qualified Data Engineering Weekly subscribers (with a valid work email) get a $25 gift card, and all participants receive early access to the results.

Take the survey.


Netflix: Building a Resilient Data Platform with Write-Ahead Log at Netflix

Ensuring data consistency and durability across Netflix’s massive, multi-datastore ecosystem proved challenging as teams battled data corruption, replication gaps, and unreliable retries. Netflix describes building a generic Write-Ahead Log (WAL) abstraction. This pluggable, distributed system captures and replays mutations with strong durability guarantees across storage engines, including Cassandra, EVCache, and Kafka.

https://netflixtechblog.com/building-a-resilient-data-platform-with-write-ahead-log-at-netflix-127b6712359a


Dropbox: A practical blueprint for evaluating conversational AI at scale

Building reliable LLM products requires as much rigor in evaluation as in model design. Dropbox shares how it built a structured, automated evaluation pipeline for Dropbox Dash, treating prompt and model changes like production code—tested, versioned, and gated by LLM-based judges and real-world datasets. By combining curated benchmarks, automated regression gates, live-traffic scoring, and continuous feedback loops, Dropbox turned evaluation into an engineering discipline that catches regressions early, enforces factual accuracy, and drives faster, safer iteration on AI features.

https://dropbox.tech/machine-learning/practical-blueprint-evaluating-conversational-ai-at-scale-dash


Hugging Face: Introducing RTEB: A New Standard for Retrieval Evaluation

Accurately measuring how well embedding models retrieve unseen information has become a critical gap in evaluating modern AI systems. Hugging Face writes about the Retrieval Embedding Benchmark (RTEB) - a new standard that combines open and private datasets to assess true generalization and prevent “teaching to the test.” Designed for real-world, multilingual, and domain-specific use cases across text retrieval.

https://huggingface.co/blog/rteb


DuckDB: Redesigning DuckDB’s Sort, Again

Handling large, out-of-memory sorts efficiently has long been a performance bottleneck for analytical databases, such as DuckDB. In version 1.4.0, DuckDB introduces a fully redesigned sort engine that integrates its new spillable page layout, utilizes normalized sort keys, employs adaptive hybrid algorithms, and features a parallel k-way merge for fine-grained scalability. The result is up to 10 times faster performance on pre-sorted data, 3 times faster wide-table sorts, and significantly improved parallel scaling, setting a new foundation for future sort-based joins and window functions.

https://duckdb.org/2025/09/24/sorting-again.html


All rights reserved, Dewpeche Pvt Ltd, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.