How to Build a Data Platform

We wrote an eBook on Data Platform Fundamentals to help you be like the happy data teams, operating undering a single platform.

In this book, you’ll learn:

- How composable architectures allow teams to ship faster
- Why data quality matters and how you can catch issues before they reach users
- What observability means, and how it will help you solve problems more quickly

Download your free copy now

Netflix: The Evolution of Cassandra Data Movement at Netflix

The Change Data Capture (CDC) from the operational store is often expensive and involves multiple staging hops and an expensive merge operation in Iceberg. Netflix writes one such case study with its Cassandra, the challenges with capturing operational data into Iceberg tables, and its solution to avoid partition skew with a layered approach.

https://netflixtechblog.medium.com/the-evolution-of-cassandra-data-movement-at-netflix-6e13329c80a1

Grab: The Hugo evolution: Engineering Grab’s unified, one-click data ingestion platform with Apache Flink

Grab narrates a similar challenge with the CDC, highlighting the disintegration of data ingestion across multiple operational data stores with schema management and ingestion issues. The unified pipeline, with the Flink pipeline, auto-detects schema changes and ingests the data back into the Hive Tables.

https://engineering.grab.com/one-click-data-ingestion-platform-with-apache-flink

Meta: A Blueprint for Valuing Content When A/B Tests Are Not an Option

Content is a primary driver of the Quest ecosystem. With the recent announcement at Google I/O about seamless shopping integration with content, it is evident that content-driven commerce has reached the mainstream. How do you value the contents when there is no A/B testing option available? Meta writes about implementing the DoubleML method to tackle the challenge.

https://medium.com/@AnalyticsAtMeta/meta-a-blueprint-for-valuing-content-when-a-b-tests-are-not-an-option-7880bac721f1

Uber: Scaling Real-Time Traffic Forecasting with a Graph-Aware Transformer

Uber writes about rebuilding the traffic forecasting stack, DeepETT, a real-time traffic forecasting system. DeepETT approaches forecasting as a fixed-input graph-aware transformer that combines pre-aggregated segment, road-graph, regional, historical, real-time, and event features with continuous Flink-based calibration.

https://www.uber.com/us/en/blog/scaling-real-time-traffic/

Airbnb: Scaling Airbnb’s identity graph with a unified knowledge graph infrastructure

Counting and Finding Unique Users are the two hard problems in Data Engineering.

One of the long-standing questions in data engineering is: since many real-world systems are fundamentally about connections, why can’t we model them using the graph data model? Airbnb highlighted the reasons for the scalability issues with Graph and its adoption of JanusGraph, using DynamoDB as a backend.

https://medium.com/airbnb-engineering/scaling-airbnbs-identity-graph-with-a-unified-knowledge-graph-infrastructure-ebac467b7836

Pinterest: Making User-Sequence Data More Cost-Efficient, Faster, and Easier to Use

The user journey/user sequence of actions is one of the most important signals for analyzing user behavior. Pinterest publishes a comprehensive case study on how to approach user sequence data as a product and its architectural patterns.

https://medium.com/pinterest-engineering/making-user-sequence-data-more-cost-efficient-faster-and-easier-to-use-2a56a928cae1

Yelp: How Partition Access Visualizations Reduced our Data Lake S3 Cost by 33%

Usage-driven data retention & storage class optimization is a must-have tool for your Lakehouse management, given the growing need to ingest more data. Yelp applies the art and science of table management by collecting usage metrics at the table-partition level to optimize storage.

https://engineeringblog.yelp.com/2026/05/partition-access-visualizations.html

LinkedIn: Crosscheck: Benchmarking AI Models in the Real World

Static AI benchmarks lose signal as models optimize toward them, collapsing role-, industry-, and task-specific performance into one number that answers no professional’s actual question. LinkedIn writes about Crosscheck, which extends the Bradley-Terry comparison model with time-decay weighting, low-data regularization, and confidence-aware ordinal tiering — surfacing only differences supported by 95% statistical evidence.

https://www.linkedin.com/blog/engineering/ai/crosscheck-benchmarking-ai-models-in-the-real-world

Jack Vanlightly: Introducing Dimster, a performance benchmarking tool for Apache Kafka

Kafka performance benchmarks rarely travel — results lack the configuration, hardware, and version metadata that another engineer needs to reproduce or trust them. The author builds Dimster, a Kafka benchmarking tool centered on dimensional testing — sweeping config axes like batch.size or consumer type while emitting self-contained result bundles. Dimster runs explore, drain-backlog, and correctness modes on Kubernetes as a portable runtime, making benchmark campaigns reproducible across any cloud or laptop, anchored to traceable result artifacts.

https://jack-vanlightly.com/blog/2026/5/20/introducing-dimster-a-performance-benchmarking-tool-for-apache-kafka

All rights reserved, Dewpeche Private Limited. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

数据工程周报 #271