数据工程周报 #259

来源: Data Engineering Weekly

原文

AI is moving fast. Is your data platform ready?

AI is reshaping how data teams operate. But legacy pipelines, brittle workflows, and fragmented tooling weren’t designed for this shift.

Learn how leading teams are future-proofing their infrastructure before AI demands overwhelm it.

Download the AI Modernization Guide


underCurrent: A one-day conference for data engineers and architects

Confluent is hosting a free one-day conference with a catch: there’s no catch. It’s a single-track event with no sponsors and no product pitches—just technical talks for data engineers and architects.

🎙️ Speakers include Joe Reis, Holden Karau, and Max Beauchemin
🚫 No vendors. No sales pitches
✨ 100% free to attend
📍 San Francisco 📅 March 26
🎟️ Limited to 100 seats — register for free here


Netflix: DataJunction as Netflix’s answer to the missing piece of the modern data stack

Metric inconsistency and definition sprawl across distributed teams create onboarding bottlenecks and fragment analytics workflows. Netflix built DataJunction, an open-source semantic layer that decouples metric definitions from compute through a graph-based metadata model and SQL generation engine. This standardizes metrics across the experimentation platform, reducing onboarding from weeks to hours, while enabling expansion across all business verticals and LLM integration for auditable metric lineage.

https://netflixtechblog.medium.com/datajunction-as-netflixs-answer-to-the-missing-piece-of-the-modern-data-stack-92af926b40a5


Benoit Pimpaud: Specs Should Be Equations, Not Essays

As AI automates code generation, the engineering bottleneck shifts from writing implementation to defining precise specifications. the author argues that natural language specifications create compounding ambiguity when parsed by LLMs and proposes layered specifications that combine text, diagrams, and mathematical notation as constraint definitions for AI iteration. Mathematical specs eliminate interpretation drift, enabling AI agents to generate correct programs by satisfying invariants rather than reconstructing intent from prose.

https://fromanengineersight.substack.com/p/specs-should-be-equations-not-essays


Notion: Balancing cost and reliability for Spark on Kubernetes

Spark clusters on Kubernetes face a fundamental tension between aggressive cost optimization through spot instances and job reliability during capacity interruptions. Notion reduced compute costs by 60–90% through EKS migration with Karpenter bin-packing, then open-sourced Spot Balancer—a Kubernetes webhook that enforces stable spot-to-on-demand ratios per job, preventing cascade failures during AWS termination windows. Spot Balancer abstracts infrastructure trade-offs into developer-friendly stability tiers, enabling teams to optimize costs without sacrificing job completion rates.

https://www.notion.com/blog/balancing-cost-and-reliability-for-spark-on-kubernetes


Sponsored: Building a Cross-Workspace Control Plane for Databricks

As Databricks deployments scale, a familiar pattern emerges: multiple workspaces, multiple teams, and no reliable way to manage the dependencies between them.
In this hands-on deep dive, we'll show you how to build a cross-workspace control plane using Dagster on top of your existing Databricks environment. Demo-heavy and practitioner-focused, you'll leave with working patterns you can apply to your own platform the same day.

Register now


Apache Iceberg: Introducing the Apache Iceberg File Format API

It is indeed an exciting development in Iceberg to support a plugable file format API spec. As we increasingly handle unstructured data, this will significantly enhance data management practices through unified governance and compliance. Interestingly, Apache Hudi’s RFC-100 is, in fact, the feature request to support the Lance File Format.

https://iceberg.apache.org/blog/apache-iceberg-file-format-api/


Delta Lake: The next evolution of Delta - Catalog-Managed Tables

We went through the full cycle, from exposing the files directly through Hadoop to Snowflake-style cloud data warehouses, to Iceberg-style direct file access, back to catalog-managed tables.

Nonetheless, it will be interesting to watch DuckLake-style catalog-managed tables vs object-store-style managed tables.

https://delta.io/blog/2026-02-02-delta-catalog-managed-tables/


Microsoft Fabric: Under the hood: an introduction to the Native Execution Engine for Microsoft Fabric

The Apache Gluten project is continually making an impact on the Spark ecosystem, bringing unique optimization and efficiency. Microsoft Fabric writes an under-the-hood story of adopting Apache Gluten in its Fabric platform.

https://blog.fabric.microsoft.com/en-us/blog/under-the-hood-an-introduction-to-the-native-execution-engine-for-microsoft-fabric/


Pinterest: Piqama - Pinterest Quota Management Ecosystem

As companies scale, manual and static quota systems become bottlenecks, forcing engineers to choose between over-provisioning resources and managing brittle enforcement logic. Pinterest developed Piqama, a unified quota platform that dynamically right-sizes limits using historical data stored in Apache Iceberg, then applies custom enforcement strategies across batch schedulers and online services. Piqama centralizes resource governance across hardware and service metrics, enabling teams to optimize capacity allocation while linking consumption directly to financial costs.

https://medium.com/pinterest-engineering/piqama-pinterest-quota-management-ecosystem-dc7881433bf5


LinkedIn: Engineering LinkedIn’s job ingestion system at scale

Ingestion systems struggle to scale source onboarding—hard-coded extraction logic creates engineering bottlenecks that slow integration of new data partners. LinkedIn shifted extraction logic from code to configuration files called Sitemaps, enabling AI tools and browser plugins to onboard sources without engineering deployments. At the same time, a transactional state machine enforces precise failure boundaries across parallel mining tasks. The configuration-driven approach reduces onboarding time from weeks to hours, allowing LinkedIn to ingest 20TB daily across thousands of global sources.

https://www.linkedin.com/blog/engineering/infrastructure/engineering-linkedins-job-ingestion-system-at-scale


Shopify: The generative recommender behind Shopify’s commerce engine

Recommendation systems traditionally treat purchases as isolated events, missing the temporal and causal structure that shapes buyer journeys across millions of products. Shopify transitioned to an autoregressive sequence model that treats commerce journeys as token sequences, implementing RoPE-inspired rotary encoding combined with relative attention bias to capture temporal gaps and seasonality across its catalog. The time-aware attention mechanism drove +0.94% order growth and +0.71% conversion lift while achieving 7.3x training speedup through optimized CUDA kernels, enabling Shopify to integrate richer context into a unified generative framework.

https://shopify.engineering/generative-recommendations


Alibaba: PostgreSQL Blink-tree Implementation

As we increasingly use AI to code, understanding database internals is more critical than ever. Alibaba Cloud engineers break down how PostgreSQL utilizes the Blink-tree architecture to achieve massive concurrency. By adding link pointers to sibling nodes and high keys to mark boundaries, PostgreSQL allows searches to proceed without lock-coupling. This enables the system to gracefully handle concurrent page splits—following links when data exceeds old boundaries—and significantly outperforms the more rigid lock-subtree approach used in MySQL’s InnoDB.

https://www.alibabacloud.com/blog/postgresql-blink-tree-implementation_602913


All rights reserved, Dewpeche Private Limited. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.