Apache Iceberg 如何解决数据湖协调挑战

来源: Thoughtworks

原文

By Jayanth Penumarthi

Imagine you have a large storage system where your company keeps all its data — multiple teams are writing to it at the same time. But suddenly one team’s job crashes halfway through. There’s now a big problem: nobody knows what data is actually there and what’s accurate. Some reports show one number while other reports show a different one. Finance can’t close the books and the CEO is frustrated at conflicting information.

This is, unfortunately, not that uncommon. That’s because data lake storage systems often lack the capacity to coordinate when multiple things are being written to them at the same time. However, there’s a solution: Apache Iceberg. In this post, I’ll explain some of the issues with data lakes, how Iceberg can fix them, and whether your company actually needs it.

A practical example of data lake challenges

Before we go further, let’s consider a practical example. Imagine an online grocery store: it’s not unlikely that the company will have multiple teams writing to the same data lake:

  • The order team writes every order placed.
  • The inventory team updates stock levels.
  • The delivery team records deliveries.

This system might work fine for a while, but then something breaks. Perhaps the order job logs crashed overnight; orders were written but not recorded to the data lake. Discrepancies appear — everyone has different numbers and it becomes incredibly difficult to establish the truth.

Iceberg can solve this problem by acting as a coordinating layer; it helps keep careful track of:

  • What files are actually in the data lake at a given moment right now.
  • What changed and when.
  • Who’s allowed to write where.
  • What the old state was (for looking back in time).

In short, Iceberg sits between your data storage and the programs that read the data.

How Apache Iceberg works

How does it do this, though? Primarily, there are three core parts to Iceberg that allow it to act as a data lake coordinator.

  • Metadata files. These are instruction files that say “this is what the table looks like right now.” Every time something is written, a new metadata file is created. Think of it like a snapshot of your data lake at that moment. Old snapshots are kept so you can look back.
  • Manifest file. When you have millions of files, you can’t check each one every time someone reads data. Manifest files are like an index in a book. They say “files 1–1000 have customer names starting with A-M, files 1001–2000 have names starting with N-Z.” This makes searches much faster.
  • Delete files. Instead of erasing a file completely (which is slow), Iceberg keeps a list of “rows to ignore.” It’s faster than rewriting everything.

Together, these elements provide a layer of organization across your data lake. It makes it easier to understand how things may have changed and, ultimately, to determine what’s true and accurate. Because of the way it’s built it’s also relatively resource-efficient (that’s not to say there aren’t performance trade-offs, but we’ll come to that later).What problems does Iceberg solve?

At a high-level, then, Iceberg solves coordination challenges. But it’s worth exploring these challenges in more detail — it gives a clearer picture of how Iceberg actually works and where it can help.

Atomic writes

With Iceberg, multiple systems can write at the same time without corrupting data.

Imagine system A writes 50 files while system B writes 50 files to the same table, only system A’s job crashes before all 50 have been recorded — the table is confused about what data belongs there. With Iceberg, when system A tries to finish once the crash has been resolved, it checks if anything changed. If it did, it automatically starts over with the new information; If system B already finished, system A doesn’t overwrite it.

Looking up old data

Without Iceberg, to look up old data you’d need to dig through backup files. This can take hours — it also opens up opportunities for errors, like restoring the wrong version. With Iceberg, though, every change creates a snapshot, which can be requested whenever required.

It’s important to note that keeping old snapshots uses additional storage; however, this is usually worth it for compliance and debugging.

Adding columns without breaking things

Sometimes, you might add a column, but your old files still have the old structure. Some programs read the old files and crash while some read new files and crash. With Iceberg, old files keep their old structure. New files have the new column. When you read the old files, the system puts a “null” (empty) value in the new column.

Fast deletions

With the introduction of data privacy regulations around the world, it’s not uncommon for deletion requests to be made. Aside from compliance, though, being able to delete data easily is an important facet of good data stewardship.

Without Iceberg this process can be tricky — you may have to read all files, filter out some specific data and rewrite all files. This can take hours. However, with it, you create a “delete file” that says “ignore all rows where customer_id = 123”, it’s done instantly. The actual cleanup happens later when the system has time.

Hidden partitioning

Normally, you have to manually organize your data into folder structures like /year=2024/month=01/day=15/. But with Iceberg’s hidden partitioning, you don’t need to create these folders yourself. You tell Iceberg once: “organize data by year, month, and day,” and it does it automatically behind the scenes. This means teams can query data without knowing the folder structure, and you can change how data is organized without breaking queries or moving files around. It’s simpler for users and more flexible for your organization.

When do you need Iceberg? And when do you not?

Apache Iceberg is an incredibly effective solution to data lake coordination issues. However, it’s not always the right solution.

Use it when/if:

You probably don’t need Iceberg if:

Essentially, as useful as Iceberg is, there are a number of costs and trade-offs that teams need to consider before using it.

In many instances these trade-offs will be more than worthwhile. However, if Iceberg is unnecessary for the way your organization manages data the additional complexity and overheads aren’t helpful.

Solving real problems

Apache Iceberg solves a range of real and common data lake problems. It helps ensure:

While it isn’t a silver bullet and isn’t necessarily relevant to every organization, in many contexts and use cases it can go a long way to helping organizations better leverage data lakes.

Originally published at https://www.thoughtworks.com.