Apache Iceberg å¦‚ä½•č§£å†³ę•°ę®ę¹–åč°ƒęŒ‘ęˆ˜

Source: Thoughtworks

By Jayanth Penumarthi

Imagine you have a large storage system where your company keeps all its dataā€Šā€”ā€Šmultiple teams are writing to it at the same time. But suddenly one team’s job crashes halfway through. There’s now a big problem: nobody knows what data is actually there and what’s accurate. Some reports show one number while other reports show a different one. Finance can’t close the books and the CEO is frustrated at conflicting information.

This is, unfortunately, not that uncommon. That’s because data lake storage systems often lack the capacity to coordinate when multiple things are being written to them at the same time. However, there’s a solution: Apache Iceberg. In this post, I’ll explain some of the issues with data lakes, how Iceberg can fix them, and whether your company actually needsĀ it.

A practical example of data lake challenges

Before we go further, let’s consider a practical example. Imagine an online grocery store: it’s not unlikely that the company will have multiple teams writing to the same dataĀ lake:

  • The order team writes every orderĀ placed.
  • The inventory team updates stockĀ levels.
  • The delivery team records deliveries.

This system might work fine for a while, but then something breaks. Perhaps the order job logs crashed overnight; orders were written but not recorded to the data lake. Discrepancies appearā€Šā€”ā€Ševeryone has different numbers and it becomes incredibly difficult to establish theĀ truth.

Iceberg can solve this problem by acting as a coordinating layer; it helps keep careful trackĀ of:

  • What files are actually in the data lake at a given moment rightĀ now.
  • What changed andĀ when.
  • Who’s allowed to writeĀ where.
  • What the old state was (for looking back inĀ time).

In short, Iceberg sits between your data storage and the programs that read theĀ data.

How Apache IcebergĀ works

How does it do this, though? Primarily, there are three core parts to Iceberg that allow it to act as a data lake coordinator.

  • Metadata files. These are instruction files that say ā€œthis is what the table looks like right now.ā€ Every time something is written, a new metadata file is created. Think of it like a snapshot of your data lake at that moment. Old snapshots are kept so you can lookĀ back.
  • Manifest file. When you have millions of files, you can’t check each one every time someone reads data. Manifest files are like an index in a book. They say ā€œfiles 1–1000 have customer names starting with A-M, files 1001–2000 have names starting with N-Z.ā€ This makes searches muchĀ faster.
  • Delete files. Instead of erasing a file completely (which is slow), Iceberg keeps a list of ā€œrows to ignore.ā€ It’s faster than rewriting everything.

Together, these elements provide a layer of organization across your data lake. It makes it easier to understand how things may have changed and, ultimately, to determine what’s true and accurate. Because of the way it’s built it’s also relatively resource-efficient (that’s not to say there aren’t performance trade-offs, but we’ll come to that later).What problems does IcebergĀ solve?

At a high-level, then, Iceberg solves coordination challenges. But it’s worth exploring these challenges in more detailā€Šā€”ā€Šit gives a clearer picture of how Iceberg actually works and where it canĀ help.

Atomic writes

With Iceberg, multiple systems can write at the same time without corrupting data.

Imagine system A writes 50 files while system B writes 50 files to the same table, only system A’s job crashes before all 50 have been recordedā€Šā€”ā€Šthe table is confused about what data belongs there. With Iceberg, when system A tries to finish once the crash has been resolved, it checks if anything changed. If it did, it automatically starts over with the new information; If system B already finished, system A doesn’t overwrite it.

Looking up oldĀ data

Without Iceberg, to look up old data you’d need to dig through backup files. This can take hoursā€Šā€”ā€Šit also opens up opportunities for errors, like restoring the wrong version. With Iceberg, though, every change creates a snapshot, which can be requested whenever required.

It’s important to note that keeping old snapshots uses additional storage; however, this is usually worth it for compliance and debugging.

Adding columns without breakingĀ things

Sometimes, you might add a column, but your old files still have the old structure. Some programs read the old files and crash while some read new files and crash. With Iceberg, old files keep their old structure. New files have the new column. When you read the old files, the system puts a ā€œnullā€ (empty) value in the newĀ column.

Fast deletions

With the introduction of data privacy regulations around the world, it’s not uncommon for deletion requests to be made. Aside from compliance, though, being able to delete data easily is an important facet of good data stewardship.

Without Iceberg this process can be trickyā€Šā€”ā€Šyou may have to read all files, filter out some specific data and rewrite all files. This can take hours. However, with it, you create a ā€œdelete fileā€ that says ā€œignore all rows where customer_id = 123ā€, it’s done instantly. The actual cleanup happens later when the system hasĀ time.

Hidden partitioning

Normally, you have to manually organize your data into folder structures like /year=2024/month=01/day=15/. But with Iceberg’s hidden partitioning, you don’t need to create these folders yourself. You tell Iceberg once: ā€œorganize data by year, month, and day,ā€ and it does it automatically behind the scenes. This means teams can query data without knowing the folder structure, and you can change how data is organized without breaking queries or moving files around. It’s simpler for users and more flexible for your organization.

When do you need Iceberg? And when do youĀ not?

Apache Iceberg is an incredibly effective solution to data lake coordination issues. However, it’s not always the right solution.

Use itĀ when/if:

You probably don’t need IcebergĀ if:

Essentially, as useful as Iceberg is, there are a number of costs and trade-offs that teams need to consider before usingĀ it.

In many instances these trade-offs will be more than worthwhile. However, if Iceberg is unnecessary for the way your organization manages data the additional complexity and overheads aren’tĀ helpful.

Solving realĀ problems

Apache Iceberg solves a range of real and common data lake problems. It helpsĀ ensure:

While it isn’t a silver bullet and isn’t necessarily relevant to every organization, in many contexts and use cases it can go a long way to helping organizations better leverage dataĀ lakes.

Originally published at https://www.thoughtworks.com.