Apache Iceberg å¦ä½č§£å³ę°ę®ę¹åč°ęę
Source: Thoughtworks
Imagine you have a large storage system where your company keeps all its dataāāāmultiple teams are writing to it at the same time. But suddenly one teamās job crashes halfway through. Thereās now a big problem: nobody knows what data is actually there and whatās accurate. Some reports show one number while other reports show a different one. Finance canāt close the books and the CEO is frustrated at conflicting information.
This is, unfortunately, not that uncommon. Thatās because data lake storage systems often lack the capacity to coordinate when multiple things are being written to them at the same time. However, thereās a solution: Apache Iceberg. In this post, Iāll explain some of the issues with data lakes, how Iceberg can fix them, and whether your company actually needsĀ it.
A practical example of data lake challenges
Before we go further, letās consider a practical example. Imagine an online grocery store: itās not unlikely that the company will have multiple teams writing to the same dataĀ lake:
- The order team writes every orderĀ placed.
- The inventory team updates stockĀ levels.
- The delivery team records deliveries.
This system might work fine for a while, but then something breaks. Perhaps the order job logs crashed overnight; orders were written but not recorded to the data lake. Discrepancies appearāāāeveryone has different numbers and it becomes incredibly difficult to establish theĀ truth.
Iceberg can solve this problem by acting as a coordinating layer; it helps keep careful trackĀ of:
- What files are actually in the data lake at a given moment rightĀ now.
- What changed andĀ when.
- Whoās allowed to writeĀ where.
- What the old state was (for looking back inĀ time).
In short, Iceberg sits between your data storage and the programs that read theĀ data.
How Apache IcebergĀ works
How does it do this, though? Primarily, there are three core parts to Iceberg that allow it to act as a data lake coordinator.
- Metadata files. These are instruction files that say āthis is what the table looks like right now.ā Every time something is written, a new metadata file is created. Think of it like a snapshot of your data lake at that moment. Old snapshots are kept so you can lookĀ back.
- Manifest file. When you have millions of files, you canāt check each one every time someone reads data. Manifest files are like an index in a book. They say āfiles 1ā1000 have customer names starting with A-M, files 1001ā2000 have names starting with N-Z.ā This makes searches muchĀ faster.
- Delete files. Instead of erasing a file completely (which is slow), Iceberg keeps a list of ārows to ignore.ā Itās faster than rewriting everything.
Together, these elements provide a layer of organization across your data lake. It makes it easier to understand how things may have changed and, ultimately, to determine whatās true and accurate. Because of the way itās built itās also relatively resource-efficient (thatās not to say there arenāt performance trade-offs, but weāll come to that later).What problems does IcebergĀ solve?
At a high-level, then, Iceberg solves coordination challenges. But itās worth exploring these challenges in more detailāāāit gives a clearer picture of how Iceberg actually works and where it canĀ help.
Atomic writes
With Iceberg, multiple systems can write at the same time without corrupting data.
Imagine system A writes 50 files while system B writes 50 files to the same table, only system Aās job crashes before all 50 have been recordedāāāthe table is confused about what data belongs there. With Iceberg, when system A tries to finish once the crash has been resolved, it checks if anything changed. If it did, it automatically starts over with the new information; If system B already finished, system A doesnāt overwrite it.
Looking up oldĀ data
Without Iceberg, to look up old data youād need to dig through backup files. This can take hoursāāāit also opens up opportunities for errors, like restoring the wrong version. With Iceberg, though, every change creates a snapshot, which can be requested whenever required.
Itās important to note that keeping old snapshots uses additional storage; however, this is usually worth it for compliance and debugging.
Adding columns without breakingĀ things
Sometimes, you might add a column, but your old files still have the old structure. Some programs read the old files and crash while some read new files and crash. With Iceberg, old files keep their old structure. New files have the new column. When you read the old files, the system puts a ānullā (empty) value in the newĀ column.
Fast deletions
With the introduction of data privacy regulations around the world, itās not uncommon for deletion requests to be made. Aside from compliance, though, being able to delete data easily is an important facet of good data stewardship.
Without Iceberg this process can be trickyāāāyou may have to read all files, filter out some specific data and rewrite all files. This can take hours. However, with it, you create a ādelete fileā that says āignore all rows where customer_id = 123ā, itās done instantly. The actual cleanup happens later when the system hasĀ time.
Hidden partitioning
Normally, you have to manually organize your data into folder structures like /year=2024/month=01/day=15/. But with Icebergās hidden partitioning, you donāt need to create these folders yourself. You tell Iceberg once: āorganize data by year, month, and day,ā and it does it automatically behind the scenes. This means teams can query data without knowing the folder structure, and you can change how data is organized without breaking queries or moving files around. Itās simpler for users and more flexible for your organization.
When do you need Iceberg? And when do youĀ not?
Apache Iceberg is an incredibly effective solution to data lake coordination issues. However, itās not always the right solution.
Use itĀ when/if:
You probably donāt need IcebergĀ if:
Essentially, as useful as Iceberg is, there are a number of costs and trade-offs that teams need to consider before usingĀ it.
In many instances these trade-offs will be more than worthwhile. However, if Iceberg is unnecessary for the way your organization manages data the additional complexity and overheads arenātĀ helpful.
Solving realĀ problems
Apache Iceberg solves a range of real and common data lake problems. It helpsĀ ensure:
While it isnāt a silver bullet and isnāt necessarily relevant to every organization, in many contexts and use cases it can go a long way to helping organizations better leverage dataĀ lakes.
Originally published at https://www.thoughtworks.com.