profile

Start Data Engineering

How to detect and fix the small files problem with Spark and Iceberg


Hello Reader,

Too many small files are one of the most common performance killers in large-scale Spark pipelines.

When a dataset is split into hundreds or thousands of tiny files, Spark spends more time opening files than processing data.

In this week's deep dive, we see how to detect & fix the issue with working code.

Identify & fix small files with Spark and Iceberg →

You’ll come away knowing how to:

  • Detect the problem using the Spark UI.
  • Compact small files using Iceberg’s built-in rewrite API
  • Use table properties to compact files on insert.

Found this useful? Forward it to a teammate or share the link 🔗.

Questions or feedback? Reply here — I read every email.

Best,

Joseph

Start Data Engineering

Over the last decade, I've built highly scalable distributed data platforms and helped companies scale to processing multiple exabytes of data. My mission is to bring software practices followed by top tech companies to data engineering and help data engineers level up. I help data engineers land high paying tech jobs and significantly up skill themselves.

Share this page