Over the last decade, I've built highly scalable distributed data platforms and helped companies scale to processing multiple exabytes of data. My mission is to bring software practices followed by top tech companies to data engineering and help data engineers level up. I help data engineers land high paying tech jobs and significantly up skill themselves.
|
Hello Reader, Too many small files are one of the most common performance killers in large-scale Spark pipelines. When a dataset is split into hundreds or thousands of tiny files, Spark spends more time opening files than processing data. In this week's deep dive, we see how to detect & fix the issue with working code. Identify & fix small files with Spark and Iceberg → You’ll come away knowing how to:
Found this useful? Forward it to a teammate or share the link 🔗. Questions or feedback? Reply here — I read every email. Best, Joseph |
Over the last decade, I've built highly scalable distributed data platforms and helped companies scale to processing multiple exabytes of data. My mission is to bring software practices followed by top tech companies to data engineering and help data engineers level up. I help data engineers land high paying tech jobs and significantly up skill themselves.