profile

Start Data Engineering

Over the last decade, I've built highly scalable distributed data platforms and helped companies scale to processing multiple exabytes of data. My mission is to bring software practices followed by top tech companies to data engineering and help data engineers level up. I help data engineers land high paying tech jobs and significantly up skill themselves.

How to manage data infrastructure with Terraform (not notebook scripts)

Hello Reader, AI can write Terraform code, but without the fundamentals, you are building a house of cards. Terraform (IaC) is basically git for infrastructure. In this post, we will go over what IaC is, Terraform fundamentals & setting up AWS infrastructure with Terraform Setting up AWS infrastructure with Terraform → In order to truly understand a concept, you need to be able to explain it plainly. In your own words, share your main takeaway from this post. Regards, Joseph Machado...

How to detect and fix the small files problem with Spark and Iceberg

Hello Reader, Too many small files are one of the most common performance killers in large-scale Spark pipelines. When a dataset is split into hundreds or thousands of tiny files, Spark spends more time opening files than processing data. N small files = 2*N Spark API calls In this week's deep dive, we see how to detect & fix the issue with working code. Identify & fix small files with Spark and Iceberg → You’ll come away knowing how to: Detect the problem using the Spark UI. Compact small...

How Companies Ingest Data: 2 Key Patterns

Hello Reader, Most companies ingest data in one of two ways. Stream data into a cloud store via an event log like Kafka. Extract data from source systems in batch. In this article 👇, we go over a high-level design of how they are built and their tradeoffs. > Data Ingestion Patterns Please reply to this email if you have any questions/thoughts/ideas and I will get back to you. Regards, Joseph Machado startdataengineering.com

Free Sample Chapter From The Data Engineering Course

Hello Reader, Over the past few weeks, we covered: 4 concepts to land a DE job 3 Data storage techniques for your warehouse Scheduling and orchestration in Airflow Being in demand as a data engineer by learning fundamentals and best practices The Data Engineering Course is where it all comes together. Learn how to build production-grade pipelines using industry-standard tools and best practices. Enrollment opens tomorrow, April 26th. Only 64 spots left. Newsletter subscribers get an early...

4 Steps To Becoming A Valuable Engineer

Hello Reader, Data engineering job requirements keep getting longer. And with LLMs generating code, it's getting harder to know which skills are actually worth your time. The engineers who stay in demand are the ones who know how to build pipelines that just chug along, no matter what's thrown at them. I made a cheatsheet that breaks down the fundamentals and best practices that set them apart. 👉 [Download it free here] In my Data Engineering Course launching April 26th, I cover the...

Free Live Airflow Webinar (April 11th, 1PM EST)

Hello Reader, Most high-paying data engineering jobs require a good understanding of pipeline orchestration. Understanding & practicing the problems that Airflow was built to solve will help you understand orchestrators in depth. I will guide you through these concepts live on Saturday, April 11th, at 1PM EST. → Join the live session here Code for the workshop: https://github.com/josephmachado/airflow-tutorial We will cover: Why Airflow exists and why not just cron? Creating pipelines with...

3 Data Storage Techniques For Faster & Cheaper Queries

Hello Reader, Are your data costs skyrocketing? Are your queries timing out? You know techniques like partitioning and bucketing exist — but knowing when to use which is where most engineers get stuck. What if you had an easy way to determine what technique to use to make your queries fast and cheap? This post gives you exactly that, a rule-of-thumb pattern for analytical tables that is useful in real life and in interviews. > 3 Data Storage Techniques Every Data Engineer Should Know If you...

4 Concepts To Land A High-Paying Data Engineering Job

Hello Reader, Data Engineering job requirements are insane. Every role now requires multiple tools and years of experience. It can feel overwhelming just thinking about it. But what if you can make potential employers excited to hire you? Companies are hiring for one thing: problem solvers. Here are the 4 concepts that map directly to the problems data teams face every day — and that interviewers actually test for. > 4 Concepts To Land A High-Paying Data Engineering Job Want to get proficient...

Free Data engineering 101 e-book

Hello Reader, Are you interested in pursuing a career in data engineering? And are you > Overwhelmed by all the concepts/jargon/frameworks of data engineering? > Feeling lost because there is no clear roadmap for someone to quickly get up to speed with the essentials of data engineering? Learning to be a data engineer can be a long and rough road, but it doesn't have to be! Imagine knowing the fundamentals of data engineering that are crucial to any data team. You will be able to quickly pick...

Build resilient & optimized pipelines to fast-track your DE career!

Hello Reader, It's a tough market for data engineers right now. Companies expect a lot from their data engineers, and the hiring bar is exceptionally high. Take a look at some job postings, and you will find a variation of "advanced distributed data systems knowledge" as a requirement. But, the catch is that there is no single definition of what this means. Additionally, there is a perceived impact of LLMs on the job market. However, it's not all lost; let's take a step back and consider what...

Over the last decade, I've built highly scalable distributed data platforms and helped companies scale to processing multiple exabytes of data. My mission is to bring software practices followed by top tech companies to data engineering and help data engineers level up. I help data engineers land high paying tech jobs and significantly up skill themselves.