João Blasques (Jonas)

João Blasques (Jonas) joaoblasques

AI-Enabled Data Engineer

Organizations

1 results for Pipeline

Building a Customer Analytics Pipeline with Airflow, dbt and Spark
Project Overview

This repository implements a production-grade ELT pipeline that automates the daily identification of high-value customers. Built as the capstone project for the DE101 course, it brings together Apache Airflow for orchestration, dbt-spark for transformation and data quality, and Apache Iceberg as the open table format — all running locally via Docker Compose.

Key Concepts
- Medallion Architecture: Data flows through Bronze (raw), Silver (cleaned), and Gold (business-ready) layers, each serving a distinct purpose in the transformation chain.
- Airflow Orchestration: A single DAG wires together data generation, dbt runs, quality tests, and dashboard generation into a reliable daily schedule.
- dbt Data Quality: 38 automated tests gate pipeline output — if any test fails, downstream tasks are blocked and the sales mart is never written with bad data.
- Apache Iceberg Table Format: Iceberg provides schema evolution, time-travel queries, and efficient partition pruning on top of the local Spark engine.
data engineering airflow dbt spark docker Created Thu, 26 Mar 2026 10:00:00 +0100