Project Overview
This repository implements a production-grade ELT pipeline that automates the daily identification of high-value customers. Built as the capstone project for the DE101 course, it brings together Apache Airflow for orchestration, dbt-spark for transformation and data quality, and Apache Iceberg as the open table format — all running locally via Docker Compose.
Key Concepts
- Medallion Architecture: Data flows through Bronze (raw), Silver (cleaned), and Gold (business-ready) layers, each serving a distinct purpose in the transformation chain.
- Airflow Orchestration: A single DAG wires together data generation, dbt runs, quality tests, and dashboard generation into a reliable daily schedule.
- dbt Data Quality: 38 automated tests gate pipeline output — if any test fails, downstream tasks are blocked and the sales mart is never written with bad data.
- Apache Iceberg Table Format: Iceberg provides schema evolution, time-travel queries, and efficient partition pruning on top of the local Spark engine.