Organizations

2 results for Analytics
  • Project Overview

    This repository implements a production-grade ELT pipeline that automates the daily identification of high-value customers. Built as the capstone project for the DE101 course, it brings together Apache Airflow for orchestration, dbt-spark for transformation and data quality, and Apache Iceberg as the open table format — all running locally via Docker Compose.


    Key Concepts

    • Medallion Architecture: Data flows through Bronze (raw), Silver (cleaned), and Gold (business-ready) layers, each serving a distinct purpose in the transformation chain.
    • Airflow Orchestration: A single DAG wires together data generation, dbt runs, quality tests, and dashboard generation into a reliable daily schedule.
    • dbt Data Quality: 38 automated tests gate pipeline output — if any test fails, downstream tasks are blocked and the sales mart is never written with bad data.
    • Apache Iceberg Table Format: Iceberg provides schema evolution, time-travel queries, and efficient partition pruning on top of the local Spark engine.
    data engineering airflow dbt spark docker Created Thu, 26 Mar 2026 10:00:00 +0100
  • Project Overview

    This project demonstrates the implementation of a comprehensive data pipeline using Google BigQuery as the primary data warehouse solution. The pipeline showcases modern data engineering practices including external data integration, table optimization strategies, and performance tuning techniques.

    Repository: Data Pipeline with BigQuery

    The project focuses on building a scalable, cost-effective data warehouse solution that can handle large volumes of NYC taxi trip data while maintaining optimal query performance and cost efficiency.

    Key Concepts

    OLAP vs OLTP: Understanding the fundamental differences between Online Analytical Processing and Online Transaction Processing systems • Data Warehousing: Implementing centralized storage for analytical workloads with optimized query performance • Table Partitioning: Dividing large tables into manageable chunks based on time or range values • Clustering: Organizing data within partitions to improve query performance and reduce costs • External Tables: Querying data stored outside BigQuery without incurring storage costs • Performance Optimization: Implementing best practices for cost reduction and query efficiency