João Blasques (Jonas)

João Blasques (Jonas) joaoblasques

AI-Enabled Data Engineer

Organizations

2 results for Analytics

Clear filter

Building a Customer Analytics Pipeline with Airflow, dbt and Spark
Project Overview

This repository implements a production-grade ELT pipeline that automates the daily identification of high-value customers. Built as the capstone project for the DE101 course, it brings together Apache Airflow for orchestration, dbt-spark for transformation and data quality, and Apache Iceberg as the open table format — all running locally via Docker Compose.

Key Concepts
- Medallion Architecture: Data flows through Bronze (raw), Silver (cleaned), and Gold (business-ready) layers, each serving a distinct purpose in the transformation chain.
- Airflow Orchestration: A single DAG wires together data generation, dbt runs, quality tests, and dashboard generation into a reliable daily schedule.
- dbt Data Quality: 38 automated tests gate pipeline output — if any test fails, downstream tasks are blocked and the sales mart is never written with bad data.
- Apache Iceberg Table Format: Iceberg provides schema evolution, time-travel queries, and efficient partition pruning on top of the local Spark engine.
data engineering airflow dbt spark docker Created Thu, 26 Mar 2026 10:00:00 +0100
Building a Data Pipeline with BigQuery: From Storage to Analytics

Project Overview

This project demonstrates the implementation of a comprehensive data pipeline using Google BigQuery as the primary data warehouse solution. The pipeline showcases modern data engineering practices including external data integration, table optimization strategies, and performance tuning techniques.

Repository: Data Pipeline with BigQuery

The project focuses on building a scalable, cost-effective data warehouse solution that can handle large volumes of NYC taxi trip data while maintaining optimal query performance and cost efficiency.

Key Concepts

• OLAP vs OLTP: Understanding the fundamental differences between Online Analytical Processing and Online Transaction Processing systems • Data Warehousing: Implementing centralized storage for analytical workloads with optimized query performance • Table Partitioning: Dividing large tables into manageable chunks based on time or range values • Clustering: Organizing data within partitions to improve query performance and reduce costs • External Tables: Querying data stored outside BigQuery without incurring storage costs • Performance Optimization: Implementing best practices for cost reduction and query efficiency

data engineering bigquery data warehouse cloud analytics Created Mon, 14 Jul 2025 00:00:00 +0100

João Blasques (Jonas) joaoblasques

Organizations

Building a Customer Analytics Pipeline with Airflow, dbt and Spark

Project Overview

Key Concepts

Building a Data Pipeline with BigQuery: From Storage to Analytics

Project Overview

Key Concepts