João Blasques (Jonas)

João Blasques (Jonas) joaoblasques

AI-Enabled Data Engineer

Organizations

about ai airflow analytics analytics engineering beginners bigquery cloud clustering data architecture data engineering data pipeline data transformation data warehouse dbt dimensional modeling docker elt gcp iceberg innovation introduction kestra kubernetes looker studio machine learning mlops nyc taxi orchestration partitioning pipeline pipelines postgresql python spark terraform tutorial

Building a Customer Analytics Pipeline with Airflow, dbt and Spark
Project Overview

This repository implements a production-grade ELT pipeline that automates the daily identification of high-value customers. Built as the capstone project for the DE101 course, it brings together Apache Airflow for orchestration, dbt-spark for transformation and data quality, and Apache Iceberg as the open table format — all running locally via Docker Compose.

Key Concepts
- Medallion Architecture: Data flows through Bronze (raw), Silver (cleaned), and Gold (business-ready) layers, each serving a distinct purpose in the transformation chain.
- Airflow Orchestration: A single DAG wires together data generation, dbt runs, quality tests, and dashboard generation into a reliable daily schedule.
- dbt Data Quality: 38 automated tests gate pipeline output — if any test fails, downstream tasks are blocked and the sales mart is never written with bad data.
- Apache Iceberg Table Format: Iceberg provides schema evolution, time-travel queries, and efficient partition pruning on top of the local Spark engine.
data engineering airflow dbt spark docker Created Thu, 26 Mar 2026 10:00:00 +0100
Analytics Engineering with dbt: From Raw Data to Business Intelligence

Project Overview

This project demonstrates the implementation of a comprehensive analytics engineering pipeline using dbt (data build tool) as the primary transformation layer. The pipeline showcases modern data engineering practices including ELT methodology, dimensional modeling, automated testing, and business intelligence visualization.

Repository: Analytics Engineering with dbt

The project focuses on transforming raw NYC taxi trip data into business-ready analytics tables using dbt’s modular approach, implementing both dbt Cloud and dbt Core workflows, and creating interactive dashboards with Looker Studio.

Key Concepts

• Analytics Engineering: Bridging the gap between data engineering and data analysis with software engineering best practices • ELT vs ETL: Leveraging cloud data warehouses for in-database transformations • Dimensional Modeling: Implementing Kimball’s star schema methodology for analytical workloads • dbt Fundamentals: Models, macros, packages, variables, and testing frameworks • Data Governance: Testing, documentation, and deployment strategies • Business Intelligence: Creating interactive dashboards and visualizations

analytics engineering dbt bigquery data transformation dimensional modeling Created Mon, 14 Jul 2025 00:00:00 +0100
Building a Data Pipeline with BigQuery: From Storage to Analytics

Project Overview

This project demonstrates the implementation of a comprehensive data pipeline using Google BigQuery as the primary data warehouse solution. The pipeline showcases modern data engineering practices including external data integration, table optimization strategies, and performance tuning techniques.

Repository: Data Pipeline with BigQuery

The project focuses on building a scalable, cost-effective data warehouse solution that can handle large volumes of NYC taxi trip data while maintaining optimal query performance and cost efficiency.

Key Concepts

• OLAP vs OLTP: Understanding the fundamental differences between Online Analytical Processing and Online Transaction Processing systems • Data Warehousing: Implementing centralized storage for analytical workloads with optimized query performance • Table Partitioning: Dividing large tables into manageable chunks based on time or range values • Clustering: Organizing data within partitions to improve query performance and reduce costs • External Tables: Querying data stored outside BigQuery without incurring storage costs • Performance Optimization: Implementing best practices for cost reduction and query efficiency

data engineering bigquery data warehouse cloud analytics Created Mon, 14 Jul 2025 00:00:00 +0100
Data Pipeline Orchestration using Kestra
Project Overview

This repository demonstrates workflow orchestration for data engineering pipelines using Kestra. It guides users through building, running, and scheduling data pipelines that extract, transform, and load (ETL) data both locally (with PostgreSQL) and in the cloud (with Google Cloud Platform). The project is hands-on and includes conceptual explanations, infrastructure setup, and several example pipeline flows.

Key Concepts
- Workflow Orchestration: Automating and managing complex workflows with dependencies, retries, logging, and monitoring.
- Kestra: An orchestration platform with a user-friendly UI and YAML-based workflow definitions (called “flows”).
- Data Lake & Data Warehouse: Demonstrates moving data from raw storage (GCS) to structured analytics (BigQuery).
data engineering beginners tutorial docker kestra Created Sat, 21 Jun 2025 09:30:00 +0100
Orchestrating Data Pipelines with Apache Airflow: A Comprehensive Guide
Project Overview

This repository serves as a practical guide to building and orchestrating robust data pipelines using Apache Airflow. It covers essential concepts from basic workflow management to advanced deployments with Google Cloud Platform (GCP) and Kubernetes.

Key Concepts
- Workflow Orchestration: Automating and managing complex data workflows with dependencies, scheduling, retries, and monitoring using Apache Airflow.
- DAGs (Directed Acyclic Graphs): The core abstraction in Airflow for defining task dependencies, execution order, and workflow logic.
- Extensible Operators & Integrations: Leveraging Airflow’s wide range of built-in operators and custom plugins to interact with databases, cloud services (GCP, Kubernetes), and external systems.
- Scalable Deployments: Running Airflow locally for prototyping, or deploying on cloud and Kubernetes for production-scale, resilient, and distributed data pipeline execution.
data engineering airflow orchestration tutorial docker Created Sat, 21 Jun 2025 09:30:00 +0100
Simple Data Pipeline

Project Overview

This repository provides a comprehensive, step-by-step guide to building a simple data engineering pipeline using containerization (Docker), orchestration (Docker Compose), and Infrastructure as Code (Terraform), with a focus on ingesting and processing NYC taxi data. The project is hands-on and includes conceptual explanations, infrastructure setup, and several example pipeline flows.

This project is a practical template for data engineers to learn and implement containerized data pipelines, local and cloud database management, and automated cloud infrastructure provisioning using modern tools like Docker, Docker Compose, and Terraform. It is especially useful for those looking to understand the end-to-end workflow from local prototyping to cloud deployment in a reproducible, automated way.

data engineering beginners tutorial docker terraform Created Sat, 21 Jun 2025 09:30:00 +0100
The Role of AI in Modern Data Architectures
AI-Driven Data Architecture

Artificial intelligence isn’t just a consumer of data—it’s increasingly becoming an integral part of how we design and operate our data systems. This post explores the evolving relationship between AI and data architecture.

AI-Enhanced Data Processing

Modern data architectures are incorporating AI at various levels:
- Intelligent Data Cataloging - Automatically discovering, classifying, and tagging data assets
- Adaptive Data Integration - Using ML to identify optimal integration patterns and transformations
- Automated Quality Management - Detecting anomalies and quality issues without manual rules
- Self-Tuning Systems - Databases and data platforms that optimize themselves based on workloads
Real-World Applications

Recommendation Systems

AI algorithms help determine which data is most relevant to different users and use cases, optimizing data discovery and access.
AI data architecture innovation Created Sun, 11 May 2025 12:15:00 +0100
Machine Learning Pipeline Design
Building Effective Machine Learning Pipelines

Creating robust machine learning pipelines is essential for deploying AI solutions at scale. This post covers key considerations and best practices.

The Anatomy of an ML Pipeline

A well-designed ML pipeline includes these key stages:
1. Data Ingestion - Collecting and importing data from various sources
2. Data Preparation - Cleaning, transforming, and feature engineering
3. Model Training - Developing and tuning ML models
4. Model Evaluation - Assessing performance and validity
5. Model Deployment - Serving models in production environments
6. Monitoring - Tracking performance and detecting drift
Common Challenges and Solutions

Challenge: Data Quality Issues

Solution: Implement robust data validation and cleaning processes early in the pipeline.
machine learning pipelines MLOps Created Fri, 09 May 2025 10:45:00 +0100
Getting Started with Data Engineering

Data Engineering Fundamentals

Data engineering is the backbone of any data-driven organization. In this post, we will explore the fundamental concepts that every aspiring data engineer should understand.

What is Data Engineering?

Data engineering focuses on designing, building, and maintaining the infrastructure and architecture for data generation, storage, and analysis. Data engineers develop the systems that collect, manage, and convert raw data into usable information for data scientists and business analysts.

data engineering beginners tutorial Created Mon, 05 May 2025 09:30:00 +0100
Welcome to My Professional Website

Hello, I’m João Blasques

Welcome to my professional website! I’m an AI-Enabled Data Engineer passionate about leveraging artificial intelligence and data solutions to solve complex business problems.

My Background

With expertise in data engineering, machine learning, and AI integration, I help organizations transform their data into actionable insights. I specialize in designing and implementing data pipelines, creating machine learning models, and developing AI-powered applications that drive business value.

What You’ll Find Here

On this website, you can explore:

introduction about Created Wed, 23 Apr 2025 15:55:33 +0100

João Blasques (Jonas) joaoblasques

Organizations

Project Overview

Key Concepts

Project Overview

Key Concepts

Project Overview

Key Concepts

Project Overview

Key Concepts

Project Overview

Key Concepts

Project Overview

AI-Driven Data Architecture

AI-Enhanced Data Processing

Real-World Applications

Recommendation Systems

Building Effective Machine Learning Pipelines

The Anatomy of an ML Pipeline

Common Challenges and Solutions

Challenge: Data Quality Issues

Data Engineering Fundamentals

What is Data Engineering?

Hello, I’m João Blasques

My Background

What You’ll Find Here