Analytics Engineering with dbt: From Raw Data to Business Intelligence - João Blasques

Project Overview

This project demonstrates the implementation of a comprehensive analytics engineering pipeline using dbt (data build tool) as the primary transformation layer. The pipeline showcases modern data engineering practices including ELT methodology, dimensional modeling, automated testing, and business intelligence visualization.

Repository: Analytics Engineering with dbt

The project focuses on transforming raw NYC taxi trip data into business-ready analytics tables using dbt’s modular approach, implementing both dbt Cloud and dbt Core workflows, and creating interactive dashboards with Looker Studio.

Key Concepts

• Analytics Engineering: Bridging the gap between data engineering and data analysis with software engineering best practices • ELT vs ETL: Leveraging cloud data warehouses for in-database transformations • Dimensional Modeling: Implementing Kimball’s star schema methodology for analytical workloads • dbt Fundamentals: Models, macros, packages, variables, and testing frameworks • Data Governance: Testing, documentation, and deployment strategies • Business Intelligence: Creating interactive dashboards and visualizations

Understanding Analytics Engineering

The Analytics Engineer Role

In traditional data teams, we recognize the data engineer, data analyst, and data scientist. However, a gap exists between these roles:

Data Engineers excel at building infrastructure and maintaining data pipelines but may lack deep understanding of business requirements and data usage patterns.

Data Analysts understand business needs and can derive insights from data but often lack software engineering skills for building scalable, maintainable solutions.

Analytics Engineers bridge this gap by combining:

Software engineering practices from data engineering
Business understanding from data analysis
Modern cloud-based transformation tools

Modern Data Stack Architecture

The analytics engineer typically works with:

Data Loading: Stitch, Apache NiFi, Airbyte
Data Storing: Google BigQuery, Snowflake, Amazon Redshift
Data Modeling: dbt, Dataform, Apache Spark SQL
Data Presentation: Looker, Tableau, Google Data Studio, Power BI

ELT vs ETL Methodology

ETL (Extract, Transform, Load) transforms data before loading into the warehouse, requiring upfront processing but ensuring data quality.

ELT (Extract, Load, Transform) loads raw data first, then transforms within the warehouse, offering:

Faster implementation
Greater flexibility
Leverages cloud warehouse compute power
Lower storage costs

This approach capitalizes on cloud data warehousing advances that have dramatically reduced storage and compute costs.

Introduction to dbt

What is dbt?

dbt (data build tool) is a transformation workflow that enables analysts and engineers to transform data using SQL and Python. It sits on top of your data warehouse and helps convert raw data into analysis-ready datasets.

Key Features:

SQL-First: Write transformations in familiar SQL
Version Control: Full Git integration for collaboration
Testing: Built-in data quality testing framework
Documentation: Auto-generated documentation and lineage
Modularity: Reusable models and macros

How dbt Works

Model Definition: Write SELECT statements in .sql files
Compilation: dbt compiles SQL with Jinja templating
Execution: Generated SQL runs against your data warehouse
Materialization: Results stored as tables, views, or incremental builds

dbt Architecture Options

dbt Core:

Open-source command-line tool
Local development environment
Manual infrastructure management
Free to use

dbt Cloud:

SaaS platform with web IDE
Managed infrastructure
Built-in scheduling and monitoring
Free tier available

Dimensional Modeling with dbt

Kimball’s Dimensional Modeling

This project implements Kimball’s methodology focusing on:

Business Understanding: Data that’s intuitive for stakeholders
Query Performance: Optimized for analytical workloads
Star Schema Design: Central fact tables surrounded by dimension tables

Project Structure

The dbt project follows a modular architecture:

├── models/
│   ├── staging/          # Raw data cleaning and standardization
│   │   ├── stg_green_tripdata.sql
│   │   ├── stg_yellow_tripdata.sql
│   │   └── schema.yml
│   └── core/             # Business logic and dimensional models
│       ├── dim_zones.sql
│       ├── fact_trips.sql
│       ├── dm_monthly_zone_revenue.sql
│       └── schema.yml
├── macros/               # Reusable SQL functions
│   └── get_payment_type_description.sql
├── seeds/                # Static reference data
│   └── taxi_zone_lookup.csv
└── dbt_project.yml       # Project configuration

Implementation: NYC Taxi Data Pipeline

Staging Layer: Data Standardization

The staging layer cleanses and standardizes raw data using consistent patterns:

-- stg_green_tripdata.sql
{{ config(materialized='view') }}

with tripdata as (
  select *,
    row_number() over(partition by vendorid, lpep_pickup_datetime) as rn
  from {{ source('staging','green_tripdata') }}
  where vendorid is not null 
)
select
    -- identifiers
    {{ dbt_utils.generate_surrogate_key(['vendorid', 'lpep_pickup_datetime']) }} as tripid,
    {{ dbt.safe_cast("vendorid", api.Column.translate_type("integer")) }} as vendorid,
    
    -- timestamps
    cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
    cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
    
    -- payment info
    cast(fare_amount as numeric) as fare_amount,
    {{ get_payment_type_description("payment_type") }} as payment_type_description
from tripdata
where rn = 1

Key Features:

Deduplication: Using ROW_NUMBER() for data quality
Type Casting: Standardizing data types across sources
Surrogate Keys: Generating unique identifiers
Macros: Reusable transformation logic

Core Layer: Dimensional Models

Dimension Tables

-- dim_zones.sql
{{ config(materialized='table') }}

select 
    locationid, 
    borough, 
    zone, 
    replace(service_zone,'Boro','Green') as service_zone 
from {{ ref('taxi_zone_lookup') }}

Fact Tables

-- fact_trips.sql
{{ config(materialized='table') }}

with green_tripdata as (
    select *, 'Green' as service_type
    from {{ ref('stg_green_tripdata') }}
), 
yellow_tripdata as (
    select *, 'Yellow' as service_type
    from {{ ref('stg_yellow_tripdata') }}
), 
trips_unioned as (
    select * from green_tripdata
    union all 
    select * from yellow_tripdata
)
select 
    trips_unioned.tripid,
    trips_unioned.service_type,
    pickup_zone.borough as pickup_borough,
    pickup_zone.zone as pickup_zone,
    dropoff_zone.borough as dropoff_borough,
    dropoff_zone.zone as dropoff_zone,
    trips_unioned.pickup_datetime,
    trips_unioned.fare_amount,
    trips_unioned.total_amount
from trips_unioned
inner join {{ ref('dim_zones') }} as pickup_zone
    on trips_unioned.pickup_locationid = pickup_zone.locationid
inner join {{ ref('dim_zones') }} as dropoff_zone
    on trips_unioned.dropoff_locationid = dropoff_zone.locationid

Advanced dbt Features

Macros for Code Reusability

-- macros/get_payment_type_description.sql
{% macro get_payment_type_description(payment_type) -%}
    case {{ dbt.safe_cast("payment_type", api.Column.translate_type("integer")) }}  
        when 1 then 'Credit card'
        when 2 then 'Cash'
        when 3 then 'No charge'
        when 4 then 'Dispute'
        when 5 then 'Unknown'
        when 6 then 'Voided trip'
        else 'EMPTY'
    end
{%- endmacro %}

Variables for Environment Configuration

-- Development vs Production data limits
{% if var('is_test_run', default=true) %}
  limit 100
{% endif %}

Package Management

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  - package: dbt-labs/codegen
    version: 0.12.1

Data Quality and Testing

Built-in Tests

dbt provides four essential tests out of the box:

# models/staging/schema.yml
models:
  - name: stg_green_tripdata
    columns:
      - name: tripid
        tests:
          - unique
          - not_null
      - name: payment_type
        tests:
          - accepted_values:
              values: [1,2,3,4,5,6]
      - name: pickup_locationid
        tests:
          - relationships:
              to: ref('dim_zones')
              field: locationid

Testing Strategy

Unique: Ensures primary key integrity
Not Null: Validates required fields
Accepted Values: Enforces valid enum values
Relationships: Maintains referential integrity

Deployment and Production

Development Workflow

Branch-based Development: Each developer works in isolated branches
Testing: All tests must pass before merging
Code Review: Pull request process for quality control
Deployment: Automated production deployment

Production Jobs

dbt Cloud enables automated production workflows:

# Production job configuration
- dbt deps          # Install packages
- dbt seed          # Load reference data
- dbt run           # Execute transformations
- dbt test          # Validate data quality
- dbt docs generate # Create documentation

Continuous Integration

CI/CD pipelines ensure code quality:

Automated Testing: Run tests on every pull request
Isolated Environments: Temporary schemas for testing
Documentation Updates: Auto-generated docs for each deployment

Business Intelligence with Looker Studio

Dashboard Creation Process

Data Source Connection: Connect Looker Studio to BigQuery
Data Modeling: Configure aggregations and dimensions
Visualization Design: Create charts, tables, and controls
Interactive Features: Add filters and drill-down capabilities

Key Visualizations

Time Series Analysis: Trip volume trends over time
Geographic Analysis: Revenue by pickup zones
Service Comparison: Green vs Yellow taxi performance
Payment Analysis: Payment method distributions

Performance Insights

The dashboard reveals business insights such as:

COVID-19 impact on ridership (March 2020 drop)
Seasonal usage patterns
Geographic revenue concentrations
Payment method preferences

dbt Core Implementation

Local Development Setup

# Installation
pip install dbt-core dbt-bigquery

# Project initialization
dbt init dbtcore_bigquery

# Dependencies and execution
dbt deps
dbt seed
dbt run
dbt test

Configuration Management

# dbt_project.yml
name: 'dbtcore_bigquery'
version: '1.0.0'
profile: 'dbtcore_bigquery'

models:
  dbtcore_bigquery:
    staging:
      materialized: view
    core:
      materialized: table

vars:
  payment_type_values: [1, 2, 3, 4, 5, 6]

Best Practices and Scalability

Code Organization

Layered Architecture: Staging → Core → Marts
Naming Conventions: Clear, consistent model names
Documentation: Comprehensive field descriptions
Version Control: Git-based collaboration

Performance Optimization

Materialization Strategy: Views for staging, tables for core models
Incremental Models: For large, append-only datasets
BigQuery Optimization: Partitioning and clustering
Resource Management: Appropriate warehouse sizing

Data Governance

Testing Coverage: Comprehensive data quality tests
Documentation: Auto-generated model documentation
Access Control: Role-based permissions
Monitoring: Job failure alerts and data freshness checks

Conclusion

This project demonstrates how to build a robust, scalable analytics engineering pipeline using modern tools and methodologies. The implementation showcases:

Modern ELT Architecture: Leveraging cloud data warehouse capabilities
Dimensional Modeling: Business-friendly star schema design
Software Engineering Practices: Version control, testing, and CI/CD
Business Intelligence: Interactive dashboards for stakeholder consumption

The pipeline processes millions of taxi trip records, transforming raw data into actionable business insights while maintaining data quality and operational excellence.

Key Achievements:

Automated data quality testing preventing bad data propagation
Modular, maintainable code architecture
Comprehensive documentation and lineage tracking
Production-ready deployment with monitoring and alerting

Next Steps: Consider implementing real-time streaming transformations, advanced ML features, or extending the pipeline to additional data sources for a more comprehensive analytics platform.

João Blasques (Jonas) / Analytics Engineering with dbt: From Raw Data to Business Intelligence