Project Overview

This project demonstrates the implementation of a comprehensive analytics engineering pipeline using dbt (data build tool) as the primary transformation layer. The pipeline showcases modern data engineering practices including ELT methodology, dimensional modeling, automated testing, and business intelligence visualization.

Repository: Analytics Engineering with dbt

The project focuses on transforming raw NYC taxi trip data into business-ready analytics tables using dbt’s modular approach, implementing both dbt Cloud and dbt Core workflows, and creating interactive dashboards with Looker Studio.

Key Concepts

Analytics Engineering: Bridging the gap between data engineering and data analysis with software engineering best practices • ELT vs ETL: Leveraging cloud data warehouses for in-database transformations • Dimensional Modeling: Implementing Kimball’s star schema methodology for analytical workloads • dbt Fundamentals: Models, macros, packages, variables, and testing frameworks • Data Governance: Testing, documentation, and deployment strategies • Business Intelligence: Creating interactive dashboards and visualizations

Understanding Analytics Engineering

The Analytics Engineer Role

In traditional data teams, we recognize the data engineer, data analyst, and data scientist. However, a gap exists between these roles:

Data Engineers excel at building infrastructure and maintaining data pipelines but may lack deep understanding of business requirements and data usage patterns.

Data Analysts understand business needs and can derive insights from data but often lack software engineering skills for building scalable, maintainable solutions.

Analytics Engineers bridge this gap by combining:

  • Software engineering practices from data engineering
  • Business understanding from data analysis
  • Modern cloud-based transformation tools

Modern Data Stack Architecture

The analytics engineer typically works with:

  • Data Loading: Stitch, Apache NiFi, Airbyte
  • Data Storing: Google BigQuery, Snowflake, Amazon Redshift
  • Data Modeling: dbt, Dataform, Apache Spark SQL
  • Data Presentation: Looker, Tableau, Google Data Studio, Power BI

ELT vs ETL Methodology

ETL (Extract, Transform, Load) transforms data before loading into the warehouse, requiring upfront processing but ensuring data quality.

ELT (Extract, Load, Transform) loads raw data first, then transforms within the warehouse, offering:

  • Faster implementation
  • Greater flexibility
  • Leverages cloud warehouse compute power
  • Lower storage costs

This approach capitalizes on cloud data warehousing advances that have dramatically reduced storage and compute costs.

Introduction to dbt

What is dbt?

dbt (data build tool) is a transformation workflow that enables analysts and engineers to transform data using SQL and Python. It sits on top of your data warehouse and helps convert raw data into analysis-ready datasets.

Key Features:

  • SQL-First: Write transformations in familiar SQL
  • Version Control: Full Git integration for collaboration
  • Testing: Built-in data quality testing framework
  • Documentation: Auto-generated documentation and lineage
  • Modularity: Reusable models and macros

How dbt Works

  1. Model Definition: Write SELECT statements in .sql files
  2. Compilation: dbt compiles SQL with Jinja templating
  3. Execution: Generated SQL runs against your data warehouse
  4. Materialization: Results stored as tables, views, or incremental builds

dbt Architecture Options

dbt Core:

  • Open-source command-line tool
  • Local development environment
  • Manual infrastructure management
  • Free to use

dbt Cloud:

  • SaaS platform with web IDE
  • Managed infrastructure
  • Built-in scheduling and monitoring
  • Free tier available

Dimensional Modeling with dbt

Kimball’s Dimensional Modeling

This project implements Kimball’s methodology focusing on:

  • Business Understanding: Data that’s intuitive for stakeholders
  • Query Performance: Optimized for analytical workloads
  • Star Schema Design: Central fact tables surrounded by dimension tables

Project Structure

The dbt project follows a modular architecture:

├── models/
│   ├── staging/          # Raw data cleaning and standardization
│   │   ├── stg_green_tripdata.sql
│   │   ├── stg_yellow_tripdata.sql
│   │   └── schema.yml
│   └── core/             # Business logic and dimensional models
│       ├── dim_zones.sql
│       ├── fact_trips.sql
│       ├── dm_monthly_zone_revenue.sql
│       └── schema.yml
├── macros/               # Reusable SQL functions
│   └── get_payment_type_description.sql
├── seeds/                # Static reference data
│   └── taxi_zone_lookup.csv
└── dbt_project.yml       # Project configuration

Implementation: NYC Taxi Data Pipeline

Staging Layer: Data Standardization

The staging layer cleanses and standardizes raw data using consistent patterns:

-- stg_green_tripdata.sql
{{ config(materialized='view') }}

with tripdata as (
  select *,
    row_number() over(partition by vendorid, lpep_pickup_datetime) as rn
  from {{ source('staging','green_tripdata') }}
  where vendorid is not null 
)
select
    -- identifiers
    {{ dbt_utils.generate_surrogate_key(['vendorid', 'lpep_pickup_datetime']) }} as tripid,
    {{ dbt.safe_cast("vendorid", api.Column.translate_type("integer")) }} as vendorid,
    
    -- timestamps
    cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
    cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
    
    -- payment info
    cast(fare_amount as numeric) as fare_amount,
    {{ get_payment_type_description("payment_type") }} as payment_type_description
from tripdata
where rn = 1

Key Features:

  • Deduplication: Using ROW_NUMBER() for data quality
  • Type Casting: Standardizing data types across sources
  • Surrogate Keys: Generating unique identifiers
  • Macros: Reusable transformation logic

Core Layer: Dimensional Models

Dimension Tables

-- dim_zones.sql
{{ config(materialized='table') }}

select 
    locationid, 
    borough, 
    zone, 
    replace(service_zone,'Boro','Green') as service_zone 
from {{ ref('taxi_zone_lookup') }}

Fact Tables

-- fact_trips.sql
{{ config(materialized='table') }}

with green_tripdata as (
    select *, 'Green' as service_type
    from {{ ref('stg_green_tripdata') }}
), 
yellow_tripdata as (
    select *, 'Yellow' as service_type
    from {{ ref('stg_yellow_tripdata') }}
), 
trips_unioned as (
    select * from green_tripdata
    union all 
    select * from yellow_tripdata
)
select 
    trips_unioned.tripid,
    trips_unioned.service_type,
    pickup_zone.borough as pickup_borough,
    pickup_zone.zone as pickup_zone,
    dropoff_zone.borough as dropoff_borough,
    dropoff_zone.zone as dropoff_zone,
    trips_unioned.pickup_datetime,
    trips_unioned.fare_amount,
    trips_unioned.total_amount
from trips_unioned
inner join {{ ref('dim_zones') }} as pickup_zone
    on trips_unioned.pickup_locationid = pickup_zone.locationid
inner join {{ ref('dim_zones') }} as dropoff_zone
    on trips_unioned.dropoff_locationid = dropoff_zone.locationid

Advanced dbt Features

Macros for Code Reusability

-- macros/get_payment_type_description.sql
{% macro get_payment_type_description(payment_type) -%}
    case {{ dbt.safe_cast("payment_type", api.Column.translate_type("integer")) }}  
        when 1 then 'Credit card'
        when 2 then 'Cash'
        when 3 then 'No charge'
        when 4 then 'Dispute'
        when 5 then 'Unknown'
        when 6 then 'Voided trip'
        else 'EMPTY'
    end
{%- endmacro %}

Variables for Environment Configuration

-- Development vs Production data limits
{% if var('is_test_run', default=true) %}
  limit 100
{% endif %}

Package Management

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  - package: dbt-labs/codegen
    version: 0.12.1

Data Quality and Testing

Built-in Tests

dbt provides four essential tests out of the box:

# models/staging/schema.yml
models:
  - name: stg_green_tripdata
    columns:
      - name: tripid
        tests:
          - unique
          - not_null
      - name: payment_type
        tests:
          - accepted_values:
              values: [1,2,3,4,5,6]
      - name: pickup_locationid
        tests:
          - relationships:
              to: ref('dim_zones')
              field: locationid

Testing Strategy

  • Unique: Ensures primary key integrity
  • Not Null: Validates required fields
  • Accepted Values: Enforces valid enum values
  • Relationships: Maintains referential integrity

Deployment and Production

Development Workflow

  1. Branch-based Development: Each developer works in isolated branches
  2. Testing: All tests must pass before merging
  3. Code Review: Pull request process for quality control
  4. Deployment: Automated production deployment

Production Jobs

dbt Cloud enables automated production workflows:

# Production job configuration
- dbt deps          # Install packages
- dbt seed          # Load reference data
- dbt run           # Execute transformations
- dbt test          # Validate data quality
- dbt docs generate # Create documentation

Continuous Integration

CI/CD pipelines ensure code quality:

  • Automated Testing: Run tests on every pull request
  • Isolated Environments: Temporary schemas for testing
  • Documentation Updates: Auto-generated docs for each deployment

Business Intelligence with Looker Studio

Dashboard Creation Process

  1. Data Source Connection: Connect Looker Studio to BigQuery
  2. Data Modeling: Configure aggregations and dimensions
  3. Visualization Design: Create charts, tables, and controls
  4. Interactive Features: Add filters and drill-down capabilities

Key Visualizations

  • Time Series Analysis: Trip volume trends over time
  • Geographic Analysis: Revenue by pickup zones
  • Service Comparison: Green vs Yellow taxi performance
  • Payment Analysis: Payment method distributions

Performance Insights

The dashboard reveals business insights such as:

  • COVID-19 impact on ridership (March 2020 drop)
  • Seasonal usage patterns
  • Geographic revenue concentrations
  • Payment method preferences

dbt Core Implementation

Local Development Setup

# Installation
pip install dbt-core dbt-bigquery

# Project initialization
dbt init dbtcore_bigquery

# Dependencies and execution
dbt deps
dbt seed
dbt run
dbt test

Configuration Management

# dbt_project.yml
name: 'dbtcore_bigquery'
version: '1.0.0'
profile: 'dbtcore_bigquery'

models:
  dbtcore_bigquery:
    staging:
      materialized: view
    core:
      materialized: table

vars:
  payment_type_values: [1, 2, 3, 4, 5, 6]

Best Practices and Scalability

Code Organization

  1. Layered Architecture: Staging → Core → Marts
  2. Naming Conventions: Clear, consistent model names
  3. Documentation: Comprehensive field descriptions
  4. Version Control: Git-based collaboration

Performance Optimization

  1. Materialization Strategy: Views for staging, tables for core models
  2. Incremental Models: For large, append-only datasets
  3. BigQuery Optimization: Partitioning and clustering
  4. Resource Management: Appropriate warehouse sizing

Data Governance

  1. Testing Coverage: Comprehensive data quality tests
  2. Documentation: Auto-generated model documentation
  3. Access Control: Role-based permissions
  4. Monitoring: Job failure alerts and data freshness checks

Conclusion

This project demonstrates how to build a robust, scalable analytics engineering pipeline using modern tools and methodologies. The implementation showcases:

  • Modern ELT Architecture: Leveraging cloud data warehouse capabilities
  • Dimensional Modeling: Business-friendly star schema design
  • Software Engineering Practices: Version control, testing, and CI/CD
  • Business Intelligence: Interactive dashboards for stakeholder consumption

The pipeline processes millions of taxi trip records, transforming raw data into actionable business insights while maintaining data quality and operational excellence.

Key Achievements:

  • Automated data quality testing preventing bad data propagation
  • Modular, maintainable code architecture
  • Comprehensive documentation and lineage tracking
  • Production-ready deployment with monitoring and alerting

Next Steps: Consider implementing real-time streaming transformations, advanced ML features, or extending the pipeline to additional data sources for a more comprehensive analytics platform.