NYC Taxi Data Pipeline

GCP-based data pipeline for NYC TLC taxi trip data: ingestion, transformation, ML (fleet recommender, anomaly detection), and static dashboard.

Project Structure

saithi thesis code/
├── README.md
├── requirements.txt
├── config/                   # Configuration
│   ├── gcs_cors.json
│   └── gcp/                  # GCP deployment (Composer DAG, deploy.sh)
├── dashboard/                # Static web dashboard
├── docs/
│   ├── GCP_OPTIMIZATION_PLAN.md
│   ├── LEGACY.md
│   ├── gcp/                  # GCP implementation guide
│   ├── aws/                  # AWS replication guide
│   └── azure/                # Azure replication guide
├── legacy/                   # Original notebooks (00–04, Gradio)
├── pipeline_utils/            # Shared utilities
│   ├── config.py
│   ├── spark_utils.py
│   ├── bq_utils.py
│   ├── gcs_utils.py
│   ├── schemas.py
│   └── logging_utils.py
└── pipeline/                 # Optimized pipeline scripts
    ├── ingest_tlc.py         # Stage 00
    ├── 01_gcs_to_bronze.py   # Stage 01 (PySpark)
    ├── 02_bronze_to_silver.py # Stage 02 (PySpark)
    ├── 03_silver_to_preml.py # Stage 03 (PySpark)
    ├── 04a_fleet_recommender.py
    ├── 04b_anomalies.py
    ├── export_dashboard_impl.py
    ├── 05_ExportDashboardData.py
    └── run_all.py

Quick Start

Option A: Optimized Pipeline (recommended)

# Python stages (00, 04a, 04b, 05)
python pipeline/ingest_tlc.py
python pipeline/04a_fleet_recommender.py
python pipeline/04b_anomalies.py
python pipeline/05_ExportDashboardData.py

# PySpark stages (01, 02, 03) — run on Dataproc
gcloud dataproc jobs submit pyspark pipeline/01_gcs_to_bronze.py --cluster=CLUSTER --region=us-central1
gcloud dataproc jobs submit pyspark pipeline/02_bronze_to_silver.py --cluster=CLUSTER --region=us-central1
gcloud dataproc jobs submit pyspark pipeline/03_silver_to_preml.py --cluster=CLUSTER --region=us-central1

Option B: Legacy Notebooks

Use the notebooks in legacy/ in order: 00 → 01 → 02a–d → 03 → 04a → 04b.

Static Dashboard

Open dashboard/index.html in a browser. Use "Use demo data" to test without GCS, or configure GCS and uncheck for real data.

GCP Resources

Resource	Name
Project	nyctaxi-467111
Raw bucket	nyc_raw_data_bucket
Dashboard bucket	nyc_dashboard_bucket
BigQuery	RawBronze, CleanSilver, PreMlGold, PostMlGold

Documentation

Doc	Description
Docs Index	Documentation overview
Pipeline README	Optimized pipeline stages, run commands, config
GCP Implementation	Full GCP replication steps, scheduling, config
GCP Optimization Plan	Architecture, optimization rationale, implementation status
Legacy Code	Original notebooks and migration path
AWS Replication	AWS setup (S3, EMR, MWAA)
Azure Replication	Azure setup (Blob, Databricks, Data Factory)

License

Academic / thesis use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NYC Taxi Data Pipeline

Project Structure

Quick Start

Option A: Optimized Pipeline (recommended)

Option B: Legacy Notebooks

Static Dashboard

GCP Resources

Documentation

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

NYC Taxi Data Pipeline

Project Structure

Quick Start

Option A: Optimized Pipeline (recommended)

Option B: Legacy Notebooks

Static Dashboard

GCP Resources

Documentation

License