Skip to content

Latest commit

 

History

History
89 lines (72 loc) · 3.39 KB

File metadata and controls

89 lines (72 loc) · 3.39 KB

NYC Taxi Data Pipeline

GCP-based data pipeline for NYC TLC taxi trip data: ingestion, transformation, ML (fleet recommender, anomaly detection), and static dashboard.

Project Structure

saithi thesis code/
├── README.md
├── requirements.txt
├── config/                   # Configuration
│   ├── gcs_cors.json
│   └── gcp/                  # GCP deployment (Composer DAG, deploy.sh)
├── dashboard/                # Static web dashboard
├── docs/
│   ├── GCP_OPTIMIZATION_PLAN.md
│   ├── LEGACY.md
│   ├── gcp/                  # GCP implementation guide
│   ├── aws/                  # AWS replication guide
│   └── azure/                # Azure replication guide
├── legacy/                   # Original notebooks (00–04, Gradio)
├── pipeline_utils/            # Shared utilities
│   ├── config.py
│   ├── spark_utils.py
│   ├── bq_utils.py
│   ├── gcs_utils.py
│   ├── schemas.py
│   └── logging_utils.py
└── pipeline/                 # Optimized pipeline scripts
    ├── ingest_tlc.py         # Stage 00
    ├── 01_gcs_to_bronze.py   # Stage 01 (PySpark)
    ├── 02_bronze_to_silver.py # Stage 02 (PySpark)
    ├── 03_silver_to_preml.py # Stage 03 (PySpark)
    ├── 04a_fleet_recommender.py
    ├── 04b_anomalies.py
    ├── export_dashboard_impl.py
    ├── 05_ExportDashboardData.py
    └── run_all.py

Quick Start

Option A: Optimized Pipeline (recommended)

# Python stages (00, 04a, 04b, 05)
python pipeline/ingest_tlc.py
python pipeline/04a_fleet_recommender.py
python pipeline/04b_anomalies.py
python pipeline/05_ExportDashboardData.py

# PySpark stages (01, 02, 03) — run on Dataproc
gcloud dataproc jobs submit pyspark pipeline/01_gcs_to_bronze.py --cluster=CLUSTER --region=us-central1
gcloud dataproc jobs submit pyspark pipeline/02_bronze_to_silver.py --cluster=CLUSTER --region=us-central1
gcloud dataproc jobs submit pyspark pipeline/03_silver_to_preml.py --cluster=CLUSTER --region=us-central1

Option B: Legacy Notebooks

Use the notebooks in legacy/ in order: 00 → 01 → 02a–d → 03 → 04a → 04b.

Static Dashboard

Open dashboard/index.html in a browser. Use "Use demo data" to test without GCS, or configure GCS and uncheck for real data.

GCP Resources

Resource Name
Project nyctaxi-467111
Raw bucket nyc_raw_data_bucket
Dashboard bucket nyc_dashboard_bucket
BigQuery RawBronze, CleanSilver, PreMlGold, PostMlGold

Documentation

Doc Description
Docs Index Documentation overview
Pipeline README Optimized pipeline stages, run commands, config
GCP Implementation Full GCP replication steps, scheduling, config
GCP Optimization Plan Architecture, optimization rationale, implementation status
Legacy Code Original notebooks and migration path
AWS Replication AWS setup (S3, EMR, MWAA)
Azure Replication Azure setup (Blob, Databricks, Data Factory)

License

Academic / thesis use.