Skip to content

LuqmaanSahar/D100_400_Coursework

Repository files navigation

NYC Trees (D100-D400 Project)

Python Version License

nyc_trees is a Python package for working with the 2015 NYC Tree Census dataset. It provides tools to load, clean, explore, and model the dataset.

The main analysis can be found in report.ipynb where Multiclass classifiers are trained to predict tree health in the borough of Queens. This package allows for sophisticated models to be trained, and the models presented in the report provide a sensible trade-off between complexity and computational expense.


Features

  • Load any Socrata hosted dataset using the SODA API

  • Quality of Life functions for quickly summarising datasets and computing statistics

  • General preprocessing functions that can be applied to any dataset

  • Some preprocessing functions specific to the Tree Census dataset

  • Custom scikit-learn Transformers including:

    • cluster similarity features
    • aggregating categorical features to broader bins
  • full feature engineering pipeline for the NYC trees dataset

  • Utilities for running a full model pipeline including:

    • feature engineering
    • Resampling using the imbalanced-learn library
    • Model training and cross validation
  • Utilities for evaluating multiclass classifiers using DALEX

  • A full test suite for every module


Installation

1. Clone the repository

git clone # anonymised repository
cd D100_400_Coursework

2. Set up the Conda environment

conda env create -f environment.yml
conda activate nyc_trees_env

3. Install the package locally

pip install -e .

Usage

Loading a dataset

from nyc_trees.data import load_data

# Load NYC Tree Census dataset for borough of Queens, without saving
df = load_data()
# Load and save to CSV
df = load_data(save=True, filename="nyc_trees.csv")

Preprocessing

# Drop missing values and outliers, then save as parquet
df_preprocessed = preprocess_master(
    df=df,
    max_val=70,
    save=True,
    filename="nyc_trees_preprocessed.parquet"
)

Model Training

# import model of choice
from sklearn.linear_model import LogisticRegression

# initialise the model and hyperparameters to search
model = LogisticRegression(solver="lbfgs", max_iter=500, random_state=42)
logistic_grid = {
    "preprocess__spatial__clustersimilarity__n_clusters": [5, 10, 20, 30],
    "preprocess__spatial__clustersimilarity__gamma": [500, 1000, 2000],
    "model__C": [0.01, 0.1, 1.0],
}

# call train_full_model()
# pickle the trained object using the name "example_model"
example_model = train_full_model(
    X_train,
    y_train,
    model=model,
    resampler="under",
    hyperparameters=logistic_grid,
    save=True,
    modelname="example_model"
)

Model Evaluation

# retrieve the best model and summarise its performance on the test set
example_model_best = example_model.best_estimator_
model_metrics_summary(example_best_model, X_test, y_test)

# initialise explainers for each class (for a multiclass classifier)
example_explainers = model_init_dalex(example_best_model, X_test, y_test, "example")

# compute feature importance for each class
example_feat_imp = model_dalex_feat_imp(example_explainers, random_state=42)

# compute and plot PDPs for each class for the variable "tree_dbh"
example_pdp = model_dalex_pdp(example_explainers, ["tree_dbh"], random_state=42)

report.ipynb provides further examples for end-to-end workflows using this package.

Package Structure

The nyc_trees package is organized to separate functionality into logical modules:

  • config.py – Global constants and configuration variables for the project, including project paths, dataset metadata, and SODA query settings.
  • data.py – Utilities for loading datasets from the SODA API, and computing key variables that were used during the exploratory data analysis stage. This includes the labels for the dataset.
  • preprocessing_utils.py – Helper functions used in both preprocessing.py and feature_engineering.py
  • preprocessing.py – Functions for cleaning and preparing the dataset for analysis.
  • feature_engineering.py – Custom scikit-learn transformers for the NYC trees dataset.
  • model_training.py – Functions for training a model end-to-end. Offers support for resampling using imbalanced-learn and cross validation using a grid search method.
  • model_eval.py - Functions for evaluating multiclass classifiers using DALEX. Improves DALEX outputs for interpretability when labels have multiple classes.
  • plotting.py – Functions for creating key visuals used in eda_cleaning.ipynb and report.ipynb.
  • tests/ – Unit tests for the package modules to ensure functionality and reproducibility.
  • eda_cleaning.ipynb – Notebook demonstrating exploratory data analysis and data cleaning workflow.
  • first_look.ipynb - Notebook containing a first look at the datasets and early analysis of the Forestry Work Orders dataset.
  • report.ipynb – Notebook providing the main analysis, which aims to predict tree health to help the NYC Parks Authority to better locate potential trees at risk of poor health.

Contributing

If you wish to contribute, feel free to fork the repository and make pull requests.

Please make sure to install and initialise the pre-commit hooks for this repository before making commits.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Testing

Unit tests are stored in the tests/ folder, and can be run with:

pytest

All key modules are covered by unit tests. Please check that these tests pass before opening new pull requests.

License

This project is licensed under the MIT license. See LICENSE for details.

About

Coursework for D100/D400 of the 2025/26 MPhil in Economics and Data Science.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors