NYC Trees (D100-D400 Project)

nyc_trees is a Python package for working with the 2015 NYC Tree Census dataset. It provides tools to load, clean, explore, and model the dataset.

The main analysis can be found in report.ipynb where Multiclass classifiers are trained to predict tree health in the borough of Queens. This package allows for sophisticated models to be trained, and the models presented in the report provide a sensible trade-off between complexity and computational expense.

Features

Load any Socrata hosted dataset using the SODA API
Quality of Life functions for quickly summarising datasets and computing statistics
General preprocessing functions that can be applied to any dataset
Some preprocessing functions specific to the Tree Census dataset
Custom scikit-learn Transformers including:
- cluster similarity features
- aggregating categorical features to broader bins
full feature engineering pipeline for the NYC trees dataset
Utilities for running a full model pipeline including:
- feature engineering
- Resampling using the imbalanced-learn library
- Model training and cross validation
Utilities for evaluating multiclass classifiers using DALEX
A full test suite for every module

Installation

1. Clone the repository

git clone # anonymised repository
cd D100_400_Coursework

2. Set up the Conda environment

conda env create -f environment.yml
conda activate nyc_trees_env

3. Install the package locally

pip install -e .

Usage

Loading a dataset

from nyc_trees.data import load_data

# Load NYC Tree Census dataset for borough of Queens, without saving
df = load_data()

# Load and save to CSV
df = load_data(save=True, filename="nyc_trees.csv")

Preprocessing

# Drop missing values and outliers, then save as parquet
df_preprocessed = preprocess_master(
    df=df,
    max_val=70,
    save=True,
    filename="nyc_trees_preprocessed.parquet"
)

Model Training

# import model of choice
from sklearn.linear_model import LogisticRegression

# initialise the model and hyperparameters to search
model = LogisticRegression(solver="lbfgs", max_iter=500, random_state=42)
logistic_grid = {
    "preprocess__spatial__clustersimilarity__n_clusters": [5, 10, 20, 30],
    "preprocess__spatial__clustersimilarity__gamma": [500, 1000, 2000],
    "model__C": [0.01, 0.1, 1.0],
}

# call train_full_model()
# pickle the trained object using the name "example_model"
example_model = train_full_model(
    X_train,
    y_train,
    model=model,
    resampler="under",
    hyperparameters=logistic_grid,
    save=True,
    modelname="example_model"
)

Model Evaluation

# retrieve the best model and summarise its performance on the test set
example_model_best = example_model.best_estimator_
model_metrics_summary(example_best_model, X_test, y_test)

# initialise explainers for each class (for a multiclass classifier)
example_explainers = model_init_dalex(example_best_model, X_test, y_test, "example")

# compute feature importance for each class
example_feat_imp = model_dalex_feat_imp(example_explainers, random_state=42)

# compute and plot PDPs for each class for the variable "tree_dbh"
example_pdp = model_dalex_pdp(example_explainers, ["tree_dbh"], random_state=42)

report.ipynb provides further examples for end-to-end workflows using this package.

Package Structure

The nyc_trees package is organized to separate functionality into logical modules:

config.py – Global constants and configuration variables for the project, including project paths, dataset metadata, and SODA query settings.
data.py – Utilities for loading datasets from the SODA API, and computing key variables that were used during the exploratory data analysis stage. This includes the labels for the dataset.
preprocessing_utils.py – Helper functions used in both preprocessing.py and feature_engineering.py
preprocessing.py – Functions for cleaning and preparing the dataset for analysis.
feature_engineering.py – Custom scikit-learn transformers for the NYC trees dataset.
model_training.py – Functions for training a model end-to-end. Offers support for resampling using imbalanced-learn and cross validation using a grid search method.
model_eval.py - Functions for evaluating multiclass classifiers using DALEX. Improves DALEX outputs for interpretability when labels have multiple classes.
plotting.py – Functions for creating key visuals used in eda_cleaning.ipynb and report.ipynb.
tests/ – Unit tests for the package modules to ensure functionality and reproducibility.
eda_cleaning.ipynb – Notebook demonstrating exploratory data analysis and data cleaning workflow.
first_look.ipynb - Notebook containing a first look at the datasets and early analysis of the Forestry Work Orders dataset.
report.ipynb – Notebook providing the main analysis, which aims to predict tree health to help the NYC Parks Authority to better locate potential trees at risk of poor health.

Contributing

If you wish to contribute, feel free to fork the repository and make pull requests.

Please make sure to install and initialise the pre-commit hooks for this repository before making commits.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Testing

Unit tests are stored in the tests/ folder, and can be run with:

pytest

All key modules are covered by unit tests. Please check that these tests pass before opening new pull requests.

License

This project is licensed under the MIT license. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
outputs		outputs
src/nyc_trees		src/nyc_trees
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
eda_cleaning.ipynb		eda_cleaning.ipynb
environment.yml		environment.yml
first_look.ipynb		first_look.ipynb
pyproject.toml		pyproject.toml
report.ipynb		report.ipynb
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Trees (D100-D400 Project)

Features

Installation

1. Clone the repository

2. Set up the Conda environment

3. Install the package locally

Usage

Loading a dataset

Preprocessing

Model Training

Model Evaluation

Package Structure

Contributing

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NYC Trees (D100-D400 Project)

Features

Installation

1. Clone the repository

2. Set up the Conda environment

3. Install the package locally

Usage

Loading a dataset

Preprocessing

Model Training

Model Evaluation

Package Structure

Contributing

Testing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages