nyc_trees is a Python package for working with the 2015 NYC Tree Census dataset.
It provides tools to load, clean, explore, and model the dataset.
The main analysis can be found in report.ipynb where Multiclass classifiers are trained to
predict tree health in the borough of Queens. This package allows for sophisticated models to be
trained, and the models presented in the report provide a sensible trade-off between complexity and
computational expense.
-
Load any Socrata hosted dataset using the SODA API
-
Quality of Life functions for quickly summarising datasets and computing statistics
-
General preprocessing functions that can be applied to any dataset
-
Some preprocessing functions specific to the Tree Census dataset
-
Custom scikit-learn Transformers including:
- cluster similarity features
- aggregating categorical features to broader bins
-
full feature engineering pipeline for the NYC trees dataset
-
Utilities for running a full model pipeline including:
- feature engineering
- Resampling using the
imbalanced-learnlibrary - Model training and cross validation
-
Utilities for evaluating multiclass classifiers using DALEX
-
A full test suite for every module
git clone # anonymised repository
cd D100_400_Courseworkconda env create -f environment.yml
conda activate nyc_trees_envpip install -e .from nyc_trees.data import load_data
# Load NYC Tree Census dataset for borough of Queens, without saving
df = load_data()# Load and save to CSV
df = load_data(save=True, filename="nyc_trees.csv")# Drop missing values and outliers, then save as parquet
df_preprocessed = preprocess_master(
df=df,
max_val=70,
save=True,
filename="nyc_trees_preprocessed.parquet"
)# import model of choice
from sklearn.linear_model import LogisticRegression
# initialise the model and hyperparameters to search
model = LogisticRegression(solver="lbfgs", max_iter=500, random_state=42)
logistic_grid = {
"preprocess__spatial__clustersimilarity__n_clusters": [5, 10, 20, 30],
"preprocess__spatial__clustersimilarity__gamma": [500, 1000, 2000],
"model__C": [0.01, 0.1, 1.0],
}
# call train_full_model()
# pickle the trained object using the name "example_model"
example_model = train_full_model(
X_train,
y_train,
model=model,
resampler="under",
hyperparameters=logistic_grid,
save=True,
modelname="example_model"
)# retrieve the best model and summarise its performance on the test set
example_model_best = example_model.best_estimator_
model_metrics_summary(example_best_model, X_test, y_test)
# initialise explainers for each class (for a multiclass classifier)
example_explainers = model_init_dalex(example_best_model, X_test, y_test, "example")
# compute feature importance for each class
example_feat_imp = model_dalex_feat_imp(example_explainers, random_state=42)
# compute and plot PDPs for each class for the variable "tree_dbh"
example_pdp = model_dalex_pdp(example_explainers, ["tree_dbh"], random_state=42)report.ipynb provides further examples for end-to-end workflows using this package.
The nyc_trees package is organized to separate functionality into logical modules:
- config.py – Global constants and configuration variables for the project, including project paths, dataset metadata, and SODA query settings.
- data.py – Utilities for loading datasets from the SODA API, and computing key variables that were used during the exploratory data analysis stage. This includes the labels for the dataset.
- preprocessing_utils.py – Helper functions used in both
preprocessing.pyandfeature_engineering.py - preprocessing.py – Functions for cleaning and preparing the dataset for analysis.
- feature_engineering.py – Custom scikit-learn transformers for the NYC trees dataset.
- model_training.py – Functions for training a model end-to-end. Offers support for resampling using
imbalanced-learnand cross validation using a grid search method. - model_eval.py - Functions for evaluating multiclass classifiers using DALEX. Improves DALEX outputs for interpretability when labels have multiple classes.
- plotting.py – Functions for creating key visuals used in
eda_cleaning.ipynbandreport.ipynb. - tests/ – Unit tests for the package modules to ensure functionality and reproducibility.
- eda_cleaning.ipynb – Notebook demonstrating exploratory data analysis and data cleaning workflow.
- first_look.ipynb - Notebook containing a first look at the datasets and early analysis of the Forestry Work Orders dataset.
- report.ipynb – Notebook providing the main analysis, which aims to predict tree health to help the NYC Parks Authority to better locate potential trees at risk of poor health.
If you wish to contribute, feel free to fork the repository and make pull requests.
Please make sure to install and initialise the pre-commit hooks for this repository before making commits.
pip install pre-commit
pre-commit install
pre-commit run --all-filesUnit tests are stored in the tests/ folder, and can be run with:
pytestAll key modules are covered by unit tests. Please check that these tests pass before opening new pull requests.
This project is licensed under the MIT license. See LICENSE for details.