Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
mlflow.db
train.csv
test.csv
sample_submission.csv
data_description.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Use Python 3.12 slim
FROM python:3.12-slim

# Avoid interactive prompts
ENV DEBIAN_FRONTEND=noninteractive

# Install basic system tools
RUN apt-get update && apt-get install -y \
ca-certificates \
curl \
git \
sudo \
&& rm -rf /var/lib/apt/lists/*

# Setup directory structure
RUN mkdir -p /install /project
WORKDIR /project

# Install Python packages
COPY requirements.txt /install/requirements.txt
RUN pip install --upgrade pip && \
pip install --no-cache-dir jupyterlab jupyterlab_vim jupytext mlflow -r /install/requirements.txt

# Config files
# COPY etc_sudoers /etc/sudoers
# COPY bashrc /root/.bashrc

# Reporting
COPY scripts/version.sh /install/
RUN /install/version.sh 2>&1 | tee /install/version.log

# Jupyter and MLflow UI ports
EXPOSE 8888 5000

CMD ["/bin/bash"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# MLflow Housing Price Prediction Tutorial

This project teaches how to manage the machine learning lifecycle using MLflow. It demonstrates how to track experiments, log metrics, and package models using a housing price prediction dataset.

## Reader Experience
The 60-minute tutorial follows this breakdown:
* **Setup (5 min):** Build the Docker environment and verify the MLflow tracking server.
* **Introduction (10 min):** Understand the core components (Tracking, Models, Projects).
* **API Exploration (20 min):** Work through `mlflow.API.ipynb` to learn native logging.
* **Complete Example (25 min):** Run `mlflow.example.ipynb` to predict housing prices using Ridge Regression.

## Quick Start
* `cd class_project/DATA605/Spring2026/projects/UmdTask463_DATA605_Spring2026_MLflow`
* `./docker_build.sh`
* `./docker_jupyter.sh`

## Notebooks
1. **mlflow.eda.ipynb**
* Prepares the raw housing data.
* Handles outliers, missing values, and saves the cleaned dataset to `artifacts/`.
2. **mlflow.API.ipynb**
* A walkthrough of the core MLflow API.
* Covers experiment creation and basic logging of parameters and metrics.
3. **mlflow.example.ipynb**
* An end-to-end application predicting housing prices using the cleaned data.
* Demonstrates hyperparameter tuning (Ridge Alpha) and model comparison.

## Implementation Details
* `mlflow_utils.py`: Contains helper functions for lifecycle management and logging metrics.
* `requirements.txt`: Environment dependencies with pinned versions.
* Uses Docker to ensure a reproducible development environment.
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
## [2026-04-29]
- Restored project from backup directory to primary repository.
- Updated `requirements.txt` to include `mlflow`, `scikit-learn`, and `pyyaml`.
- Modified `Dockerfile` to align with `docker.use_standard_style`.
- Initialized `helpers_root` submodule and verified `hdbg` integration.
- Confirmed MLflow functionality by running `mlflow_utils.py` inside the container and logging test parameters.

## [2026-04-30]
- Successfully added Kaggle House Prices dataset (train.csv, test.csv).
- Developed `data_loader.py` utilizing `helpers.hdbg` for file validation.
- Verified data integrity (1460 rows, 81 columns).

## [2026-05-01]
- Fixed docker working directory.
- Removed outliers where GrLivArea > 4000.
- Applied log transformation to SalePrice to address skewness.
- Created df_no_outliers and saved progress to `train_clean.csv`.

## [2026-05-02]
- Finalized EDA by imputing missing values and one-hot encoding categorical values.
- Saved final model-ready data to `train_clean.csv`.
- Updated `mlflow_utils.py` to handle experiment lifecycle, metrics logging, and model artifact serialization.
- Implemented `mlflow.API.ipynb` and tested by logging test parameters and metrics.
- Configured Jupytext pairing for all notebooks.

## [2026-05-04]
- Created `run_mlflow.sh` to standardize MLflow UI launch with specific host/port permissions.
- Updated `mlflow_utils.py` to use a Python context manager.
- Performed linear and Ridge regression runs on an 80/20 split of `train_clean.csv`.
- Performed an Alpha hyperparameter sweep to determine optimal hyperparamter tuning.

## [2026-05-05]
- Changed `run_mlflow.sh` to operate on a clean state by deleting old files every run and operate off of 0.0.0.0.
- Changed `Dockerfile` to install MLflow and Jupytext.
- Changed `mlflow.example.ipynb` to write runs to a single folder.

## [2026-05-06]
- Modified `run.jupyter.sh` to resolve `grep` on the current working directory.

## [2026-05-07]
- Modified `README.md` template to relate to this project.
- Added markdown cells to `mlflow.API.ipynb` to better explain as a tutorial.
- Modified `requirements.txt` to include version numbers for required software.

## [2026-05-08]
- Modified `mlflow.API.ipynb` to store in the same temp folder as `mlflow.example.ipynb`.
- Modified various files to better match ideal project folder template.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/bin/bash
# """
# This script launches a Docker container with an interactive bash shell for
# development.
# """

# Exit immediately if any command exits with a non-zero status.
set -e

# Import the utility functions from the project template.
GIT_ROOT=$(git rev-parse --show-toplevel)
source $GIT_ROOT/class_project/project_template/utils.sh

# Parse default args (-h, -v) and enable set -x if -v is passed.
parse_default_args "$@"

# Load Docker configuration variables for this script.
get_docker_vars_script ${BASH_SOURCE[0]}
source $DOCKER_NAME
print_docker_vars

# List the available Docker images matching the expected image name.
run "docker image ls $FULL_IMAGE_NAME"

# Configure and run the Docker container with interactive bash shell.
# - Container is removed automatically on exit (--rm)
# - Interactive mode with TTY allocation (-ti)
# - Port forwarding for Jupyter or other services
# - Git root mounted to /git_root inside container
CONTAINER_NAME=${IMAGE_NAME}_bash
PORT=
DOCKER_CMD=$(get_docker_bash_command)
DOCKER_CMD_OPTS=$(get_docker_bash_options $CONTAINER_NAME $PORT)
DOCKER_CMD_OPTS="$DOCKER_CMD_OPTS -w /git_root/class_project/data605/Spring2026/projects/UmdTask463_DATA605_Spring2026_MLflow"
run "$DOCKER_CMD $DOCKER_CMD_OPTS $FULL_IMAGE_NAME"
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/bin/bash
# """
# Build a Docker container image for the project.
#
# This script sets up the build environment with error handling and command
# tracing, loads Docker configuration from docker_name.sh, and builds the
# Docker image using the build_container_image utility function. It supports
# both single-architecture and multi-architecture builds via the
# DOCKER_BUILD_MULTI_ARCH environment variable.
# """

# Exit immediately if any command exits with a non-zero status.
set -e

# Import the utility functions.
GIT_ROOT=$(git rev-parse --show-toplevel)
source $GIT_ROOT/class_project/project_template/utils.sh

# Parse default args (-h, -v) and enable set -x if -v is passed.
# Shift processed option flags so remaining args are passed to the build.
parse_default_args "$@"
shift $((OPTIND-1))

# Load Docker configuration variables (REPO_NAME, IMAGE_NAME, FULL_IMAGE_NAME).
get_docker_vars_script ${BASH_SOURCE[0]}
source $DOCKER_NAME
print_docker_vars

# Configure Docker build settings.
# Enable BuildKit for improved build performance and features.
export DOCKER_BUILDKIT=1
#export DOCKER_BUILDKIT=0

# Configure single-architecture build (set to 1 for multi-arch build).
#export DOCKER_BUILD_MULTI_ARCH=1
export DOCKER_BUILD_MULTI_ARCH=0

# Build the container image.
# Pass extra arguments (e.g., --no-cache) via command line after -v.
build_container_image "$@"
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
bash: line 1: /data/version.sh: No such file or directory
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/bin/bash
# """
# Remove Docker container image for the project.
#
# This script cleans up Docker images by removing the container image
# matching the project configuration. Useful for freeing disk space or
# ensuring a fresh build.
# """

# Exit immediately if any command exits with a non-zero status.
set -e

# Import the utility functions.
GIT_ROOT=$(git rev-parse --show-toplevel)
source $GIT_ROOT/class_project/project_template/utils.sh

# Parse default args (-h, -v) and enable set -x if -v is passed.
parse_default_args "$@"

# Load Docker configuration variables for this script.
get_docker_vars_script ${BASH_SOURCE[0]}
source $DOCKER_NAME
print_docker_vars

# Remove the container image.
remove_container_image
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/bin/bash
# """
# Execute Jupyter Lab in a Docker container.
#
# This script launches a Docker container running Jupyter Lab with
# configurable port, directory mounting, and vim bindings. It passes
# command-line options to the run_jupyter.sh script inside the container.
#
# Usage:
# > docker_jupyter.sh [options]
# """

# Exit immediately if any command exits with a non-zero status.
set -e

# Import the utility functions.
GIT_ROOT=$(git rev-parse --show-toplevel)
source $GIT_ROOT/class_project/project_template/utils.sh

# Parse command-line options and set Jupyter configuration variables.
parse_docker_jupyter_args "$@"

# Load Docker configuration variables for this script.
get_docker_vars_script ${BASH_SOURCE[0]}
source $DOCKER_NAME
print_docker_vars

# List available Docker images and inspect architecture.
list_and_inspect_docker_image

# Run the Docker container with Jupyter Lab.
CMD="bash ./scripts/run_jupyter.sh $OLD_CMD_OPTS"
CONTAINER_NAME=$IMAGE_NAME
# Kill existing container if -f flag is set.
kill_existing_container_if_forced

DOCKER_CMD=$(get_docker_jupyter_command)
DOCKER_CMD_OPTS=$(get_docker_jupyter_options $CONTAINER_NAME $JUPYTER_HOST_PORT $JUPYTER_USE_VIM)
DOCKER_CMD_OPTS="$DOCKER_CMD_OPTS -w /git_root/class_project/data605/Spring2026/projects/UmdTask463_DATA605_Spring2026_MLflow"
run "$DOCKER_CMD $DOCKER_CMD_OPTS $FULL_IMAGE_NAME $CMD"
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash
# """
# Docker image naming configuration.
#
# This file defines the repository name, image name, and full image name
# variables used by all docker_*.sh scripts in the project template.
# """

REPO_NAME=gpsaggese
# The file should be all lower case.
IMAGE_NAME=umd_project_template
FULL_IMAGE_NAME=$REPO_NAME/$IMAGE_NAME
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "6f35d697-f17a-4771-82f2-0903e8357711",
"metadata": {},
"source": [
"# MLflow API Overview\n",
"\n",
"This notebook provides a hands-on walkthrough of the core MLflow API components: **Experiments**, **Runs**, and **Logging**. MLflow is a platform for managing the machine learning lifecycle, primarily used for tracking how different versions of models perform."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "73901c91-d1b5-4c71-a403-28c538131282",
"metadata": {},
"outputs": [
{
"ename": "ModuleNotFoundError",
"evalue": "No module named 'tutorials.UmdTask463_DATA605_Spring2026_MLflow'",
"output_type": "error",
"traceback": [
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
"\u001b[31mModuleNotFoundError\u001b[39m Traceback (most recent call last)",
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[1]\u001b[39m\u001b[32m, line 11\u001b[39m\n\u001b[32m 7\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m mlflow\n\u001b[32m 8\u001b[39m repo_root = os.path.abspath(\u001b[33m\"../../../../../\"\u001b[39m)\n\u001b[32m 9\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m repo_root \u001b[38;5;28;01mnot\u001b[39;00m \u001b[38;5;28;01min\u001b[39;00m sys.path:\n\u001b[32m 10\u001b[39m sys.path.append(repo_root)\n\u001b[32m---> \u001b[39m\u001b[32m11\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m tutorials.UmdTask463_DATA605_Spring2026_MLflow.mlflow_utils \u001b[38;5;28;01mas\u001b[39;00m tmlfuti\n\u001b[32m 12\u001b[39m \n\u001b[32m 13\u001b[39m \u001b[38;5;66;03m# Ensure custom tracking directory exists\u001b[39;00m\n\u001b[32m 14\u001b[39m tracking_path = \u001b[33m\"/tmp/mlflow_data\"\u001b[39m\n",
"\u001b[31mModuleNotFoundError\u001b[39m: No module named 'tutorials.UmdTask463_DATA605_Spring2026_MLflow'"
]
}
],
"source": [
"%load_ext autoreload\n",
"%autoreload 2\n",
"\n",
"import os\n",
"import logging\n",
"import mlflow\n",
"import mlflow_utils as mltuti\n",
"\n",
"# Ensure custom tracking directory exists\n",
"tracking_path = \"/tmp/mlflow_data\"\n",
"if not os.path.exists(tracking_path):\n",
" os.makedirs(tracking_path)\n",
"\n",
"# Set the tracking URI\n",
"mlflow.set_tracking_uri(f\"file://{tracking_path}\")\n",
"\n",
"# Configure logging\n",
"logging.basicConfig(level=logging.INFO)\n",
"_LOG = logging.getLogger(__name__)\n",
"\n",
"print(f\"Tracking URI: {mlflow.get_tracking_uri()}\")"
]
},
{
"cell_type": "markdown",
"id": "921b6a8b-6a57-4b44-a2d4-48aa8b93383a",
"metadata": {},
"source": [
"## Tracking Experiments\n",
"\n",
"Experiments are the highest level of organization in MLflow. Every time you try a new idea, you record it as a **Run** inside an **Experiment**."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "508a512e-ec29-4a4a-bbd8-7ff4b4abe91a",
"metadata": {},
"outputs": [],
"source": [
"# Start a test experiment\n",
"with mltuti.start_mlflow_run(\"Verification Test\"):\n",
" \n",
" # Log a dummy parameter (input)\n",
" mlflow.log_param(\"test_mode\", \"manual_verification\")\n",
" \n",
" # Log a dummy metric (output)\n",
" mlflow.log_metric(\"fake_rmse\", 0.5)\n",
" \n",
" print(\"Verification run completed.\")"
]
}
],
"metadata": {
"jupytext": {
"formats": "ipynb,py:percent"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading