|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +Housing Regression MLE is an end-to-end machine learning pipeline for predicting housing prices using XGBoost. The project follows ML engineering best practices with modular pipelines, experiment tracking via MLflow, containerization, AWS cloud deployment, and comprehensive testing. The system includes both a REST API and a Streamlit dashboard for interactive predictions. |
| 8 | + |
| 9 | +## Architecture |
| 10 | + |
| 11 | +The codebase is organized into distinct pipelines following the flow: |
| 12 | +`Load → Preprocess → Feature Engineering → Train → Tune → Evaluate → Inference → Batch → Serve` |
| 13 | + |
| 14 | +### Core Modules |
| 15 | + |
| 16 | +- **`src/feature_pipeline/`**: Data loading, preprocessing, and feature engineering |
| 17 | + - `load.py`: Time-aware data splitting (train <2020, eval 2020-21, holdout ≥2022) |
| 18 | + - `preprocess.py`: City normalization, deduplication, outlier removal |
| 19 | + - `feature_engineering.py`: Date features, frequency encoding (zipcode), target encoding (city_full) |
| 20 | + |
| 21 | +- **`src/training_pipeline/`**: Model training and hyperparameter optimization |
| 22 | + - `train.py`: Baseline XGBoost training with configurable parameters |
| 23 | + - `tune.py`: Optuna-based hyperparameter tuning with MLflow integration |
| 24 | + - `eval.py`: Model evaluation and metrics calculation |
| 25 | + |
| 26 | +- **`src/inference_pipeline/`**: Production inference |
| 27 | + - `inference.py`: Applies same preprocessing/encoding transformations using saved encoders |
| 28 | + |
| 29 | +- **`src/batch/`**: Batch prediction processing |
| 30 | + - `run_monthly.py`: Generates monthly predictions on holdout data |
| 31 | + |
| 32 | +- **`src/api/`**: FastAPI web service |
| 33 | + - `main.py`: REST API with S3 integration, health checks, prediction endpoints, and batch processing |
| 34 | + |
| 35 | +### Web Applications |
| 36 | + |
| 37 | +- **`app.py`**: Streamlit dashboard for interactive housing price predictions |
| 38 | + - Real-time predictions via FastAPI integration |
| 39 | + - Interactive filtering by year, month, and region |
| 40 | + - Visualization of predictions vs actuals with metrics (MAE, RMSE, % Error) |
| 41 | + - Yearly trend analysis with highlighted selected periods |
| 42 | + |
| 43 | +### Cloud Infrastructure & Deployment |
| 44 | + |
| 45 | +- **AWS S3 Integration**: Data and model storage in `housing-regression-data` bucket |
| 46 | +- **Amazon ECR**: Container registry for Docker images |
| 47 | +- **Amazon ECS**: Container orchestration with Fargate |
| 48 | +- **Application Load Balancer**: Traffic distribution and routing |
| 49 | +- **CI/CD Pipeline**: Automated deployment via GitHub Actions |
| 50 | + |
| 51 | +#### ECS Services: |
| 52 | +- **housing-api-service**: FastAPI backend (port 8000, 1024 CPU, 3072 MB memory) |
| 53 | +- **housing-streamlit-service**: Streamlit dashboard (port 8501, 512 CPU, 1024 MB memory) |
| 54 | + |
| 55 | +### Data Leakage Prevention |
| 56 | + |
| 57 | +The project implements strict data leakage prevention: |
| 58 | +- Time-based splits (not random) |
| 59 | +- Encoders fitted only on training data |
| 60 | +- Leakage-prone columns dropped before training |
| 61 | +- Schema alignment enforced between train/eval/inference |
| 62 | + |
| 63 | +## Common Commands |
| 64 | + |
| 65 | +### Environment Setup |
| 66 | +```bash |
| 67 | +# Install dependencies using uv |
| 68 | +uv sync |
| 69 | +``` |
| 70 | + |
| 71 | +### Testing |
| 72 | +```bash |
| 73 | +# Run all tests |
| 74 | +pytest |
| 75 | + |
| 76 | +# Run specific test modules |
| 77 | +pytest tests/test_features.py |
| 78 | +pytest tests/test_training.py |
| 79 | +pytest tests/test_inference.py |
| 80 | + |
| 81 | +# Run with verbose output |
| 82 | +pytest -v |
| 83 | +``` |
| 84 | + |
| 85 | +### Data Pipeline |
| 86 | +```bash |
| 87 | +# 1. Load and split raw data |
| 88 | +python src/feature_pipeline/load.py |
| 89 | + |
| 90 | +# 2. Preprocess splits |
| 91 | +python -m src.feature_pipeline.preprocess |
| 92 | + |
| 93 | +# 3. Feature engineering |
| 94 | +python -m src.feature_pipeline.feature_engineering |
| 95 | +``` |
| 96 | + |
| 97 | +### Training Pipeline |
| 98 | +```bash |
| 99 | +# Train baseline model |
| 100 | +python src/training_pipeline/train.py |
| 101 | + |
| 102 | +# Hyperparameter tuning with MLflow |
| 103 | +python src/training_pipeline/tune.py |
| 104 | + |
| 105 | +# Model evaluation |
| 106 | +python src/training_pipeline/eval.py |
| 107 | +``` |
| 108 | + |
| 109 | +### Inference |
| 110 | +```bash |
| 111 | +# Single inference |
| 112 | +python src/inference_pipeline/inference.py --input data/raw/holdout.csv --output predictions.csv |
| 113 | + |
| 114 | +# Batch monthly predictions |
| 115 | +python src/batch/run_monthly.py |
| 116 | +``` |
| 117 | + |
| 118 | +### API Service |
| 119 | +```bash |
| 120 | +# Start FastAPI server locally |
| 121 | +uv run uvicorn src.api.main:app --host 0.0.0.0 --port 8000 |
| 122 | +``` |
| 123 | + |
| 124 | +### Streamlit Dashboard |
| 125 | +```bash |
| 126 | +# Start Streamlit dashboard locally |
| 127 | +streamlit run app.py --server.port 8501 --server.address 0.0.0.0 |
| 128 | +``` |
| 129 | + |
| 130 | +### Docker |
| 131 | +```bash |
| 132 | +# Build API container |
| 133 | +docker build -t housing-regression . |
| 134 | + |
| 135 | +# Build Streamlit container |
| 136 | +docker build -t housing-streamlit -f Dockerfile.streamlit . |
| 137 | + |
| 138 | +# Run API container |
| 139 | +docker run -p 8000:8000 housing-regression |
| 140 | + |
| 141 | +# Run Streamlit container |
| 142 | +docker run -p 8501:8501 housing-streamlit |
| 143 | +``` |
| 144 | + |
| 145 | +### MLflow Tracking |
| 146 | +```bash |
| 147 | +# Start MLflow UI (view experiments) |
| 148 | +mlflow ui |
| 149 | +``` |
| 150 | + |
| 151 | +## Key Design Patterns |
| 152 | + |
| 153 | +### Pipeline Modularity |
| 154 | +Each pipeline component can be run independently with consistent interfaces. All modules accept configurable input/output paths for testing isolation. |
| 155 | + |
| 156 | +### Cloud-Native Architecture |
| 157 | +- **S3-First Storage**: Models and data automatically sync from S3 buckets |
| 158 | +- **Containerized Services**: Both API and dashboard run in Docker containers |
| 159 | +- **Auto-scaling Infrastructure**: ECS Fargate provides serverless container scaling |
| 160 | +- **Environment-based Configuration**: Separate configs for local development and production |
| 161 | + |
| 162 | +### Encoder Persistence |
| 163 | +Frequency and target encoders are saved as pickle files during training and loaded during inference to ensure consistent transformations. |
| 164 | + |
| 165 | +### Configuration Management |
| 166 | +Model parameters, file paths, and pipeline settings use sensible defaults but can be overridden through function parameters or environment variables. Production deployments use AWS environment variables. |
| 167 | + |
| 168 | +### Testing Strategy |
| 169 | +- Unit tests for individual pipeline components |
| 170 | +- Integration tests for end-to-end pipeline flows |
| 171 | +- Smoke tests for inference pipeline |
| 172 | +- All tests use temporary directories to avoid touching production data |
| 173 | + |
| 174 | +## Dependencies |
| 175 | + |
| 176 | +Key production dependencies (see `pyproject.toml`): |
| 177 | +- **ML/Data**: `xgboost==3.0.4`, `scikit-learn`, `pandas==2.1.1`, `numpy==1.26.4` |
| 178 | +- **API**: `fastapi`, `uvicorn` |
| 179 | +- **Dashboard**: `streamlit`, `plotly` |
| 180 | +- **Cloud**: `boto3` (AWS integration) |
| 181 | +- **Experimentation**: `mlflow`, `optuna` |
| 182 | +- **Quality**: `great-expectations`, `evidently` |
| 183 | + |
| 184 | +## File Structure Notes |
| 185 | + |
| 186 | +- **`data/`**: Raw, processed, and prediction data (time-structured, S3-synced) |
| 187 | +- **`models/`**: Trained models and encoders (pkl files, S3-synced) |
| 188 | +- **`mlruns/`**: MLflow experiment tracking data |
| 189 | +- **`configs/`**: YAML configuration files |
| 190 | +- **`notebooks/`**: Jupyter notebooks for EDA and experimentation |
| 191 | +- **`tests/`**: Comprehensive test suite with sample data |
| 192 | +- **AWS Task Definitions**: `housing-api-task-def.json`, `streamlit-task-def.json` |
| 193 | +- **CI/CD**: `.github/workflows/ci.yml` for automated deployment |
| 194 | + |
| 195 | +## Production URLs |
| 196 | + |
| 197 | +- **API**: `http://housing-api-alb-945997111.eu-west-2.elb.amazonaws.com` |
| 198 | +- **Dashboard**: `http://housing-api-alb-945997111.eu-west-2.elb.amazonaws.com/dashboard` |
0 commit comments