Skip to content

Commit 10930c4

Browse files
committed
updating
0 parents  commit 10930c4

45 files changed

Lines changed: 2324206 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
name: CI/CD Pipeline
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
7+
jobs:
8+
build-and-deploy:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- name: Checkout code
12+
uses: actions/checkout@v4
13+
14+
- name: Configure AWS credentials
15+
uses: aws-actions/configure-aws-credentials@v4
16+
with:
17+
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
18+
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
19+
aws-region: ${{ secrets.AWS_REGION }}
20+
21+
- name: Login to Amazon ECR
22+
uses: aws-actions/amazon-ecr-login@v2
23+
24+
# Build and push housing-api image
25+
- name: Build, Tag, and Push housing-api
26+
run: |
27+
IMAGE_TAG=${GITHUB_SHA}
28+
ECR_REGISTRY=261899902410.dkr.ecr.eu-west-2.amazonaws.com
29+
ECR_REPOSITORY=housing-api
30+
31+
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG -f Dockerfile .
32+
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
33+
34+
docker tag $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG $ECR_REGISTRY/$ECR_REPOSITORY:latest
35+
docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest
36+
37+
# Build and push housing-streamlit image
38+
- name: Build, Tag, and Push housing-streamlit
39+
run: |
40+
IMAGE_TAG=${GITHUB_SHA}
41+
ECR_REGISTRY=261899902410.dkr.ecr.eu-west-2.amazonaws.com
42+
ECR_REPOSITORY=housing-streamlit
43+
44+
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG -f Dockerfile.streamlit .
45+
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
46+
47+
docker tag $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG $ECR_REGISTRY/$ECR_REPOSITORY:latest
48+
docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest
49+
50+
# Deploy to ECS - API
51+
- name: Deploy housing-api-service
52+
run: |
53+
aws ecs update-service \
54+
--cluster housing-api-cluster-ecs \
55+
--service housing-api-service \
56+
--force-new-deployment
57+
58+
# Deploy to ECS - Streamlit
59+
- name: Deploy housing-streamlit-service
60+
run: |
61+
aws ecs update-service \
62+
--cluster housing-api-cluster-ecs \
63+
--service housing-streamlit-service \
64+
--force-new-deployment

.gitignore

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# ==========================================
2+
# Python-generated files
3+
# ==========================================
4+
__pycache__/
5+
*.py[oc]
6+
*.pyo
7+
*.pyd
8+
*.pdb
9+
*.egg-info/
10+
*.egg
11+
build/
12+
dist/
13+
wheels/
14+
15+
# Virtual environments
16+
.venv/
17+
.env/
18+
*.env
19+
20+
# ==========================================
21+
# Project-specific ignores
22+
# ==========================================
23+
24+
# Datasets (raw + processed)
25+
data/
26+
27+
# MLflow experiment tracking
28+
mlruns/
29+
30+
# Notebook checkpoints
31+
.ipynb_checkpoints/
32+
33+
# Jupyter runtime files
34+
.jupyter/
35+
*.nbconvert.ipynb
36+
37+
# Logs
38+
logs/
39+
*.log
40+
41+
# Cache & temp
42+
*.tmp
43+
*.swp
44+
.DS_Store
45+
Thumbs.db
46+
47+
# VSCode / IDE
48+
.vscode/
49+
.idea/
50+
51+
# Docker
52+
*.pid
53+
*.tar
54+
*.lock.json

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.11

CLAUDE.md

Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
Housing Regression MLE is an end-to-end machine learning pipeline for predicting housing prices using XGBoost. The project follows ML engineering best practices with modular pipelines, experiment tracking via MLflow, containerization, AWS cloud deployment, and comprehensive testing. The system includes both a REST API and a Streamlit dashboard for interactive predictions.
8+
9+
## Architecture
10+
11+
The codebase is organized into distinct pipelines following the flow:
12+
`Load → Preprocess → Feature Engineering → Train → Tune → Evaluate → Inference → Batch → Serve`
13+
14+
### Core Modules
15+
16+
- **`src/feature_pipeline/`**: Data loading, preprocessing, and feature engineering
17+
- `load.py`: Time-aware data splitting (train <2020, eval 2020-21, holdout ≥2022)
18+
- `preprocess.py`: City normalization, deduplication, outlier removal
19+
- `feature_engineering.py`: Date features, frequency encoding (zipcode), target encoding (city_full)
20+
21+
- **`src/training_pipeline/`**: Model training and hyperparameter optimization
22+
- `train.py`: Baseline XGBoost training with configurable parameters
23+
- `tune.py`: Optuna-based hyperparameter tuning with MLflow integration
24+
- `eval.py`: Model evaluation and metrics calculation
25+
26+
- **`src/inference_pipeline/`**: Production inference
27+
- `inference.py`: Applies same preprocessing/encoding transformations using saved encoders
28+
29+
- **`src/batch/`**: Batch prediction processing
30+
- `run_monthly.py`: Generates monthly predictions on holdout data
31+
32+
- **`src/api/`**: FastAPI web service
33+
- `main.py`: REST API with S3 integration, health checks, prediction endpoints, and batch processing
34+
35+
### Web Applications
36+
37+
- **`app.py`**: Streamlit dashboard for interactive housing price predictions
38+
- Real-time predictions via FastAPI integration
39+
- Interactive filtering by year, month, and region
40+
- Visualization of predictions vs actuals with metrics (MAE, RMSE, % Error)
41+
- Yearly trend analysis with highlighted selected periods
42+
43+
### Cloud Infrastructure & Deployment
44+
45+
- **AWS S3 Integration**: Data and model storage in `housing-regression-data` bucket
46+
- **Amazon ECR**: Container registry for Docker images
47+
- **Amazon ECS**: Container orchestration with Fargate
48+
- **Application Load Balancer**: Traffic distribution and routing
49+
- **CI/CD Pipeline**: Automated deployment via GitHub Actions
50+
51+
#### ECS Services:
52+
- **housing-api-service**: FastAPI backend (port 8000, 1024 CPU, 3072 MB memory)
53+
- **housing-streamlit-service**: Streamlit dashboard (port 8501, 512 CPU, 1024 MB memory)
54+
55+
### Data Leakage Prevention
56+
57+
The project implements strict data leakage prevention:
58+
- Time-based splits (not random)
59+
- Encoders fitted only on training data
60+
- Leakage-prone columns dropped before training
61+
- Schema alignment enforced between train/eval/inference
62+
63+
## Common Commands
64+
65+
### Environment Setup
66+
```bash
67+
# Install dependencies using uv
68+
uv sync
69+
```
70+
71+
### Testing
72+
```bash
73+
# Run all tests
74+
pytest
75+
76+
# Run specific test modules
77+
pytest tests/test_features.py
78+
pytest tests/test_training.py
79+
pytest tests/test_inference.py
80+
81+
# Run with verbose output
82+
pytest -v
83+
```
84+
85+
### Data Pipeline
86+
```bash
87+
# 1. Load and split raw data
88+
python src/feature_pipeline/load.py
89+
90+
# 2. Preprocess splits
91+
python -m src.feature_pipeline.preprocess
92+
93+
# 3. Feature engineering
94+
python -m src.feature_pipeline.feature_engineering
95+
```
96+
97+
### Training Pipeline
98+
```bash
99+
# Train baseline model
100+
python src/training_pipeline/train.py
101+
102+
# Hyperparameter tuning with MLflow
103+
python src/training_pipeline/tune.py
104+
105+
# Model evaluation
106+
python src/training_pipeline/eval.py
107+
```
108+
109+
### Inference
110+
```bash
111+
# Single inference
112+
python src/inference_pipeline/inference.py --input data/raw/holdout.csv --output predictions.csv
113+
114+
# Batch monthly predictions
115+
python src/batch/run_monthly.py
116+
```
117+
118+
### API Service
119+
```bash
120+
# Start FastAPI server locally
121+
uv run uvicorn src.api.main:app --host 0.0.0.0 --port 8000
122+
```
123+
124+
### Streamlit Dashboard
125+
```bash
126+
# Start Streamlit dashboard locally
127+
streamlit run app.py --server.port 8501 --server.address 0.0.0.0
128+
```
129+
130+
### Docker
131+
```bash
132+
# Build API container
133+
docker build -t housing-regression .
134+
135+
# Build Streamlit container
136+
docker build -t housing-streamlit -f Dockerfile.streamlit .
137+
138+
# Run API container
139+
docker run -p 8000:8000 housing-regression
140+
141+
# Run Streamlit container
142+
docker run -p 8501:8501 housing-streamlit
143+
```
144+
145+
### MLflow Tracking
146+
```bash
147+
# Start MLflow UI (view experiments)
148+
mlflow ui
149+
```
150+
151+
## Key Design Patterns
152+
153+
### Pipeline Modularity
154+
Each pipeline component can be run independently with consistent interfaces. All modules accept configurable input/output paths for testing isolation.
155+
156+
### Cloud-Native Architecture
157+
- **S3-First Storage**: Models and data automatically sync from S3 buckets
158+
- **Containerized Services**: Both API and dashboard run in Docker containers
159+
- **Auto-scaling Infrastructure**: ECS Fargate provides serverless container scaling
160+
- **Environment-based Configuration**: Separate configs for local development and production
161+
162+
### Encoder Persistence
163+
Frequency and target encoders are saved as pickle files during training and loaded during inference to ensure consistent transformations.
164+
165+
### Configuration Management
166+
Model parameters, file paths, and pipeline settings use sensible defaults but can be overridden through function parameters or environment variables. Production deployments use AWS environment variables.
167+
168+
### Testing Strategy
169+
- Unit tests for individual pipeline components
170+
- Integration tests for end-to-end pipeline flows
171+
- Smoke tests for inference pipeline
172+
- All tests use temporary directories to avoid touching production data
173+
174+
## Dependencies
175+
176+
Key production dependencies (see `pyproject.toml`):
177+
- **ML/Data**: `xgboost==3.0.4`, `scikit-learn`, `pandas==2.1.1`, `numpy==1.26.4`
178+
- **API**: `fastapi`, `uvicorn`
179+
- **Dashboard**: `streamlit`, `plotly`
180+
- **Cloud**: `boto3` (AWS integration)
181+
- **Experimentation**: `mlflow`, `optuna`
182+
- **Quality**: `great-expectations`, `evidently`
183+
184+
## File Structure Notes
185+
186+
- **`data/`**: Raw, processed, and prediction data (time-structured, S3-synced)
187+
- **`models/`**: Trained models and encoders (pkl files, S3-synced)
188+
- **`mlruns/`**: MLflow experiment tracking data
189+
- **`configs/`**: YAML configuration files
190+
- **`notebooks/`**: Jupyter notebooks for EDA and experimentation
191+
- **`tests/`**: Comprehensive test suite with sample data
192+
- **AWS Task Definitions**: `housing-api-task-def.json`, `streamlit-task-def.json`
193+
- **CI/CD**: `.github/workflows/ci.yml` for automated deployment
194+
195+
## Production URLs
196+
197+
- **API**: `http://housing-api-alb-945997111.eu-west-2.elb.amazonaws.com`
198+
- **Dashboard**: `http://housing-api-alb-945997111.eu-west-2.elb.amazonaws.com/dashboard`

Dockerfile

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Use slim Python base image
2+
FROM python:3.11-slim
3+
4+
# Set working directory inside container
5+
WORKDIR /app
6+
7+
# Copy dependency files first (better caching)
8+
COPY pyproject.toml uv.lock* ./
9+
10+
# Install uv (dependency manager)
11+
RUN pip install uv
12+
RUN uv sync --frozen --no-dev
13+
14+
# Copy project files
15+
COPY . .
16+
17+
# Expose FastAPI default port
18+
EXPOSE 8000
19+
20+
# Command to run API with Uvicorn
21+
CMD ["uv", "run", "uvicorn", "src.api.main:app", "--host", "0.0.0.0", "--port", "8000"]

Dockerfile.streamlit

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Dockerfile.streamlit
2+
# Having Separate Dockerfiles for UI and FastAPI avoids mixing UI logic with backend logic.
3+
4+
# makes builds work across different architectures (
5+
FROM --platform=$BUILDPLATFORM python:3.11-slim
6+
7+
ENV PYTHONUNBUFFERED=1 PIP_NO_CACHE_DIR=1
8+
WORKDIR /app
9+
10+
# Install uv and project deps (from pyproject.toml)
11+
COPY pyproject.toml uv.lock* ./
12+
RUN pip install --no-cache-dir uv \
13+
&& uv pip install --system .
14+
15+
# Copy the app (and data needed by the app)
16+
COPY . .
17+
18+
# Streamlit config
19+
ENV STREAMLIT_SERVER_PORT=8501 \
20+
STREAMLIT_SERVER_ADDRESS=0.0.0.0 \
21+
STREAMLIT_SERVER_BASEURLPATH=/dashboard \
22+
STREAMLIT_BROWSER_GATHERUSAGESTATS=false
23+
24+
# Default API_URL (override in docker run / ECS)
25+
ENV API_URL=http://localhost:8000/predict
26+
27+
EXPOSE 8501
28+
29+
# Make absolutely sure Streamlit is the thing that starts
30+
ENTRYPOINT ["streamlit", "run", "app.py"]
31+
CMD ["--server.port=8501", "--server.address=0.0.0.0", "--server.baseUrlPath=/dashboard"]
32+
33+
34+
# FastAPI container: lightweight, only needs Python + Uvicorn + your code.
35+
# Streamlit container: interactive web app, needs Streamlit configuration and a link to the API.
36+
# Streamlit has more ENV config because it’s UI-focused. FastAPI just runs a server.

0 commit comments

Comments
 (0)