zakiscoding
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 64 additions & 0 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 64 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 54 additions & 0 deletions b/‎.gitignore‎
Lines changed: 54 additions & 0 deletions
diff --git a/‎.python-version‎
Lines changed: 1 addition & 0 deletions b/‎.python-version‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 198 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 198 additions & 0 deletions
diff --git a/‎Dockerfile‎
Lines changed: 21 additions & 0 deletions b/‎Dockerfile‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎Dockerfile.streamlit‎
Lines changed: 36 additions & 0 deletions b/‎Dockerfile.streamlit‎
Lines changed: 36 additions & 0 deletions
@@ -0,0 +1,64 @@
+name: CI/CD Pipeline
+
+on:
+  push:
+    branches: [ main ]
+
+jobs:
+  build-and-deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v4
+        with:
+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: ${{ secrets.AWS_REGION }}
+
+      - name: Login to Amazon ECR
+        uses: aws-actions/amazon-ecr-login@v2
+
+      # Build and push housing-api image
+      - name: Build, Tag, and Push housing-api
+        run: |
+          IMAGE_TAG=${GITHUB_SHA}
+          ECR_REGISTRY=261899902410.dkr.ecr.eu-west-2.amazonaws.com
+          ECR_REPOSITORY=housing-api
+
+          docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG -f Dockerfile .
+          docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
+
+          docker tag $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG $ECR_REGISTRY/$ECR_REPOSITORY:latest
+          docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest
+
+      # Build and push housing-streamlit image
+      - name: Build, Tag, and Push housing-streamlit
+        run: |
+          IMAGE_TAG=${GITHUB_SHA}
+          ECR_REGISTRY=261899902410.dkr.ecr.eu-west-2.amazonaws.com
+          ECR_REPOSITORY=housing-streamlit
+
+          docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG -f Dockerfile.streamlit .
+          docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
+
+          docker tag $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG $ECR_REGISTRY/$ECR_REPOSITORY:latest
+          docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest
+
+      # Deploy to ECS - API
+      - name: Deploy housing-api-service
+        run: |
+          aws ecs update-service \
+            --cluster housing-api-cluster-ecs \
+            --service housing-api-service \
+            --force-new-deployment
+
+      # Deploy to ECS - Streamlit
+      - name: Deploy housing-streamlit-service
+        run: |
+          aws ecs update-service \
+            --cluster housing-api-cluster-ecs \
+            --service housing-streamlit-service \
+            --force-new-deployment
@@ -0,0 +1,54 @@
+# ==========================================
+# Python-generated files
+# ==========================================
+__pycache__/
+*.py[oc]
+*.pyo
+*.pyd
+*.pdb
+*.egg-info/
+*.egg
+build/
+dist/
+wheels/
+
+# Virtual environments
+.venv/
+.env/
+*.env
+
+# ==========================================
+# Project-specific ignores
+# ==========================================
+
+# Datasets (raw + processed)
+data/
+
+# MLflow experiment tracking
+mlruns/
+
+# Notebook checkpoints
+.ipynb_checkpoints/
+
+# Jupyter runtime files
+.jupyter/
+*.nbconvert.ipynb
+
+# Logs
+logs/
+*.log
+
+# Cache & temp
+*.tmp
+*.swp
+.DS_Store
+Thumbs.db
+
+# VSCode / IDE
+.vscode/
+.idea/
+
+# Docker
+*.pid
+*.tar
+*.lock.json
@@ -0,0 +1 @@
+3.11
@@ -0,0 +1,198 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+Housing Regression MLE is an end-to-end machine learning pipeline for predicting housing prices using XGBoost. The project follows ML engineering best practices with modular pipelines, experiment tracking via MLflow, containerization, AWS cloud deployment, and comprehensive testing. The system includes both a REST API and a Streamlit dashboard for interactive predictions.
+
+## Architecture
+
+The codebase is organized into distinct pipelines following the flow:
+`Load → Preprocess → Feature Engineering → Train → Tune → Evaluate → Inference → Batch → Serve`
+
+### Core Modules
+
+- **`src/feature_pipeline/`**: Data loading, preprocessing, and feature engineering
+  - `load.py`: Time-aware data splitting (train <2020, eval 2020-21, holdout ≥2022)
+  - `preprocess.py`: City normalization, deduplication, outlier removal  
+  - `feature_engineering.py`: Date features, frequency encoding (zipcode), target encoding (city_full)
+
+- **`src/training_pipeline/`**: Model training and hyperparameter optimization
+  - `train.py`: Baseline XGBoost training with configurable parameters
+  - `tune.py`: Optuna-based hyperparameter tuning with MLflow integration
+  - `eval.py`: Model evaluation and metrics calculation
+
+- **`src/inference_pipeline/`**: Production inference
+  - `inference.py`: Applies same preprocessing/encoding transformations using saved encoders
+
+- **`src/batch/`**: Batch prediction processing
+  - `run_monthly.py`: Generates monthly predictions on holdout data
+
+- **`src/api/`**: FastAPI web service
+  - `main.py`: REST API with S3 integration, health checks, prediction endpoints, and batch processing
+
+### Web Applications
+
+- **`app.py`**: Streamlit dashboard for interactive housing price predictions
+  - Real-time predictions via FastAPI integration
+  - Interactive filtering by year, month, and region
+  - Visualization of predictions vs actuals with metrics (MAE, RMSE, % Error)
+  - Yearly trend analysis with highlighted selected periods
+
+### Cloud Infrastructure & Deployment
+
+- **AWS S3 Integration**: Data and model storage in `housing-regression-data` bucket
+- **Amazon ECR**: Container registry for Docker images
+- **Amazon ECS**: Container orchestration with Fargate
+- **Application Load Balancer**: Traffic distribution and routing
+- **CI/CD Pipeline**: Automated deployment via GitHub Actions
+
+#### ECS Services:
+- **housing-api-service**: FastAPI backend (port 8000, 1024 CPU, 3072 MB memory)
+- **housing-streamlit-service**: Streamlit dashboard (port 8501, 512 CPU, 1024 MB memory)
+
+### Data Leakage Prevention
+
+The project implements strict data leakage prevention:
+- Time-based splits (not random)
+- Encoders fitted only on training data
+- Leakage-prone columns dropped before training
+- Schema alignment enforced between train/eval/inference
+
+## Common Commands
+
+### Environment Setup
+```bash
+# Install dependencies using uv
+uv sync
+```
+
+### Testing
+```bash
+# Run all tests
+pytest
+
+# Run specific test modules  
+pytest tests/test_features.py
+pytest tests/test_training.py
+pytest tests/test_inference.py
+
+# Run with verbose output
+pytest -v
+```
+
+### Data Pipeline
+```bash
+# 1. Load and split raw data
+python src/feature_pipeline/load.py
+
+# 2. Preprocess splits
+python -m src.feature_pipeline.preprocess
+
+# 3. Feature engineering
+python -m src.feature_pipeline.feature_engineering
+```
+
+### Training Pipeline
+```bash
+# Train baseline model
+python src/training_pipeline/train.py
+
+# Hyperparameter tuning with MLflow
+python src/training_pipeline/tune.py
+
+# Model evaluation
+python src/training_pipeline/eval.py
+```
+
+### Inference
+```bash
+# Single inference
+python src/inference_pipeline/inference.py --input data/raw/holdout.csv --output predictions.csv
+
+# Batch monthly predictions
+python src/batch/run_monthly.py
+```
+
+### API Service
+```bash
+# Start FastAPI server locally
+uv run uvicorn src.api.main:app --host 0.0.0.0 --port 8000
+```
+
+### Streamlit Dashboard
+```bash
+# Start Streamlit dashboard locally
+streamlit run app.py --server.port 8501 --server.address 0.0.0.0
+```
+
+### Docker
+```bash
+# Build API container
+docker build -t housing-regression .
+
+# Build Streamlit container  
+docker build -t housing-streamlit -f Dockerfile.streamlit .
+
+# Run API container
+docker run -p 8000:8000 housing-regression
+
+# Run Streamlit container
+docker run -p 8501:8501 housing-streamlit
+```
+
+### MLflow Tracking
+```bash
+# Start MLflow UI (view experiments)
+mlflow ui
+```
+
+## Key Design Patterns
+
+### Pipeline Modularity
+Each pipeline component can be run independently with consistent interfaces. All modules accept configurable input/output paths for testing isolation.
+
+### Cloud-Native Architecture
+- **S3-First Storage**: Models and data automatically sync from S3 buckets
+- **Containerized Services**: Both API and dashboard run in Docker containers  
+- **Auto-scaling Infrastructure**: ECS Fargate provides serverless container scaling
+- **Environment-based Configuration**: Separate configs for local development and production
+
+### Encoder Persistence  
+Frequency and target encoders are saved as pickle files during training and loaded during inference to ensure consistent transformations.
+
+### Configuration Management
+Model parameters, file paths, and pipeline settings use sensible defaults but can be overridden through function parameters or environment variables. Production deployments use AWS environment variables.
+
+### Testing Strategy
+- Unit tests for individual pipeline components
+- Integration tests for end-to-end pipeline flows  
+- Smoke tests for inference pipeline
+- All tests use temporary directories to avoid touching production data
+
+## Dependencies
+
+Key production dependencies (see `pyproject.toml`):
+- **ML/Data**: `xgboost==3.0.4`, `scikit-learn`, `pandas==2.1.1`, `numpy==1.26.4`
+- **API**: `fastapi`, `uvicorn`
+- **Dashboard**: `streamlit`, `plotly`
+- **Cloud**: `boto3` (AWS integration)
+- **Experimentation**: `mlflow`, `optuna`
+- **Quality**: `great-expectations`, `evidently`
+
+## File Structure Notes
+
+- **`data/`**: Raw, processed, and prediction data (time-structured, S3-synced)
+- **`models/`**: Trained models and encoders (pkl files, S3-synced)
+- **`mlruns/`**: MLflow experiment tracking data
+- **`configs/`**: YAML configuration files
+- **`notebooks/`**: Jupyter notebooks for EDA and experimentation
+- **`tests/`**: Comprehensive test suite with sample data
+- **AWS Task Definitions**: `housing-api-task-def.json`, `streamlit-task-def.json`
+- **CI/CD**: `.github/workflows/ci.yml` for automated deployment
+
+## Production URLs
+
+- **API**: `http://housing-api-alb-945997111.eu-west-2.elb.amazonaws.com`
+- **Dashboard**: `http://housing-api-alb-945997111.eu-west-2.elb.amazonaws.com/dashboard`
@@ -0,0 +1,21 @@
+# Use slim Python base image
+FROM python:3.11-slim
+
+# Set working directory inside container
+WORKDIR /app
+
+# Copy dependency files first (better caching)
+COPY pyproject.toml uv.lock* ./
+
+# Install uv (dependency manager)
+RUN pip install uv
+RUN uv sync --frozen --no-dev
+
+# Copy project files
+COPY . .
+
+# Expose FastAPI default port
+EXPOSE 8000
+
+# Command to run API with Uvicorn
+CMD ["uv", "run", "uvicorn", "src.api.main:app", "--host", "0.0.0.0", "--port", "8000"]
@@ -0,0 +1,36 @@
+# Dockerfile.streamlit
+# Having Separate Dockerfiles for UI and FastAPI avoids mixing UI logic with backend logic.
+
+# makes builds work across different architectures (
+FROM --platform=$BUILDPLATFORM python:3.11-slim
+
+ENV PYTHONUNBUFFERED=1 PIP_NO_CACHE_DIR=1
+WORKDIR /app
+
+# Install uv and project deps (from pyproject.toml)
+COPY pyproject.toml uv.lock* ./
+RUN pip install --no-cache-dir uv \
+ && uv pip install --system .
+
+# Copy the app (and data needed by the app)
+COPY . .
+
+# Streamlit config
+ENV STREAMLIT_SERVER_PORT=8501 \
+    STREAMLIT_SERVER_ADDRESS=0.0.0.0 \
+    STREAMLIT_SERVER_BASEURLPATH=/dashboard \
+    STREAMLIT_BROWSER_GATHERUSAGESTATS=false
+
+# Default API_URL (override in docker run / ECS)
+ENV API_URL=http://localhost:8000/predict
+
+EXPOSE 8501
+
+# Make absolutely sure Streamlit is the thing that starts
+ENTRYPOINT ["streamlit", "run", "app.py"]
+CMD ["--server.port=8501", "--server.address=0.0.0.0", "--server.baseUrlPath=/dashboard"]
+
+
+# FastAPI container: lightweight, only needs Python + Uvicorn + your code.
+# Streamlit container: interactive web app, needs Streamlit configuration and a link to the API.
+# Streamlit has more ENV config because it’s UI-focused. FastAPI just runs a server.