A comprehensive machine learning platform for predicting student retention and lead scoring, built with production-ready code and best practices.
This project addresses two critical business problems in higher education:
-
Student Retention Prediction: Predicts which students are at risk of dropping out, with models for both early-semester (beginning of term) and mid-semester (midway through) predictions. This enables proactive intervention and resource allocation.
-
Lead Scoring: Predicts which prospective students (leads) are most likely to enroll, integrating data from multiple sources (GA4 web analytics, CRM, and SIS) despite incomplete join coverage.
Student Prediction/
βββ api/
β βββ main.py # FastAPI REST API (data pipeline, models, score bands)
βββ src/
β βββ data_generation.py # Synthetic data generation with realistic patterns
β βββ feature_engineering.py # Feature engineering pipelines
β βββ models.py # XGBoost, LightGBM, ensemble models
β βββ train.py # Training script with validation
βββ web/ # Next.js 14 dashboard (primary UI)
β βββ src/
β β βββ app/ # Pages and layout
β β βββ components/ # Stakeholder dashboard, data pipeline, model viz
β β βββ lib/ # API helpers
β βββ package.json
βββ scripts/
β βββ superset_provision.py # Apache Superset dashboard provisioning via API
β βββ load_data_for_superset.py # ETL for Superset SQLite
βββ data/ # Generated CSVs (created by train.py)
βββ models/ # Trained .pkl files (created by train.py)
βββ config.yaml # Configuration
βββ run.ps1 # One-click start (Windows)
βββ requirements.txt
python -m venv venv
# Windows: venv\Scripts\activate
# Mac/Linux: source venv/bin/activate
pip install -r requirements.txt# From project root, with PYTHONPATH set
# Windows PowerShell:
$env:PYTHONPATH = (Get-Location).Path
python src/train.py
# Or run.ps1 will auto-train if data is missing
.\run.ps1This will:
- Generate synthetic datasets for retention and lead scoring
- Train early-semester retention model (15 features)
- Train mid-semester retention model (24 features)
- Train lead scoring ensemble model (XGB+LGB, 29 features)
- Save models to
models/directory
# Terminal 1: FastAPI backend
$env:PYTHONPATH = (Get-Location).Path
python -m uvicorn api.main:app --host 0.0.0.0 --port 8000
# Terminal 2: Next.js dashboard
cd web && npm install && npm run devOpen http://localhost:3000 for the stakeholder dashboard. For Apache Superset BI dashboards, see Docker setup below.
Early Semester Model
- Uses data available at the beginning of the semester
- 15 features including demographics, academic preparation, and early engagement
- Enables proactive intervention before problems escalate
Mid-Semester Model
- Enhanced predictions using mid-semester performance data
- 24 features including GPA trends, engagement changes, and support utilization
- More accurate predictions with additional context
Key Features:
- Handles missing exit dates (realistic data quality issue)
- Addresses class imbalance with SMOTE
- Cross-validation for robust performance estimates
- SHAP values for model interpretability
Multi-Source Integration
- GA4 web analytics (100% coverage)
- CRM marketing data (70% coverage - realistic join issue)
- SIS academic data (15% coverage - only enrolled students)
Ensemble Approach
- XGBoost + LightGBM weighted ensemble
- Handles missing data from incomplete joins
- Feature engineering across all sources
Key Features:
- Engagement scoring from web behavior
- Marketing touchpoint analysis
- Academic quality indicators
- Cross-source feature alignment
- Early Semester: AUC ~0.75-0.80
- Mid-Semester: AUC ~0.82-0.87 (improved with additional data)
- Enrollment Prediction: AUC ~0.78-0.85
- Handles class imbalance (15% enrollment rate)
A sleek, dark-themed Next.js dashboard with comprehensive data pipeline visualizations:
# Terminal 1: Start FastAPI backend
cd "Student Prediction"
$env:PYTHONPATH = (Get-Location).Path
python -m uvicorn api.main:app --host 0.0.0.0 --port 8000
# Terminal 2: Start Next.js frontend
cd web
npm install
npm run devOpen http://localhost:3000 for the web interface.
- Retention Data Pipeline β Stats, missing data rates, feature distributions, correlation matrix
- Lead Scoring Data Pipeline β Join coverage (GA4βCRMβSIS), traffic sources, distributions
- Retention Feature Engineering β Pipeline steps, transformations, early vs mid features
- Lead Feature Engineering β Merge strategy, missing data handling
- Retention Models β Early/mid AUC, feature importance, A/B/C/D risk bands with interventions
- Lead Scoring Model β AUC, top features, A/B/C/D lead bands with interventions
Both retention and lead scoring use ABCD bands with specific actions:
- Retention: A=Critical β phone+meeting, B=High β phone+advisor, C=Medium β email, D=Low β monitor
- Leads: A=Hot β priority call, B=Warm β phone+email, C=Cool β nurture, D=Cold β low touch
Optional Docker-based Superset for executive dashboards:
docker-compose -f docker-compose.superset.yml up -d
python scripts/superset_provision.py # loads data + provisions (deletes old datasets)Login: admin / admin β Executive Dashboard
- BigQuery: Set
DATA_SOURCE=bigqueryandGCP_PROJECTto load from BigQuery instead of CSV. - Cloud Run:
gcloud run deploywith the included Dockerfile. - See CLOUD.md for BigQuery setup and Cloud Functions/Run deploy.
The Streamlit dashboard (streamlit run dashboard.py) includes:
-
Student Retention Section
- Early and mid-semester risk assessments
- Risk score distributions
- Feature importance visualizations
- Individual student risk calculator
- Actionable recommendations
-
Lead Scoring Section
- Enrollment probability scores
- Score distributions by enrollment status
- Data source coverage analysis
- Feature importance rankings
-
Model Performance
- Cross-validation metrics
- ROC curves
- Classification reports
-
Data Overview
- Dataset summaries
- Missing data analysis
- Statistical summaries
Edit config.yaml to adjust:
- Dataset sizes
- Missing data rates
- Model parameters
- File paths
This project addresses real-world data challenges:
- Missing Exit Dates: 35% of withdrawn students lack exit dates
- Incomplete Joins: CRM data covers only 70% of leads
- Sparse SIS Data: Only enrolled students have SIS records
- Class Imbalance: Low enrollment rates (15%) and withdrawal rates
All models handle these issues through:
- Missing value imputation
- Feature engineering for missing data indicators
- SMOTE for class imbalance
- Robust validation strategies
- ML Frameworks: XGBoost, LightGBM, scikit-learn
- API: FastAPI, REST endpoints
- Frontend: Next.js 14, React, TypeScript
- BI Dashboards: Apache Superset (Docker), API-provisioned charts
- Visualization: Plotly, Recharts
- Interpretability: SHAP values
- Data Processing: pandas, numpy
- Validation: Cross-validation, stratified splits
- β Proper train/validation/test splits
- β Cross-validation for robust metrics
- β Handling class imbalance
- β Feature engineering pipelines
- β Model interpretability (SHAP)
- β Production-ready code structure
- β Configuration management
- β Comprehensive documentation
- β Interactive visualization dashboard
- Early Intervention: Identify at-risk students at semester start
- Resource Allocation: Prioritize coaching and support services
- Marketing Optimization: Focus on high-quality leads
- Enrollment Planning: Forecast enrollment from lead pipeline
This is a portfolio project demonstrating data science and ML engineering capabilities.