π¬ Data Science Engineer Project
A production-quality, end-to-end Data Science project demonstrating the complete ML lifecycle β from raw data ingestion through EDA, preprocessing, model training, evaluation, and interactive serving via a Streamlit dashboard.
ds/
βββ data/
β βββ raw/ # Raw source data
β βββ processed/ # Cleaned & feature-engineered data
βββ models/ # Saved model artifacts (.pkl)
βββ notebooks/
β βββ 01_eda.ipynb
β βββ 02_preprocessing.ipynb
β βββ 03_modeling.ipynb
β βββ 04_prediction.ipynb
βββ reports/
β βββ figures/ # Auto-generated plots
βββ src/
β βββ config.py # Central configuration
β βββ data/
β β βββ ingest.py # Data loaders
β β βββ preprocess.py # Cleaning, encoding, scaling
β β βββ split.py # Train/val/test splits
β βββ eda/
β β βββ analysis.py # Descriptive stats, correlations
β β βββ visualize.py # Reusable plot helpers
β βββ models/
β β βββ base_model.py # Abstract base class
β β βββ classification.py # LR, RF, XGBoost
β β βββ regression.py # Linear, GradientBoosting
β β βββ clustering.py # KMeans, DBSCAN
β βββ evaluation/
β β βββ metrics.py # All metric computations
β β βββ report.py # Report generators
β βββ pipelines/
β β βββ train_pipeline.py # End-to-end training
β β βββ predict_pipeline.py # Inference pipeline
β βββ utils/
β βββ logger.py # Loguru logger factory
β βββ helpers.py # save/load model, dir utils
βββ dashboard/
β βββ app.py # Streamlit interactive dashboard
βββ tests/
β βββ test_preprocess.py
β βββ test_metrics.py
β βββ test_pipeline.py
βββ .gitignore
βββ Makefile
βββ README.md
βββ requirements.txt
βββ setup.py
1. Create & activate a virtual environment
python -m venv .venv
# Windows
.venv\S cripts\a ctivate
# macOS/Linux
source .venv/bin/activate
make install
# or manually:
pip install -r requirements.txt && pip install -e .
3. Run the training pipeline
make train
# or:
python -m src.pipelines.train_pipeline
4. Launch the interactive dashboard
make dashboard
# or:
streamlit run dashboard/app.py
make test
# or:
pytest tests/ -v
Feature
Details
Datasets
Iris (classification), Boston Housing (regression), built-in β no downloads needed
EDA
Descriptive stats, correlation matrices, distribution plots
Preprocessing
Null handling, encoding, standard/min-max scaling
Models
Logistic Regression, Random Forest, XGBoost, Linear Regression, Gradient Boosting, KMeans, DBSCAN
Evaluation
Accuracy, F1, ROC-AUC, MAE, RMSE, RΒ², Silhouette Score
Dashboard
EDA, training, evaluation, live prediction β all in Streamlit
Testing
pytest with coverage reporting
Logging
Structured logging via Loguru
Python 3.10+
pandas, numpy, scipy β data manipulation
scikit-learn, xgboost β machine learning
matplotlib, seaborn, plotly β visualization
streamlit β interactive dashboard
loguru β logging
pytest β testing
Notebook
Description
01_eda.ipynb
Exploratory Data Analysis on Iris dataset
02_preprocessing.ipynb
Data cleaning & feature engineering on Titanic
03_modeling.ipynb
Train and compare multiple classifiers
04_prediction.ipynb
Inference demo with saved model