🔬 Data Science Engineer Project
A production-quality, end-to-end Data Science project demonstrating the complete ML lifecycle — from raw data ingestion through EDA, preprocessing, model training, evaluation, and interactive serving via a Streamlit dashboard.
ds/
├── data/
│ ├── raw/ # Raw source data
│ └── processed/ # Cleaned & feature-engineered data
├── models/ # Saved model artifacts (.pkl)
├── notebooks/
│ ├── 01_eda.ipynb
│ ├── 02_preprocessing.ipynb
│ ├── 03_modeling.ipynb
│ └── 04_prediction.ipynb
├── reports/
│ └── figures/ # Auto-generated plots
├── src/
│ ├── config.py # Central configuration
│ ├── data/
│ │ ├── ingest.py # Data loaders
│ │ ├── preprocess.py # Cleaning, encoding, scaling
│ │ └── split.py # Train/val/test splits
│ ├── eda/
│ │ ├── analysis.py # Descriptive stats, correlations
│ │ └── visualize.py # Reusable plot helpers
│ ├── models/
│ │ ├── base_model.py # Abstract base class
│ │ ├── classification.py # LR, RF, XGBoost
│ │ ├── regression.py # Linear, GradientBoosting
│ │ └── clustering.py # KMeans, DBSCAN
│ ├── evaluation/
│ │ ├── metrics.py # All metric computations
│ │ └── report.py # Report generators
│ ├── pipelines/
│ │ ├── train_pipeline.py # End-to-end training
│ │ └── predict_pipeline.py # Inference pipeline
│ └── utils/
│ ├── logger.py # Loguru logger factory
│ └── helpers.py # save/load model, dir utils
├── dashboard/
│ └── app.py # Streamlit interactive dashboard
├── tests/
│ ├── test_preprocess.py
│ ├── test_metrics.py
│ └── test_pipeline.py
├── .gitignore
├── Makefile
├── README.md
├── requirements.txt
└── setup.py
1. Create & activate a virtual environment
python -m venv .venv
# Windows
.venv\S cripts\a ctivate
# macOS/Linux
source .venv/bin/activate
make install
# or manually:
pip install -r requirements.txt && pip install -e .
3. Run the training pipeline
make train
# or:
python -m src.pipelines.train_pipeline
4. Launch the interactive dashboard
make dashboard
# or:
streamlit run dashboard/app.py
make test
# or:
pytest tests/ -v
Feature
Details
Datasets
Iris (classification), Boston Housing (regression), built-in — no downloads needed
EDA
Descriptive stats, correlation matrices, distribution plots
Preprocessing
Null handling, encoding, standard/min-max scaling
Models
Logistic Regression, Random Forest, XGBoost, Linear Regression, Gradient Boosting, KMeans, DBSCAN
Evaluation
Accuracy, F1, ROC-AUC, MAE, RMSE, R², Silhouette Score
Dashboard
EDA, training, evaluation, live prediction — all in Streamlit
Testing
pytest with coverage reporting
Logging
Structured logging via Loguru
Python 3.10+
pandas, numpy, scipy – data manipulation
scikit-learn, xgboost – machine learning
matplotlib, seaborn, plotly – visualization
streamlit – interactive dashboard
loguru – logging
pytest – testing
Notebook
Description
01_eda.ipynb
Exploratory Data Analysis on Iris dataset
02_preprocessing.ipynb
Data cleaning & feature engineering on Titanic
03_modeling.ipynb
Train and compare multiple classifiers
04_prediction.ipynb
Inference demo with saved model