Skip to content

Latest commit

Β 

History

History
140 lines (115 loc) Β· 3.99 KB

File metadata and controls

140 lines (115 loc) Β· 3.99 KB

πŸ”¬ Data Science Engineer Project

A production-quality, end-to-end Data Science project demonstrating the complete ML lifecycle β€” from raw data ingestion through EDA, preprocessing, model training, evaluation, and interactive serving via a Streamlit dashboard.


πŸ“ Project Structure

ds/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/              # Raw source data
β”‚   └── processed/        # Cleaned & feature-engineered data
β”œβ”€β”€ models/               # Saved model artifacts (.pkl)
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_eda.ipynb
β”‚   β”œβ”€β”€ 02_preprocessing.ipynb
β”‚   β”œβ”€β”€ 03_modeling.ipynb
β”‚   └── 04_prediction.ipynb
β”œβ”€β”€ reports/
β”‚   └── figures/          # Auto-generated plots
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py         # Central configuration
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ ingest.py     # Data loaders
β”‚   β”‚   β”œβ”€β”€ preprocess.py # Cleaning, encoding, scaling
β”‚   β”‚   └── split.py      # Train/val/test splits
β”‚   β”œβ”€β”€ eda/
β”‚   β”‚   β”œβ”€β”€ analysis.py   # Descriptive stats, correlations
β”‚   β”‚   └── visualize.py  # Reusable plot helpers
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ base_model.py        # Abstract base class
β”‚   β”‚   β”œβ”€β”€ classification.py    # LR, RF, XGBoost
β”‚   β”‚   β”œβ”€β”€ regression.py        # Linear, GradientBoosting
β”‚   β”‚   └── clustering.py        # KMeans, DBSCAN
β”‚   β”œβ”€β”€ evaluation/
β”‚   β”‚   β”œβ”€β”€ metrics.py    # All metric computations
β”‚   β”‚   └── report.py     # Report generators
β”‚   β”œβ”€β”€ pipelines/
β”‚   β”‚   β”œβ”€β”€ train_pipeline.py   # End-to-end training
β”‚   β”‚   └── predict_pipeline.py # Inference pipeline
β”‚   └── utils/
β”‚       β”œβ”€β”€ logger.py     # Loguru logger factory
β”‚       └── helpers.py    # save/load model, dir utils
β”œβ”€β”€ dashboard/
β”‚   └── app.py            # Streamlit interactive dashboard
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_preprocess.py
β”‚   β”œβ”€β”€ test_metrics.py
β”‚   └── test_pipeline.py
β”œβ”€β”€ .gitignore
β”œβ”€β”€ Makefile
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
└── setup.py

πŸš€ Quickstart

1. Create & activate a virtual environment

python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate

2. Install dependencies

make install
# or manually:
pip install -r requirements.txt && pip install -e .

3. Run the training pipeline

make train
# or:
python -m src.pipelines.train_pipeline

4. Launch the interactive dashboard

make dashboard
# or:
streamlit run dashboard/app.py

5. Run tests

make test
# or:
pytest tests/ -v

🧩 Key Features

Feature Details
Datasets Iris (classification), Boston Housing (regression), built-in β€” no downloads needed
EDA Descriptive stats, correlation matrices, distribution plots
Preprocessing Null handling, encoding, standard/min-max scaling
Models Logistic Regression, Random Forest, XGBoost, Linear Regression, Gradient Boosting, KMeans, DBSCAN
Evaluation Accuracy, F1, ROC-AUC, MAE, RMSE, RΒ², Silhouette Score
Dashboard EDA, training, evaluation, live prediction β€” all in Streamlit
Testing pytest with coverage reporting
Logging Structured logging via Loguru

πŸ“Š Tech Stack

  • Python 3.10+
  • pandas, numpy, scipy – data manipulation
  • scikit-learn, xgboost – machine learning
  • matplotlib, seaborn, plotly – visualization
  • streamlit – interactive dashboard
  • loguru – logging
  • pytest – testing

πŸ“– Notebooks

Notebook Description
01_eda.ipynb Exploratory Data Analysis on Iris dataset
02_preprocessing.ipynb Data cleaning & feature engineering on Titanic
03_modeling.ipynb Train and compare multiple classifiers
04_prediction.ipynb Inference demo with saved model