🔬 Data Science Engineer Project

A production-quality, end-to-end Data Science project demonstrating the complete ML lifecycle — from raw data ingestion through EDA, preprocessing, model training, evaluation, and interactive serving via a Streamlit dashboard.

📁 Project Structure

ds/
├── data/
│   ├── raw/              # Raw source data
│   └── processed/        # Cleaned & feature-engineered data
├── models/               # Saved model artifacts (.pkl)
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_preprocessing.ipynb
│   ├── 03_modeling.ipynb
│   └── 04_prediction.ipynb
├── reports/
│   └── figures/          # Auto-generated plots
├── src/
│   ├── config.py         # Central configuration
│   ├── data/
│   │   ├── ingest.py     # Data loaders
│   │   ├── preprocess.py # Cleaning, encoding, scaling
│   │   └── split.py      # Train/val/test splits
│   ├── eda/
│   │   ├── analysis.py   # Descriptive stats, correlations
│   │   └── visualize.py  # Reusable plot helpers
│   ├── models/
│   │   ├── base_model.py        # Abstract base class
│   │   ├── classification.py    # LR, RF, XGBoost
│   │   ├── regression.py        # Linear, GradientBoosting
│   │   └── clustering.py        # KMeans, DBSCAN
│   ├── evaluation/
│   │   ├── metrics.py    # All metric computations
│   │   └── report.py     # Report generators
│   ├── pipelines/
│   │   ├── train_pipeline.py   # End-to-end training
│   │   └── predict_pipeline.py # Inference pipeline
│   └── utils/
│       ├── logger.py     # Loguru logger factory
│       └── helpers.py    # save/load model, dir utils
├── dashboard/
│   └── app.py            # Streamlit interactive dashboard
├── tests/
│   ├── test_preprocess.py
│   ├── test_metrics.py
│   └── test_pipeline.py
├── .gitignore
├── Makefile
├── README.md
├── requirements.txt
└── setup.py

🚀 Quickstart

1. Create & activate a virtual environment

python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate

2. Install dependencies

make install
# or manually:
pip install -r requirements.txt && pip install -e .

3. Run the training pipeline

make train
# or:
python -m src.pipelines.train_pipeline

4. Launch the interactive dashboard

make dashboard
# or:
streamlit run dashboard/app.py

5. Run tests

make test
# or:
pytest tests/ -v

🧩 Key Features

Feature	Details
Datasets	Iris (classification), Boston Housing (regression), built-in — no downloads needed
EDA	Descriptive stats, correlation matrices, distribution plots
Preprocessing	Null handling, encoding, standard/min-max scaling
Models	Logistic Regression, Random Forest, XGBoost, Linear Regression, Gradient Boosting, KMeans, DBSCAN
Evaluation	Accuracy, F1, ROC-AUC, MAE, RMSE, R², Silhouette Score
Dashboard	EDA, training, evaluation, live prediction — all in Streamlit
Testing	pytest with coverage reporting
Logging	Structured logging via Loguru

📊 Tech Stack

Python 3.10+
pandas, numpy, scipy – data manipulation
scikit-learn, xgboost – machine learning
matplotlib, seaborn, plotly – visualization
streamlit – interactive dashboard
loguru – logging
pytest – testing

📖 Notebooks

Notebook	Description
`01_eda.ipynb`	Exploratory Data Analysis on Iris dataset
`02_preprocessing.ipynb`	Data cleaning & feature engineering on Titanic
`03_modeling.ipynb`	Train and compare multiple classifiers
`04_prediction.ipynb`	Inference demo with saved model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔬 Data Science Engineer Project

📁 Project Structure

🚀 Quickstart

1. Create & activate a virtual environment

2. Install dependencies

3. Run the training pipeline

4. Launch the interactive dashboard

5. Run tests

🧩 Key Features

📊 Tech Stack

📖 Notebooks

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🔬 Data Science Engineer Project

📁 Project Structure

🚀 Quickstart

1. Create & activate a virtual environment

2. Install dependencies

3. Run the training pipeline

4. Launch the interactive dashboard

5. Run tests

🧩 Key Features

📊 Tech Stack

📖 Notebooks