Skip to content

adityaTechProjects/DataScience

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔬 Data Science Engineer Project

A production-quality, end-to-end Data Science project demonstrating the complete ML lifecycle — from raw data ingestion through EDA, preprocessing, model training, evaluation, and interactive serving via a Streamlit dashboard.


📁 Project Structure

ds/
├── data/
│   ├── raw/              # Raw source data
│   └── processed/        # Cleaned & feature-engineered data
├── models/               # Saved model artifacts (.pkl)
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_preprocessing.ipynb
│   ├── 03_modeling.ipynb
│   └── 04_prediction.ipynb
├── reports/
│   └── figures/          # Auto-generated plots
├── src/
│   ├── config.py         # Central configuration
│   ├── data/
│   │   ├── ingest.py     # Data loaders
│   │   ├── preprocess.py # Cleaning, encoding, scaling
│   │   └── split.py      # Train/val/test splits
│   ├── eda/
│   │   ├── analysis.py   # Descriptive stats, correlations
│   │   └── visualize.py  # Reusable plot helpers
│   ├── models/
│   │   ├── base_model.py        # Abstract base class
│   │   ├── classification.py    # LR, RF, XGBoost
│   │   ├── regression.py        # Linear, GradientBoosting
│   │   └── clustering.py        # KMeans, DBSCAN
│   ├── evaluation/
│   │   ├── metrics.py    # All metric computations
│   │   └── report.py     # Report generators
│   ├── pipelines/
│   │   ├── train_pipeline.py   # End-to-end training
│   │   └── predict_pipeline.py # Inference pipeline
│   └── utils/
│       ├── logger.py     # Loguru logger factory
│       └── helpers.py    # save/load model, dir utils
├── dashboard/
│   └── app.py            # Streamlit interactive dashboard
├── tests/
│   ├── test_preprocess.py
│   ├── test_metrics.py
│   └── test_pipeline.py
├── .gitignore
├── Makefile
├── README.md
├── requirements.txt
└── setup.py

🚀 Quickstart

1. Create & activate a virtual environment

python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate

2. Install dependencies

make install
# or manually:
pip install -r requirements.txt && pip install -e .

3. Run the training pipeline

make train
# or:
python -m src.pipelines.train_pipeline

4. Launch the interactive dashboard

make dashboard
# or:
streamlit run dashboard/app.py

5. Run tests

make test
# or:
pytest tests/ -v

🧩 Key Features

Feature Details
Datasets Iris (classification), Boston Housing (regression), built-in — no downloads needed
EDA Descriptive stats, correlation matrices, distribution plots
Preprocessing Null handling, encoding, standard/min-max scaling
Models Logistic Regression, Random Forest, XGBoost, Linear Regression, Gradient Boosting, KMeans, DBSCAN
Evaluation Accuracy, F1, ROC-AUC, MAE, RMSE, R², Silhouette Score
Dashboard EDA, training, evaluation, live prediction — all in Streamlit
Testing pytest with coverage reporting
Logging Structured logging via Loguru

📊 Tech Stack

  • Python 3.10+
  • pandas, numpy, scipy – data manipulation
  • scikit-learn, xgboost – machine learning
  • matplotlib, seaborn, plotly – visualization
  • streamlit – interactive dashboard
  • loguru – logging
  • pytest – testing

📖 Notebooks

Notebook Description
01_eda.ipynb Exploratory Data Analysis on Iris dataset
02_preprocessing.ipynb Data cleaning & feature engineering on Titanic
03_modeling.ipynb Train and compare multiple classifiers
04_prediction.ipynb Inference demo with saved model

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors