Skip to content

r-siddiq/machine-learning-data-science

Repository files navigation

Machine Learning & Data Science Monorepo

Python

scikit-learn

NumPy
Pandas
Matplotlib
Seaborn

A structured monorepo of Jupyter notebooks covering data science foundations through classic machine learning. This repo favors the mainstream PyData stack - NumPy, pandas, Matplotlib/Seaborn, and scikit‑learn - with minimal external dependencies for a smooth, reproducible setup.


Contents


At a Glance

  • Audience: learners and practitioners who want a pragmatic path from data wrangling to ML modeling.
  • Format: self-contained Jupyter notebooks organized by topic and difficulty.
  • Compute: CPU‑friendly; no GPU/CUDA required.
  • Dependencies: pinned to the PyData stack (see requirements.txt).

Quickstart

1) Prerequisites

  • Python 3.9+
  • Graphviz (system package) for decision‑tree visuals:
      - macOS: brew install graphviz
      - Ubuntu/Debian: sudo apt-get update && sudo apt-get install -y graphviz
      - Windows (Chocolatey): choco install graphviz

2) Create an isolated environment

Option A  -  venv (macOS/Linux/Windows)

python -m venv .venv  
# macOS/Linux  
source .venv/bin/activate  
# Windows PowerShell  
# .venvScriptsActivate.ps1  

Option B  -  Conda

conda create -n ml-ds python=3.9 -y  
conda activate ml-ds  

If using Conda, installing Graphviz via conda-forge is recommended:
conda install -c conda-forge graphviz

3) Install requirements (and JupyterLab)

python -m pip install --upgrade pip  
pip install -r requirements.txt jupyterlab  

4) Launch notebooks

jupyter lab  
# or: jupyter notebook  

Project Structure

machine-learning-data-science/  
├── 1_data-science-foundation/        # NumPy & pandas fundamentals  
├── 2_data-visualization/            # Matplotlib & pandas plotting  
├── 3_exploratory-data-analysis/      # EDA patterns & diagnostics  
├── 4_data-cleaning-preprocessing/    # Data quality, preprocessing, Bayes basics  
├── 5_foundation-machine-learning/    # KNN and modeling‑ready EDA  
├── 6_model-building-evaluation/      # Linear regression & evaluation  
├── 7_advanced-modeling-system-design/# Trees & ML system design  
└── requirements.txt  

Topic map & representative notebooks

  • 1_data-science-foundation/  -  NumPy & pandas foundations 
      intro_to_pandas.ipynb, numpy-basics.ipynb, numpy-arrays.ipynb, numpy-multidim-arrays.ipynb, pandas-series.ipynb, pandas-dataframes.ipynb

  • 2_data-visualization/  -  Plotting with Matplotlib & pandas 
      matplotlib-basics.ipynb, Vizualization_With_Matplotlib.ipynb, pandas-aggregation.ipynb, cars.ipynb, cars-python-graphics.ipynb

  • 3_exploratory-data-analysis/  -  Exploratory data analysis patterns 
      MissingData.ipynb, college_EDA.ipynb

  • 4_data-cleaning-preprocessing/  -  Data quality, preprocessing & probability 
      BadData_EDA.ipynb, DataPreprocessing.ipynb, SingleVariable_EDA.ipynb, TwoVariables_EDA.ipynb, BayesTheorem_MeaslesSim.ipynb

  • 5_foundation-machine-learning/  -  Intro ML tasks (KNN, modeling‑oriented EDA) 
      KNN_Classification.ipynb, KNN_Regression.ipynb, KNN_AnomalyDetector.ipynb, TwoVariablesP2_EDA.ipynb, campaign_EDA.ipynb

  • 6_model-building-evaluation/  -  Regression & evaluation 
      LinearRegression1.ipynb, LinearRegression2.ipynb, LinearRegression3.ipynb, LinearRegression4.ipynb, KNN_Hyperparameters.ipynb

  • 7_advanced-modeling-system-design/  -  Trees & ML system design 
      ClassificationTrees1.ipynb, ClassificationTrees2.ipynb, DescisionTrees.ipynb, linear-regression.ipynb, SystemDesign.ipynb

Tip: the numbering is a suggested progression. Each notebook is self‑contained; feel free to jump to topics as needed.


Development Workflow

The repo is notebook‑first. The following conventions keep notebooks clean and reproducible.

Environment management

  • Use a local virtual environment (.venv) pinned by requirements.txt.
  • If multiple Python versions are installed, confirm the interpreter used by Jupyter:
      bash   python -m ipykernel install --user --name ml-ds --display-name "Python (ml-ds)"  
      Then select Python (ml-ds) as the kernel in JupyterLab.

Notebook hygiene

  • Restart & Run All before committing changes to verify a clean state.
  • Prefer pure‑Python + standard PyData idioms; avoid hidden state in global variables.
  • Keep plots lightweight (Matplotlib/Seaborn); large figures should save to disk via plt.savefig(...) when needed.

Data Notes

  • Notebooks primarily use toy/synthetic data or built‑in datasets (e.g., from Seaborn or scikit‑learn). No external data downloads are required for core lessons.
  • When experimenting with personal datasets, prefer CSV/Parquet under a local data/ folder ignored by version control (e.g., add /data to .gitignore).

Troubleshooting

Graphviz errors (tree visualizations fail): Ensure the Graphviz system package is installed (see Quickstart). After installing, fully restart JupyterLab.

Kernel not listed / wrong interpreter: Re‑create the ipykernel with the environment you installed packages into (see ipykernel install command above), then pick it from the kernel selector.

Package mismatch: Run the following in a notebook cell to verify versions:

import sys, numpy, pandas, matplotlib, seaborn, sklearn  
print(sys.version)  
print("numpy=", numpy.__version__)  
print("pandas=", pandas.__version__)  
print("matplotlib=", matplotlib.__version__)  
print("seaborn=", seaborn.__version__)  
print("scikit-learn=", sklearn.__version__)  

Releases

No releases published

Packages

 
 
 

Contributors