A structured monorepo of Jupyter notebooks covering data science foundations through classic machine learning. This repo favors the mainstream PyData stack - NumPy, pandas, Matplotlib/Seaborn, and scikit‑learn - with minimal external dependencies for a smooth, reproducible setup.
- Audience: learners and practitioners who want a pragmatic path from data wrangling to ML modeling.
- Format: self-contained Jupyter notebooks organized by topic and difficulty.
- Compute: CPU‑friendly; no GPU/CUDA required.
- Dependencies: pinned to the PyData stack (see
requirements.txt).
- Python 3.9+
- Graphviz (system package) for decision‑tree visuals:
- macOS:brew install graphviz
- Ubuntu/Debian:sudo apt-get update && sudo apt-get install -y graphviz
- Windows (Chocolatey):choco install graphviz
Option A - venv (macOS/Linux/Windows)
python -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows PowerShell
# .venvScriptsActivate.ps1 Option B - Conda
conda create -n ml-ds python=3.9 -y
conda activate ml-ds If using Conda, installing Graphviz via conda-forge is recommended:
conda install -c conda-forge graphviz
python -m pip install --upgrade pip
pip install -r requirements.txt jupyterlab jupyter lab
# or: jupyter notebook machine-learning-data-science/
├── 1_data-science-foundation/ # NumPy & pandas fundamentals
├── 2_data-visualization/ # Matplotlib & pandas plotting
├── 3_exploratory-data-analysis/ # EDA patterns & diagnostics
├── 4_data-cleaning-preprocessing/ # Data quality, preprocessing, Bayes basics
├── 5_foundation-machine-learning/ # KNN and modeling‑ready EDA
├── 6_model-building-evaluation/ # Linear regression & evaluation
├── 7_advanced-modeling-system-design/# Trees & ML system design
└── requirements.txt
-
1_data-science-foundation/- NumPy & pandas foundations
intro_to_pandas.ipynb,numpy-basics.ipynb,numpy-arrays.ipynb,numpy-multidim-arrays.ipynb,pandas-series.ipynb,pandas-dataframes.ipynb -
2_data-visualization/- Plotting with Matplotlib & pandas
matplotlib-basics.ipynb,Vizualization_With_Matplotlib.ipynb,pandas-aggregation.ipynb,cars.ipynb,cars-python-graphics.ipynb -
3_exploratory-data-analysis/- Exploratory data analysis patterns
MissingData.ipynb,college_EDA.ipynb -
4_data-cleaning-preprocessing/- Data quality, preprocessing & probability
BadData_EDA.ipynb,DataPreprocessing.ipynb,SingleVariable_EDA.ipynb,TwoVariables_EDA.ipynb,BayesTheorem_MeaslesSim.ipynb -
5_foundation-machine-learning/- Intro ML tasks (KNN, modeling‑oriented EDA)
KNN_Classification.ipynb,KNN_Regression.ipynb,KNN_AnomalyDetector.ipynb,TwoVariablesP2_EDA.ipynb,campaign_EDA.ipynb -
6_model-building-evaluation/- Regression & evaluation
LinearRegression1.ipynb,LinearRegression2.ipynb,LinearRegression3.ipynb,LinearRegression4.ipynb,KNN_Hyperparameters.ipynb -
7_advanced-modeling-system-design/- Trees & ML system design
ClassificationTrees1.ipynb,ClassificationTrees2.ipynb,DescisionTrees.ipynb,linear-regression.ipynb,SystemDesign.ipynb
Tip: the numbering is a suggested progression. Each notebook is self‑contained; feel free to jump to topics as needed.
The repo is notebook‑first. The following conventions keep notebooks clean and reproducible.
- Use a local virtual environment (
.venv) pinned byrequirements.txt. - If multiple Python versions are installed, confirm the interpreter used by Jupyter:
bash python -m ipykernel install --user --name ml-ds --display-name "Python (ml-ds)"
Then select Python (ml-ds) as the kernel in JupyterLab.
- Restart & Run All before committing changes to verify a clean state.
- Prefer pure‑Python + standard PyData idioms; avoid hidden state in global variables.
- Keep plots lightweight (Matplotlib/Seaborn); large figures should save to disk via
plt.savefig(...)when needed.
- Notebooks primarily use toy/synthetic data or built‑in datasets (e.g., from Seaborn or scikit‑learn). No external data downloads are required for core lessons.
- When experimenting with personal datasets, prefer CSV/Parquet under a local
data/folder ignored by version control (e.g., add/datato.gitignore).
Graphviz errors (tree visualizations fail): Ensure the Graphviz system package is installed (see Quickstart). After installing, fully restart JupyterLab.
Kernel not listed / wrong interpreter: Re‑create the ipykernel with the environment you installed packages into (see ipykernel install command above), then pick it from the kernel selector.
Package mismatch: Run the following in a notebook cell to verify versions:
import sys, numpy, pandas, matplotlib, seaborn, sklearn
print(sys.version)
print("numpy=", numpy.__version__)
print("pandas=", pandas.__version__)
print("matplotlib=", matplotlib.__version__)
print("seaborn=", seaborn.__version__)
print("scikit-learn=", sklearn.__version__)