A comprehensive collection of data science and machine learning projects, tutorials, and real-world applications. This repository contains 29 Jupyter notebooks covering fundamental concepts, advanced algorithms, model evaluation techniques, and practical data analysis projects.
- Overview
- Repository Structure
- Getting Started
- Project Categories
- Technologies Used
- Prerequisites
- Installation
- Usage
- Key Highlights
- Contributing
- License
This repository serves as both a learning resource and a portfolio showcase, demonstrating proficiency in:
- Machine Learning Algorithms: Classification, clustering, and ensemble methods
- Deep Learning: Neural network implementations
- Data Analysis: Exploratory data analysis and visualization
- Model Evaluation: Cross-validation, hyperparameter tuning, and metrics
- Real-World Applications: Environmental studies, real estate analysis, healthcare predictions
The repository contains standalone Jupyter notebooks organized by topic, making it easy to explore specific areas of interest without dependencies on other files.
- Python 3.7+
- Jupyter Notebook or JupyterLab
- Required Python libraries (see Installation)
- Clone the repository:
git clone https://github.com/DataDarling/DATA-SCIENCE-projects.git
cd DATA-SCIENCE-projects- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install required dependencies:
pip install jupyter numpy pandas matplotlib seaborn scikit-learn scipyLaunch Jupyter Notebook:
jupyter notebookNavigate to any notebook and run the cells to see the demonstrations and analyses.
| Notebook | Description | Dataset | Key Concepts |
|---|---|---|---|
| linear_regression.ipynb | Simple and multiple linear regression from first principles | Synthetic data (1D & 2D) | Matrix-based regression, coefficient calculation, prediction visualization |
Fundamental Algorithms:
| Notebook | Description | Dataset | Key Concepts |
|---|---|---|---|
| logistic_regression.ipynb | Binary and multiclass logistic regression implementations | Synthetic data | Classification, decision boundaries, sigmoid function |
| naive_Bayes.ipynb | Implementation of Multinomial, Bernoulli, and Gaussian Naive Bayes | Synthetic text & binary data | Probabilistic classification, Bayes theorem |
| decision_tree.ipynb | Decision tree classifier with visualization | Iris Dataset | Tree-based learning, feature importance, tree visualization |
| knn_knearest neighbors.ipynb | k-Nearest Neighbors algorithm | Synthetic data | Distance metrics, k-value selection, instance-based learning |
| suport_vector_machine_svm.ipynb | SVM with multiple kernel types (linear, polynomial, RBF) | Iris Dataset | Kernel methods, hyperplane optimization, C and gamma parameters |
Advanced Classification Applications:
| Notebook | Description | Dataset | Key Achievements |
|---|---|---|---|
| Heart Disease Prediction Model using Logistic Regression.ipynb | Medical diagnosis prediction model | Heart Disease Dataset | Healthcare ML application, logistic regression |
| Breast Cancer Dataset - Model Evaluation and Hyperparameter Tuning.ipynb | Comprehensive model evaluation and optimization | Breast Cancer Dataset | Hyperparameter tuning, model selection, medical ML |
| Multi-Class Classification and Model Tuning.ipynb | Multi-class problem solving with model comparison | Synthetic data | Model comparison, hyperparameter optimization |
Unsupervised Learning:
| Notebook | Description | Dataset | Key Concepts |
|---|---|---|---|
| k_means_clustering.ipynb | K-Means clustering fundamentals | Synthetic blobs | Centroid-based clustering, elbow method |
| Iris Dataset - k_means clustering.ipynb | K-Means applied to Iris dataset | Iris Dataset | Cluster visualization, centroid analysis |
| clustering.ipynb | Advanced clustering techniques | Synthetic blobs (300 samples) | GMM, hierarchical clustering (Ward, Complete, Average, Single linkage), dendrograms |
| Notebook | Description | Algorithms | Performance |
|---|---|---|---|
| ensemble_models.ipynb | Boosting methods for classification and regression | AdaBoost, Gradient Boosting | 100% accuracy on Iris, comprehensive regression metrics |
| neural_nets.ipynb | Multi-Layer Perceptron implementation | Neural Network (2-10-1 architecture) | Decision boundaries, ReLU activation, Sigmoid output |
Comprehensive Evaluation Techniques:
| Notebook | Description | Focus | Key Techniques |
|---|---|---|---|
| evaluation_metrics.ipynb | Complete classification metrics guide | Performance measurement | Confusion matrix, Accuracy, Precision, Recall, F1-score, ROC curves, AUC |
| cross_validation.ipynb | Cross-validation strategies | Model validation | KFold, LeaveOneOut, LeavePOut, ShuffleSplit, StratifiedKFold |
| feature_selection.ipynb | Feature importance and selection | Wine dataset analysis | Univariate selection, feature ranking |
| feature_selection_advanced.ipynb | Advanced feature engineering | Synthetic data | Missing value analysis, tree-based feature importance, Random Forest |
Core Skills & Tools:
| Notebook | Description | Focus |
|---|---|---|
| numpy_demo.ipynb | NumPy fundamentals | Array operations, formatting, mathematical operations |
| pandas_demo.ipynb | Pandas library essentials | DataFrame manipulation, data operations |
| data_visualization.ipynb | Data visualization techniques | Matplotlib, Seaborn plotting |
| data_preparation.ipynb | Data preprocessing workflow | Data cleaning, transformation |
| data_quality_report.ipynb | Data quality assessment | Missing values, quality metrics |
| iris_data_exploration.ipynb | Comprehensive EDA | Statistical summaries, distributions |
Applied Data Science:
| Project | Description | Domain | Key Insights |
|---|---|---|---|
| weather data analysis.ipynb | Weather pattern analysis | Meteorology | Trend analysis, seasonal patterns |
| New York Homes and Hotel Listings Data Exploration.ipynb | NYC real estate and hospitality analysis | Real Estate | Price trends, market insights, feature analysis |
| 2020-2025 Southern States Fungi Report.ipynb | Multi-year fungi observation study | Ecology | Species distribution, temporal patterns, regional analysis |
| georgia fungi observation report sept-oct 24.ipynb | Georgia fungi seasonal report | Environmental Science | Monthly patterns, species identification |
| V2 Hikes Of Georgia Monthly Fungi Reports.ipynb | Hiking and fungi tracking in Georgia | Outdoor Recreation/Ecology | Geographic analysis, species by location |
- Programming Language: Python 3.x
- Data Manipulation: NumPy, Pandas
- Visualization: Matplotlib, Seaborn
- Machine Learning: Scikit-learn
- Deep Learning: Neural network implementations
- Development Environment: Jupyter Notebook
- Statistical Analysis: SciPy
- Regression: Linear regression with matrix-based solutions
- 9 Classification algorithms including SVM, Naive Bayes, Logistic Regression, Decision Trees, KNN, and ensemble methods
- 3 Clustering approaches with various linkage methods and algorithms
- Neural Networks with custom architecture
- Comprehensive evaluation with 4 notebooks dedicated to model validation
- Healthcare applications: Heart disease and breast cancer prediction models
- Environmental research: Multi-year fungi observation studies
- Market analysis: Real estate and hospitality data exploration
- Beginner-friendly: Starts with NumPy and Pandas basics
- Progressive complexity: Moves from simple algorithms to advanced ensemble methods
- Best practices: Demonstrates proper cross-validation, feature selection, and model evaluation
- Well-documented: Each notebook contains explanations and visualizations
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is available for educational and reference purposes. Please check individual datasets for their respective licenses.
For questions or collaborations, please open an issue in the repository.
Note: Some notebooks may require specific datasets. Check individual notebooks for dataset sources and requirements.
Last Updated: February 2026