Skip to content
View Mutheu04's full-sized avatar

Block or report Mutheu04

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Mutheu04/README.md




🧠 About Me

"I don't just analyse data — I translate it into decisions that move the needle."

I'm a commercially-minded Data Analyst & Data Scientist with an MSc in Data Science (University of Salford) and 5+ years of experience turning complex datasets into clear, actionable business insights across non-profit, FMCG, regulatory, and e-commerce domains.

My flagship project: an end-to-end ML pipeline on ~500,000 donor records for Marie Curie that achieved a ROC-AUC of 0.957, a PR-AUC of 0.775, and a 5× campaign targeting lift — showing that targeting just the top 10% of donors could capture a 78% response rate while cutting mailing volume by 70–80% and dramatically improving campaign ROI.

I'm equally comfortable writing complex SQL, automating workflows with Python, building Power BI dashboards that slash reporting time by 60%, and deploying scalable recommendation engines on Databricks.


📊 Impact at a Glance

Metric Result
🏆 Best ML Model (ROC-AUC) 0.957 — Marie Curie Donor Prediction
📈 Campaign Targeting Lift over baseline response rate
💹 Market Share Growth +30% within 12 months
🎯 Demand Forecast Accuracy ~90% across 50+ product lines
⚡ Reporting Time Reduction ~60% via automated Power BI dashboards
👥 Stakeholders Trained 50 sales reps, 98% tool adoption rate
🔬 Regulatory Sources Automated 6+ international sources (ECHA, REACH, Stockholm Convention)

🚀 Featured Projects

🏥 Donor Response Prediction — Marie Curie Christmas Cash Appeal

Python · Scikit-learn · SHAP · SMOTE · CRISP-DM · Gradient Boosting

End-to-end ML pipeline on ~500K donor records predicting Christmas appeal response. Engineered 15+ behavioural features including RFM scores, donor fatigue indicators, mailing exposure counts, and engagement trend deltas. Compared Logistic Regression, Random Forest, and Gradient Boosting using hyperparameter tuning via RandomizedSearchCV, 5-fold stratified cross-validation, threshold sensitivity analysis, and SHAP-based model interpretability.

Results: ROC-AUC 0.957 · PR-AUC 0.775 · 5× targeting lift · 70–80% reduction in unnecessary outreach · top-10% scoring threshold → 78% response rate

Python Scikit-learn Jupyter


R · Random Forest · XGBoost · Statistical Testing · EDA

ML project predicting wine quality from physicochemical attributes using R. Applied statistical hypothesis testing, feature importance analysis, and comparative model evaluation across Random Forest and XGBoost, with full visualisation pipeline.

R Random Forest XGBoost


PySpark · ALS · Databricks · MLflow · SparkSQL · Big Data

Scalable collaborative filtering recommendation engine built on Databricks using PySpark ALS, analysing user play-history and rating patterns to serve personalised game recommendations. Experiment tracking with MLflow.

Apache Spark Databricks MLflow


Power BI · DAX · Power Query · Business Intelligence

Interactive Power BI dashboard analysing global population trends from 1960–2050 with demographic projections, regional breakdowns, and trend forecasting. Demonstrates production-level DAX and Power Query skills.

Power BI DAX


Python · Scikit-learn · Multi-class Classification · Hyperparameter Tuning

Multi-class classification pipeline predicting obesity levels from lifestyle and dietary data. Compared Logistic Regression, Random Forest, and Gradient Boosting with full hyperparameter tuning and cross-validation.

Python Scikit-learn


Python · K-Means · Hierarchical Clustering · PCA · Silhouette Analysis

Customer segmentation project using unsupervised learning to profile online shoppers by browsing behaviour. Dimensionality reduction via PCA, cluster optimisation with silhouette analysis, and comparative evaluation of K-Means vs. hierarchical approaches.

Python PCA


Python · NLP · TF-IDF · Logistic Regression · Linear SVM

NLP pipeline classifying sentiment across Amazon, IMDB, and Yelp review corpora. Full text pre-processing, TF-IDF vectorisation, and comparative model evaluation (Logistic Regression vs. Linear SVM) with cross-domain generalisation testing.

Python NLP NLTK


🛠️ Tech Stack

Languages

Python R SQL

Machine Learning & Data Science

Scikit-learn Pandas NumPy Apache Spark NLTK imbalanced-learn

Visualisation & Business Intelligence

Power BI Matplotlib Seaborn Excel

Platforms & Tools

Databricks Jupyter Google Colab SQL Server Git GitHub SAP Power Automate Miro

Statistical Tools

SPSS STATA


📚 Education

Degree Institution Year
🎓 MSc Data Science University of Salford, Manchester 2025 – 2026
🎓 BSc Economics & Statistics Kenyatta University, Nairobi 2015 – 2019

🏅 Certifications

Google ALX SPSS STATA


📈 GitHub Stats

Ruth's GitHub Stats Top Languages
GitHub Trophies

💼 Professional Timeline

Jan 2026 – Present  │  Data Science Intern    @ Marie Curie              │  ML · SHAP · CRISP-DM · 500K records
Jul 2025 – Present  │  Consultant Analyst     @ Yordas Limited            │  Python · Web Scraping · Regulatory ETL
Sep 2023 – Dec 2024 │  Data Analyst           @ Westside Distillers       │  SQL · Power BI · +30% market share
Jul 2019 – Aug 2023 │  Data Analyst           @ Hasbah Kenya Ltd          │  Big Data · Forecasting · 90% accuracy

⭐ Found a project useful? A star means the world!
📫 Open to Data Analyst & Data Scientist roles — let's connect on LinkedIn!

Popular repositories Loading

  1. Data-Science-Portfolio Data-Science-Portfolio Public

    Jupyter Notebook

  2. Obesity-Level-Classification-ML Obesity-Level-Classification-ML Public

    This project aims to predict the obesity level of individuals based on their eating habits, lifestyle and physical conditions.

    Jupyter Notebook

  3. Online-Shopping-Intention-Clustering Online-Shopping-Intention-Clustering Public

    Customer segmentation project using clustering techniques to group online shoppers based on behavior and purchasing patterns.

    Jupyter Notebook

  4. sentiment-analysis-reviews sentiment-analysis-reviews Public

    Built a cross-domain sentiment analysis model using Amazon, IMDB, and Yelp reviews, analysing model performance and generalisation across different text domains.

    Jupyter Notebook

  5. Steam-Recommender-System Steam-Recommender-System Public

    Built a scalable recommendation engine using PySpark ALS, analysing user play behaviour to generate personalised game recommendations with optimised model performance.

    Jupyter Notebook

  6. global-population-dashboard-powerbi global-population-dashboard-powerbi Public

    Interactive Power BI dashboard analysing global population trends (1960–2050) with projections, regional comparisons and demographic insights.