Skip to content

arpit2006/GitHub-Developer-Popularity-Tier-Predictor

Repository files navigation

GitHub Developer Impact Tier Predictor

A Machine Learning-powered web application that predicts a GitHub developer's Impact Tier (Beginner, Advanced, or Elite) using repository activity, stars, forks, language diversity, and account statistics.

The application fetches real-time GitHub data using the GitHub REST API and uses an XGBoost model to classify developers into impact tiers.


📌 Overview

GitHub profiles contain valuable signals about a developer's open-source impact. This project analyzes a user's repositories, calculates impact-related metrics, and predicts their GitHub Impact Tier using Machine Learning.

The model evaluates repository popularity, developer engagement, language diversity, and account activity to estimate a developer's GitHub impact level.


✨ Features

  • 🔍 Analyze any public GitHub profile
  • 🤖 Machine Learning-based tier prediction
  • 📊 Real-time GitHub API integration
  • 📈 Confidence score visualization
  • 🌐 Interactive Streamlit dashboard
  • ⚡ XGBoost-powered predictions
  • 📋 Detailed GitHub profile insights

🛠️ Tech Stack

Machine Learning

  • Python
  • Scikit-Learn
  • XGBoost
  • Pandas
  • NumPy

Visualization

  • Plotly
  • Matplotlib

Deployment

  • Streamlit

Data Source

  • GitHub REST API

📊 Feature Engineering

The model uses the following GitHub metrics:

Feature Description
Following Number of users followed
Public Repositories Total public repositories
Total Stars Sum of stars across repositories
Total Forks Sum of forks across repositories
Language Diversity Number of unique programming languages used
Account Age (Days) GitHub account age
Stars per Repository Average stars per repository
Forks per Repository Average forks per repository

🎯 Target Variable

A custom GitHub Impact Score is calculated as:

impact_score = total_stars + 2 * total_forks

Forks are weighted higher because they represent deeper developer engagement and repository adoption compared to stars.

The impact score is divided into three balanced tiers using Pandas qcut():

Beginner
Advanced
Elite

🤖 Models Evaluated

Decision Tree Classifier

  • Accuracy: 97.5%

Random Forest Classifier

  • Accuracy: 94.5%

XGBoost Classifier ⭐

  • Accuracy: 99.0%

Cross Validation

5-Fold Cross Validation Accuracy: 96.88%

📈 Why Is The Accuracy So High?

The target variable (Impact Tier) is generated using repository impact metrics:

impact_score = total_stars + 2 * total_forks

The model is trained using features such as:

  • Total Stars
  • Total Forks
  • Stars per Repository
  • Forks per Repository
  • Public Repositories
  • Language Diversity

Because the target variable is strongly related to the selected features, the classification problem becomes highly learnable for tree-based models such as XGBoost.

Therefore, the reported accuracy reflects the strong relationship between GitHub activity metrics and the engineered Impact Tier rather than predicting subjective measures such as developer skill or experience.


🔄 Project Workflow

  1. Collect GitHub user data using GitHub API
  2. Fetch repository statistics
  3. Perform feature engineering
  4. Calculate GitHub Impact Score
  5. Generate tier labels using qcut()
  6. Preprocess features using Scikit-Learn Pipeline
  7. Train multiple classification models
  8. Evaluate model performance
  9. Save trained models using Joblib
  10. Deploy using Streamlit

📁 Project Structure

GitHub-Developer-Impact-Tier-Predictor/
│
├── app.py
├── model.ipynb
├── github_users.csv
├── xgb_model.pkl
├── pipeline.pkl
├── label_encoder.pkl
├── requirements.txt
├── .env
├── README.md
│
└── assets/

⚙️ Installation

Clone Repository

git clone https://github.com/yourusername/github-developer-impact-tier-predictor.git

cd github-developer-impact-tier-predictor

Install Dependencies

pip install -r requirements.txt

Create Environment Variables

Create a .env file:

GITHUB_TOKEN=your_github_personal_access_token

▶️ Run Locally

streamlit run app.py

🎓 Learning Outcomes

  • Data Collection using APIs
  • Feature Engineering
  • Classification Problems
  • XGBoost Modeling
  • Cross Validation
  • Model Evaluation
  • Streamlit Deployment
  • Real-Time Data Processing
  • Model Serialization using Joblib

🚀 Future Improvements

  • GitHub contribution analysis
  • Developer comparison dashboard
  • Organization-level analysis
  • Cloud deployment
  • Automated retraining pipeline
  • Advanced feature selection techniques

👨‍💻 Author

Arpit Shirbhate

Machine Learning • Data Science • Open Source Contributor


⭐ If you found this project useful, consider giving it a star.

About

Predicting Developer Popularity Levels Using GitHub Activity Metrics and Machine Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors