A Machine Learning-powered web application that predicts a GitHub developer's Impact Tier (Beginner, Advanced, or Elite) using repository activity, stars, forks, language diversity, and account statistics.
The application fetches real-time GitHub data using the GitHub REST API and uses an XGBoost model to classify developers into impact tiers.
GitHub profiles contain valuable signals about a developer's open-source impact. This project analyzes a user's repositories, calculates impact-related metrics, and predicts their GitHub Impact Tier using Machine Learning.
The model evaluates repository popularity, developer engagement, language diversity, and account activity to estimate a developer's GitHub impact level.
- 🔍 Analyze any public GitHub profile
- 🤖 Machine Learning-based tier prediction
- 📊 Real-time GitHub API integration
- 📈 Confidence score visualization
- 🌐 Interactive Streamlit dashboard
- ⚡ XGBoost-powered predictions
- 📋 Detailed GitHub profile insights
- Python
- Scikit-Learn
- XGBoost
- Pandas
- NumPy
- Plotly
- Matplotlib
- Streamlit
- GitHub REST API
The model uses the following GitHub metrics:
| Feature | Description |
|---|---|
| Following | Number of users followed |
| Public Repositories | Total public repositories |
| Total Stars | Sum of stars across repositories |
| Total Forks | Sum of forks across repositories |
| Language Diversity | Number of unique programming languages used |
| Account Age (Days) | GitHub account age |
| Stars per Repository | Average stars per repository |
| Forks per Repository | Average forks per repository |
A custom GitHub Impact Score is calculated as:
impact_score = total_stars + 2 * total_forksForks are weighted higher because they represent deeper developer engagement and repository adoption compared to stars.
The impact score is divided into three balanced tiers using Pandas qcut():
Beginner
Advanced
Elite- Accuracy: 97.5%
- Accuracy: 94.5%
- Accuracy: 99.0%
5-Fold Cross Validation Accuracy: 96.88%
The target variable (Impact Tier) is generated using repository impact metrics:
impact_score = total_stars + 2 * total_forksThe model is trained using features such as:
- Total Stars
- Total Forks
- Stars per Repository
- Forks per Repository
- Public Repositories
- Language Diversity
Because the target variable is strongly related to the selected features, the classification problem becomes highly learnable for tree-based models such as XGBoost.
Therefore, the reported accuracy reflects the strong relationship between GitHub activity metrics and the engineered Impact Tier rather than predicting subjective measures such as developer skill or experience.
- Collect GitHub user data using GitHub API
- Fetch repository statistics
- Perform feature engineering
- Calculate GitHub Impact Score
- Generate tier labels using qcut()
- Preprocess features using Scikit-Learn Pipeline
- Train multiple classification models
- Evaluate model performance
- Save trained models using Joblib
- Deploy using Streamlit
GitHub-Developer-Impact-Tier-Predictor/
│
├── app.py
├── model.ipynb
├── github_users.csv
├── xgb_model.pkl
├── pipeline.pkl
├── label_encoder.pkl
├── requirements.txt
├── .env
├── README.md
│
└── assets/
git clone https://github.com/yourusername/github-developer-impact-tier-predictor.git
cd github-developer-impact-tier-predictorpip install -r requirements.txtCreate a .env file:
GITHUB_TOKEN=your_github_personal_access_tokenstreamlit run app.py- Data Collection using APIs
- Feature Engineering
- Classification Problems
- XGBoost Modeling
- Cross Validation
- Model Evaluation
- Streamlit Deployment
- Real-Time Data Processing
- Model Serialization using Joblib
- GitHub contribution analysis
- Developer comparison dashboard
- Organization-level analysis
- Cloud deployment
- Automated retraining pipeline
- Advanced feature selection techniques
Arpit Shirbhate
Machine Learning • Data Science • Open Source Contributor
⭐ If you found this project useful, consider giving it a star.