This project implements and compares three different machine learning models for multi-class classification of student academic performance. The goal is to predict student performance categories (Poor, Average, Good, Excellent) based on various demographic, academic, and personal factors.
The project demonstrates:
- Building and training multiple classification models (Logistic Regression, SVM, MLP)
- Hyperparameter tuning using GridSearchCV
- Model evaluation with cross-validation
- Performance comparison and analysis
- Handling imbalanced datasets
- Train and evaluate three different multi-class classifiers
- Perform hyperparameter tuning to optimize model performance
- Use 5-fold cross-validation for robust model evaluation
- Visualize model performance using confusion matrices
- Analyze classification results and identify model strengths/weaknesses
- Provide recommendations for future improvements
- File:
performance.csv - Total Records: 1,009 student records
- Features: 33 features (mix of categorical and numerical)
- Target Variable: Student performance (4 classes)
| Class | Count | Percentage |
|---|---|---|
| Poor | 503 | 49.9% |
| Average | 272 | 27.0% |
| Good | 178 | 17.6% |
| Excellent | 56 | 5.5% |
Note: The dataset exhibits significant class imbalance, with "Excellent" being highly underrepresented.
The dataset includes:
- Demographic: Gender, Age, Program, Admission Year
- Academic Metrics: SGPA, CGPA, Study Hours, Attendance, Credits Earned
- Support Factors: Scholarship Status, Smartphone Access, PC Access, Probation Status
- Personal Factors: Health Issues, Physical Disabilities, Relationship Status, Part-time Work
- Academic Background: Skills and Interest Areas
- Algorithm: Multi-class logistic regression with SAGA solver
- Tuning Parameter: Regularization strength (C)
- Tuning Range: [0.01, 0.1, 1, 10, 100]
- Algorithm: SVC with RBF kernel
- Tuning Parameter: Regularization parameter (C)
- Tuning Range: [0.01, 0.1, 1, 10, 100]
- Algorithm: Neural network classifier
- Tuning Parameter: Hidden layer architecture
- Tuning Options: [(128,), (64,64), (128,64), (128,64,32)]
- Missing Value Handling: Addressed missing values in Skills1 (1 missing) and Interest_Area1 (7 missing)
- Categorical Encoding: One-hot encoding for categorical variables
- Feature Scaling: StandardScaler normalization for numerical features
- Train-Test Split: 70/30 split (700 training samples, 301 testing samples)
- Baseline Training: Initial model training with default parameters
- Hyperparameter Tuning: GridSearchCV with 5-fold cross-validation
- Performance Metrics: Accuracy, Precision, Recall, F1-score
- Visualization: Confusion matrices for detailed performance analysis
| Model | Accuracy | Key Observations |
|---|---|---|
| Logistic Regression | 53% | Strong bias toward "Poor" class |
| SVM | 53% | Severe class bias, zero recall for minority classes |
| MLP | 68% | Best performance, but struggles with "Excellent" class |
| Model | Accuracy | Best Parameters | Change |
|---|---|---|---|
| Logistic Regression | 53% | C=0.01 | No improvement |
| SVM | 53% | C=0.01 | No change |
| MLP | 63% | hidden_layer_sizes=(128, 64) | Slight decrease |
- β MLP emerged as the best-performing model with 68% accuracy before tuning
β οΈ "Excellent" class had the worst performance across all models (often zero recall/precision)β οΈ Class imbalance significantly affected minority class predictionβ οΈ Overlapping feature distributions between classes caused confusion- βΉοΈ Hyperparameter tuning did not significantly improve performance, suggesting fundamental data challenges
Python 3.12.6 or higherpip install pandas numpy scikit-learn matplotlib jupyterOr install from requirements file:
pip install -r requirements.txt- Clone this repository:
git clone https://github.com/DataDarling/Multi-Class-Classification-and-Model-Tuning.git
cd Multi-Class-Classification-and-Model-Tuning- Launch Jupyter Notebook:
jupyter notebook-
Open
Multi-Class Classification and Model Tuning.ipynb -
Run all cells sequentially to reproduce the analysis
Multi-Class-Classification-and-Model-Tuning/
β
βββ Multi-Class Classification and Model Tuning.ipynb # Main analysis notebook
βββ performance.csv # Dataset (not included in repo)
βββ README.md # Project documentation
- pandas: Data manipulation and analysis
- numpy: Numerical computations
- scikit-learn: Machine learning algorithms and tools
LogisticRegression: Logistic regression classifierSVC: Support vector classifierMLPClassifier: Multi-layer perceptron classifierGridSearchCV: Hyperparameter tuningStandardScaler: Feature scalingtrain_test_split: Data splitting
- matplotlib: Data visualization
Based on the analysis, the following improvements are recommended:
-
Address Class Imbalance:
- Implement SMOTE (Synthetic Minority Over-sampling Technique)
- Use class weight adjustments
- Try undersampling majority classes
-
Feature Engineering:
- Create interaction features
- Perform feature selection to reduce noise
- Engineer domain-specific features
-
Try Advanced Models:
- Ensemble methods (Random Forest, XGBoost, LightGBM)
- Deep learning architectures with dropout and regularization
- Voting classifiers combining multiple models
-
Alternative Evaluation Strategies:
- Use stratified sampling for better class representation
- Focus on macro-averaged metrics for imbalanced data
- Implement cost-sensitive learning
This project is open source and available for educational purposes.
DataDarling
- GitHub: @DataDarling
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
Give a βοΈ if this project helped you!