A comparative study of SVM, Kernel Methods, and Logistic Regression for wine quality prediction.
Implementation of three classification algorithms from scratch (only numpy library used):
- Linear Support Vector Machine (SVM)
- Kernelized SVM with RBF kernel
- Logistic Regression
The models classify wine quality (good/poor) based on physicochemical properties.
- Source: UCI Machine Learning Repository - Wine Quality Dataset
- Combined red and white wine samples (6,497 instances)
- 11 physicochemical features
- Binary classification: quality ≥ 6 (+1), quality ≤ 5 (-1)
- From-scratch implementations (no sklearn for core algorithms)
- Proper validation: Stratified 5-fold CV with optimized normalization, train/test separation, no data leakage
- Automatic hyperparameter tuning: Random search with CV
- Class balancing: Sample weights to handle imbalance (4113 of good quality wines, 2384 of bad quality wines)
- Evaluation metrics: Accuracy, Precision, Recall, F1-score, mainly F1-score was used for tuning
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Linear SVM | 72.52% | 74.59% | 84.98% | 79.45% |
| RBF SVM | 75.60% | 81.94% | 78.20% | 80.03% |
| Logistic Regression | 71.75% | 74.92% | 82.39% | 78.48% |
-
RBF SVM: Best overall performance (F1: 80.03%, Accuracy: 75.60%)
- Highest precision (81.94%)
- Slowest training and prediction
- Initial underfitting problem occurred (classified all samples as +1) - resolved with class balancing
-
Linear SVM: Best speed-accuracy trade-off (F1: 79.45%)
- Highest recall (84.98%)
- Recommended for production
- The most stable in performance, no under/overfitting
-
Logistic Regression: Comparable performance (F1: 78.48%)
- Provides probabilistic outputs
| Issue | Solution |
|---|---|
| RBF underfitting (classified all as +1) | Added class weights |
| Class imbalance (4113 +1 / 2384 -1) | Sample weights in loss functions |
| Hyperparameter sensitivity | Random search + 5-fold CV (30 trials) |
| Slow cross-validation due to repeated normalization | Pre-calculated fold statistics (mean/std) once, reused for all train/test splits |
Performance Issue: During hyperparameter tuning with 5-fold CV, normalizing each fold from scratch was extremely time-consuming for both SVM models
Solution: Pre-computed statistics for all 5 folds once:
- Deviding the folds only once before hyperparameter tuning
- Calculated mean and standard deviation for each fold
- For each CV iteration, combined statistics from 4 training folds using weighted averages
- Applied the same statistics to normalize the test fold
- Eliminated redundant calculations, significantly reducing runtime
For detailed methodology, mathematical formulations, pseudocode, and complete analysis, see the project report.