Skip to content

aminafarabi/wine-quality-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wine Quality Classification

A comparative study of SVM, Kernel Methods, and Logistic Regression for wine quality prediction.

Project Overview

Implementation of three classification algorithms from scratch (only numpy library used):

  • Linear Support Vector Machine (SVM)
  • Kernelized SVM with RBF kernel
  • Logistic Regression

The models classify wine quality (good/poor) based on physicochemical properties.

Dataset

Key Features

  • From-scratch implementations (no sklearn for core algorithms)
  • Proper validation: Stratified 5-fold CV with optimized normalization, train/test separation, no data leakage
  • Automatic hyperparameter tuning: Random search with CV
  • Class balancing: Sample weights to handle imbalance (4113 of good quality wines, 2384 of bad quality wines)
  • Evaluation metrics: Accuracy, Precision, Recall, F1-score, mainly F1-score was used for tuning

Results Summary

Model Accuracy Precision Recall F1-Score
Linear SVM 72.52% 74.59% 84.98% 79.45%
RBF SVM 75.60% 81.94% 78.20% 80.03%
Logistic Regression 71.75% 74.92% 82.39% 78.48%
  • RBF SVM: Best overall performance (F1: 80.03%, Accuracy: 75.60%)

    • Highest precision (81.94%)
    • Slowest training and prediction
    • Initial underfitting problem occurred (classified all samples as +1) - resolved with class balancing
  • Linear SVM: Best speed-accuracy trade-off (F1: 79.45%)

    • Highest recall (84.98%)
    • Recommended for production
    • The most stable in performance, no under/overfitting
  • Logistic Regression: Comparable performance (F1: 78.48%)

    • Provides probabilistic outputs

Critical Issues Resolved

Issue Solution
RBF underfitting (classified all as +1) Added class weights
Class imbalance (4113 +1 / 2384 -1) Sample weights in loss functions
Hyperparameter sensitivity Random search + 5-fold CV (30 trials)
Slow cross-validation due to repeated normalization Pre-calculated fold statistics (mean/std) once, reused for all train/test splits

Optimization Note: Efficient Cross-Validation

Performance Issue: During hyperparameter tuning with 5-fold CV, normalizing each fold from scratch was extremely time-consuming for both SVM models

Solution: Pre-computed statistics for all 5 folds once:

  • Deviding the folds only once before hyperparameter tuning
  • Calculated mean and standard deviation for each fold
  • For each CV iteration, combined statistics from 4 training folds using weighted averages
  • Applied the same statistics to normalize the test fold
  • Eliminated redundant calculations, significantly reducing runtime

📄 Full Report

For detailed methodology, mathematical formulations, pseudocode, and complete analysis, see the project report.

About

From-scratch implementation of Linear SVM, RBF SVM (with SMO algorithm), and Logistic Regression for wine quality classification

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors