A from-scratch implementation of Gaussian and Multinomial Naive Bayes classifiers in Python, evaluated on two real-world datasets.
.
├── classifiers.py # GaussianNaiveBayes and MultinomialNaiveBayes implementations
├── utils.py # Helper functions: accuracy, mean, variance, Bag-of-Words builder
└── main.ipynb # Experiments, evaluation, and visualizations
| Dataset | Task | Model |
|---|---|---|
| Abalone (UCI) | Age classification (Young / Adult / Old) | Gaussian NB |
| IMDB Movie Reviews | Sentiment analysis (Positive / Negative) | Multinomial NB |
| Model | Mode | Accuracy |
|---|---|---|
| Gaussian NB | With log probabilities | 56.58% |
| Gaussian NB | Without log probabilities | 56.58% |
| Multinomial NB | With log probabilities | 78.50% |
| Multinomial NB | Without log probabilities | 53.00% |
Key insight: Log probabilities matter significantly for Multinomial NB on text data (25.5% accuracy gap), because multiplying many small word probabilities causes numerical underflow without the log transformation.
-
Install dependencies:
pip install numpy pandas matplotlib scikit-learn
-
Place the IMDB dataset at
data/IMDB Dataset.csv. -
Open and run
main.ipynb.
- Both classifiers support toggling log-probability mode via
use_log=True/False. - Gaussian NB uses Laplace smoothing (
+1e-9to variance) to avoid division by zero. - Multinomial NB uses Laplace (additive) smoothing on word counts.
- The
buildBOWutility converts raw text reviews into fixed-length Bag-of-Words vectors.