Skip to content

afftab/fuzzy-gate-cal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Does Calibration Survive Quantization?

Post-hoc probability calibration transfer from FP32 to INT4 transformer models.

Can a calibrator trained on full-precision model outputs be safely deployed on a quantized model without retraining? This project provides a systematic study on FinBERT, along with a novel fuzzy-gated log-space affine calibration method.


Key Findings

Finding Detail
Calibration transfer is robust All 4 methods transfer FP32 to INT4 with <=5% mean ECE change
Quantization improves uncalibrated ECE INT4 reduces ECE by 9.4% (0.254 to 0.230)
Best calibration (our method) ECE 0.036 +/- 0.002 (FP32), 0.037 +/- 0.002 (INT4)
Accuracy improvement 57.6% to 66.6% via class-prior bias correction

Results

Method ECE (FP32) ECE (INT4) Acc (INT4) Delta %
Uncalibrated 0.254 0.230 57.2% -9.4%
Temperature Scaling 0.113 0.118 57.2% +4.5%
Dirichlet 0.068 0.048 64.5% -28.9%
Plain Affine 0.193 0.161 59.6% -16.8%
Fuzzy-Gated (Ours) 0.036 +/- 0.002 0.037 +/- 0.002 66.5% +5.0%

All accuracy improvements are statistically significant (McNemar's test, p < 0.001), evaluated across 3 seeds.

Method

Fuzzy-Gated Log-Space Affine Calibration

Standard post-hoc calibration applies a single global transform. This method applies confidence-conditioned calibration: different corrections for different confidence regions, gated by learnable Gaussian membership functions.

Raw Probability --> Fuzzy Memberships --> Per-Region Calibration --> Weighted Sum
                     (5 regions)          (log-space affine)         (by membership)

Each of the 5 confidence regions learns its own scale and bias in log-space (60 total parameters for 3 classes), enabling strong corrections where the model is overconfident and gentle corrections where it is already well-calibrated.

Training objective: NLL + Soft ECE + Class-wise Soft ECE + Brier Score (all differentiable).

Quick Start

# Setup
conda env create -f environment.yml
conda activate fuzzy-calibration

# Extract FP32 + INT4 probabilities
python scripts/extract_probs.py --model ProsusAI/finbert --dataset lwrf42/financial-sentiment-dataset --seed 42 --quant int4

# Train the fuzzy-gated calibrator
python scripts/train_calibrator.py --model ProsusAI/finbert --dataset lwrf42/financial-sentiment-dataset --seed 42 --max-epochs 200

# Evaluate FP32 -> INT4 calibration transfer
python scripts/run_calibration.py --model ProsusAI/finbert --int4 --seed 42 \
    --load-fp32-probs ./cache/probs/seed42.npz --load-quant-probs ./cache/probs/seed42_int4.npz

Project Structure

.
├── src/
│   ├── config.py                  # Configuration
│   ├── data_loader.py             # Dataset loading + val/test split
│   ├── transformer_base.py        # Model wrapper (MPS/CUDA auto-detect)
│   ├── fuzzy_calibrator.py        # Fuzzy-gated calibrator
│   ├── fuzzy_membership.py        # Learnable Gaussian membership functions
│   ├── label_wise_calibrator.py   # Label-wise calibration
│   ├── evaluator.py               # ECE, Brier, accuracy metrics
│   └── requirements.txt           # Python dependencies
├── scripts/
│   ├── extract_probs.py           # FP32 + INT4 probability extraction
│   ├── train_calibrator.py        # Train fuzzy-gated calibrator
│   ├── run_calibration.py         # Calibration transfer evaluation
│   └── generate_figures.py        # Figure generation
├── environment.yml                # Conda environment
└── .gitignore

Experimental Setup

  • Model: FinBERT (110M params), INT4 via BitsAndBytes (418 MB -> 132 MB, 68% compression)
  • Dataset: LWRF Financial Sentiment (95,220 samples, 3 classes)
  • Metrics: ECE (15 bins), Brier score, accuracy
  • Seeds: 42, 123, 456
  • Significance: McNemar's test, paired bootstrap

License

MIT

Releases

No releases published

Packages

 
 
 

Contributors

Languages