Post-hoc probability calibration transfer from FP32 to INT4 transformer models.
Can a calibrator trained on full-precision model outputs be safely deployed on a quantized model without retraining? This project provides a systematic study on FinBERT, along with a novel fuzzy-gated log-space affine calibration method.
| Finding | Detail |
|---|---|
| Calibration transfer is robust | All 4 methods transfer FP32 to INT4 with <=5% mean ECE change |
| Quantization improves uncalibrated ECE | INT4 reduces ECE by 9.4% (0.254 to 0.230) |
| Best calibration (our method) | ECE 0.036 +/- 0.002 (FP32), 0.037 +/- 0.002 (INT4) |
| Accuracy improvement | 57.6% to 66.6% via class-prior bias correction |
| Method | ECE (FP32) | ECE (INT4) | Acc (INT4) | Delta % |
|---|---|---|---|---|
| Uncalibrated | 0.254 | 0.230 | 57.2% | -9.4% |
| Temperature Scaling | 0.113 | 0.118 | 57.2% | +4.5% |
| Dirichlet | 0.068 | 0.048 | 64.5% | -28.9% |
| Plain Affine | 0.193 | 0.161 | 59.6% | -16.8% |
| Fuzzy-Gated (Ours) | 0.036 +/- 0.002 | 0.037 +/- 0.002 | 66.5% | +5.0% |
All accuracy improvements are statistically significant (McNemar's test, p < 0.001), evaluated across 3 seeds.
Standard post-hoc calibration applies a single global transform. This method applies confidence-conditioned calibration: different corrections for different confidence regions, gated by learnable Gaussian membership functions.
Raw Probability --> Fuzzy Memberships --> Per-Region Calibration --> Weighted Sum
(5 regions) (log-space affine) (by membership)
Each of the 5 confidence regions learns its own scale and bias in log-space (60 total parameters for 3 classes), enabling strong corrections where the model is overconfident and gentle corrections where it is already well-calibrated.
Training objective: NLL + Soft ECE + Class-wise Soft ECE + Brier Score (all differentiable).
# Setup
conda env create -f environment.yml
conda activate fuzzy-calibration
# Extract FP32 + INT4 probabilities
python scripts/extract_probs.py --model ProsusAI/finbert --dataset lwrf42/financial-sentiment-dataset --seed 42 --quant int4
# Train the fuzzy-gated calibrator
python scripts/train_calibrator.py --model ProsusAI/finbert --dataset lwrf42/financial-sentiment-dataset --seed 42 --max-epochs 200
# Evaluate FP32 -> INT4 calibration transfer
python scripts/run_calibration.py --model ProsusAI/finbert --int4 --seed 42 \
--load-fp32-probs ./cache/probs/seed42.npz --load-quant-probs ./cache/probs/seed42_int4.npz.
├── src/
│ ├── config.py # Configuration
│ ├── data_loader.py # Dataset loading + val/test split
│ ├── transformer_base.py # Model wrapper (MPS/CUDA auto-detect)
│ ├── fuzzy_calibrator.py # Fuzzy-gated calibrator
│ ├── fuzzy_membership.py # Learnable Gaussian membership functions
│ ├── label_wise_calibrator.py # Label-wise calibration
│ ├── evaluator.py # ECE, Brier, accuracy metrics
│ └── requirements.txt # Python dependencies
├── scripts/
│ ├── extract_probs.py # FP32 + INT4 probability extraction
│ ├── train_calibrator.py # Train fuzzy-gated calibrator
│ ├── run_calibration.py # Calibration transfer evaluation
│ └── generate_figures.py # Figure generation
├── environment.yml # Conda environment
└── .gitignore
- Model: FinBERT (110M params), INT4 via BitsAndBytes (418 MB -> 132 MB, 68% compression)
- Dataset: LWRF Financial Sentiment (95,220 samples, 3 classes)
- Metrics: ECE (15 bins), Brier score, accuracy
- Seeds: 42, 123, 456
- Significance: McNemar's test, paired bootstrap
MIT