CalibPL: Dual Isotonic Calibration Reveals the Hidden Failure Mode of Semi-Supervised Detection in Dense Scenes
[WACV 2027 Submission · Algorithms Track · Double-Blind Review]
We identify and formally prove NMS Tail Amplification — a fundamental failure mode that causes semi-supervised object detectors to hallucinate confident, spatially incorrect pseudo-labels in dense scenes. Our fix, CalibPL, dynamically re-calibrates confidence at every self-training iteration, achieving +1.20 AP₅₀ on dense retail shelves, +8.3 AP cross-domain on CrowdHuman, and a ~70% false-positive reduction — while being theoretically grounded with a falsifiable prediction (no gain on sparse COCO, confirmed at p=0.41).
- The Core Problem
- Key Contributions
- Results at a Glance
- Theoretical Foundation
- Method Overview
- Installation
- Quick Start
- Reproducing Paper Results
- Project Structure
- Datasets
- Citation
- Research Lineage
- License
Every state-of-the-art semi-supervised object detector (SSOD) makes the same silent assumption:
Classification confidence reliably proxies pseudo-label spatial quality.
This assumption is mathematically wrong in dense scenes — and we prove it.
| Scene Type | Objects/Image | NMS Competition | ECE_loc Drift (5 iters) | AP Impact |
|---|---|---|---|---|
| COCO (sparse) | ~7 | Low | ≈0 | No degradation |
| CrowdHuman (dense) | ~23 | High | 0.033 → 0.407 | −5.7 AP |
| SKU-110K (very dense) | ~147 | Very High | 0.113 → 0.226 | Stagflation |
What goes wrong: Non-Maximum Suppression selects the highest-classification-score survivor from each dense cluster. Localization quality is a passenger — not selected for. In crowds of 147+ objects per image, this creates a systematic decoupling where the most confident boxes are also the most spatially unreliable. The Consistent-Teacher GMM baseline doubles localization ECE (0.113 → 0.226), leaving 39% of accepted pseudo-boxes with IoU < 0.5.
Why static calibration can't fix it: Post-hoc methods like Platt Scaling and Isotonic Regression (Kuzucu et al., ECCV 2024) work beautifully on fixed supervised models. But in iterative SSOD, the score distribution shifts every iteration. A calibrator fitted at iteration 0 accrues 0.024 ECE drift by iteration 1 — the map is stale, the territory has moved.
1. The NMS Tail Amplification Theorem (Proposition 1)
We formally prove that the expected gap between classification confidence and localization accuracy is non-decreasing in scene density
2. Dynamic Dual Isotonic Calibration
Two isotonic regression calibrators — one for classification (g_cls), one for localization (g_loc) — re-fitted at every SSOD iteration to track the non-stationary score distribution. Achieves ECE < 0.003 throughout all 5 self-training iterations.
3. Class-Geometry Joint Stability (CGJS) Gate
A multi-augmentation consistency filter that accepts a pseudo-label only when it is both spatially stable (box position consistent under geometric augmentation) and semantically stable (class prediction consistent under photometric augmentation). Requires 3.6× training overhead at the lightweight setting (|A|=2) while recovering 69% of the full gain.
4. Falsifiable Theoretical Prediction
Our theory predicts zero benefit on sparse datasets (
| Method | Precision | Recall | #PL | AP₅₀ |
|---|---|---|---|---|
| Fixed τ=0.5 (Baseline) | 0.610 | 0.875 | 14,194 | 87.28 |
| GMM-only (CT [4]) | 0.680 | 0.820 | 11,983 | 86.08 ± 0.26 |
| Temperature Scaling | 0.732 | 0.836 | 12,108 | 87.89 ± 0.08 |
| Static Isotonic (Kuzucu et al.) | 0.759 | 0.811 | 12,803 | 87.96 ± 0.10 |
| CalibPL (Ours) | 0.840 | 0.797 | 10,494 | 88.48 ± 0.11 |
| Improvement | +23.0 pp | −7.8 pp | −26% | +1.20 AP |
Statistical significance:
| Method | AP | AP₅₀ | Δ AP |
|---|---|---|---|
| Fixed τ=0.5 | 33.66 ± 0.03 | 53.17 ± 0.09 | — |
| CalibPL | 33.59 ± 0.03 | 52.99 ± 0.08 | −0.07 (p=0.41 ✓) |
CalibPL correctly predicts no benefit on sparse data — a key scientific validation.
| Method | ECE_loc (It.1→3) | AP (It.1→3) | ΔAP by It.3 |
|---|---|---|---|
| Baseline | 0.033 → 0.407 | 81.5 → 75.8 | — |
| CalibPL | 0.033 → 0.052 | 82.3 → 84.1 | +8.3 AP |
Consider a spatial clique of
-
$S^* = \max_{i} S_i$ — the NMS-selected score (order statistic) -
$g(s) = \Pr(\text{IoU} \geq \tau \mid S = s)$ — true localization-conditional precision - Tail overconfidence assumption (A):
$g(s) \leq s - \delta$ for$s \geq s_0$
Theorem: The expected overconfidence gap satisfies:
Empirical validation: On SKU-110K validation (16,393 post-NMS boxes, iteration 3), the tail overconfidence assumption holds with
A static calibrator
┌─────────────────────────────────────────────────────────────────────┐
│ CalibPL Self-Training Loop │
│ │
│ ┌──────────┐ ┌───────────────────────┐ ┌────────────────┐ │
│ │ f_{t-1} │───▶│ Density-Adaptive │───▶│ Dual Isotonic │ │
│ │ (Teacher) │ │ Scene Analysis │ │ Calibrators │ │
│ └──────────┘ │ ρ ≥ κ ? → loc-gate │ │ g_cls + g_loc │ │
│ └───────────────────────┘ └───────┬────────┘ │
│ │ re-fit/iter│
│ ┌──────────────────────────────────────────────────── ▼ ──────┐ │
│ │ Pseudo-Label Acceptance Gate (Eq. 3) │ │
│ │ g_cls(p̂_cls) ≥ r_cls ∧ g_loc(p̂_loc) ≥ r_loc ∧ σ_b ≥ β │ │
│ │ ↑ calibrated conf ↑ calibrated IoU │ │
│ │ ↑ CGJS stability score │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ accepted PL → D̃_U → Joint Training → f_t │
└─────────────────────────────────────────────────────────────────────┘
Algorithm 1 (one SSOD iteration):
- Compute scene density
$\rho$ from accepted pseudo-labels of$f_{t-1}$ - If
$\rho \geq \kappa$ : activate loc-gate$r_\text{loc}^\text{eff} = r_\text{loc}$ ; else$r_\text{loc}^\text{eff} = 0$ - Fit
$g_t^\text{cls}$ ,$g_t^\text{loc}$ via isotonic regression with stratified bootstrap ($B=5$ ) on$D_L^\text{val}$ - For each unlabeled box
$b$ : accept iff $g_t^\text{cls}(\hat{p}\text{cls}) \geq r\text{cls}$ ∧ $g_t^\text{loc}(\hat{p}\text{loc}) \geq r\text{loc}^\text{eff}$ ∧$\sigma_b \geq \beta$ - Train
$f_t$ on$D_L \cup \tilde{D}_U$
- Python 3.10+
- CUDA 11.8+ (tested on Tesla K80 / RTX 3090)
- ~8 GB GPU RAM minimum (11 GB recommended for full batch sizes)
# Clone the repository
git clone https://github.com/mohanganesh3/CalibPL.git
cd CalibPL
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# .venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtultralytics>=8.2.0 # YOLOv12 backbone and training
torch>=2.1.0
torchvision>=0.16.0
scikit-learn>=1.3.0 # IsotonicRegression calibrators
numpy>=1.24.0
scipy>=1.11.0
matplotlib>=3.7.0
Pillow>=10.0.0
pyyaml>=6.0
tqdm>=4.65.0
opencv-python>=4.8.0
python scripts/calibpl_selftrain.py \
--data-yaml data/sku110k_10pct.yaml \
--unlabeled-dir data/SKU-110K/unlabeled/images \
--method calibpl \
--iterations 5 \
--seed 42 \
--target-reliability 0.6 \
--cgjs-threshold 0.5 \
--tag sku10_seed42python scripts/calibpl_selftrain.py \
--data-yaml data/sku110k_10pct.yaml \
--unlabeled-dir data/SKU-110K/unlabeled/images \
--method calibpl_gmm \
--use-gmm \
--tag sku10_gmm_seed42python scripts/calibpl_selftrain.py \
--data-yaml data/sku110k_10pct.yaml \
--unlabeled-dir data/SKU-110K/unlabeled/images \
--method calibpl \
--cgjs-threshold 0.0 \ # disable CGJS gate
--tag sku10_dual_onlypython scripts/evaluate_sku_pseudo_labels.py \
--pseudo-dir results/calibpl_v3/calibpl_seed42_sku10_seed42/iter_3/pseudo_labels \
--gt-dir data/SKU-110K/val/labels \
--output results/pl_quality_report.jsonpython scripts/validate_proposition1_nms_competition.py \
--model-path models/yolo12n.pt \
--data-yaml data/sku110k_10pct.yaml \
--output results/proposition1_validation.json# Seeds 42, 43, 44 in parallel (requires 3 GPUs)
bash scripts/launch_sku_calibpl_seeds43_44.sh
# Or sequentially
for seed in 42 43 44; do
python scripts/calibpl_selftrain.py \
--data-yaml data/sku110k_10pct.yaml \
--unlabeled-dir data/SKU-110K/unlabeled/images \
--method calibpl \
--iterations 5 --seed $seed \
--tag sku10_seed${seed}
donebash scripts/run_ablations.sh
# Runs: fixed_threshold, g_cls_only, g_loc_only, dual, cgjs_only, full_calibplpython scripts/benchmark_gpu_cgjs.py \
--model-path models/yolo12n.pt \
--data-yaml data/sku110k_10pct.yaml \
--aug-sizes 0 2 5python scripts/experiment_ece_drift_crowdhuman.py \
--method calibpl \
--iterations 3python scripts/compile_results.py --output results/paper_tables.jsonCalibPL/
├── 📄 README.md
├── 📄 LICENSE
├── 📄 requirements.txt
│
├── 🔬 core/ # Core library modules
│ ├── calibration/
│ │ ├── detection_calibration.py # D-ECE, LaECE measurement utilities
│ │ └── mc_dropout.py # MC Dropout uncertainty estimation
│ ├── training/
│ │ ├── cotraining_exact.py # Co-training framework
│ │ └── optimizer_exact.py # Custom SGD + cosine decay
│ ├── dataset/ # Dataset loaders and splits
│ ├── models/ # Model wrappers
│ └── ensemble/ # Ensemble utilities
│
├── 📜 scripts/ # All experiment scripts
│ ├── calibpl_selftrain.py # ★ MAIN: CalibPL SSOD loop
│ ├── prediction_stability.py # CGJS gate implementation
│ ├── validate_proposition1_nms_competition.py # Theory validation
│ ├── measure_tail_misalignment.py # Tail overconfidence measurement
│ ├── experiment_ece_drift.py # ECE drift (Fig. 2 in paper)
│ ├── experiment_ece_drift_crowdhuman.py
│ ├── ablation_cgjs_precision.py # Table 6 ablations
│ ├── ablation_gmm_vs_isotonic.py # GMM vs isotonic comparison
│ ├── density_analysis.py # κ=12 density threshold analysis
│ ├── sensitivity_sweep.py # r_cls/r_loc threshold sweep
│ ├── independence_test.py # Pearson correlation cls/loc
│ ├── benchmark_gpu_cgjs.py # CGJS overhead benchmarking
│ └── [launch scripts, evaluators, figure generators...]
│
├── 📊 results/ # Experiment outputs
│ ├── figures/ # Generated figures (PDF/PNG)
│ ├── ablations/ # Ablation result JSONs
│ ├── table6_multiseed_ablation.json
│ ├── crowdhuman_results.json
│ └── [training logs per method/seed/iteration]
│
├── 📝 paper/
│ └── wacv2027/
│ ├── wacv_paper.tex # Main paper (WACV 2027)
│ ├── wacv_supplementary.tex # Full supplementary
│ ├── wacv2027_refs.bib # Bibliography
│ └── wacv.sty # WACV style file
│
├── 📐 theory/ # Mathematical analysis notebooks
├── 🎨 figures/ # Matplotlib figure scripts
├── 📚 docs/ # Additional documentation
└── 🗂️ data/ # Data configs (not raw data)
├── sku110k_10pct.yaml
├── sku110k_1pct.yaml
├── sku110k_5pct.yaml
└── coco_1pct.yaml
bash scripts/download_sku110k.sh
# Or manually: https://github.com/eg4000/SKU110K_CVPR19After downloading, set up the split:
python scripts/subset_sku_json.py \
--source data/SKU-110K/annotations/instances_train2017.json \
--pct 10 \
--output data/sku110k_splits/Dataset statistics:
- Train: 8,233 images | Val: 588 | Test: 2,941
- Objects/image: 147.4 ± 32.4
- Total annotated instances: 1.73 million
- Our working subset: 1,400 images (10% = 140 labeled)
# Download from https://cocodataset.org
python scripts/prepare_coco_ssod.py --split 1pct --output data/coco_splits/bash scripts/download_crowdhuman.shKey hyperparameters (all set to paper defaults):
| Parameter | Default | Description |
|---|---|---|
--iterations |
5 | Number of SSOD self-training iterations |
--epochs |
10 | Training epochs per iteration |
--target-reliability |
0.6 | Calibrated threshold r_cls = r_loc |
--cgjs-threshold |
0.5 | CGJS stability gate β |
--cgjs-alpha |
0.5 | Weight in score fusion mode |
--seed |
42 | Random seed (paper uses 42, 43, 44) |
--batch-size |
16 | Training batch size |
CGJS augmentation budget:
| |A| | Overhead | AP₅₀ | Training Time |
|:------:|:--------:|:----:|:-------------:|
| 0 (off) | 1.0× | 88.06 | ~12h |
| 2 (fast) | 3.6× | 88.35 | ~15h ← recommended |
| 5 (full) | 7.2× | 88.48 | ~18h |
| Script | Purpose |
|---|---|
scripts/calibpl_selftrain.py |
Main experiment entry point |
scripts/prediction_stability.py |
CGJS gate (multi-augmentation stability) |
scripts/validate_proposition1_nms_competition.py |
Empirical proof of Proposition 1 |
scripts/measure_tail_misalignment.py |
Tail overconfidence measurement (Table A.1) |
scripts/experiment_ece_drift.py |
ECE drift across iterations (Fig. 2) |
scripts/ablation_cgjs_precision.py |
Component ablation table (Table 6) |
scripts/independence_test.py |
Pearson correlation cls/loc (Sec. 5) |
scripts/density_analysis.py |
κ=12 density threshold derivation |
scripts/sensitivity_sweep.py |
r_cls/r_loc sensitivity (Table 7 right) |
scripts/compile_results.py |
Aggregate all results → paper tables |
This work sits at the intersection of three independent research threads that had never been combined:
Thread 1: Semi-Supervised Object Detection
STAC (2020) → Unbiased Teacher (ICLR 2021) → Soft Teacher (ICCV 2021)
→ Consistent Teacher (CVPR 2023) → [this work: dense calibration layer]
Thread 2: Object Detection Calibration
Neumann et al. (NeurIPS-W 2018) → Küppers et al. (CVPR-W 2022)
→ Kuzucu et al. LaECE (ECCV 2024 Oral) → [this work: dynamic SSOD version]
→ IoU-Net (ECCV 2018) ─┘
Thread 3: Dense Scene Understanding
SKU-110K (CVPR 2019) → Dense Teacher (ECCV 2022)
→ [this work: first calibration study on dense SSOD]
What makes this novel: No prior paper has combined (a) isotonic regression calibration, (b) iterative SSOD, and (c) dense-scene focus. Kuzucu et al. (ECCV 2024) is the nearest calibration precedent but operates on fixed supervised models only. Our formal Proposition 1 provides the first mathematical characterization of why dense SSOD is structurally different.
| Method | Calibration | Iterative SSOD | Dense Scene | Localization Calib. | Formal Theory |
|---|---|---|---|---|---|
| Unbiased Teacher | ❌ | ✅ | ❌ COCO | ❌ | ❌ |
| Consistent Teacher | Partial (GMM) | ✅ | ❌ COCO | ❌ | ❌ |
| Kuzucu et al. | ✅ Isotonic | ❌ Static | ❌ COCO/LVIS | ✅ | Empirical |
| IoU-Net | ❌ | ❌ Supervised | ❌ | ✅ | ❌ |
| CalibPL (Ours) | ✅ Dynamic | ✅ | ✅ SKU-110K | ✅ | ✅ Prop. 1 |
Why this matters beyond retail shelves:
The NMS Tail Amplification effect is a structural property of any NMS-based SSOD pipeline on dense data. It affects:
- 🏥 Medical imaging: Dense cell detection in pathology slides
- 🚗 Autonomous driving: Pedestrian crowd detection (CrowdHuman validation: +8.3 AP)
- 🛰️ Satellite imagery: Infrastructure damage assessment after disasters
- 🏭 Manufacturing: Semiconductor defect detection on dense component grids
Limitation acknowledgment: CalibPL requires ~500 labeled boxes minimum for reliable calibration fitting, is validated on YOLOv12n only (anchor-free adaptation is future work), and κ=12 was derived on SKU-110K.
If you find this work useful, please cite:
@inproceedings{calibpl2027wacv,
title = {{CalibPL}: Dual Isotonic Calibration Reveals the Hidden
Failure Mode of Semi-Supervised Detection in Dense Scenes},
author = {Anonymous Author(s)},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision (WACV)},
year = {2027},
note = {Under review — anonymized submission}
}Related work we build on:
@inproceedings{kuzucu2024calibration,
title = {How Calibrated are {L}ocalization-Dependent Object Detectors?},
author = {Kuzucu, Selim and Oksuz, Kemal and Sherwood, Jonathan and Dokania, Puneet K.},
booktitle = {ECCV},
year = {2024}
}
@inproceedings{iounet2018,
title = {Acquisition of Localization Confidence for Accurate Object Detection},
author = {Jiang, Borui and Luo, Ruixuan and Mao, Jiayuan and Xiao, Tete and Jiang, Yuning},
booktitle = {ECCV},
year = {2018}
}
@inproceedings{sku110k2019,
title = {Accurate Detection of Objects in Dense Shelf Images},
author = {Goldman, Eran and Herzig, Roei and Eisenschtat, Aviv and Goldberger, Jacob and Hassner, Tal},
booktitle = {CVPR},
year = {2019}
}This is an active research project under anonymous review. After the review period, we welcome:
- Bug reports via GitHub Issues
- Dataset compatibility PRs (adapting to new dense detection datasets)
- Benchmark extensions (anchor-free detectors, video detection)
Please read CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License — see LICENSE for details.
All datasets used (SKU-110K, COCO, CrowdHuman) are subject to their respective licenses. Download and use them per their original terms.
If this research helps your work, please ⭐ star the repository.
Built with rigorous science, six months of experiments, and a lot of GPU-hours on Tesla K80.