CalibPL: Dual Isotonic Calibration Reveals the Hidden Failure Mode of Semi-Supervised Detection in Dense Scenes

[WACV 2027 Submission · Algorithms Track · Double-Blind Review]

We identify and formally prove NMS Tail Amplification — a fundamental failure mode that causes semi-supervised object detectors to hallucinate confident, spatially incorrect pseudo-labels in dense scenes. Our fix, CalibPL, dynamically re-calibrates confidence at every self-training iteration, achieving +1.20 AP₅₀ on dense retail shelves, +8.3 AP cross-domain on CrowdHuman, and a ~70% false-positive reduction — while being theoretically grounded with a falsifiable prediction (no gain on sparse COCO, confirmed at p=0.41).

🎯 The Core Problem

Every state-of-the-art semi-supervised object detector (SSOD) makes the same silent assumption:

Classification confidence reliably proxies pseudo-label spatial quality.

This assumption is mathematically wrong in dense scenes — and we prove it.

Scene Type	Objects/Image	NMS Competition	ECE_loc Drift (5 iters)	AP Impact
COCO (sparse)	~7	Low	≈0	No degradation
CrowdHuman (dense)	~23	High	0.033 → 0.407	−5.7 AP
SKU-110K (very dense)	~147	Very High	0.113 → 0.226	Stagflation

What goes wrong: Non-Maximum Suppression selects the highest-classification-score survivor from each dense cluster. Localization quality is a passenger — not selected for. In crowds of 147+ objects per image, this creates a systematic decoupling where the most confident boxes are also the most spatially unreliable. The Consistent-Teacher GMM baseline doubles localization ECE (0.113 → 0.226), leaving 39% of accepted pseudo-boxes with IoU < 0.5.

Why static calibration can't fix it: Post-hoc methods like Platt Scaling and Isotonic Regression (Kuzucu et al., ECCV 2024) work beautifully on fixed supervised models. But in iterative SSOD, the score distribution shifts every iteration. A calibrator fitted at iteration 0 accrues 0.024 ECE drift by iteration 1 — the map is stale, the territory has moved.

🏆 Key Contributions

1. The NMS Tail Amplification Theorem (Proposition 1) We formally prove that the expected gap between classification confidence and localization accuracy is non-decreasing in scene density $n$, providing the first mathematical explanation for why dense SSOD is structurally harder than sparse SSOD.

2. Dynamic Dual Isotonic Calibration Two isotonic regression calibrators — one for classification (g_cls), one for localization (g_loc) — re-fitted at every SSOD iteration to track the non-stationary score distribution. Achieves ECE < 0.003 throughout all 5 self-training iterations.

3. Class-Geometry Joint Stability (CGJS) Gate A multi-augmentation consistency filter that accepts a pseudo-label only when it is both spatially stable (box position consistent under geometric augmentation) and semantically stable (class prediction consistent under photometric augmentation). Requires 3.6× training overhead at the lightweight setting (|A|=2) while recovering 69% of the full gain.

4. Falsifiable Theoretical Prediction Our theory predicts zero benefit on sparse datasets ($\bar{\rho} < \kappa \approx 12$). This is confirmed empirically: CalibPL is statistically indistinguishable from the baseline on COCO 1% ($p=0.41$), providing a rare falsifiable validation loop for an ML claim.

📊 Results at a Glance

Dense Retail: SKU-110K 10% Labels

Method	Precision	Recall	#PL	AP₅₀
Fixed τ=0.5 (Baseline)	0.610	0.875	14,194	87.28
GMM-only (CT [4])	0.680	0.820	11,983	86.08 ± 0.26
Temperature Scaling	0.732	0.836	12,108	87.89 ± 0.08
Static Isotonic (Kuzucu et al.)	0.759	0.811	12,803	87.96 ± 0.10
CalibPL (Ours)	0.840	0.797	10,494	88.48 ± 0.11
Improvement	+23.0 pp	−7.8 pp	−26%	+1.20 AP

Statistical significance: $t = 10.9$, $p < 0.001$ (two-tailed $t$-test, $n=3$ seeds).

Sparse Regime: COCO 1% (Theory Validation)

Method	AP	AP₅₀	Δ AP
Fixed τ=0.5	33.66 ± 0.03	53.17 ± 0.09	—
CalibPL	33.59 ± 0.03	52.99 ± 0.08	−0.07 (p=0.41 ✓)

CalibPL correctly predicts no benefit on sparse data — a key scientific validation.

Cross-Domain: CrowdHuman (Zero-Shot Transfer, ρ̄≈22)

Method	ECE_loc (It.1→3)	AP (It.1→3)	ΔAP by It.3
Baseline	0.033 → 0.407	81.5 → 75.8	—
CalibPL	0.033 → 0.052	82.3 → 84.1	+8.3 AP

🧮 Theoretical Foundation

Proposition 1: NMS Tail Amplification

Consider a spatial clique of $n$ mutually-suppressing detections. Define:

$S^* = \max_{i} S_i$ — the NMS-selected score (order statistic)
$g(s) = \Pr(\text{IoU} \geq \tau \mid S = s)$ — true localization-conditional precision
Tail overconfidence assumption (A): $g(s) \leq s - \delta$ for $s \geq s_0$

Theorem: The expected overconfidence gap satisfies: $$\Delta_n := \mathbb{E}[S^* - g(S^*)] \geq \delta\left(1 - F_S(s_0)^n\right)$$ This bound is non-decreasing in $n$ and strictly increasing whenever $F_S(s_0) \in (0,1)$.

Empirical validation: On SKU-110K validation (16,393 post-NMS boxes, iteration 3), the tail overconfidence assumption holds with $\delta \approx 0.13$, $s_0 = 0.85$ — confirming the theoretical conditions on real detection data.

Corollary (Inevitable Drift Under Iterative SSOD)

A static calibrator $g_0$ fitted at iteration 0 misrepresents $F_S^{(t)}$ for all $t \geq 1$, accumulating ECE error proportional to the distributional shift. Dynamic per-iteration re-fitting (CalibPL) eliminates this drift entirely.

🔬 Method Overview

┌─────────────────────────────────────────────────────────────────────┐
│                    CalibPL Self-Training Loop                        │
│                                                                      │
│  ┌──────────┐    ┌───────────────────────┐    ┌────────────────┐   │
│  │  f_{t-1}  │───▶│  Density-Adaptive     │───▶│  Dual Isotonic │   │
│  │ (Teacher) │    │  Scene Analysis       │    │  Calibrators   │   │
│  └──────────┘    │  ρ ≥ κ ? → loc-gate  │    │  g_cls + g_loc │   │
│                  └───────────────────────┘    └───────┬────────┘   │
│                                                        │ re-fit/iter│
│  ┌──────────────────────────────────────────────────── ▼ ──────┐   │
│  │              Pseudo-Label Acceptance Gate (Eq. 3)            │   │
│  │   g_cls(p̂_cls) ≥ r_cls  ∧  g_loc(p̂_loc) ≥ r_loc  ∧  σ_b ≥ β │   │
│  │                    ↑ calibrated conf   ↑ calibrated IoU       │   │
│  │                                        ↑ CGJS stability score │   │
│  └──────────────────────────────────────────────────────────────┘   │
│         accepted PL → D̃_U → Joint Training → f_t                   │
└─────────────────────────────────────────────────────────────────────┘

Algorithm 1 (one SSOD iteration):

Compute scene density $\rho$ from accepted pseudo-labels of $f_{t-1}$
If $\rho \geq \kappa$: activate loc-gate $r_\text{loc}^\text{eff} = r_\text{loc}$; else $r_\text{loc}^\text{eff} = 0$
Fit $g_t^\text{cls}$, $g_t^\text{loc}$ via isotonic regression with stratified bootstrap ($B=5$) on $D_L^\text{val}$
For each unlabeled box $b$: accept iff $g_t^\text{cls}(\hat{p}\text{cls}) \geq r\text{cls}$ ∧ $g_t^\text{loc}(\hat{p}\text{loc}) \geq r\text{loc}^\text{eff}$ ∧ $\sigma_b \geq \beta$
Train $f_t$ on $D_L \cup \tilde{D}_U$

⚙️ Installation

Requirements

Python 3.10+
CUDA 11.8+ (tested on Tesla K80 / RTX 3090)
~8 GB GPU RAM minimum (11 GB recommended for full batch sizes)

Setup

# Clone the repository
git clone https://github.com/mohanganesh3/CalibPL.git
cd CalibPL

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# .venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Dependencies

ultralytics>=8.2.0       # YOLOv12 backbone and training
torch>=2.1.0
torchvision>=0.16.0
scikit-learn>=1.3.0      # IsotonicRegression calibrators
numpy>=1.24.0
scipy>=1.11.0
matplotlib>=3.7.0
Pillow>=10.0.0
pyyaml>=6.0
tqdm>=4.65.0
opencv-python>=4.8.0

🚀 Quick Start

1. Run CalibPL on SKU-110K (main experiment)

python scripts/calibpl_selftrain.py \
  --data-yaml data/sku110k_10pct.yaml \
  --unlabeled-dir data/SKU-110K/unlabeled/images \
  --method calibpl \
  --iterations 5 \
  --seed 42 \
  --target-reliability 0.6 \
  --cgjs-threshold 0.5 \
  --tag sku10_seed42

2. Run with GMM baseline (Consistent-Teacher)

python scripts/calibpl_selftrain.py \
  --data-yaml data/sku110k_10pct.yaml \
  --unlabeled-dir data/SKU-110K/unlabeled/images \
  --method calibpl_gmm \
  --use-gmm \
  --tag sku10_gmm_seed42

3. Ablation: Dual calibration only (no CGJS)

python scripts/calibpl_selftrain.py \
  --data-yaml data/sku110k_10pct.yaml \
  --unlabeled-dir data/SKU-110K/unlabeled/images \
  --method calibpl \
  --cgjs-threshold 0.0 \   # disable CGJS gate
  --tag sku10_dual_only

4. Evaluate pseudo-label quality

python scripts/evaluate_sku_pseudo_labels.py \
  --pseudo-dir results/calibpl_v3/calibpl_seed42_sku10_seed42/iter_3/pseudo_labels \
  --gt-dir data/SKU-110K/val/labels \
  --output results/pl_quality_report.json

5. Validate Proposition 1 empirically

python scripts/validate_proposition1_nms_competition.py \
  --model-path models/yolo12n.pt \
  --data-yaml data/sku110k_10pct.yaml \
  --output results/proposition1_validation.json

📈 Reproducing Paper Results

Multi-seed main experiment (Table 3 in paper)

# Seeds 42, 43, 44 in parallel (requires 3 GPUs)
bash scripts/launch_sku_calibpl_seeds43_44.sh

# Or sequentially
for seed in 42 43 44; do
  python scripts/calibpl_selftrain.py \
    --data-yaml data/sku110k_10pct.yaml \
    --unlabeled-dir data/SKU-110K/unlabeled/images \
    --method calibpl \
    --iterations 5 --seed $seed \
    --tag sku10_seed${seed}
done

Component ablation (Table 6 in paper)

bash scripts/run_ablations.sh
# Runs: fixed_threshold, g_cls_only, g_loc_only, dual, cgjs_only, full_calibpl

CGJS overhead sweep (Table 7 in paper)

python scripts/benchmark_gpu_cgjs.py \
  --model-path models/yolo12n.pt \
  --data-yaml data/sku110k_10pct.yaml \
  --aug-sizes 0 2 5

CrowdHuman cross-domain (Table 8 in paper)

python scripts/experiment_ece_drift_crowdhuman.py \
  --method calibpl \
  --iterations 3

Compile all results

python scripts/compile_results.py --output results/paper_tables.json

📁 Project Structure

CalibPL/
├── 📄 README.md
├── 📄 LICENSE
├── 📄 requirements.txt
│
├── 🔬 core/                          # Core library modules
│   ├── calibration/
│   │   ├── detection_calibration.py  # D-ECE, LaECE measurement utilities
│   │   └── mc_dropout.py            # MC Dropout uncertainty estimation
│   ├── training/
│   │   ├── cotraining_exact.py      # Co-training framework
│   │   └── optimizer_exact.py       # Custom SGD + cosine decay
│   ├── dataset/                     # Dataset loaders and splits
│   ├── models/                      # Model wrappers
│   └── ensemble/                    # Ensemble utilities
│
├── 📜 scripts/                       # All experiment scripts
│   ├── calibpl_selftrain.py         # ★ MAIN: CalibPL SSOD loop
│   ├── prediction_stability.py      # CGJS gate implementation
│   ├── validate_proposition1_nms_competition.py  # Theory validation
│   ├── measure_tail_misalignment.py # Tail overconfidence measurement
│   ├── experiment_ece_drift.py      # ECE drift (Fig. 2 in paper)
│   ├── experiment_ece_drift_crowdhuman.py
│   ├── ablation_cgjs_precision.py   # Table 6 ablations
│   ├── ablation_gmm_vs_isotonic.py  # GMM vs isotonic comparison
│   ├── density_analysis.py          # κ=12 density threshold analysis
│   ├── sensitivity_sweep.py         # r_cls/r_loc threshold sweep
│   ├── independence_test.py         # Pearson correlation cls/loc
│   ├── benchmark_gpu_cgjs.py        # CGJS overhead benchmarking
│   └── [launch scripts, evaluators, figure generators...]
│
├── 📊 results/                       # Experiment outputs
│   ├── figures/                     # Generated figures (PDF/PNG)
│   ├── ablations/                   # Ablation result JSONs
│   ├── table6_multiseed_ablation.json
│   ├── crowdhuman_results.json
│   └── [training logs per method/seed/iteration]
│
├── 📝 paper/
│   └── wacv2027/
│       ├── wacv_paper.tex           # Main paper (WACV 2027)
│       ├── wacv_supplementary.tex   # Full supplementary
│       ├── wacv2027_refs.bib        # Bibliography
│       └── wacv.sty                 # WACV style file
│
├── 📐 theory/                        # Mathematical analysis notebooks
├── 🎨 figures/                       # Matplotlib figure scripts
├── 📚 docs/                          # Additional documentation
└── 🗂️ data/                          # Data configs (not raw data)
    ├── sku110k_10pct.yaml
    ├── sku110k_1pct.yaml
    ├── sku110k_5pct.yaml
    └── coco_1pct.yaml

📦 Datasets

SKU-110K (Primary Benchmark)

bash scripts/download_sku110k.sh
# Or manually: https://github.com/eg4000/SKU110K_CVPR19

After downloading, set up the split:

python scripts/subset_sku_json.py \
  --source data/SKU-110K/annotations/instances_train2017.json \
  --pct 10 \
  --output data/sku110k_splits/

Dataset statistics:

Train: 8,233 images | Val: 588 | Test: 2,941
Objects/image: 147.4 ± 32.4
Total annotated instances: 1.73 million
Our working subset: 1,400 images (10% = 140 labeled)

COCO 2017 (Sparse Control)

# Download from https://cocodataset.org
python scripts/prepare_coco_ssod.py --split 1pct --output data/coco_splits/

CrowdHuman (Cross-Domain)

bash scripts/download_crowdhuman.sh

📐 Configuration

Key hyperparameters (all set to paper defaults):

Parameter	Default	Description
`--iterations`	5	Number of SSOD self-training iterations
`--epochs`	10	Training epochs per iteration
`--target-reliability`	0.6	Calibrated threshold r_cls = r_loc
`--cgjs-threshold`	0.5	CGJS stability gate β
`--cgjs-alpha`	0.5	Weight in score fusion mode
`--seed`	42	Random seed (paper uses 42, 43, 44)
`--batch-size`	16	Training batch size

CGJS augmentation budget:

| |A| | Overhead | AP₅₀ | Training Time | |:------:|:--------:|:----:|:-------------:| | 0 (off) | 1.0× | 88.06 | ~12h | | 2 (fast) | 3.6× | 88.35 | ~15h ← recommended | | 5 (full) | 7.2× | 88.48 | ~18h |

🔍 Key Scripts Reference

Script	Purpose
`scripts/calibpl_selftrain.py`	Main experiment entry point
`scripts/prediction_stability.py`	CGJS gate (multi-augmentation stability)
`scripts/validate_proposition1_nms_competition.py`	Empirical proof of Proposition 1
`scripts/measure_tail_misalignment.py`	Tail overconfidence measurement (Table A.1)
`scripts/experiment_ece_drift.py`	ECE drift across iterations (Fig. 2)
`scripts/ablation_cgjs_precision.py`	Component ablation table (Table 6)
`scripts/independence_test.py`	Pearson correlation cls/loc (Sec. 5)
`scripts/density_analysis.py`	κ=12 density threshold derivation
`scripts/sensitivity_sweep.py`	r_cls/r_loc sensitivity (Table 7 right)
`scripts/compile_results.py`	Aggregate all results → paper tables

📖 Research Lineage

This work sits at the intersection of three independent research threads that had never been combined:

Thread 1: Semi-Supervised Object Detection
  STAC (2020) → Unbiased Teacher (ICLR 2021) → Soft Teacher (ICCV 2021)
    → Consistent Teacher (CVPR 2023) → [this work: dense calibration layer]

Thread 2: Object Detection Calibration  
  Neumann et al. (NeurIPS-W 2018) → Küppers et al. (CVPR-W 2022)
    → Kuzucu et al. LaECE (ECCV 2024 Oral) → [this work: dynamic SSOD version]
                                           → IoU-Net (ECCV 2018) ─┘

Thread 3: Dense Scene Understanding
  SKU-110K (CVPR 2019) → Dense Teacher (ECCV 2022)
    → [this work: first calibration study on dense SSOD]

What makes this novel: No prior paper has combined (a) isotonic regression calibration, (b) iterative SSOD, and (c) dense-scene focus. Kuzucu et al. (ECCV 2024) is the nearest calibration precedent but operates on fixed supervised models only. Our formal Proposition 1 provides the first mathematical characterization of why dense SSOD is structurally different.

📊 Comparison with Prior Work

Method	Calibration	Iterative SSOD	Dense Scene	Localization Calib.	Formal Theory
Unbiased Teacher	❌	✅	❌ COCO	❌	❌
Consistent Teacher	Partial (GMM)	✅	❌ COCO	❌	❌
Kuzucu et al.	✅ Isotonic	❌ Static	❌ COCO/LVIS	✅	Empirical
IoU-Net	❌	❌ Supervised	❌	✅	❌
CalibPL (Ours)	✅ Dynamic	✅	✅ SKU-110K	✅	✅ Prop. 1

🌍 Broader Impact

Why this matters beyond retail shelves:

The NMS Tail Amplification effect is a structural property of any NMS-based SSOD pipeline on dense data. It affects:

🏥 Medical imaging: Dense cell detection in pathology slides
🚗 Autonomous driving: Pedestrian crowd detection (CrowdHuman validation: +8.3 AP)
🛰️ Satellite imagery: Infrastructure damage assessment after disasters
🏭 Manufacturing: Semiconductor defect detection on dense component grids

Limitation acknowledgment: CalibPL requires ~500 labeled boxes minimum for reliable calibration fitting, is validated on YOLOv12n only (anchor-free adaptation is future work), and κ=12 was derived on SKU-110K.

📝 Citation

If you find this work useful, please cite:

@inproceedings{calibpl2027wacv,
  title     = {{CalibPL}: Dual Isotonic Calibration Reveals the Hidden 
               Failure Mode of Semi-Supervised Detection in Dense Scenes},
  author    = {Anonymous Author(s)},
  booktitle = {Proceedings of the IEEE/CVF Winter Conference on 
               Applications of Computer Vision (WACV)},
  year      = {2027},
  note      = {Under review — anonymized submission}
}

Related work we build on:

@inproceedings{kuzucu2024calibration,
  title   = {How Calibrated are {L}ocalization-Dependent Object Detectors?},
  author  = {Kuzucu, Selim and Oksuz, Kemal and Sherwood, Jonathan and Dokania, Puneet K.},
  booktitle = {ECCV},
  year    = {2024}
}

@inproceedings{iounet2018,
  title     = {Acquisition of Localization Confidence for Accurate Object Detection},
  author    = {Jiang, Borui and Luo, Ruixuan and Mao, Jiayuan and Xiao, Tete and Jiang, Yuning},
  booktitle = {ECCV},
  year      = {2018}
}

@inproceedings{sku110k2019,
  title     = {Accurate Detection of Objects in Dense Shelf Images},
  author    = {Goldman, Eran and Herzig, Roei and Eisenschtat, Aviv and Goldberger, Jacob and Hassner, Tal},
  booktitle = {CVPR},
  year      = {2019}
}

🤝 Contributing

This is an active research project under anonymous review. After the review period, we welcome:

Bug reports via GitHub Issues
Dataset compatibility PRs (adapting to new dense detection datasets)
Benchmark extensions (anchor-free detectors, video detection)

Please read CONTRIBUTING.md for guidelines.

📜 License

This project is licensed under the MIT License — see LICENSE for details.

All datasets used (SKU-110K, COCO, CrowdHuman) are subject to their respective licenses. Download and use them per their original terms.

If this research helps your work, please ⭐ star the repository.

Built with rigorous science, six months of experiments, and a lot of GPU-hours on Tesla K80.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
core		core
data		data
paper/wacv2027		paper/wacv2027
results		results
scripts		scripts
theory		theory
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compile_wacv2027.sh		compile_wacv2027.sh
deep_motivation_research.md		deep_motivation_research.md
novelty_review_report.md		novelty_review_report.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CalibPL: Dual Isotonic Calibration Reveals the Hidden Failure Mode of Semi-Supervised Detection in Dense Scenes

📋 Table of Contents

🎯 The Core Problem

🏆 Key Contributions

📊 Results at a Glance

Dense Retail: SKU-110K 10% Labels

Sparse Regime: COCO 1% (Theory Validation)

Cross-Domain: CrowdHuman (Zero-Shot Transfer, ρ̄≈22)

🧮 Theoretical Foundation

Proposition 1: NMS Tail Amplification

Corollary (Inevitable Drift Under Iterative SSOD)

🔬 Method Overview

⚙️ Installation

Requirements

Setup

Dependencies

🚀 Quick Start

1. Run CalibPL on SKU-110K (main experiment)

2. Run with GMM baseline (Consistent-Teacher)

3. Ablation: Dual calibration only (no CGJS)

4. Evaluate pseudo-label quality

5. Validate Proposition 1 empirically

📈 Reproducing Paper Results

Multi-seed main experiment (Table 3 in paper)

Component ablation (Table 6 in paper)

CGJS overhead sweep (Table 7 in paper)

CrowdHuman cross-domain (Table 8 in paper)

Compile all results

📁 Project Structure

📦 Datasets

SKU-110K (Primary Benchmark)

COCO 2017 (Sparse Control)

CrowdHuman (Cross-Domain)

📐 Configuration

🔍 Key Scripts Reference

📖 Research Lineage

📊 Comparison with Prior Work

🌍 Broader Impact

📝 Citation

🤝 Contributing

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages