Skip to content

mohanganesh3/CalibPL

Repository files navigation


CalibPL: Dual Isotonic Calibration Reveals the Hidden Failure Mode of Semi-Supervised Detection in Dense Scenes

[WACV 2027 Submission · Algorithms Track · Double-Blind Review]

We identify and formally prove NMS Tail Amplification — a fundamental failure mode that causes semi-supervised object detectors to hallucinate confident, spatially incorrect pseudo-labels in dense scenes. Our fix, CalibPL, dynamically re-calibrates confidence at every self-training iteration, achieving +1.20 AP₅₀ on dense retail shelves, +8.3 AP cross-domain on CrowdHuman, and a ~70% false-positive reduction — while being theoretically grounded with a falsifiable prediction (no gain on sparse COCO, confirmed at p=0.41).


📋 Table of Contents


🎯 The Core Problem

Every state-of-the-art semi-supervised object detector (SSOD) makes the same silent assumption:

Classification confidence reliably proxies pseudo-label spatial quality.

This assumption is mathematically wrong in dense scenes — and we prove it.

Scene Type Objects/Image NMS Competition ECE_loc Drift (5 iters) AP Impact
COCO (sparse) ~7 Low ≈0 No degradation
CrowdHuman (dense) ~23 High 0.033 → 0.407 −5.7 AP
SKU-110K (very dense) ~147 Very High 0.113 → 0.226 Stagflation

What goes wrong: Non-Maximum Suppression selects the highest-classification-score survivor from each dense cluster. Localization quality is a passenger — not selected for. In crowds of 147+ objects per image, this creates a systematic decoupling where the most confident boxes are also the most spatially unreliable. The Consistent-Teacher GMM baseline doubles localization ECE (0.113 → 0.226), leaving 39% of accepted pseudo-boxes with IoU < 0.5.

Why static calibration can't fix it: Post-hoc methods like Platt Scaling and Isotonic Regression (Kuzucu et al., ECCV 2024) work beautifully on fixed supervised models. But in iterative SSOD, the score distribution shifts every iteration. A calibrator fitted at iteration 0 accrues 0.024 ECE drift by iteration 1 — the map is stale, the territory has moved.


🏆 Key Contributions

1. The NMS Tail Amplification Theorem (Proposition 1) We formally prove that the expected gap between classification confidence and localization accuracy is non-decreasing in scene density $n$, providing the first mathematical explanation for why dense SSOD is structurally harder than sparse SSOD.

2. Dynamic Dual Isotonic Calibration Two isotonic regression calibrators — one for classification (g_cls), one for localization (g_loc) — re-fitted at every SSOD iteration to track the non-stationary score distribution. Achieves ECE < 0.003 throughout all 5 self-training iterations.

3. Class-Geometry Joint Stability (CGJS) Gate A multi-augmentation consistency filter that accepts a pseudo-label only when it is both spatially stable (box position consistent under geometric augmentation) and semantically stable (class prediction consistent under photometric augmentation). Requires 3.6× training overhead at the lightweight setting (|A|=2) while recovering 69% of the full gain.

4. Falsifiable Theoretical Prediction Our theory predicts zero benefit on sparse datasets ($\bar{\rho} &lt; \kappa \approx 12$). This is confirmed empirically: CalibPL is statistically indistinguishable from the baseline on COCO 1% ($p=0.41$), providing a rare falsifiable validation loop for an ML claim.


📊 Results at a Glance

Dense Retail: SKU-110K 10% Labels

Method Precision Recall #PL AP₅₀
Fixed τ=0.5 (Baseline) 0.610 0.875 14,194 87.28
GMM-only (CT [4]) 0.680 0.820 11,983 86.08 ± 0.26
Temperature Scaling 0.732 0.836 12,108 87.89 ± 0.08
Static Isotonic (Kuzucu et al.) 0.759 0.811 12,803 87.96 ± 0.10
CalibPL (Ours) 0.840 0.797 10,494 88.48 ± 0.11
Improvement +23.0 pp −7.8 pp −26% +1.20 AP

Statistical significance: $t = 10.9$, $p &lt; 0.001$ (two-tailed $t$-test, $n=3$ seeds).

Sparse Regime: COCO 1% (Theory Validation)

Method AP AP₅₀ Δ AP
Fixed τ=0.5 33.66 ± 0.03 53.17 ± 0.09
CalibPL 33.59 ± 0.03 52.99 ± 0.08 −0.07 (p=0.41 ✓)

CalibPL correctly predicts no benefit on sparse data — a key scientific validation.

Cross-Domain: CrowdHuman (Zero-Shot Transfer, ρ̄≈22)

Method ECE_loc (It.1→3) AP (It.1→3) ΔAP by It.3
Baseline 0.033 → 0.407 81.5 → 75.8
CalibPL 0.033 → 0.052 82.3 → 84.1 +8.3 AP

🧮 Theoretical Foundation

Proposition 1: NMS Tail Amplification

Consider a spatial clique of $n$ mutually-suppressing detections. Define:

  • $S^* = \max_{i} S_i$ — the NMS-selected score (order statistic)
  • $g(s) = \Pr(\text{IoU} \geq \tau \mid S = s)$ — true localization-conditional precision
  • Tail overconfidence assumption (A): $g(s) \leq s - \delta$ for $s \geq s_0$

Theorem: The expected overconfidence gap satisfies: $$\Delta_n := \mathbb{E}[S^* - g(S^*)] \geq \delta\left(1 - F_S(s_0)^n\right)$$ This bound is non-decreasing in $n$ and strictly increasing whenever $F_S(s_0) \in (0,1)$.

Empirical validation: On SKU-110K validation (16,393 post-NMS boxes, iteration 3), the tail overconfidence assumption holds with $\delta \approx 0.13$, $s_0 = 0.85$ — confirming the theoretical conditions on real detection data.

Corollary (Inevitable Drift Under Iterative SSOD)

A static calibrator $g_0$ fitted at iteration 0 misrepresents $F_S^{(t)}$ for all $t \geq 1$, accumulating ECE error proportional to the distributional shift. Dynamic per-iteration re-fitting (CalibPL) eliminates this drift entirely.


🔬 Method Overview

┌─────────────────────────────────────────────────────────────────────┐
│                    CalibPL Self-Training Loop                        │
│                                                                      │
│  ┌──────────┐    ┌───────────────────────┐    ┌────────────────┐   │
│  │  f_{t-1}  │───▶│  Density-Adaptive     │───▶│  Dual Isotonic │   │
│  │ (Teacher) │    │  Scene Analysis       │    │  Calibrators   │   │
│  └──────────┘    │  ρ ≥ κ ? → loc-gate  │    │  g_cls + g_loc │   │
│                  └───────────────────────┘    └───────┬────────┘   │
│                                                        │ re-fit/iter│
│  ┌──────────────────────────────────────────────────── ▼ ──────┐   │
│  │              Pseudo-Label Acceptance Gate (Eq. 3)            │   │
│  │   g_cls(p̂_cls) ≥ r_cls  ∧  g_loc(p̂_loc) ≥ r_loc  ∧  σ_b ≥ β │   │
│  │                    ↑ calibrated conf   ↑ calibrated IoU       │   │
│  │                                        ↑ CGJS stability score │   │
│  └──────────────────────────────────────────────────────────────┘   │
│         accepted PL → D̃_U → Joint Training → f_t                   │
└─────────────────────────────────────────────────────────────────────┘

Algorithm 1 (one SSOD iteration):

  1. Compute scene density $\rho$ from accepted pseudo-labels of $f_{t-1}$
  2. If $\rho \geq \kappa$: activate loc-gate $r_\text{loc}^\text{eff} = r_\text{loc}$; else $r_\text{loc}^\text{eff} = 0$
  3. Fit $g_t^\text{cls}$, $g_t^\text{loc}$ via isotonic regression with stratified bootstrap ($B=5$) on $D_L^\text{val}$
  4. For each unlabeled box $b$: accept iff $g_t^\text{cls}(\hat{p}\text{cls}) \geq r\text{cls}$ ∧ $g_t^\text{loc}(\hat{p}\text{loc}) \geq r\text{loc}^\text{eff}$ ∧ $\sigma_b \geq \beta$
  5. Train $f_t$ on $D_L \cup \tilde{D}_U$

⚙️ Installation

Requirements

  • Python 3.10+
  • CUDA 11.8+ (tested on Tesla K80 / RTX 3090)
  • ~8 GB GPU RAM minimum (11 GB recommended for full batch sizes)

Setup

# Clone the repository
git clone https://github.com/mohanganesh3/CalibPL.git
cd CalibPL

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# .venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Dependencies

ultralytics>=8.2.0       # YOLOv12 backbone and training
torch>=2.1.0
torchvision>=0.16.0
scikit-learn>=1.3.0      # IsotonicRegression calibrators
numpy>=1.24.0
scipy>=1.11.0
matplotlib>=3.7.0
Pillow>=10.0.0
pyyaml>=6.0
tqdm>=4.65.0
opencv-python>=4.8.0

🚀 Quick Start

1. Run CalibPL on SKU-110K (main experiment)

python scripts/calibpl_selftrain.py \
  --data-yaml data/sku110k_10pct.yaml \
  --unlabeled-dir data/SKU-110K/unlabeled/images \
  --method calibpl \
  --iterations 5 \
  --seed 42 \
  --target-reliability 0.6 \
  --cgjs-threshold 0.5 \
  --tag sku10_seed42

2. Run with GMM baseline (Consistent-Teacher)

python scripts/calibpl_selftrain.py \
  --data-yaml data/sku110k_10pct.yaml \
  --unlabeled-dir data/SKU-110K/unlabeled/images \
  --method calibpl_gmm \
  --use-gmm \
  --tag sku10_gmm_seed42

3. Ablation: Dual calibration only (no CGJS)

python scripts/calibpl_selftrain.py \
  --data-yaml data/sku110k_10pct.yaml \
  --unlabeled-dir data/SKU-110K/unlabeled/images \
  --method calibpl \
  --cgjs-threshold 0.0 \   # disable CGJS gate
  --tag sku10_dual_only

4. Evaluate pseudo-label quality

python scripts/evaluate_sku_pseudo_labels.py \
  --pseudo-dir results/calibpl_v3/calibpl_seed42_sku10_seed42/iter_3/pseudo_labels \
  --gt-dir data/SKU-110K/val/labels \
  --output results/pl_quality_report.json

5. Validate Proposition 1 empirically

python scripts/validate_proposition1_nms_competition.py \
  --model-path models/yolo12n.pt \
  --data-yaml data/sku110k_10pct.yaml \
  --output results/proposition1_validation.json

📈 Reproducing Paper Results

Multi-seed main experiment (Table 3 in paper)

# Seeds 42, 43, 44 in parallel (requires 3 GPUs)
bash scripts/launch_sku_calibpl_seeds43_44.sh

# Or sequentially
for seed in 42 43 44; do
  python scripts/calibpl_selftrain.py \
    --data-yaml data/sku110k_10pct.yaml \
    --unlabeled-dir data/SKU-110K/unlabeled/images \
    --method calibpl \
    --iterations 5 --seed $seed \
    --tag sku10_seed${seed}
done

Component ablation (Table 6 in paper)

bash scripts/run_ablations.sh
# Runs: fixed_threshold, g_cls_only, g_loc_only, dual, cgjs_only, full_calibpl

CGJS overhead sweep (Table 7 in paper)

python scripts/benchmark_gpu_cgjs.py \
  --model-path models/yolo12n.pt \
  --data-yaml data/sku110k_10pct.yaml \
  --aug-sizes 0 2 5

CrowdHuman cross-domain (Table 8 in paper)

python scripts/experiment_ece_drift_crowdhuman.py \
  --method calibpl \
  --iterations 3

Compile all results

python scripts/compile_results.py --output results/paper_tables.json

📁 Project Structure

CalibPL/
├── 📄 README.md
├── 📄 LICENSE
├── 📄 requirements.txt
│
├── 🔬 core/                          # Core library modules
│   ├── calibration/
│   │   ├── detection_calibration.py  # D-ECE, LaECE measurement utilities
│   │   └── mc_dropout.py            # MC Dropout uncertainty estimation
│   ├── training/
│   │   ├── cotraining_exact.py      # Co-training framework
│   │   └── optimizer_exact.py       # Custom SGD + cosine decay
│   ├── dataset/                     # Dataset loaders and splits
│   ├── models/                      # Model wrappers
│   └── ensemble/                    # Ensemble utilities
│
├── 📜 scripts/                       # All experiment scripts
│   ├── calibpl_selftrain.py         # ★ MAIN: CalibPL SSOD loop
│   ├── prediction_stability.py      # CGJS gate implementation
│   ├── validate_proposition1_nms_competition.py  # Theory validation
│   ├── measure_tail_misalignment.py # Tail overconfidence measurement
│   ├── experiment_ece_drift.py      # ECE drift (Fig. 2 in paper)
│   ├── experiment_ece_drift_crowdhuman.py
│   ├── ablation_cgjs_precision.py   # Table 6 ablations
│   ├── ablation_gmm_vs_isotonic.py  # GMM vs isotonic comparison
│   ├── density_analysis.py          # κ=12 density threshold analysis
│   ├── sensitivity_sweep.py         # r_cls/r_loc threshold sweep
│   ├── independence_test.py         # Pearson correlation cls/loc
│   ├── benchmark_gpu_cgjs.py        # CGJS overhead benchmarking
│   └── [launch scripts, evaluators, figure generators...]
│
├── 📊 results/                       # Experiment outputs
│   ├── figures/                     # Generated figures (PDF/PNG)
│   ├── ablations/                   # Ablation result JSONs
│   ├── table6_multiseed_ablation.json
│   ├── crowdhuman_results.json
│   └── [training logs per method/seed/iteration]
│
├── 📝 paper/
│   └── wacv2027/
│       ├── wacv_paper.tex           # Main paper (WACV 2027)
│       ├── wacv_supplementary.tex   # Full supplementary
│       ├── wacv2027_refs.bib        # Bibliography
│       └── wacv.sty                 # WACV style file
│
├── 📐 theory/                        # Mathematical analysis notebooks
├── 🎨 figures/                       # Matplotlib figure scripts
├── 📚 docs/                          # Additional documentation
└── 🗂️ data/                          # Data configs (not raw data)
    ├── sku110k_10pct.yaml
    ├── sku110k_1pct.yaml
    ├── sku110k_5pct.yaml
    └── coco_1pct.yaml

📦 Datasets

SKU-110K (Primary Benchmark)

bash scripts/download_sku110k.sh
# Or manually: https://github.com/eg4000/SKU110K_CVPR19

After downloading, set up the split:

python scripts/subset_sku_json.py \
  --source data/SKU-110K/annotations/instances_train2017.json \
  --pct 10 \
  --output data/sku110k_splits/

Dataset statistics:

  • Train: 8,233 images | Val: 588 | Test: 2,941
  • Objects/image: 147.4 ± 32.4
  • Total annotated instances: 1.73 million
  • Our working subset: 1,400 images (10% = 140 labeled)

COCO 2017 (Sparse Control)

# Download from https://cocodataset.org
python scripts/prepare_coco_ssod.py --split 1pct --output data/coco_splits/

CrowdHuman (Cross-Domain)

bash scripts/download_crowdhuman.sh

📐 Configuration

Key hyperparameters (all set to paper defaults):

Parameter Default Description
--iterations 5 Number of SSOD self-training iterations
--epochs 10 Training epochs per iteration
--target-reliability 0.6 Calibrated threshold r_cls = r_loc
--cgjs-threshold 0.5 CGJS stability gate β
--cgjs-alpha 0.5 Weight in score fusion mode
--seed 42 Random seed (paper uses 42, 43, 44)
--batch-size 16 Training batch size

CGJS augmentation budget:

| |A| | Overhead | AP₅₀ | Training Time | |:------:|:--------:|:----:|:-------------:| | 0 (off) | 1.0× | 88.06 | ~12h | | 2 (fast) | 3.6× | 88.35 | ~15h ← recommended | | 5 (full) | 7.2× | 88.48 | ~18h |


🔍 Key Scripts Reference

Script Purpose
scripts/calibpl_selftrain.py Main experiment entry point
scripts/prediction_stability.py CGJS gate (multi-augmentation stability)
scripts/validate_proposition1_nms_competition.py Empirical proof of Proposition 1
scripts/measure_tail_misalignment.py Tail overconfidence measurement (Table A.1)
scripts/experiment_ece_drift.py ECE drift across iterations (Fig. 2)
scripts/ablation_cgjs_precision.py Component ablation table (Table 6)
scripts/independence_test.py Pearson correlation cls/loc (Sec. 5)
scripts/density_analysis.py κ=12 density threshold derivation
scripts/sensitivity_sweep.py r_cls/r_loc sensitivity (Table 7 right)
scripts/compile_results.py Aggregate all results → paper tables

📖 Research Lineage

This work sits at the intersection of three independent research threads that had never been combined:

Thread 1: Semi-Supervised Object Detection
  STAC (2020) → Unbiased Teacher (ICLR 2021) → Soft Teacher (ICCV 2021)
    → Consistent Teacher (CVPR 2023) → [this work: dense calibration layer]

Thread 2: Object Detection Calibration  
  Neumann et al. (NeurIPS-W 2018) → Küppers et al. (CVPR-W 2022)
    → Kuzucu et al. LaECE (ECCV 2024 Oral) → [this work: dynamic SSOD version]
                                           → IoU-Net (ECCV 2018) ─┘

Thread 3: Dense Scene Understanding
  SKU-110K (CVPR 2019) → Dense Teacher (ECCV 2022)
    → [this work: first calibration study on dense SSOD]

What makes this novel: No prior paper has combined (a) isotonic regression calibration, (b) iterative SSOD, and (c) dense-scene focus. Kuzucu et al. (ECCV 2024) is the nearest calibration precedent but operates on fixed supervised models only. Our formal Proposition 1 provides the first mathematical characterization of why dense SSOD is structurally different.


📊 Comparison with Prior Work

Method Calibration Iterative SSOD Dense Scene Localization Calib. Formal Theory
Unbiased Teacher ❌ COCO
Consistent Teacher Partial (GMM) ❌ COCO
Kuzucu et al. ✅ Isotonic ❌ Static ❌ COCO/LVIS Empirical
IoU-Net ❌ Supervised
CalibPL (Ours) ✅ Dynamic ✅ SKU-110K ✅ Prop. 1

🌍 Broader Impact

Why this matters beyond retail shelves:

The NMS Tail Amplification effect is a structural property of any NMS-based SSOD pipeline on dense data. It affects:

  • 🏥 Medical imaging: Dense cell detection in pathology slides
  • 🚗 Autonomous driving: Pedestrian crowd detection (CrowdHuman validation: +8.3 AP)
  • 🛰️ Satellite imagery: Infrastructure damage assessment after disasters
  • 🏭 Manufacturing: Semiconductor defect detection on dense component grids

Limitation acknowledgment: CalibPL requires ~500 labeled boxes minimum for reliable calibration fitting, is validated on YOLOv12n only (anchor-free adaptation is future work), and κ=12 was derived on SKU-110K.


📝 Citation

If you find this work useful, please cite:

@inproceedings{calibpl2027wacv,
  title     = {{CalibPL}: Dual Isotonic Calibration Reveals the Hidden 
               Failure Mode of Semi-Supervised Detection in Dense Scenes},
  author    = {Anonymous Author(s)},
  booktitle = {Proceedings of the IEEE/CVF Winter Conference on 
               Applications of Computer Vision (WACV)},
  year      = {2027},
  note      = {Under review — anonymized submission}
}

Related work we build on:

@inproceedings{kuzucu2024calibration,
  title   = {How Calibrated are {L}ocalization-Dependent Object Detectors?},
  author  = {Kuzucu, Selim and Oksuz, Kemal and Sherwood, Jonathan and Dokania, Puneet K.},
  booktitle = {ECCV},
  year    = {2024}
}

@inproceedings{iounet2018,
  title     = {Acquisition of Localization Confidence for Accurate Object Detection},
  author    = {Jiang, Borui and Luo, Ruixuan and Mao, Jiayuan and Xiao, Tete and Jiang, Yuning},
  booktitle = {ECCV},
  year      = {2018}
}

@inproceedings{sku110k2019,
  title     = {Accurate Detection of Objects in Dense Shelf Images},
  author    = {Goldman, Eran and Herzig, Roei and Eisenschtat, Aviv and Goldberger, Jacob and Hassner, Tal},
  booktitle = {CVPR},
  year      = {2019}
}

🤝 Contributing

This is an active research project under anonymous review. After the review period, we welcome:

  • Bug reports via GitHub Issues
  • Dataset compatibility PRs (adapting to new dense detection datasets)
  • Benchmark extensions (anchor-free detectors, video detection)

Please read CONTRIBUTING.md for guidelines.


📜 License

This project is licensed under the MIT License — see LICENSE for details.

All datasets used (SKU-110K, COCO, CrowdHuman) are subject to their respective licenses. Download and use them per their original terms.


If this research helps your work, please ⭐ star the repository.

Built with rigorous science, six months of experiments, and a lot of GPU-hours on Tesla K80.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors