Small Objects Detection Benchmark

Benchmark of CNN and DETR-family detectors for small-object detection in aerial imagery.

TL;DR

Scope: Benchmark of CNN (YOLO, Faster R-CNN) and DETR (RT-DETR, RF-DETR) families for small-object detection in aerial imagery.
Dataset: SkyFusion - 2,992 images, 63,530 objects, 3 classes (aircraft, ship, vehicle).
Best baseline mAP@0.5: RT-DETR-L 0.599.
Best tiny-object behavior: YOLOv11m-P2 (highest tiny-object AP and recall among all models).
Post-threshold tuning: YOLOv11m-P2 reaches 0.693 mAP@0.5 (+23.1% via Optuna).

Problem, Dataset, and Constraints

Detecting small objects (often < 32×32 px) in aerial and satellite imagery is challenging due to low pixel counts, class imbalance, and high object density. This project benchmarks six detector configurations on the SkyFusion dataset:

Statistic	Value
Total images	2,992
Total objects	63,530
Classes	aircraft, ship, vehicle
Image resolution	640×640 px
Avg. object density	21.25 objects/image
Median object density	7 objects/image

Vehicles dominate (76.7% of objects), ships are rare (3.4%), and aircraft sit in between (19.9%). Most objects fall into the COCO "small" or "tiny" bins.

What I Built (System + Experiment Pipeline)

Unified benchmark framework (src/odc/benchmark/) — model-agnostic evaluation with pluggable adapters for Ultralytics (YOLO, RT-DETR), RF-DETR, and Faster R-CNN (PyTorch Lightning).
Iterative YOLO training pipeline — progressive improvements from baseline YOLOv8m through augmentation, cosine LR, optimized augmentation, YOLOv11 upgrade, and P2 head addition.
Threshold optimization — Optuna-based per-model tuning of conf_thr and nms_iou on the validation split.
Extended analysis scripts — size-bin evaluation, spatial-density evaluation, TIDE error decomposition, WBF ensembling, and inference-time tiling.

Benchmark Setup (Fair-Comparison Protocol)

Parameter	Value
Test split	420 images, 6,939 annotations
Baseline confidence threshold	0.25
Baseline NMS IoU	0.45 (where applicable)
Metrics	mAP@0.5, mAP@0.75, mAP@[0.5:0.95], per-class AP, latency/FPS/params
Training hardware	Kaggle P100 GPU
Inference hardware	NVIDIA GTX 1050 Ti (local)

All models were evaluated on the identical test split with identical metric computation (COCO-style pycocotools).

Results at a Glance

Model	mAP@0.5	mAP@0.75	mAP@[0.5:0.95]	Inference (ms)	FPS	Params (M)
RT-DETR-L	0.599	0.346	0.341	86.6 ± 15.5	11.5	32.0
RF-DETR	0.577	0.342	0.334	95.0 ± 5.8	10.5	93.1
YOLOv11m (P2, Optimized Aug)	0.563	0.352	0.341	61.0 ± 1.1	16.4	20.5
YOLOv11m (Optimized Aug)	0.515	0.343	0.325	45.5 ± 1.0	22.0	20.0
YOLOv8m (Optimized Aug, CosLR)	0.514	0.344	0.325	43.8 ± 1.4	22.8	25.8
Faster R-CNN	0.458	0.292	0.264	200.3 ± 5.9	5.0	41.3

RT-DETR-L leads on aggregate mAP@0.5; YOLOv11m-P2 is best on mAP@0.75 and tiny-object AP/recall, with 3× fewer parameters than RF-DETR and the fastest DETR-class inference.

Key Improvements and Ablations

YOLO Training Progression

Step	Change	mAP@0.5	Δ	Vehicle AP
1	Baseline YOLOv8m	0.397	—	0.224
2	+ Augmentation	0.420	+5.8%	0.291
3	+ Cosine LR	0.481	+14.5%	0.329
4	Optimized Augmentation	0.514	+6.9%	0.412
5	YOLOv11m	0.515	+0.2%	0.413
6	YOLOv11m + P2 head	0.563	+9.3%	0.560 (+35.6% vs step 5)

Optuna Threshold Tuning Impact

Model	Baseline mAP@0.5	Optimized mAP@0.5	Improvement
YOLOv11m (P2, Optimized Aug)	0.563	0.693	+23.1%
YOLOv8m (Optimized Aug, CosLR)	0.514	0.662	+28.8%
RT-DETR-L	0.599	0.652	+8.8%
RF-DETR	0.577	0.637	+10.4%
Faster R-CNN	0.458	0.493	+7.6%

What Did Not Work

Class-specific training (dropping aircraft): Reduced ship and vehicle AP — inter-class context matters.
Inference-time tiling: Regressed performance for both top models (YOLOv11m-P2 −7.0% mAP@0.5; RT-DETR-L −28.2%) due to train-infer scale mismatch.
WBF ensemble: Gave modest overall gain (+0.6% mAP@[0.5:0.95]) but better small-object gain (+2.8% AP_S). Not enough to justify 2× inference cost without further tuning.

Reproducibility / How to Run

Caveat: Some research scripts still include path assumptions from thesis-time local environment. The entry points below use configurable CLI arguments.

1. Environment Setup

# Requires Python 3.10–3.11
uv sync

2. Exploratory Data Analysis

uv run python src/scripts/dataset_eda.py \
  --dataset datasets/SkyFusion_yolo \
  --output materials/dataset_eda

3. Run Benchmark

# Complete benchmark (all models, full test set)
uv run python src/scripts/benchmark.py --mode complete

# Quick sanity check (20 samples, single model)
uv run python src/scripts/benchmark.py --mode simple --samples 20

4. Size-Bin Analysis

uv run python src/scripts/size_bin_benchmark.py \
  --dataset_path datasets/SkyFusion_yolo \
  --output_dir output/size-bin

5. Spatial Density Analysis

uv run python src/scripts/spatial_density_benchmark.py \
  --dataset_path datasets/SkyFusion_yolo \
  --output_dir output/spatial-density

Prerequisites: Model weights must be placed in models/. See models/README.md for exact filenames, optimal thresholds, and download links.

Repository Structure

small-objects-detection-benchmark/
├── docs/assets/                  # Figures for README
├── models/                       # Trained weights + model zoo README
│   └── README.md                 # Filenames, thresholds, usage examples
├── notebooks_and_scripts/        # Kaggle notebooks, local experiment scripts
│   ├── kaggle/
│   ├── local/
│   └── future_work/
├── src/
│   ├── odc/                      # Core library
│   │   ├── benchmark/            # Pipeline, adapters, metrics, visualizers
│   │   └── dataset_eda/          # EDA pipeline
│   └── scripts/                  # CLI entry points
│       ├── benchmark.py          # Main benchmark (simple/complete/enhanced)
│       ├── dataset_eda.py        # Dataset EDA
│       ├── size_bin_benchmark.py # Size-bin performance analysis
│       ├── spatial_density_benchmark.py
│       ├── generate_density_model_grid.py
│       ├── generate_tide_error_plots.py
│       ├── plot_size_bin_comparison.py
│       ├── count_density_bins.py
│       └── augment_dataset.py
├── pyproject.toml                # Dependencies (uv)
└── uv.lock

Artifacts and Links

Artifact	Link
Trained models	Kaggle Models
SkyFusion dataset	Kaggle Dataset
Code repository	GitHub (v1.0.0)

Discussion Points

Why DETR wins rare classes but YOLO wins tiny vehicles: RF-DETR achieves highest ship AP (0.471) via global attention over the full image, while YOLOv11m-P2's extra high-resolution feature map (P2) preserves fine-grained spatial detail critical for the dominant tiny-vehicle class.
Latency/accuracy/parameter trade-offs: YOLOv11m-P2 delivers near-DETR accuracy at 61 ms (vs 87–95 ms) with 20.5M params (vs 32–93M). For edge deployment, this matters.
Why threshold optimization materially changed ranking: Default conf=0.25 penalizes models with different confidence distributions. Optuna tuning on validation shifted YOLOv11m-P2 from 3rd to 1st on mAP@0.5 — showing that "out-of-the-box" rankings can be misleading.
Why failed tiling matters: Tiling magnifies content by ~1.25×, creating a train-infer scale mismatch. Models trained on 640×640 don't generalize to upscaled 512→640 tiles. This is a practical lesson for production aerial detection pipelines.
Productionization next steps: Integrate tiling with tile-aware training, apply Optuna jointly with WBF ensemble weights, deploy with TensorRT/ONNX quantization, and add active-learning feedback for rare-class samples.

Limitations and Future Work

Single dataset: Results are specific to SkyFusion; generalization to other aerial datasets (DOTA, VisDrone) is untested.
Fixed input resolution: All models used 640×640; higher-resolution training could shift rankings.
No tile-aware training: Tiling was only applied at inference; training on tiles may recover the expected gains.
Ensemble not fully tuned: WBF used default thresholds; per-class weighting and expanded model diversity are open.
Hardware-specific latency: Inference times measured on GTX 1050 Ti; relative rankings may differ on other hardware.

Figures

Object Density Performance Comparison

Images were categorized into three buckets based on the number of ground-truth objects they contain:

Sparse: 0–9 objects
Medium: 10–29 objects
Dense: 30+ objects

TIDE Error Decomposition

TIDE decomposes object detection errors into six categories to show how much AP is lost to each error type:

Cls (Classification): The detector finds the object with sufficient overlap, but predicts the wrong class.
Loc (Localization): The detector predicts the correct class, but the bounding box is not well aligned with the ground-truth object.
Both: The detection has both an incorrect class and poor localization.
Dupe (Duplicate): The detector produces multiple detections for the same ground-truth object; only one can be matched correctly, and the others count as duplicate errors.
Bkg (Background): The detector predicts an object where there is no matching ground-truth object, i.e. a background false positive.
Miss: The detector fails to produce a valid detection for a ground-truth object, i.e. a false negative.

FP vs FN

In TIDE, FP and FN are not raw counts. They are dAP values showing how much Average Precision (AP) would improve under two separate oracle analyses.

FP (False Positive): measures how much AP is lost because the detector produces invalid detections, such as background predictions, duplicate detections, or some wrong-class / badly localized predictions.
FN (False Negative): measures how much AP is lost because real ground-truth objects do not receive a valid matching detection.

How TIDE computes FP and FN contributions

TIDE computes each contribution as:

$$\Delta AP_o = AP_o - AP$$

where $AP$ is the original AP and $AP_o$ is the AP after applying oracle $o$.

For FP, TIDE removes all false positives and recomputes AP: $$\Delta AP_{FP} = AP(\text{all false positives removed}) - AP$$
For FN, TIDE adjusts the recall denominator so recall becomes perfect without changing precision:

$$\Delta AP_{FN} = AP(N^{\prime}_{GT} = TP_{final}) - AP$$

So:

FP dAP = AP loss caused by bad extra detections
FN dAP = AP loss caused by missing valid detections for real objects

Because TIDE uses separate oracle analyses, FP vs FN is not just a sum of the main TIDE error categories.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Small Objects Detection Benchmark

TL;DR

Problem, Dataset, and Constraints

What I Built (System + Experiment Pipeline)

Benchmark Setup (Fair-Comparison Protocol)

Results at a Glance

Key Improvements and Ablations

YOLO Training Progression

Optuna Threshold Tuning Impact

What Did Not Work

Reproducibility / How to Run

1. Environment Setup

2. Exploratory Data Analysis

3. Run Benchmark

4. Size-Bin Analysis

5. Spatial Density Analysis

Repository Structure

Artifacts and Links

Discussion Points

Limitations and Future Work

Figures

Object Density Performance Comparison

TIDE Error Decomposition

FP vs FN

How TIDE computes FP and FN contributions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docs/assets		docs/assets
models		models
notebooks_and_scripts		notebooks_and_scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Small Objects Detection Benchmark

TL;DR

Problem, Dataset, and Constraints

What I Built (System + Experiment Pipeline)

Benchmark Setup (Fair-Comparison Protocol)

Results at a Glance

Key Improvements and Ablations

YOLO Training Progression

Optuna Threshold Tuning Impact

What Did Not Work

Reproducibility / How to Run

1. Environment Setup

2. Exploratory Data Analysis

3. Run Benchmark

4. Size-Bin Analysis

5. Spatial Density Analysis

Repository Structure

Artifacts and Links

Discussion Points

Limitations and Future Work

Figures

Object Density Performance Comparison

TIDE Error Decomposition

FP vs FN

How TIDE computes FP and FN contributions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages