This project benchmarks the inference performance of an Ultralytics YOLO model exported to multiple formats: PyTorch (eager), TorchScript, and ONNX.
The goal is to compare inference speeds and determine which format is best suited for real-time deployment, edge devices, or production-grade applications.
- β Measure and compare inference performance across different model formats
- β‘ Identify the most efficient format for low-latency applications
- π Provide a reproducible benchmark for YOLO-based models on CUDA-enabled systems
-
export_model.py
Exports the trained YOLO model to ONNX and TorchScript formats. -
compare.py
Runs timed inference on each model format and calculates average FPS. -
test_img/
Folder containing the test image used for inference. -
weight/
Folder containing model weights:best.ptbest.torchscriptbest.onnx
Install the following dependencies:
pip install ultralytics torch onnx onnxruntime
Test Device Specs:
GPU: NVIDIA GeForce RTX 4060 Laptop GPU
CUDA: 12.4
cuDNN: 90100
cuDNN enabled: β
CUDA Available: β
Input size: (1, 3, 448, 640) Runs: 10 (with 2 warm-up runs)
| Format | Avg. Time (s) | Fastest (s) | Slowest (s) | FPS (approx.) |
|---|---|---|---|---|
| Eager (PyTorch) | 0.0417 | 0.0361 | 0.0451 | 24.00 π |
| TorchScript | 0.0348 | 0.0325 | 0.0385 | 28.70 π |
| ONNX Runtime | 0.0374 | 0.0345 | 0.0405 | 26.72 π |
TorchScript performed the best in terms of speed, likely due to reduced Python overhead and operation fusion. ONNX also performed well with consistent latency, while Eager mode (standard PyTorch) was the slowest but most flexible for debugging and experimentation.