Problem
StereoDepthEstimator runs at ~1400-1600ms per frame on MPS (Apple Silicon), making it unusable for real-time. The FFS library hardcodes torch.amp.autocast('cuda', ...) which we now patch to MPS, but inference time is unchanged — the model is compute-bound, not precision-bound.
Options to investigate
- CoreML export — convert FFS to CoreML for Apple Neural Engine acceleration
- ONNX Runtime with CoreML execution provider
- Reduce resolution — currently 512x512, try 384x384
- TensorRT on NVIDIA for production deployments
- Alternative model — lighter stereo matching network (e.g. RAFT-Stereo small, CREStereo-lite)
- RemoteStereoDepthEstimator — offload to GPU server (already implemented)
Notes
- bitsandbytes quantization is CUDA-only, not viable for MPS
- Reducing
valid_iters from 4 to 2 did not noticeably improve performance
- The autocast patch (feat/local branch) confirms MPS is being used but the model architecture itself is the bottleneck
Problem
StereoDepthEstimator runs at ~1400-1600ms per frame on MPS (Apple Silicon), making it unusable for real-time. The FFS library hardcodes
torch.amp.autocast('cuda', ...)which we now patch to MPS, but inference time is unchanged — the model is compute-bound, not precision-bound.Options to investigate
Notes
valid_itersfrom 4 to 2 did not noticeably improve performance