gpu softnms is slower because you use numpy in the middle of a torch ops

torch execute gpu ops in async, so when you call numpy, torch will sync up all ops and transfer to cpu, then transfer back when you call torch ops again, which is extremely slow.
https://github.com/DocF/Soft-NMS/blob/95dab79eac5c786f61fef2f6d5cd633eec7ecfd6/softnms_pytorch.py#L51