I noticed that you don't cancel gradient of the large values, when using straight through estimator here.
In QNN paper it was claimed "Not cancelling the gradient when r is too large significantly worsens performance".
Does it only matter for low precision quantization (e.g. binary?)
I noticed that you don't cancel gradient of the large values, when using straight through estimator here.
In QNN paper it was claimed "Not cancelling the gradient when r is too large significantly worsens performance".
Does it only matter for low precision quantization (e.g. binary?)