Hi --
I was wondering where you got the idea for the specific construction of the L-softmax. It seems like maybe you could achieve a similar goal by enforcing a margin like
norm(W) * norm(x) * (m * cos(theta) - m + 1)
instead of
norm(W) * norm(x) * cos(m * theta)
as you do in the paper.
The former seems simpler because you don't have to worry about constructing a psi function that behaves well for all values of theta, m doesn't have to be integer valued, etc. Also, in the paper, the gradient of psi is 0 at pi/2, which AFAICT is an undesirable side effect of the choice of psi. Is that right, or is there some reason that grad psi(pi/2) should be 0?
The proposed alternative above would have the same shape as cos in [0, pi] but with a range of [-m, 1], which seems maybe more natural.
Thoughts? Am I missing something? Did you try this and it stunk in practice?
Thanks
Hi --
I was wondering where you got the idea for the specific construction of the L-softmax. It seems like maybe you could achieve a similar goal by enforcing a margin like
norm(W) * norm(x) * (m * cos(theta) - m + 1)instead of
norm(W) * norm(x) * cos(m * theta)as you do in the paper.
The former seems simpler because you don't have to worry about constructing a
psifunction that behaves well for all values oftheta,mdoesn't have to be integer valued, etc. Also, in the paper, the gradient ofpsiis 0 atpi/2, which AFAICT is an undesirable side effect of the choice ofpsi. Is that right, or is there some reason thatgrad psi(pi/2)should be 0?The proposed alternative above would have the same shape as
cosin[0, pi]but with a range of[-m, 1], which seems maybe more natural.Thoughts? Am I missing something? Did you try this and it stunk in practice?
Thanks