Why this particular construction?

Hi -- 

I was wondering where you got the idea for the specific construction of the L-softmax.  It seems like maybe you could achieve a similar goal by enforcing a margin like 

`norm(W) * norm(x) * (m * cos(theta) - m + 1)` 

instead of 

`norm(W) * norm(x) * cos(m * theta)`

as you do in the paper.

The former seems simpler because you don't have to worry about constructing a `psi` function that behaves well for all values of `theta`,  `m` doesn't have to be integer valued, etc. Also, in the paper, the gradient of `psi` is 0 at `pi/2`, which AFAICT is an undesirable side effect of the choice of `psi`.  Is that right, or is there some reason that `grad psi(pi/2)` should be 0?  

The proposed alternative above would have the same shape as `cos` in `[0, pi]` but with a range of `[-m, 1]`, which seems maybe more natural.

Thoughts?  Am I missing something?  Did you try this and it stunk in practice?  

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why this particular construction? #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Why this particular construction? #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions