Hey all!
While looking through the VPT code I noticed the use of "relative attention logits" in the Self-Attention layers, as seen here:
https://github.com/openai/Video-Pre-Training/blob/main/lib/xf.py#L342
The hypothesis on my end regarding these now was that this R-stream is used as a learnable, data-dependent bias for attention, as also seen in the attention function and the b_nd matrix.
I was also wondering about the use of nbasis = 10 as the per-head dimensionality for it, and thought of it as a form of bottleneck. But I am not sure how different values for nbasis would affect the network.
I would really appreciate any further insights, corrections and references to other resources regarding this.
Hey all!
While looking through the VPT code I noticed the use of "relative attention logits" in the Self-Attention layers, as seen here:
https://github.com/openai/Video-Pre-Training/blob/main/lib/xf.py#L342
The hypothesis on my end regarding these now was that this R-stream is used as a learnable, data-dependent bias for attention, as also seen in the
attentionfunction and theb_ndmatrix.I was also wondering about the use of
nbasis = 10as the per-head dimensionality for it, and thought of it as a form of bottleneck. But I am not sure how different values fornbasiswould affect the network.I would really appreciate any further insights, corrections and references to other resources regarding this.