Use of relative attention

Hey all!

While looking through the VPT code I noticed the use of "relative attention logits" in the Self-Attention layers, as seen here:
https://github.com/openai/Video-Pre-Training/blob/main/lib/xf.py#L342

The hypothesis on my end regarding these now was that this R-stream is used as a learnable, data-dependent bias for attention, as also seen in the `attention` function and the `b_nd` matrix. 
I was also wondering about the use of  `nbasis = 10` as the per-head dimensionality for it, and thought of it as a form of bottleneck. But I am not sure how different values for `nbasis` would affect the network.

I would really appreciate any further insights, corrections and references to other resources regarding this.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of relative attention #27

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Use of relative attention #27

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions