Skip to content

Use of relative attention #27

@florianleopold

Description

@florianleopold

Hey all!

While looking through the VPT code I noticed the use of "relative attention logits" in the Self-Attention layers, as seen here:
https://github.com/openai/Video-Pre-Training/blob/main/lib/xf.py#L342

The hypothesis on my end regarding these now was that this R-stream is used as a learnable, data-dependent bias for attention, as also seen in the attention function and the b_nd matrix.
I was also wondering about the use of nbasis = 10 as the per-head dimensionality for it, and thought of it as a form of bottleneck. But I am not sure how different values for nbasis would affect the network.

I would really appreciate any further insights, corrections and references to other resources regarding this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions