Skip to content

Implement Relative Positional Multi-Head Attention in Transformer Variants #2

@rajveer43

Description

@rajveer43

Description:

Hello! I’ve been following the development of this repository and appreciate the efforts to benchmark various efficient Transformer variants. I’d like to propose the implementation of Relative Positional Multi-Head Attention as an enhancement to the current models.

What is Relative Positional Multi-Head Attention?

Relative Positional Multi-Head Attention is a modification to the standard self-attention mechanism in Transformers. Traditional Transformers use absolute positional encodings to provide information about the position of tokens in a sequence. However, relative positional encodings allow the model to focus on the relative distance between tokens, which is often more relevant in tasks where the relationship between tokens matters more than their absolute position.

This method enhances the model's ability to capture local dependencies and handle sequences where the relative position of tokens plays a significant role. It is particularly beneficial for tasks like language modeling, where understanding the proximity of words to each other can be crucial.

Reference Paper:

Title: Attention Is All You Need (with Relative Position Encoding)
Link: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Implementation Example: Relative Positional Encodings in Transformer Models

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions