Typically, the input sequence B x T x emb_dim is transformed before self-attention by adding a positional embedding i.e. where T x emb_dim can be learnable or can be fixed functions e..g sinuisodal.
is the inner product of the mth query vector and the nth key vector. It is the value of the self-attention matrix. This is the operation that typically enables knowledge conveyance between tokens at different positions.
In theory, one could hope that this relation is satisfied i.e. inner product encodes position information only in the relative form.
RoPE
Incorporating the relative position embedding is straightforward: simply rotate the word embedding vector by amount of angle multiples of its position index.
rotation matrix = where
For an embedding of dimension , RopE divides the embedding into blocks and rotates each block separately by the rotation matrix