Positional Embeddings

Typically, the input sequence $x$ B x T x emb_dim is transformed before self-attention by adding a positional embedding i.e. $\overset{x}{^} = x + p$ where $p$ T x emb_dim can be learnable or can be fixed functions e..g sinuisodal.

$q_{m}^{T} k_{n}$ is the inner product of the mth query vector and the nth key vector. It is the $Q^{T} K_{mn}$ value of the self-attention matrix. This is the operation that typically enables knowledge conveyance between tokens at different positions.

In theory, one could hope that this relation is satisfied $q_{m}^{T} k_{n} = ⟨ f_{q} (x_{m}, m), f_{k} (x_{n}, n)⟩ = g (x_{m}, x_{n}, m - n)$ i.e. inner product encodes position information only in the relative form.

RoPE

Incorporating the relative position embedding is straightforward: simply rotate the word embedding vector by amount of angle multiples of its position index.

rotation matrix = $R_{θ_{i}, m}^{d} = (cos (m θ_{i}) sin (m θ_{i}) - sin (m θ_{i}) cos (m θ_{i}))$ where $θ_{i} = 1000 0^{- 2 i / d}$

For an embedding of dimension $d$ , RopE divides the embedding into $d /2$ blocks and rotates each block $b$ separately by the rotation matrix $R_{θ_{b}, m}^{d}$

🤖 Harold's Notes

Explorer

Positional Embeddings

RoPE

Graph View

Backlinks