Positional Encoding
Absolute Positional Encoding
Relative Positional Encoding
ALiBi
RoPE
NoPE
[!tip] Read for some math? A self-attention operator is permutation equivariant while an attention operator with learned query is permutation invariant.
Consider an image or feature map , where denotes the spatial dimension and denotes the number of features. Let denote a permutation of elements. A transformation is called a spatial permutation if , where denotes the permutation matrix associated with , defined as with being a one-hot vector of length and -th element as 1.
[!algorithm] definition
An operator is spatial permutation equivariant if for any and any spatial permutation . In addition, an operator is spatially invariant if for any and any spatial permutation .
[!algorithm]
A self-attention operator is permutation equivariant while an attention operator with learned query is permutation invariant. In particular, denote by the input matrix and by any spatial permutation, we have
and
When applying a spatial permutation to the input of a self-attetnion operator , we have
A_s(T_pi(X)) &= W_v T_pi(X) dot.op "softmax"((W_k T_pi(X))^T dot.op W_v T_pi(X)) \ &=W_v X P_pi dot.op "softmax" ((W_k X P_pi)^T dot.op W_q X P_pi) \ &=W_v X P_pi dot.op "softmax" P_pi^T(W_k X )^T dot.op W_q X P_pi) \ &=W_v X P_pi P_pi^T dot.op "softmax" ((W_k X)^T dot.op W_q X)P_pi \ &=W_v X dot.op "softmax" ((W_k X)^T dot.op W_q X)P_pi \ &=T_pi(A_s(X)).Note that since is an orthogonal matrix. It is also easy to verify that
for any matrix . Hence is spatial permutation equivariant. Similarly, when applying to the input of an attention operator with a learned query , which is independent of the input , we have
A_Q(T_pi(X)) &= W_v T_pi(X) dot.op "softmax" ((W_k T_pi(X))^T dot.op Q) \ &= W_v X (P_pi P_pi^T) dot.op "softmax" ((W_k X)^T dot.op Q) \ &= W_v X dot.op "softmax" ((W_k X)^T dot.op Q) \ &= A_Q(X).Hence is spatial permutation invariant.