Divergent

Positional Encoding

Absolute Positional Encoding

Relative Positional Encoding

ALiBi

RoPE

NoPE

[!tip] Read for some math? A self-attention operator AsA_s is permutation equivariant while an attention operator with learned query AQA_Q is permutation invariant.

credits

Consider an image or feature map XinRR(dtimesn)X in RR^(d times n), where nn denotes the spatial dimension and dd denotes the number of features. Let pipi denote a permutation of nn elements. A transformation T:RR(dtimesn)>RR(dtimesn)T: RR^(d times n) -> RR^(d times n) is called a spatial permutation if T(X)=XPpiT(X)=X P_pi, where PpiinRR(ntimesn)P_pi in RR^(n times n) denotes the permutation matrix associated with pipi, defined as Ppi=[epi(1),epi(2),dots.h.cepi(n)]P_pi=[e_pi(1),e_pi(2),dots.h.c e_pi(n)] with eie_i being a one-hot vector of length nn and ii-th element as 1.

[!algorithm] definition

An operator A:RR(dtimesn)>RR(dtimesn)A: RR^(d times n) -> RR^(d times n) is spatial permutation equivariant if Tpi(A(X))=A(Tpi(X))T_pi (A(X))=A(T_pi(X)) for any XX and any spatial permutation TpiT_pi. In addition, an operator A:X(dtimesn)>X(dtimesn)A: X^(d times n) -> X^(d times n) is spatially invariant if A(Tpi(X))=A(X)A(T_pi(X))=A(X) for any XX and any spatial permutation TpiT_pi.

[!algorithm]

A self-attention operator AsA_s is permutation equivariant while an attention operator with learned query AQA_Q is permutation invariant. In particular, denote by XX the input matrix and by TT any spatial permutation, we have

As(Tpi(X))=Tpi(As(X)),A_s(T_pi(X))=T_pi(A_s(X)),

and

AQ(Tpi(X))=AQ(X).A_Q(T_pi(X))=A_Q(X).

When applying a spatial permutation TpiT_pi to the input XX of a self-attetnion operator AsA_s, we have

A_s(T_pi(X)) &= W_v T_pi(X) dot.op "softmax"((W_k T_pi(X))^T dot.op W_v T_pi(X)) \ &=W_v X P_pi dot.op "softmax" ((W_k X P_pi)^T dot.op W_q X P_pi) \ &=W_v X P_pi dot.op "softmax" P_pi^T(W_k X )^T dot.op W_q X P_pi) \ &=W_v X P_pi P_pi^T dot.op "softmax" ((W_k X)^T dot.op W_q X)P_pi \ &=W_v X dot.op "softmax" ((W_k X)^T dot.op W_q X)P_pi \ &=T_pi(A_s(X)).

Note that PpiTPpi=IP_pi^T P_pi=I since PpiP_pi is an orthogonal matrix. It is also easy to verify that

"softmax"(PpiTMPpi)=PpiT"softmax"(M)Ppi"softmax"(P_pi^T M P_pi)=P_pi^T "softmax"(M)P_pi

for any matrix MM. Hence AsA_s is spatial permutation equivariant. Similarly, when applying TpiT_pi to the input of an attention operator AQA_Q with a learned query QQ, which is independent of the input XX, we have

A_Q(T_pi(X)) &= W_v T_pi(X) dot.op "softmax" ((W_k T_pi(X))^T dot.op Q) \ &= W_v X (P_pi P_pi^T) dot.op "softmax" ((W_k X)^T dot.op Q) \ &= W_v X dot.op "softmax" ((W_k X)^T dot.op Q) \ &= A_Q(X).

Hence AQA_Q is spatial permutation invariant.

readme

On this page