Positional Encoding

NoPE

[!tip] Read for some math? A self-attention operator $A_s$ is permutation equivariant while an attention operator with learned query $A_Q$ is permutation invariant.

credits

Consider an image or feature map $X in RR^(d times n)$ , where $n$ denotes the spatial dimension and $d$ denotes the number of features. Let $pi$ denote a permutation of $n$ elements. A transformation $T: RR^(d times n) -> RR^(d times n)$ is called a spatial permutation if $T(X)=X P_pi$ , where $P_pi in RR^(n times n)$ denotes the permutation matrix associated with $pi$ , defined as $P_pi=[e_pi(1),e_pi(2),dots.h.c e_pi(n)]$ with $e_i$ being a one-hot vector of length $n$ and $i$ -th element as 1.

[!algorithm] definition

An operator $A: RR^(d times n) -> RR^(d times n)$ is spatial permutation equivariant if $T_pi (A(X))=A(T_pi(X))$ for any $X$ and any spatial permutation $T_pi$ . In addition, an operator $A: X^(d times n) -> X^(d times n)$ is spatially invariant if $A(T_pi(X))=A(X)$ for any $X$ and any spatial permutation $T_pi$ .

[!algorithm]

A self-attention operator $A_s$ is permutation equivariant while an attention operator with learned query $A_Q$ is permutation invariant. In particular, denote by $X$ the input matrix and by $T$ any spatial permutation, we have
$A_s(T_pi(X))=T_pi(A_s(X)),$
and
$A_Q(T_pi(X))=A_Q(X).$

When applying a spatial permutation $T_pi$ to the input $X$ of a self-attetnion operator $A_s$ , we have

A_s(T_pi(X)) &= W_v T_pi(X) dot.op "softmax"((W_k T_pi(X))^T dot.op W_v T_pi(X)) \ &=W_v X P_pi dot.op "softmax" ((W_k X P_pi)^T dot.op W_q X P_pi) \ &=W_v X P_pi dot.op "softmax" P_pi^T(W_k X )^T dot.op W_q X P_pi) \ &=W_v X P_pi P_pi^T dot.op "softmax" ((W_k X)^T dot.op W_q X)P_pi \ &=W_v X dot.op "softmax" ((W_k X)^T dot.op W_q X)P_pi \ &=T_pi(A_s(X)).

Note that $P_pi^T P_pi=I$ since $P_pi$ is an orthogonal matrix. It is also easy to verify that

"softmax"(P_pi^T M P_pi)=P_pi^T "softmax"(M)P_pi

for any matrix $M$ . Hence $A_s$ is spatial permutation equivariant. Similarly, when applying $T_pi$ to the input of an attention operator $A_Q$ with a learned query $Q$ , which is independent of the input $X$ , we have

A_Q(T_pi(X)) &= W_v T_pi(X) dot.op "softmax" ((W_k T_pi(X))^T dot.op Q) \ &= W_v X (P_pi P_pi^T) dot.op "softmax" ((W_k X)^T dot.op Q) \ &= W_v X dot.op "softmax" ((W_k X)^T dot.op Q) \ &= A_Q(X).

Hence $A_Q$ is spatial permutation invariant.

readme

位置编码一篇质量高的综述

Positional Encoding

Absolute Positional Encoding

Relative Positional Encoding

ALiBi

RoPE

NoPE

readme

On this page