Speculative Decoding

Algorithm of Speculative Decoding Step

Inputs: $M_p, M_q, "prefix"$ .

Sample $gamma$ guesses $x_(1,dots.h.c,gamma)$ from $M_q$ autoregressively.

for i = 1 to $gamma$ do $q_i(x) <- M_q("prefix" + [x_1,dots.h.c, x_(i−1)])$ $x_i tilde.op q_i(x)$ end for

Run $M_p$ in parallel.

$p_1(x),dots.h.c,p_(gamma+1)(x) <- M_p("prefix"),dots.h.c, M_p("prefix" + [x_1,dots.h.c, x_gamma])$

Determine the number of accepted guesses $n$ .

$r_1 tilde.op U(0, 1),dots.h.c, r_gamma tilde.op U(0, 1)$

$n <- min({i−1|1 lt.eq i lt.eq gamma, r_i>p_i(x)/q_i(x)} union {gamma})$

Adjust the distribution from $M_p$ if needed.

$tilde(p)(x) <- p_(n+1)(x)$

if $n<gamma$ then $tilde(p)(x) <- "norm"(max {0,p_(n+1)(x) − q_(n+1)(x)})$

end if

Return one token from $M_p$ , and $n$ tokens from $M_q$ .

$t tilde.op tilde(p)(x)$

return $"prefix" + [x_1, dots.h.c, x_n, t]$

Correctness of Speculative Sampling

We will now show that for any distributions $p(x)$ and $q(x)$ , the tokens sampled via speculative sampling from $p(x)$ and $q(x)$

are distributed identically to those sampled from $p(x)$ alone. Let $beta$ be the acceptance probability (Definition).

Note that as

tilde(p)(x) &= "norm"(max {0, p(x)-q(x)}) \ &= (p(x)-min{q(x), p(x)}) / (sum_(x prime)p(x prime)-min{q(x prime), p(x prime)}) \ &= (p(x)-min{q(x), p(x)}) / (1-beta),

the normalizing constant for the adjusted distribution $tilde(p)(x)$ is $1−beta$ , where the last equation follows immediately from Lemma 3.3 and Theorem 3.5.

Now:

P(x=x prime) = P("guess accepted", x=x prime) + P("guess rejected", x=x prime)

Where:

P("guess accepted", x=x prime) = q(x prime) min {1,p(x prime)/q(x prime)} = min{q(x prime), p(x prime)}

And:

P("guess rejected", x=x prime) = (1−beta)tilde(p)(x prime) = p(x prime) − min{q(x prime), p(x prime)}

Overall:

P(x=x prime) = min{p(x prime), q(x prime)} + p(x prime) − min {p(x prime), q(x prime)} = p(x prime).

As desired.

Definition

The acceptance rate $beta_(x_(lt t))$ , given a prefix $x_(lt t)$ , is the probability of accepting $x_t tilde.op q(x_t |x_(lt t))$ by speculative sampling.

Lemma

Define

D_(L K)(p, q)=sum_x |p(x) − M(x)|=sum_x |q(x) − M(x)|

where $M(x)=(p(x)+q(x))/2$ . Then

D_(L K)(p, q) = 1−sum_x min{p(x), q(x)}

Proof.

D_(L K)(p, q) = sum_x |p(x) − M(x)|=sum_x (|p−q|) / 2=1−sum_x (p+q−|p−q|)/2=1-sum_x min{p(x), q(x)}

Speculative Decoding

Algorithm of Speculative Decoding Step

Correctness of Speculative Sampling

On this page