Divergent

Deep Learning

Backpropagation

[!tip] Dive deep(pun intended) into neural nets with some exciting learnings! Try to become a back prop ninja with Karpathy’s courses.

Derive Backprop Gradients step by step

dprobs = dlogprobs / probs

dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)

dcounts = dprobs * counts_sum_inv

dcounts_sum = -dcounts_sum_inv * counts_sum ** -2

dcounts += dcounts_sum.broadcast_to(counts.shape)

dnorm_logits = dcounts * norm_logits.exp() # norm_logits.exp() is actually counts

dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)

dlogits = dnorm_logits.clone()

tmp = torch.zeros_like(logits)

tmp[range(n), logits.max(1, keepdim=True).indices.view(-1)] = 1 # try F.one_hot

dlogits += dlogit_maxes * tmp

dh = dlogits @ W2.T

dW2 = h.T @ dlogits

db2 = dlogits.sum(0, keepdim=False)

# dhpreact = dh * (1 - torch.tanh(hpreact) ** 2)

# dhpreact = (1.0 - h ** 2) * dh # figure out later

dhpreact = hpreact.grad.clone()

# dbngain = (dhpreact * bnraw).sum(0, keepdim=True)

dbngain = (dhpreact * bnraw).sum(0, keepdim=True)

dbnbias = dhpreact.sum(0, keepdim=True)

dbnraw = dhpreact * bngain

dbnvar_inv = (dbnraw * bndiff).sum(0, keepdim=True)

dbndiff = dbnraw * bnvar_inv

# dbnvar = dbnvar_inv * (-0.5) * bnvar_inv ** 3

dbnvar = dbnvar_inv * (-0.5) * (bnvar + 1e-5) ** -1.5

dbndiff2 = 1.0 / (n-1) * torch.ones_like(bndiff2) * dbnvar

dbndiff += 2 * bndiff * dbndiff2

dbnmeani = -dbndiff.sum(0, keepdim=True)

dhprebn = dbndiff.clone()

dhprebn += dbnmeani * 1.0 / n * torch.ones_like(hprebn)

dembcat = dhprebn @ W1.T

dW1 = embcat.T @ dhprebn

db1 = dhprebn.sum(0, keepdim=False)

demb = dembcat.view(emb.shape)

dC = torch.zeros_like(C)

for k in range(Xb.shape[0]):

for j in range(Xb.shape[1]):

ix = Xb[k,j]

dC[ix] += demb[k,j]

dlogprobs

Notation-wise, "dlogprobs""dlogprobs" stands for the gradient of loss through "logprobs""logprobs". To start we have the following

"loss"=1/nsum(i=1)sum(jinYb)"logprobs"(i,j)"loss" = -1/n sum_(i=1)sum_(j in Y_b)"logprobs"_(i, j)

which easily yields the gradient as follows.

("dloss" / "dlogprobs")_(i,j)= cases( -1/n"," & quad j in Y_b, 0"," & quad "otherwise")
dlogprobs = torch.zeros_like(logprobs)
dlogprobs[range(n), Yb] = -1. / n

dprobs

Continue the process, we have

"logprobs"(i,j)=log("probs"(i,j))"logprobs"_(i,j)=log("probs"_(i,j))

Hence

("dloss"/"dprobs")(i,j)=("dloss"/"dlogprobs")(i,j)dot.c("dlogprobs"/"dprobs")(i,j)=("dloss"/"dlogprobs")(i,j)dot.c1/"probs"(i,j)("dloss"/"dprobs")_(i,j)=("dloss"/"dlogprobs")_(i,j)dot.c("dlogprobs"/"dprobs")_(i,j)=("dloss"/"dlogprobs")_(i,j)dot.c 1/"probs"_(i,j)
dprobs = dlogprobs / probs

dcounts_sum_inv

dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)

Note that probs = counts * counts_sum_inv is in fact

"probs"(i,j)="counts"(i,j)dot.c"csi"i"probs"_(i,j)="counts"_(i,j)dot.c "csi"_i

For simplicity of notation, denote counts_sum_inv by csi.

>>> counts.shape, counts_sum_inv.shape
(torch.Size([32, 27]), torch.Size([32, 1]))

Hence

("dprobs"/"dcsi")i=sumj"counts"(i,j)("dprobs"/"dcsi")_i=sum_j "counts"_(i,j)

and

("dprobs"/"dcsi")i=sumj("dloss"/"dprobs")(i,j)dot.c("dprobs"/"dcsi")i("dprobs"/"dcsi")_i=sum_j ("dloss"/"dprobs")_(i,j) dot.c ("dprobs"/"dcsi")_i
dcounts_sum_inv = (dprobs * counts).sum(1, keepdim=True)

dcounts

Note that from one also has

probs = counts * counts_sum_inv

which leads to

("dprobs"/"dcounts")(i,j)="csi"i("dprobs"/"dcounts")_(i,j)="csi"_i

However, other than contributing to loss through probs, counts also does that through counts_sum and then counts_sum_inv. The complete gradient of dcounts remains to be determined.

counts_sum_inv = counts_sum**-1

dcounts_sum

leads to

("dcsi"/"dcs")i="cs"i(2)("dcsi"/"dcs")_i=-"cs"_i^(-2)

and then

("dloss"/"dcountssum")i=("dloss"/"dcsi")idot.c"countssum"i(2)("dloss"/"dcounts_sum")_i=-("dloss"/"dcsi")_i dot.c "counts_sum"_i^(-2)

dnorm_logits

dnorm_logits = dcounts * norm_logits.exp() # norm_logits.exp() is actually counts

dlogit_maxes

dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)

dlogits

dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)

Backprop through cross_entropy but All in One Go

Basically what happens in the forward pass can be described by the following pseudocode

logprobs = log(norm(softmax(logits, 1), 1))
loss = -mean(logprobs[range(n), Yb])

To discuss derivative for each single element, for every ii, denote (Yb)i(Y_b)_i by yy. For simplicity of notation, denote "logits""logits" by lglg.

"loss" &= -1/n sum_(i=1)^n "logprobs"_(i,y) \ &=-1/n sum_(i=1)^n log("probs"_(i,y)) \ &=-1/n sum_(i=1)^n log("exp"{"lg"_(i,y)} / (sum_k exp{"lg"_(i,k)}))

[!note] Keep in mind that there is supposed to be a subtracting the maximum of logits (in each row) in the numerator. Here it is omitted because it does not affect the gradient towards loss.

Now conduct chain rules to derive the derivatives. If jeq.notyj eq.not y,

("dloss"/"dlg")(i,j)=1/nsumkexp"lg"(i,k)/(exp"lg"(i,y))dot.c(1)dot.cexp"lg"(i,y)/(sumkexp"lg"(i,k))2dot.cexp"lg"(i,j)("dloss"/"dlg")_(i,j)=-1/n sum_k exp{"lg"_(i,k)} / (exp{"lg"_(i,y)}) dot.c (-1) dot.c exp{"lg"_(i,y)} / (sum_k exp{"lg"_(i,k)})^2 dot.c exp{"lg"_(i,j)}

which yields

("dloss"/"dlg")(i,j)=1/ndot.cexp"lg"(i,j)/(sumkexp"lg"(i,k))=1/ndot.c"softmax"("lg"(i,dot.c))j("dloss"/"dlg")_(i,j)=1/n dot.c exp{"lg"_(i,j)} / (sum_k exp{"lg"_(i,k)})=1/n dot.c "softmax"("lg"_(i,dot.c))_j

If j=yj=y,

("dloss"/"dlg")(i,j)=1/n(sumkexp"lg"(i,k))/(exp"lg"(i,y))dot.c(exp"lg"(i,y)dot.csumkexp"lg"(i,k)exp2"lg"(i,y))/((sumkexplg(i,k))2)("dloss"/"dlg")_(i,j)=-1/n (sum_k exp{"lg"_(i,k)}) / (exp{"lg"_(i,y)}) dot.c (exp{"lg"_(i,y)} dot.c sum_k exp{"lg"_(i,k)}-exp^2{"lg"_(i,y)}) / ((sum_k exp{lg_(i,k)})^2)

which yields

("dloss"/"dlg")(i,j)=1/n+1/ndot.c(exp"lg"(i,y))/(sumkexp"lg"(i,k))=1/ndot.c("softmax"(lg(i,dot.c))y)1("dloss"/"dlg")_(i,j)=-1/n+1/n dot.c (exp{"lg"_(i,y)}) / (sum_k exp{"lg"_(i,k)})=1/n dot.c ("softmax"(lg_(i,dot.c))_y)-1

While the two cases are discussed separately, they share a common part in softmax.

dlogits = F.softmax(logits, 1)
dlogits[range(n), Yb] -= 1
dlogits /= n

Finish calculating the rest derivatives. Read more

A typical training loop looks like this. One thing worth mentioning is the line optimizer.zero_grad().

Gradients in PyTorch are accumulated by default. Without zeroing them, gradients from multiple backpropagation steps would add up, leading to incorrect model updates.

In other words, zeroing gradients ensures that each backward pass calculates gradients based only on the current mini-batch.

# Training loop
losses = []
for epoch in range(num_epochs):
    epoch_loss = 0
    for batch_X, batch_y in dataloader:
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass and optimize

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


        epoch_loss += loss.item()

    # Record average loss for the epoch
    avg_loss = epoch_loss / len(dataloader)
    losses.append(avg_loss)

These steps are part of each training loop iteration and help in adjusting the model's parameters to improve its predictions.

  1. optimizer.zero_grad():

    • Purpose: Clears old gradients.
    • What Happens: Resets the gradients of all model parameters to zero. This prevents the accumulation of gradients from multiple backpropagations, ensuring each mini-batch is computed independently.
  2. loss.backward():

    • Purpose: Computes gradients.
    • What Happens: Performs backpropagation to compute the gradient of the loss with respect to each parameter (weight and bias). This is where the network learns, by calculating how much each parameter needs to change to minimize the loss.
  3. optimizer.step():

    • Purpose: Updates parameters.
    • What Happens: Updates the model parameters based on the calculated gradients. This step uses the gradients computed during loss.backward() to adjust the weights in an attempt to minimize the loss.

Loss Functions

hinge loss

cross-entropy loss

The cross-entropy between a "true" distribution p and an estimated distribution q is defined as:

H(p,q)=sumxp(x)logq(x)H(p, q)=-sum_x p(x) log q(x)

[!note] Possibly confusing naming conventions To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss. The Softmax classifier uses the cross-entropy loss. The Softmax classifier gets its name from the softmax function, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied. In particular, note that technically it doesn’t make sense to talk about the "softmax loss", since softmax is just the squashing function, but it is a relatively commonly used shorthand.

focal loss

readme

ROC-AUC

Basic Concepts

Consider a two-class prediction problem (binary classification), in which the outcomes are labeled either as positive or negative. There are four possible outcomes from a binary classifier:

  • True Positive(TP): the prediction is positive and the actual value is positive.
  • True Negative(TN): the prediction is negative and the actual value is negative.
  • False Positive(FP): the prediction is positive and the actual value is negative.
  • False Negative(FN): the prediction is negative and the actual value is positive.

Precision measures the accuracy of positive predictions. It is the ratio of true positive predictions to the total positive predictions (true positives + false positives). High precision indicates that fewer false positives are present.

"Precision"="TP"/"TP+FP""Precision" = "TP"/"TP+FP"

Recall (also known as sensitivity or true positive rate) measures the ability of a model to identify all relevant instances. It is the ratio of true positive predictions to the actual number of positive cases (true positives + false negatives). High recall means the model successfully captured most of the actual positives.

"Recall"="TP"/"TP+FN""Recall" = "TP"/"TP+FN"

F1 score is

"F1"=2/(1/"Precision"+1/"Recall")"F1" =2/(1/"Precision"+1/"Recall")

Calculation and Interpretation

[!note] The area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming positive ranks higher than negative). 1

The ROC curve (x,y){(x,y)} can be parameterized by y="TPR"(T)y="TPR"(T) and x="FPR"(T)x="FPR"(T).

Given a threshold parameter TT, the instance is identified as positive if X>TX>T and negative otherwise. XX follows a probability density function f1(x)f_1(x) if the instance actually belongs to the class positive, and f0(x)f_0(x) otherwise. Therefore, the true positive rate is given by "TPR"(T)=integralT(infinity)f1(x)"dx""TPR"(T)=integral_T^(infinity)f_1(x)"dx" and the false positive rate is given by "FPR"(T)=integralT(integral)f0(x)"dx""FPR"(T)=integral_T^(integral)f_0(x)"dx". According to the following maps, "TPR"(T):T>y(x)"TPR"(T):T -> y(x) and "FPR"(T):T>x"FPR"(T):T -> x, the area under the curve is given by:

"AUC" & = integral_(x=0)^1 y(x)"dx" \ & = integral_(x=0)^1 "TPR"(T)"dx" \ &= integral_(x=0)^1 "TPR"("FPR"^(-1)(x))"dx" \ & = integral_infinity^(-infinity)"TPR"(T)"FPR"^ prime(T)"dT" \ & = integral_(-infinity)^infinity "TPR"(T)f_0(T)"dT" \ & = integral_(-infinity)^infinity integral_T^infinity f_1(x)"dx" f_0(T)"dT" \ & = integral_(-infinity)^infinity integral_(-infinity)^infinity bb(1){x gt.eq T}f_1(x)"dx" f_0(T)"dT" \ & = integral_(-infinity)^infinity integral_{-infinity}^{infinity} bb(1){T^prime gt.eq T}f_1(T^prime)f_0(T)"dT"^prime "dT" \ & = P(X_1 gt.eq X_0)

Read more on ROC's Wiki page and some CSDN blog post

Suppose the numbers of positive and negative samples in a batch are MM and NN, respectively.

"AUC"=(sumII(P"pos",P"neg"))/(Mdot.cN)"AUC" = (sum II(P_"pos", P_"neg")) / (M dot.c N)

where IIII is a characteristic function:

II(p, q)=cases( 1\, & quad p > q , 0.5\, & quad p = q , 0\, & quad p < q )

The time complexity is O(MN)O(M N). Denote by rir_i the rank score of ii-th sample, i.e., for xjx_j with the smallest prediction score, rj=0r_j=0 and for xkx_k with the largest prediction score, rk=M+N1r_k=M+N-1.

For each positive sample, its rank is precisely the same amount of contribution towards the count of pairs where positive sample is picked over negative sample (because of higher score).

Therefore, one can first sum up all ranks from positive samples, then subtract those cases where two positive samples are paired. Note that for the highest ranked positive sample, there are M1M-1 positive samples below it, corresponding to a subtraction of M1M-1. A simple deduction leads to the final subtraction is

(M1)+dots.c+1+0=((M1)M)/2.(M-1)+dots.c +1+0=((M-1)M)/2.

Finally we have

"AUC"=sum(iin"Positive")ri((M1)M)/2/(Mdot.cN)"AUC" = sum_(i in "Positive")r_i-((M-1)M)/2 / (M dot.c N)

The time complexity is O((M+N)log(M+N))O((M+N)log(M+N)), which is noticeably better than O(MN)O(M N) when MM and NN are large.

Now we calculate with an example:

#table( columns: 3, table.header[real label][pred score][rank], [1], [0.9], [**5**], [1], [0.7], [**4**], [0], [0.6], [3], [1], [0.55], [**2**], [0], [0.2], [1], [0], [0.1], [0], )
  • Calculate AUC with the original formula:
"AUC"=(bold(1)(0.9)dot.c3+bold(1)(0.7)dot.c3+bold(1)(0.55)dot.c2)/(3dot.c3)=8/9."AUC" = (bold(1)_(0.9) dot.c 3+bold(1)_(0.7) dot.c 3+bold(1)_(0.55) dot.c 2) / (3 dot.c 3)=8/9.
  • Calculate AUC with the quicker formula:
"AUC"=(5+4+2)(3dot.c3)(31)(2dot.c3)=8/9."AUC" = (5+4+2)(3 dot.c 3)-(3-1)(2 dot.c 3)=8/9.

CUDA

[!warning] FileNotFoundError: [Errno 2] No such file or directory: ':/usr/local/cuda-11.8:/usr/local/cuda-11.8/bin/nvcc'

solution

export CUDA_HOME=/usr/local/cuda

or change export CUDA_HOME=/usr/local/cuda-11.8 in ~/.bashrc to export CUDA_HOME=/usr/local/cuda.

For more detail refer to GitHub issue.

Footnotes

  1. Fawcett, Tom (2006); An introduction to ROC analysis, Pattern Recognition Letters, 27, 861–874.

On this page