Math for VGG 1 Intro I am writing this to help you understand what - - PDF document

math for vgg
SMART_READER_LITE
LIVE PREVIEW

Math for VGG 1 Intro I am writing this to help you understand what - - PDF document

Math for VGG 1 Intro I am writing this to help you understand what the code is doing. Its still work in progress, toward making it more reader friendly. At this point, its just a bunch of math formulas you do not want to follow. 2


slide-1
SLIDE 1

Math for VGG

1 Intro

I am writing this to help you understand what the code is doing. It’s still work in progress, toward making it more reader

  • friendly. At this point, it’s just a bunch of math formulas you do not want to follow.

2 Notational Conventions

  • Since there are lots of variables that take multiple indices, it would be difficult to parse them if we use subscripts

for indices. We therefore put indices in parens, like x(i, j, k, l), instead of subscripts xi,j,k,l. This would be much easier to read.

3 Symbols

Constant Parameters and Indexes

  • B (Batch size) : the number of samples in a mini-batch

– 0 ≤ b < B (batch) : an index of a sample in a mini batch

  • C (Classes) : the number of classes

– 0 ≤ c < C (class) : an index of a class

  • IC (Input Channels) : the number of channels in an input image of a layer (e.g., three if an image has red, green

and blue components) – 0 ≤ ic < IC (input channel index) : an index of a channel in an input image

  • OC (Output Channels) : the number of channels in an output image of a layer

– 0 ≤ oc < OC (output channel index) : an index of a channel in an output image

  • H (Height) : the number of pixels in a single column of an image

– 0 ≤ i < H (image row index)

  • W (Width) : the number of pixels in a single row of an image

– 0 ≤ j < W (image column index)

  • K (Kernel size) : half of the number of pixels in a single row or a single column. throughout VGG, K is actually

always 1 and the kernel is actually 3x3 pixels. – −K ≤ i′ ≤ K (kernel row index) – −K ≤ j′ ≤ K (kernel column index) 1

slide-2
SLIDE 2

Multidimensional Data

  • x(b, ic, i, j) : a batch of images input to a layer
  • y(b, oc, i, j) : a batch of images output from a layer
  • w(oc, ic, i′, j′) : filters (kernels) applied to each image

4 Convolution2D

Description: Convolution takes a batch of images (x) and a filter (w) and outputs another batch of images (y). An input batch x consists of B images, each of which consists of IC channels, each of which consists of (H × W) pixels. A filter is essentially a small image. It consists of OC output channels, each of which consists of IC input channels, each of which consits of (2K +1)×(2K +1) pixels. An output batch consists of B images, each of which consists of OC channels, each of which consists of (H × W) pixels. Each pixel in the output is obtained by taking the inner product of the filter

  • x(b, ic, i, j) : the pixel value of bth image’s icth chanel

Forward: y(b, oc, i, j) = ∑ 0 ≤ ic < H, −K ≤ i′ ≤ K, −K ≤ j′ ≤ K, w(oc, ic, i′, j′) x(b, ic, i + i′, j + j′) (1) The actual code must take care of array index underflow and overflow. In the expression above, we assume all elements whose indices underflow or overflow are zero. Backward: ∂L ∂x(b, ic, i + i′, j + j′) = ∑

b′,oc,i,j

∂L ∂y(b′, oc, i, j) ∂y(b′, oc, i, j) ∂x(b, ic, i + i′, j + j′) (2) = ∑

  • c,i,j

∂L ∂y(b, oc, i, j) ∂y(b, oc, i, j) ∂x(b, ic, i + i′, j + j′) (3) = ∑

  • c,i,j

∂L ∂y(b, oc, i, j)w(oc, ic, i′, j′) (4) Equivalently, let i′′ = i + i′ and j′′ = j + j′. 0 ≤ i = i′′ − i′ < H (5) 0 ≤ j = j′′ − j′ < W (6) ∂L ∂x(b, ic, i′′, j′′) = ∑

  • c,

i′′ − H < i′ ≤ i′′, j′′ − W < j′ ≤ j′′ ∂L ∂y(b, oc, i′′ − i′, j′′ − j′)w(oc, ic, i′, j′) (7) Replacing i′′ with i and j′′ with j for readability, we get ∂L ∂x(b, ic, i, j) = ∑

  • c,

i − H < i′ ≤ i, j − W < j′ ≤ j ∂L ∂y(b, oc, i − i′, j − j′)w(oc, ic, i′, j′) (8) 2

slide-3
SLIDE 3

∂L ∂w(oc, ic, i′, j′) = ∑

b,oc′,i,j

∂L ∂y(b, oc′, i, j) ∂y(b, oc′, i, j) ∂w(oc, ic, i′, j′) (9) = ∑

b,i,j

∂L ∂y(b, oc, i, j) ∂y(b, oc, i, j) ∂w(oc, ic, i′, j′) (10) = ∑

b,i,j

∂L ∂y(b, oc, i, j)x(b, ic, i + i′, j + j′) (11)

5 Linear4D

Forward: y(b, c, 0, 0) = ∑

ic

x(b, ic, 0, 0) w(ic, c) (12) Backward: ∂L ∂x(b, ic, 0, 0) = ∑

b′,c

∂L ∂y(b′, c, 0, 0) ∂y(b′, c, 0, 0) ∂x(b, ic, 0, 0) (13) = ∑

c

∂L ∂y(b, c, 0, 0)w(ic, c) (14) ∂L ∂w(ic, c) = ∑

b,c′

∂L ∂y(b, c′, 0, 0) ∂y(b, c′, 0, 0) ∂w(ic, c) (15) = ∑

b

∂L ∂y(b, c, 0, 0, )x(b, ic, 0, 0) (16)

6 Dropout4

Forward: y(b, c, i, j) = R(b, c, i, j) x(b, c, i, j) (17) where R(b, c, i, j) is a random matrix whose element is 0 with probatility p and 1/(1 − p) with probability (1 − p) Backward: ∂L ∂x(b, c, i, j) = ∂L ∂y(b, c, i, j)R(b, c, i, j) (18)

7 BatchNormalization4

Forward: µ(ic) = 1 BHW ∑

b,i,j

x(b, ic, i, j), (19) σ2(ic) = 1 BHW ∑

b,i,j

(x(b, ic, i, j) − µ(ic))2, (20) ˆ x(b, ic, i, j) = x(b, ic, i, j) − µ(ic) √ σ2(ic) + ϵ , (21) y(b, ic, i, j) = γ(ic)ˆ x(b, ic, i, j) + β(ic). (22) 3

slide-4
SLIDE 4

Backward: ∂L ∂γ(ic) = ∑

b,i,j

∂L ∂y(b, ic, i, j) ∂y(b, ic, i, j) ∂γ(ic′) (23) = ∑

b,i,j

∂L ∂y(b, ic, i, j) x(b, ic, i, j) − µ(ic) √ σ2(ic) + ϵ (24) ∂L ∂β(ic) = ∑

b,i,j

∂L ∂y(b, ic, i, j) (25) ∂L ∂ˆ x(b, ic, i, j) = ∂L ∂y(b, ic, i, j) ∂y ∂ˆ x(b, ic, i, j) (26) = ∂L ∂y(b, ic, i, j)γ(ic) (27) ∂L ∂σ2(ic) = ∑

b,i,j

∂L ∂ˆ x(b, ic, i, j) ∂ˆ x(b, ic, i, j) ∂σ2(ic) (28) = −1 2 ∑

b,i,j

∂L ∂ˆ x(b, ic, i, j) x(b, ic, i, j) − µ(ic) (σ2(ic) + ϵ)3/2 (29) ∂L ∂µ(ic) = ∑

b,i,j

∂L ∂ˆ x(b, ic, i, j) ∂ˆ x(b, ic, i, j) ∂µ(ic) + ∂L ∂σ2(ic) ∂σ2(ic) ∂µ(ic) (30) = − ∑

b,i,j

∂L ∂ˆ x(b, ic, i, j) 1 √ σ2(ic) + ϵ + ∂L ∂σ2(ic) 2 BHW ∑

b,i,j

(µ(ic) − x(b, ic, i, j)) (31) = − ∑

b,i,j

∂L ∂ˆ x(b, ic, i, j) 1 √ σ2(ic) + ϵ (32) ∂L ∂x(b, ic, i, j) = ∂L ∂ˆ x(b, ic, i, j) ∂ˆ x(b, ic, i, j) ∂x(b, ic, i, j) + ∂L ∂σ2(ic) ∂σ2(ic) ∂x(b, ic, i, j) + ∂L ∂µ(ic) ∂µ(ic) ∂x(b, ic, i, j) (33) = ∂L ∂ˆ x(b, ic, i, j) 1 √ σ2(ic) + ϵ + ∂L ∂σ2(ic) 2 BHW (x(b, ic, i, j) − µ(ic)) + ∂L ∂µ(ic) 1 BHW (34) = ∂L ∂y(b, ic, i, j) γ(ic) √ σ2(ic) + ϵ (35) +  −1 2 ∑

b,i,j

∂L ∂ˆ x(b, ic, i, j) x(b, ic, i, j) − µ(ic) (σ2(ic) + ϵ)3/2   2 BHW (x(b, ic, i, j) − µ(ic)) (36) +  − ∑

b,i,j

∂L ∂ˆ x(b, ic, i, j) 1 √ σ2(ic) + ϵ   1 BHW (37) = ∂L ∂y(b, ic, i, j) γ(ic) √ σ2(ic) + ϵ (38) − γ(ic) BHW  ∑

b,i,j

∂L ∂y(b, ic, i, j) x(b, ic, i, j) − µ(ic) (σ2(ic) + ϵ)3/2   (x(b, ic, i, j) − µ(ic)) (39) 4

slide-5
SLIDE 5

− γ(ic) BHW  ∑

b,i,j

∂L ∂y(b, ic, i, j) 1 √ σ2(ic) + ϵ   (40) = ∂L ∂y(b, ic, i, j) γ(ic) √ σ2(ic) + ϵ (41) − 1 BHW γ(ic) √ σ2(ic) + ϵ  ∑

b,i,j

∂L ∂y(b, ic, i, j) x(b, ic, i, j) − µ(ic) σ2(ic) + ϵ   (x(b, ic, i, j) − µ(ic)) (42) − 1 BHW γ(ic) √ σ2(ic) + ϵ ∑

b,i,j

∂L ∂y(b, ic, i, j) (43) = ∂L ∂y(b, ic, i, j) γ(ic) √ σ2(ic) + ϵ (44) − 1 BHW γ(ic) √ σ2(ic) + ϵ  ∑

b,i,j

∂L ∂y(b, ic, i, j) x(b, ic, i, j) − µ(ic) √ σ2(ic) + ϵ   x(b, ic, i, j) − µ(ic) √ σ2(ic) + ϵ (45) − 1 BHW γ(ic) √ σ2(ic) + ϵ ∑

b,i,j

∂L ∂y(b, ic, i, j) (46) = ∂L ∂y(b, ic, i, j) γ(ic) √ σ2(ic) + ϵ (47) − 1 BHW γ(ic) √ σ2(ic) + ϵ ∂L ∂γ(ic) ˆ x(b, ic, i, j) (48) − 1 BHW γ(ic) √ σ2(ic) + ϵ ∂L ∂β(ic) (49) = γ(ic) √ σ2(ic) + ϵ ( ∂L ∂y(b, ic, i, j) − 1 BHW ( ∂L ∂γ(ic) ˆ x(b, ic, i, j) + ∂L ∂β(ic) )) (50)

8 Relu4

Forward: y(b, c, i, j) = max(0, x(b, c, i, j)) (51) = { x(b, c, i, j) x(b, c, i, j) ≥ 0 x(b, c, i, j) < 0 (52) Backward: ∂L ∂x(b, c, i, j) =    ∂L ∂y(b, c, i, j) (x(b, c, i, j) ≥ 0)

  • therwise

(53)

9 MaxPooling2d

Forward: y(b, c, i, j) = max

Si≤i′<S(i+1),Sj≤j′<S(j+1) x(b, c, i′, j′)

(54) Backward: ∂L ∂x(b, c, i, j) = ∑

b′,c′,i′,j′

∂L ∂y(b′, c′, i′, j′) ∂y(b′, c′, i′, j′) ∂x(b, c, i, j) (55) 5

slide-6
SLIDE 6

=    ∂L ∂y(b, c, i, j) (i, j) = argmaxSi≤i′′<S(i+1),Sj≤j′′<S(j+1)x(b, c, i′′, j′′)

  • therwise

(56)

10 SoftmaxCrossEntropy

Definition (softmax): x = (x0, · · · , xn−1) : n-vector softmax(x) ≡ 1 ∑

i

exp xi    exp x0 . . . exp xn−1    (57) Definition (logsoftmax): It is convenient to take a logarithm of it. logsoftmax(x) ≡ log softmax(x) (58) =    x0 − log Z . . . xn−1 − log Z    (59) where Z = ∑

i

exp xi Definition (Cross Entropy H): x = (x0, · · · , xn−1), t = (t0, · · · , tn−1) : n-vector H(t, x) ≡ − ∑

i

ti log xi (60) (61) Definition (Softmax Cross Entropy): Composition of softmax and cross entropy: SoftmaxCrossEntropy(t, x) ≡ H(t, softmax(x)) (62) = − ∑

i

ti(log softmax(x))i (63) = −t · logsoftmax(x) (64) In particular, we consider a special case where t is a one hot vector. One hot vector i is a vector whose ith element is one and all other elements zero. i ≡ ( · · · 1 · · · ) i − 1 i i + 1 n − 1 When t is a one hot vector, SoftmaxCrossEntropy becomes as simple as SoftmaxCrossEntropy(i, x) = −logsoftmax(x)i (65) = −xi + log ∑

j

exp xj (66) Forward:

  • x(b, i) : the score for bth sample to belong to class i
  • t(b) : the true class of the bth sample

y(b) = SoftmaxCrossEntropy(t(b), x) (67) = −x(b, t(b)) + log ∑

j

exp x(b, j) (68) 6

slide-7
SLIDE 7

Backward: ∂L ∂x(b, c) = ∑

b′

∂L ∂y(b′) ∂y(b′) ∂x(b, c) (69) = ∂L ∂y(b) ∂y(b) ∂x(b, c) (70) =            ∂L ∂y(b) ( −1 + exp x(b, c) ∑

j exp x(b, j)

) (c = t(b)) ∂L ∂y(b) ( exp x(b, c) ∑

j exp x(b, j)

) (c ̸= t(b)) (71) 7