Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 - - PowerPoint PPT Presentation

training neural nets
SMART_READER_LITE
LIVE PREVIEW

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 - - PowerPoint PPT Presentation

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training Neural Nets 1 / 29 Outline 1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization COMPSCI 527


slide-1
SLIDE 1

Training Neural Nets

COMPSCI 527 — Computer Vision

COMPSCI 527 — Computer Vision Training Neural Nets 1 / 29

slide-2
SLIDE 2

Outline

1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization

COMPSCI 527 — Computer Vision Training Neural Nets 2 / 29

slide-3
SLIDE 3

The Softmax Simplex

The Softmax Simplex

  • Neural-net classifier: ˆ

y = h(x) : Rd → Y

  • The last layer of a neural net used for classification is a

soft-max layer p = σ(z) =

exp(z) 1T exp(z)

  • The net is p = f(x, w) : X × Rm → P
  • The classifier is ˆ

y = h(x) = arg max p = arg max f(x, w)

  • P is the set of all nonnegative real-valued vectors p ∈ RK

whose entries add up to 1 (with K = |Y|): P

def

= {p ∈ RK : p ≥ 0 and

K

  • i=1

pi = 1} .

COMPSCI 527 — Computer Vision Training Neural Nets 3 / 29

slide-4
SLIDE 4

The Softmax Simplex

P

def

= {p ∈ RK : p ≥ 0 and K

i=1 pi = 1}

p

1

p

2

1 1 1/2 1/2

1/3 1/3 1/3 1 1 1 p

3

p

1

p

2

  • Decision regions are polyhedral and convex:

Pc = {pc ≥ pj for j = c} for c = 1, . . . , K

  • A network transforms images into points in P

COMPSCI 527 — Computer Vision Training Neural Nets 4 / 29

slide-5
SLIDE 5

Loss and Risk

Training is Empirical Risk Minimization

  • Define a loss ℓ(y, ˆ

y): How much do we pay when the true label is y and the network says ˆ y?

  • Network: p = f(x, w), then ˆ

y = arg max p

  • Risk is average loss over training set

T = {(x1, y1), . . . (xN, yN)}: LT(w) = 1

N

N

n=1 ℓn(w)

where ℓn(w) = ℓ(yn, f(xn, w))

  • Determine network weights w by minimizing LT(w)
  • Use some variant of steepest descent
  • We need ∇LT(w) and therefore ∇ℓn(w)

COMPSCI 527 — Computer Vision Training Neural Nets 5 / 29

slide-6
SLIDE 6

Loss and Risk

The Cross-Entropy Loss

  • Ideal loss would be 0-1 loss

ℓ0-1(y, ˆ y) on classifier output ˆ y

  • 0-1 loss is constant where it is

differentiable!

  • Not useful for computing a

gradient for risk minimization

  • Use cross-entropy loss on the

softmax output p as a proxy loss ℓ(y, p) = − log py

  • Differentiable!
  • Unbounded loss for total

misclassification

1

COMPSCI 527 — Computer Vision Training Neural Nets 6 / 29

slide-7
SLIDE 7

Loss and Risk

Example: K = 5 Classes

  • Last layer before soft-max has activations z ∈ RK
  • Soft-max has output p = σ(z) =

exp(z) 1T exp(z) ∈ R5

  • Ideally, if the correct class is y = 4, we would like output p

to equal q = [0, 0, 0, 1, 0], the one-hot encoding of y

  • That is, qy = q4 = 1 and all other qj are zero
  • ℓ(y, p) = − log py = − log p4
  • py → 1 and ℓ(y, p) → 0 when zy/zy′ → ∞ for all y′ = y
  • That is, when p approaches the correct simplex corner
  • py → 0 and ℓ(y, p) → ∞ when zy/zy′ → −∞ for some

y′ = y

  • That is, when p is far from the correct simplex corner

COMPSCI 527 — Computer Vision Training Neural Nets 7 / 29

slide-8
SLIDE 8

Loss and Risk

Example, Continued

  • 10

10 15

  • ℓ(y, p) = − log py = − log

exp(zy) 1T exp(z) = log(1T exp(z)) − zy

  • py → 1 and ℓ(y, p) → 0 when zy/zy′ → ∞ for y′ = y
  • py → 0 and ℓ(y, p) → ∞ when zy/zy′ → −∞ for y′ = y
  • Actual plot depends on all values in z
  • This is a “soft hinge loss” in z (not in p)

COMPSCI 527 — Computer Vision Training Neural Nets 8 / 29

slide-9
SLIDE 9

Back-Propagation

Back-Propagation

(1)

f f (2) f (3) w(1) w(2) w(3)

(1)

x

(2)

x

(3)

x = p

(0)

xn x = l n

n

y l

  • We need ∇LT(w) and therefore ∇ℓn(w) = ∂ℓn

∂w

  • Computations from xn to ℓn form a chain
  • Apply the chain rule
  • Every derivative of ℓn w.r.t. layers before k goes through x(k)

∂ℓn ∂w(k) = ∂ℓn ∂x(k) ∂x(k) ∂w(k) ∂ℓn ∂x(k−1) = ∂ℓn ∂x(k) ∂x(k) ∂x(k−1)

(recursion!)

  • Start:

∂ℓn ∂x(K) = ∂ℓ ∂p

COMPSCI 527 — Computer Vision Training Neural Nets 9 / 29

slide-10
SLIDE 10

Back-Propagation

Local Jacobians

(1)

f f (2) f (3) w(1) w(2) w(3)

(1)

x

(2)

x

(3)

x = p

(0)

xn x = l n

n

y l

  • Local computations at layer k:

∂x(k) ∂w(k)

and

∂x(k) ∂x(k−1)

  • Partial derivatives of f (k) with respect to layer weights and

input to the layer

  • Local Jacobian matrices, can compute by knowing what the

layer does

  • The start of the process can be computed from knowing the

loss function,

∂ℓn ∂x(K) = ∂ℓ ∂p

  • Another local Jacobian
  • The rest is going recursively from output to input, one layer

at a time, accumulating

∂ℓn ∂w(k) into a vector ∂ℓn ∂w

COMPSCI 527 — Computer Vision Training Neural Nets 10 / 29

slide-11
SLIDE 11

Back-Propagation

Back-Propagation Spelled Out for K = 3

(1)

f f (2) f (3) w(1) w(2) w(3)

(1)

x

(2)

x

(3)

x = p

(0)

xn x = l n

n

y l ∂ℓn ∂x(3) = ∂ℓ ∂p ∂ℓn ∂w(3) = ∂ℓn ∂x(3) ∂x(3) ∂w(3) ∂ℓn ∂x(2) = ∂ℓn ∂x(3) ∂x(3) ∂x(2) ∂ℓn ∂w(2) = ∂ℓn ∂x(2) ∂x(2) ∂w(2) ∂ℓn ∂x(1) = ∂ℓn ∂x(2) ∂x(2) ∂x(1) ∂ℓn ∂w(1) = ∂ℓn ∂x(1) ∂x(1) ∂w(1)

  • ∂ℓn

∂x(0) = ∂ℓn ∂x(1) ∂x(1) ∂x(0)

  • ∂ℓn

∂w =

     

∂ℓn ∂w(1) ∂ℓn ∂w(2) ∂ℓn ∂w(3)

     

(Jacobians in blue are local)

COMPSCI 527 — Computer Vision Training Neural Nets 11 / 29

slide-12
SLIDE 12

Back-Propagation

Computing Local Jacobians

∂x(k) ∂w(k) and ∂x(k) ∂x(k−1)

  • Easier to make a “layer” as simple as possible
  • z = Vx + b is one layer (Fully Connected (FC), affine part)
  • z = ρ(x)

(ReLU) is another layer

  • Softmax, max-pooling, convolutional,...

COMPSCI 527 — Computer Vision Training Neural Nets 12 / 29

slide-13
SLIDE 13

Back-Propagation

Local Jacobians for a FC Layer

z = Vx + b

  • ∂z

∂x = V (easy!)

  • ∂z

∂w: What is ∂z ∂V ? Three subscripts: ∂zi ∂vjk .

A 3D tensor?

  • For a general package, tensors are the way to go
  • Conceptually, it may be easier to vectorize everything:

V = v11 v12 v13 v21 v22 v23

  • , b =

b1 b2

w = [v11, v12, v13, v21, v22, v23, b1, b2]T

  • ∂z

∂w is a 2 × 8 matrix

  • With e outputs and d inputs, an e × e(d + 1) matrix

COMPSCI 527 — Computer Vision Training Neural Nets 13 / 29

slide-14
SLIDE 14

Back-Propagation

Jacobianw for a FC Layer

z1 z2

  • =

w1 w2 w3 w4 w5 w6   x1 x2 x3   + w7 w8

  • Don’t be afraid to spell things out:

z1 = w1x1 + w2x2 + w3x3 + w7 z2 = w4x1 + w5x2 + w6x3 + w8

∂z ∂w =

  • ∂z1

∂w1 ∂z1 ∂w2 ∂z1 ∂w3 ∂z1 ∂w4 ∂z1 ∂w5 ∂z1 ∂w6 ∂z1 ∂w7 ∂z1 ∂w8 ∂z2 ∂w1 ∂z2 ∂w2 ∂z2 ∂w3 ∂z2 ∂w4 ∂z2 ∂w5 ∂z2 ∂w6 ∂z2 ∂w7 ∂z2 ∂w8

  • ∂z

∂w =

x1 x2 x3 1 x1 x2 x3 1

  • Obvious pattern: Repeat xT, staggered, e times
  • Then append the e × e identity at the end

COMPSCI 527 — Computer Vision Training Neural Nets 14 / 29

slide-15
SLIDE 15

Stochastic Gradient Descent

Training

  • Local gradients are used in back-propagation
  • So we now have ∇LT(w)
  • ˆ

w = arg min LT(w)

  • LT(w) is (very) non-convex, so we look for local minima
  • w ∈ Rm with m very large: No Hessians
  • Gradient descent
  • Even so, every step calls back-propagation N = |T| times
  • Back-propagation computes m derivatives ∇ℓn(w)
  • Computational complexity is Ω(mN) per step
  • Even gradient descent is way too expensive!

COMPSCI 527 — Computer Vision Training Neural Nets 15 / 29

slide-16
SLIDE 16

Stochastic Gradient Descent

No Line Search

  • Line search is out of the question
  • Fix some step multiplier α, called the learning rate

wt+1 = wt − α∇LT(wt)

  • How to pick α? Validation is too expensive
  • Tradeoffs:
  • α too small: Slow progress
  • α too big: Jump over minima
  • Frequent practice:
  • Start with α relatively large, and monitor LT(w)
  • When LT(w) levels off, decrease α
  • Alternative: Fixed decay schedule for α
  • Better (recent) option: Change α adaptively

(Adam, 2015)

COMPSCI 527 — Computer Vision Training Neural Nets 16 / 29

slide-17
SLIDE 17

Stochastic Gradient Descent

Manual Adjustment of α

  • Start with α relatively large, and monitor LT(wt)
  • When LT(wt) levels off, decrease α
  • Typical plots of LT(wt) versus iteration index t:

risk

COMPSCI 527 — Computer Vision Training Neural Nets 17 / 29

slide-18
SLIDE 18

Stochastic Gradient Descent

Batch Gradient Descent

  • ∇LT(w) = 1

N

N

n=1 ∇ℓn(w) .

  • Taking a macro-step −α∇LT(wt) is the same as

taking the N micro-steps − α

N ∇ℓ1(wt), . . . , − α N ∇ℓN(wt)

  • First compute all the N steps at wt, then take all the steps
  • Thus, standard gradient descent is a batch method:

Compute the gradient at wt using the entire batch of data, then move

  • Even with no line search, computing N micro-steps is still

expensive

COMPSCI 527 — Computer Vision Training Neural Nets 18 / 29

slide-19
SLIDE 19

Stochastic Gradient Descent

Stochastic Descent

  • Taking a macro-step −α∇LT(wt) is the same as

taking the N micro-steps − α

N ∇ℓ1(wt), . . . , − α N ∇ℓN(wt)

  • First compute all the N steps at wt, then take all the steps
  • Can we use this effort more effectively?
  • Key observation: −α∇ℓn(w) is a poor estimate of

−α∇LT(w), but an estimate all the same: Micro-steps are in the correct direction on average!

  • After each micro-step, we are on average in a better place
  • How about computing a new micro-gradient after every

micro-step?

  • Now each micro-step gradient is evaluated at a point that is
  • n average better (lower risk) than in the batch method

COMPSCI 527 — Computer Vision Training Neural Nets 19 / 29

slide-20
SLIDE 20

Stochastic Gradient Descent

Batch versus Stochastic Gradient Descent

  • sn(w) = − α

N ∇ℓn(w)

  • Batch:
  • Compute s1(wt), . . . , sN(wt)
  • Move by s1(wt), then s2(wt), . . . then sN(wt)

(or equivalently move once by s1(wt) + . . . + sN(wt))

  • Stochastic (SGD):
  • Compute s1(wt), then move by s1(wt) from wt to w(1)

t

  • Compute s2(w(1)

t

), then move by s2(w(1)

t

) from w(1)

t

to w(2)

t

. . .

  • Compute sN(w(N−1)

t

), then move by sN(w(N−1)

t

) from w(N−1)

t

to w(N)

t

= wt+1

  • In SGD, each micro-step is taken from a better (lower risk)

place on average

COMPSCI 527 — Computer Vision Training Neural Nets 20 / 29

slide-21
SLIDE 21

Stochastic Gradient Descent

Why “Stochastic?”

  • Progress occurs only on average
  • Many micro-steps are bad, but they are good on average
  • Progress is a random walk

https://towardsdatascience.com/ COMPSCI 527 — Computer Vision Training Neural Nets 21 / 29

slide-22
SLIDE 22

Stochastic Gradient Descent

Reducing Variance: Mini-Batches

  • Each data sample is a poor estimate of T: High-variance

micro-steps

  • Each micro-step takes full advantage of the estimate, by

moving right away: Low-bias micro-steps

  • High variance may hurt more than low bias helps
  • Can we lower variance at the expense of bias?
  • Average B samples at a time: Take mini-steps
  • The B samples are a mini-batch
  • With bigger B,
  • Higher bias
  • Lower variance

COMPSCI 527 — Computer Vision Training Neural Nets 22 / 29

slide-23
SLIDE 23

Stochastic Gradient Descent

Mini-Batches

  • Scramble T at random
  • Divide T into J mini-batches Tj of size B
  • w(0) = w
  • For j = 1, . . . , J:
  • Batch gradient:

gj = ∇LTj(w(j−1)) = 1

B

jB

n=(j−1)B+1 ∇ℓn(w(j−1))

  • Move:

w(j) = w(j−1) − αgj

  • This for loop amounts to one macro-step
  • Each execution of the entire loop uses the training data
  • nce
  • Each execution of the entire loop is an epoch
  • Repeat over several epochs until a stopping criterion is met

COMPSCI 527 — Computer Vision Training Neural Nets 23 / 29

slide-24
SLIDE 24

Stochastic Gradient Descent

Momentum

  • Sometimes w(j) meanders around in shallow valleys

200 400 600 800 1000 1200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

No α adjustment here

  • α is too small, direction is still promising
  • Add momentum

v0 = 0 v(j+1) = µ(j)v(j) − α∇LTj(w(j)) (0 ≤ µ(j) < 1) w(j+1) = w(j) + v(j+1)

COMPSCI 527 — Computer Vision Training Neural Nets 24 / 29

slide-25
SLIDE 25

Regularization

Regularization

  • The capacity of deep networks is very high: It is often

possible to achieve near-zero training risk

  • “Memorize the training set”

  • verfitting
  • All training methods use some type of regularization
  • Regularization can be seen as inductive bias: Bias the

training algorithm to find weights with certain properties

  • Simplest method: weight decay, add a term λw2 to the

risk function: Keep the weights small (Tikhonov)

  • Many proposals have been made
  • Not yet clear which method works best, a few proposals

follow

COMPSCI 527 — Computer Vision Training Neural Nets 25 / 29

slide-26
SLIDE 26

Regularization

Early Termination

  • Terminating training well before the LT is minimized is

somewhat similar to “implicit” weight decay

  • Progress at each iteration is limited, so stopping early keeps

us close to w0, which is a set of small random weights

  • Therefore, the norm of wt is restrained, albeit in terms of

how long the learner takes to get there rather than in absolute terms

  • A more informed approach to early termination stops when

a validation risk (or, even better, error rate) stops declining

  • This (with validation check) is arguably the most widely

used regularization method

COMPSCI 527 — Computer Vision Training Neural Nets 26 / 29

slide-27
SLIDE 27

Regularization

Dropout

  • Dropout inspired by ensemble methods:

Regularize by averaging multiple predictors

  • Key difficulty: It is too expensive to train an ensemble of

deep neural networks

  • Efficient (crude!) approximation:
  • Before processing a new mini-batch, flip a coin with

P[heads] = p (typically p = 1/2) for each neuron

  • Turn off the neurons for which the coin comes up tails
  • Restore all neurons at the end of the mini-batch
  • When training is done, multiply all weights by p
  • This is very loosely akin to training a different network for

every mini-batch

  • Multiplication by p takes the “average” of all networks
  • There are flaws in the reasoning, but the method works

COMPSCI 527 — Computer Vision Training Neural Nets 27 / 29

slide-28
SLIDE 28

Regularization COMPSCI 527 — Computer Vision Training Neural Nets 28 / 29

slide-29
SLIDE 29

Regularization

Data Augmentation

  • Data augmentation is not a regularization method, but

combats overfitting

  • Make new training data out of thin air
  • Given data sample (x, y), create perturbed copies x1, . . . , xk
  • f x (these have the same label!)
  • Add samples (x1, y), . . . , (xk, y) to training set T
  • With images this is easy. The xis are cropped, rotated,

stretched, re-colored, . . . versions of x

  • One training sample generates k new ones
  • T grows by a factor of k + 1
  • Very effective, used almost universally
  • Need to use realistic perturbations

COMPSCI 527 — Computer Vision Training Neural Nets 29 / 29