Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D - - PowerPoint PPT Presentation

training neural nets
SMART_READER_LITE
LIVE PREVIEW

Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D - - PowerPoint PPT Presentation

Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Training Neural Nets 1 / 40 Outline 1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization 6 Network Depth


slide-1
SLIDE 1

Training Neural Nets

COMPSCI 371D — Machine Learning

COMPSCI 371D — Machine Learning Training Neural Nets 1 / 40

slide-2
SLIDE 2

Outline

1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization 6 Network Depth and Batch Normalization 7 Experiments with SGD

COMPSCI 371D — Machine Learning Training Neural Nets 2 / 40

slide-3
SLIDE 3

The Softmax Simplex

The Softmax Simplex

  • Neural-net classifier: ˆ

y = h(x) : Rd → Y

  • The last layer of a neural net used for classification is a

soft-max layer p = σ(z) =

exp(z) 1T exp(z)

  • The net is p = f(x, w) : X → P
  • The classifier is ˆ

y = h(x) = arg max p = arg max f(x, w)

  • P is the set of all nonnegative real-valued vectors p ∈ Re

whose entries add up to 1 (with e = |Y|): P

def

= {p ∈ Re : p ≥ 0 and

e

  • i=1

pi = 1} .

COMPSCI 371D — Machine Learning Training Neural Nets 3 / 40

slide-4
SLIDE 4

The Softmax Simplex

P

def

= {p ∈ Re : p ≥ 0 and e

i=1 pi = 1}

p

1

p

2

1 1 1/2 1/2

1/3 1/3 1/3 1 1 1 p

3

p

1

p

2

  • Decision regions are polyhedral:

Pc = {pc ≥ pj for j = c} for c = 1, . . . , e

  • A network transforms images into points in P

COMPSCI 371D — Machine Learning Training Neural Nets 4 / 40

slide-5
SLIDE 5

Loss and Risk

Loss and Risk (D´ ej` a Vu)

  • Ideal loss would be 0-1 loss on network output ˆ

y

  • 0-1 loss is constant where it is differentiable!
  • Not useful for computing a gradient
  • Use cross-entropy loss on the softmax output p as a proxy

loss ℓ(y, p) = − log py

  • Non-differentiability of ReLU or max-pooling is minor

(pointwise), and typically ignored

  • Risk, as usual:

LT(w) = 1

N

N

n=1 ℓn(w)

where ℓn(w) = ℓ(yn, f(xn, w))

  • We need ∇LT(w) and therefore ∇ℓn(w)

COMPSCI 371D — Machine Learning Training Neural Nets 5 / 40

slide-6
SLIDE 6

Back-Propagation

Back-Propagation

(1)

f f (2) f (3) w(1) w(2) w(3)

(1)

x

(2)

x

(3)

x = p

(0)

xn x = l n

n

y l

  • We need ∇LT(w) and therefore ∇ℓn(w) = ∂ℓn

∂w

  • Computations from x to ℓn form a chain
  • Apply the chain rule
  • Every derivative of ℓn w.r.t. layers before k goes through x(k)

∂ℓn ∂w(k) = ∂ℓn ∂x(k) ∂x(k) ∂w(k) ∂ℓn ∂x(k−1) = ∂ℓn ∂x(k) ∂x(k) ∂x(k−1)

(recursion!)

  • Start:

∂ℓn ∂x(K) = ∂ℓ ∂p

COMPSCI 371D — Machine Learning Training Neural Nets 6 / 40

slide-7
SLIDE 7

Back-Propagation

Local Jacobians

(1)

f f (2) f (3) w(1) w(2) w(3)

(1)

x

(2)

x

(3)

x = p

(0)

xn x = l n

n

y l

  • Local computations at layer k:

∂x(k) ∂w(k)

and

∂x(k) ∂x(k−1)

  • Partial derivatives of f (k) with respect to layer weights and

input to the layer

  • Local Jacobian matrices, can compute by knowing what the

layer does

  • The start of the process can be computed from knowing the

loss function,

∂ℓn ∂x(K) = ∂ℓ ∂p

  • Another local Jacobian
  • The rest is going recursively from output to input, one layer

at a time, accumulating

∂ℓn ∂w(k) into a vector ∂ℓn ∂w

COMPSCI 371D — Machine Learning Training Neural Nets 7 / 40

slide-8
SLIDE 8

Back-Propagation

The Forward Pass

(1)

f f (2) f (3) w(1) w(2) w(3)

(1)

x

(2)

x

(3)

x = p

(0)

xn x = l n

n

y l

  • All local Jacobians,

∂x(k) ∂w(k)

and

∂x(k) ∂x(k−1), are computed

numerically for the current values of weights w(k) and layer inputs x(k−1)

  • Therefore, we need to know x(k−1) for training sample n and

for all k

  • This is achieved by a forward pass through the network:

Run the network on input xn and store x(0) = xn, x(1), . . .

COMPSCI 371D — Machine Learning Training Neural Nets 8 / 40

slide-9
SLIDE 9

Back-Propagation

Back-Propagation Spelled Out for K = 3

(1)

f f (2) f (3) w(1) w(2) w(3)

(1)

x

(2)

x

(3)

x = p

(0)

xn x = l n

n

y l

∂ℓn ∂w(k) = ∂ℓn ∂x(k) ∂x(k) ∂w(k) ∂ℓn ∂x(k−1) = ∂ℓn ∂x(k) ∂x(k) ∂x(k−1)

(after forward pass)

∂ℓn ∂x(3) = ∂ℓ ∂p ∂ℓn ∂w(3) = ∂ℓn ∂x(3) ∂x(3) ∂w(3) ∂ℓn ∂x(2) = ∂ℓn ∂x(3) ∂x(3) ∂x(2) ∂ℓn ∂w(2) = ∂ℓn ∂x(2) ∂x(2) ∂w(2) ∂ℓn ∂x(1) = ∂ℓn ∂x(2) ∂x(2) ∂x(1) ∂ℓn ∂w(1) = ∂ℓn ∂x(1) ∂x(1) ∂w(1)

  • ∂ℓn

∂x(0) = ∂ℓn ∂x(1) ∂x(1) ∂x(0)

  • ∂ℓn

∂w =

     

∂ℓn ∂w(1) ∂ℓn ∂w(2) ∂ℓn ∂w(3)

     

(Jacobians in blue are local, those in red are what we want eventually)

COMPSCI 371D — Machine Learning Training Neural Nets 9 / 40

slide-10
SLIDE 10

Back-Propagation

Computing Local Jacobians

∂x(k) ∂w(k) and ∂x(k) ∂x(k−1)

  • Easier to make a “layer” as simple as possible
  • z = Vx + b is one layer (Fully Connected (FC), affine part)
  • z = ρ(x)

(ReLU) is another layer

  • Softmax, max-pooling, convolutional,...

COMPSCI 371D — Machine Learning Training Neural Nets 10 / 40

slide-11
SLIDE 11

Back-Propagation

Local Jacobians for a FC Layer

z = Vx + b

  • ∂z

∂x = V (easy!)

  • ∂z

∂w: What is ∂z ∂V ? Three subscripts: ∂zi ∂vjk .

A 3D tensor?

  • For a general package, tensors are the way to go
  • Conceptually, it may be easier to vectorize everything:

V = v11 v12 v13 v21 v22 v23

  • , b =

b1 b2

w = [v11, v12, v13, v21, v22, v23, b1, b2]T

  • ∂z

∂w is a 2 × 8 matrix

  • With e outputs and d inputs, an e × e(d + 1) matrix

COMPSCI 371D — Machine Learning Training Neural Nets 11 / 40

slide-12
SLIDE 12

Back-Propagation

Jacobianw for a FC Layer

z1 z2

  • =

w1 w2 w3 w4 w5 w6   x1 x2 x3   + w7 w8

  • Don’t be afraid to spell things out:

z1 = w1x1 + w2x2 + w3x3 + w7 z2 = w4x1 + w5x2 + w6x3 + w8

∂z ∂w =

  • ∂z1

∂w1 ∂z1 ∂w2 ∂z1 ∂w3 ∂z1 ∂w4 ∂z1 ∂w5 ∂z1 ∂w6 ∂z1 ∂w7 ∂z1 ∂w8 ∂z2 ∂w1 ∂z2 ∂w2 ∂z2 ∂w3 ∂z2 ∂w4 ∂z2 ∂w5 ∂z2 ∂w6 ∂z2 ∂w7 ∂z2 ∂w8

  • ∂z

∂w =

x1 x2 x3 1 x1 x2 x3 1

  • Obvious pattern: Repeat xT, staggered, e times
  • Then append the e × e identity at the end

COMPSCI 371D — Machine Learning Training Neural Nets 12 / 40

slide-13
SLIDE 13

Stochastic Gradient Descent

Training

  • Local gradients are used in back-propagation
  • So we now have ∇LT(w)
  • ˆ

w = arg min LT(w)

  • LT(w) is (very) non-convex, so we look for local minima
  • w ∈ Rm with m very large: No Hessians
  • Gradient descent
  • Even so, every step calls back-propagation N = |T| times
  • Back-propagation computes m derivatives ∇ℓn(w)
  • Computational complexity is Ω(mN) per step
  • Even gradient descent is way too expensive!

COMPSCI 371D — Machine Learning Training Neural Nets 13 / 40

slide-14
SLIDE 14

Stochastic Gradient Descent

No Line Search

  • Line search is out of the question
  • Fix some step multiplier α, called the learning rate

wt+1 = wt − α∇LT(wt)

  • How to pick α? Cross-validation is too expensive
  • Tradeoffs:
  • α too small: Slow progress
  • α too big: Jump over minima
  • Frequent practice:
  • Start with α relatively large, and monitor LT(w)
  • When LT(w) levels off, decrease α
  • Alternative: Fixed decay schedule for α
  • Better (recent) option: Change α adaptively

(Adam, 2015)

COMPSCI 371D — Machine Learning Training Neural Nets 14 / 40

slide-15
SLIDE 15

Stochastic Gradient Descent

Manual Adjustment of α

  • Start with α relatively large, and monitor LT(wt)
  • When LT(wt) levels off, decrease α
  • Typical plots of LT(wt) versus iteration index t:

risk

COMPSCI 371D — Machine Learning Training Neural Nets 15 / 40

slide-16
SLIDE 16

Stochastic Gradient Descent

Batch Gradient Descent

  • ∇LT(w) = 1

N

N

n=1 ∇ℓn(w) .

  • Taking a macro-step −α∇LT(wt) is the same as

taking the N micro-steps − α

N ∇ℓ1(wt), . . . , − α N ∇ℓN(wt)

  • First compute all the N steps at wt, then take all the steps
  • Thus, standard gradient descent is a batch method:

Compute the gradient at wt using the entire batch of data, then move

  • Even with no line search, computing N micro-steps is still

expensive

COMPSCI 371D — Machine Learning Training Neural Nets 16 / 40

slide-17
SLIDE 17

Stochastic Gradient Descent

Stochastic Descent

  • Taking a macro-step −α∇LT(wt) is the same as

taking the N micro-steps − α

N ∇ℓ1(wt), . . . , − α N ∇ℓN(wt)

  • First compute all the N steps at wt, then take all the steps
  • Can we use this effort more effectively?
  • Key observation: −∇ℓn(w) is a poor estimate of −∇LT(w),

but an estimate all the same: Micro-steps are correct on average!

  • After each micro-step, we are on average in a better place
  • How about computing a new micro-gradient after every

micro-step?

  • Now each micro-step gradient is evaluated at a point that is
  • n average better (lower risk) than in the batch method

COMPSCI 371D — Machine Learning Training Neural Nets 17 / 40

slide-18
SLIDE 18

Stochastic Gradient Descent

Batch versus Stochastic Gradient Descent

  • sn(w) = − α

N ∇ℓn(w)

  • Batch:
  • Compute s1(wt), . . . , sN(wt)
  • Move by s1(wt), then s2(wt), . . . then sN(wt)

(or equivalently move once by s1(wt) + . . . + sN(wt))

  • Stochastic (SGD):
  • Compute s1(wt), then move by s1(wt) from wt to w(1)

t

  • Compute s2(w(1)

t

), then move by s2(w(1)

t

) from w(1)

t

to w(2)

t

. . .

  • Compute sN(w(N−1)

t

), then move by sN(w(N−1)

t

) from w(N−1)

t

to w(N)

t

= wt+1

  • In SGD, each micro-step is taken from a better (lower risk)

place on average

COMPSCI 371D — Machine Learning Training Neural Nets 18 / 40

slide-19
SLIDE 19

Stochastic Gradient Descent

Why “Stochastic?”

  • Progress occurs only on average
  • Many micro-steps are bad, but they are good on average
  • Progress is a random walk

https://towardsdatascience.com/ COMPSCI 371D — Machine Learning Training Neural Nets 19 / 40

slide-20
SLIDE 20

Stochastic Gradient Descent

Reducing Variance: Mini-Batches

  • Each data sample is a poor estimate of T: High-variance

micro-steps

  • Each micro-step take full advantage of the estimate, by

moving right away: Low-bias micro-steps

  • High variance may hurt more than low bias helps
  • Can we lower variance at the expense of bias?
  • Average B samples at a time: Take mini-steps
  • With bigger B,
  • Higher bias
  • Lower variance
  • The B samples are a mini-batch

COMPSCI 371D — Machine Learning Training Neural Nets 20 / 40

slide-21
SLIDE 21

Stochastic Gradient Descent

Mini-Batches

  • Scramble T at random
  • Divide T into J mini-batches Tj of size B
  • w(0) = w
  • For j = 1, . . . , J:
  • Batch gradient:

gj = ∇LTj(w(j−1)) = 1

B

jB

n=(j−1)B+1 ∇ℓn(w(j−1))

  • Move:

w(j) = w(j−1) − αgj

  • This for loop amounts to one macro-step
  • Each execution of the entire loop uses the training data
  • nce
  • Each execution of the entire loop is an epoch
  • Repeat over several epochs until a stopping criterion is met

COMPSCI 371D — Machine Learning Training Neural Nets 21 / 40

slide-22
SLIDE 22

Stochastic Gradient Descent

Momentum

  • Sometimes w(j) meanders around in shallow valleys

200 400 600 800 1000 1200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

No α adjustment here

  • α is too small, direction is still promising
  • Add momentum

v0 = 0 v(j+1) = µ(j)v(j) − α∇LT(w(j)) (0 ≤ µ(j) < 1) w(j+1) = w(j) + v(j+1)

COMPSCI 371D — Machine Learning Training Neural Nets 22 / 40

slide-23
SLIDE 23

Regularization

Regularization

  • The capacity of deep networks is very high: It is often

possible to achieve near-zero training loss

  • “Memorize the training set”

  • verfitting
  • All training methods use some type of regularization
  • Regularization can be seen as inductive bias: Bias the

training algorithm to find weights with certain properties

  • Simplest method: weight decay, add a term λw2 to the

loss function: Keep the weights small (Tikhonov)

  • Many proposals have been made
  • Not yet clear which method works best, a few proposals

follow

COMPSCI 371D — Machine Learning Training Neural Nets 23 / 40

slide-24
SLIDE 24

Regularization

Early Termination

  • Terminating training well before the LT is minimized is

somewhat similar to “implicit” weight decay

  • Progress at each iteration is limited, so stopping early keeps

us close to w0, which is a set of small random weights

  • Therefore, the norm of wt is restrained, albeit in terms of

how long the learner takes to get there rather than in absolute terms

  • A more informed approach to early termination stops when

a validation risk (or, even better, error rate) stops declining

  • This is arguably the most widely used regularization method

COMPSCI 371D — Machine Learning Training Neural Nets 24 / 40

slide-25
SLIDE 25

Regularization

Dropout

  • Dropout inspired by ensemble methods (random forests):

Regularize by averaging multiple predictors

  • Key difficulty: It is too expensive to train an ensemble of

deep neural networks

  • Efficient (crude!) approximation:
  • Before processing a new mini-batch, flip a coin with

P[heads] = p (typically p = 1/2) for each neuron

  • Turn off the neurons for which the coin comes up tails
  • Restore all neurons at the end of the mini-batch
  • When training is done, multiply all weights by p
  • This is very loosely akin to training a different network for

every mini-batch

  • Multiplication by p takes the “average” of all networks
  • There are flaws in the reasoning, but the method works

COMPSCI 371D — Machine Learning Training Neural Nets 25 / 40

slide-26
SLIDE 26

Regularization COMPSCI 371D — Machine Learning Training Neural Nets 26 / 40

slide-27
SLIDE 27

Regularization

Data Augmentation

  • Data augmentation is not a regularization method, but

combats overfitting

  • Make new training data out of thin air
  • Given data sample (x, y), create perturbed copies x1, . . . , xk
  • f x (these have the same label!)
  • Add samples (x1, y), . . . , (xk, y) to training set T
  • With images this is easy. The xis are cropped, rotated,

stretched, re-colored, . . . versions of x

  • One training sample generates k new ones
  • T grows by a factor of k + 1
  • Very effective, used almost universally
  • Need to use realistic perturbations

COMPSCI 371D — Machine Learning Training Neural Nets 27 / 40

slide-28
SLIDE 28

Network Depth and Batch Normalization

Current Trend: Go Deeper

  • If the output of the last layer comes from a ReLU, it is

nonnegative

  • Therefore, an additional layer, even with ReLU, can

implement the identity by setting V = I and b = 0

  • Therefore, more layers give more capacity (expressive

power)

  • So, why not go deeper?
  • Two problems with greater capacity:
  • Overfitting
  • Vanishing or exploding gradients
  • Overfitting can be controlled by regularization

COMPSCI 371D — Machine Learning Training Neural Nets 28 / 40

slide-29
SLIDE 29

Network Depth and Batch Normalization

Vanishing or Exploding Gradients

(1)

f f (2) f (3) w(1) w(2) w(3)

(1)

x

(2)

x

(3)

x = p

(0)

xn x = l n

n

y l

  • The recursion

∂ℓn ∂x(k−1) = ∂ℓn ∂x(k) ∂x(k) ∂x(k−1)

yields

∂ℓn ∂x(i) = ∂ℓn ∂x(K) ∂x(K) ∂x(K−1) . . . ∂x(i+1) ∂x(i) = ∂ℓn ∂x(K)JK · . . . · Ji+1

  • Feedback signal (gradient) from loss ℓn to layer i, and

therefore also

∂ℓn ∂w(i) = ∂ℓn ∂x(i) ∂x(i) ∂w(i), depends on the product

J(i) = JK · . . . · Ji+1 of layer Jacobians

  • det(J(i)) = det(JK) · . . . · det(Ji+1) determines (pun intended)

the magnitude of the gradient

  • Vanishing gradients choke information flow: No progress in

early layers

  • Exploding gradients cause instability during training

COMPSCI 371D — Machine Learning Training Neural Nets 29 / 40

slide-30
SLIDE 30

Network Depth and Batch Normalization

Batch Normalization

  • Ideally, we would like the norms of all activations

x(0), . . . , x(K) to be equal (det(Ji) ≈ 1)

  • Suppose that we could interpose a layer βk between layers

k and k + 1 that subtracts the mean of all possible outputs x(k) from layer k and divides by their standard deviation: ˆ x(c)

k

=

x(c)

k

−µ(c)

k

σ(c)

k

for component c of xk

  • Then, layer k together with βk has normalized outputs
  • If we do this for all layers, all layers transform normalized

inputs to normalized outputs

  • Problem 1: We don’t know “all possible outputs x(k) from

layer k” because the network changes during training

  • Problem 2: We limit the expressive power of the network

COMPSCI 371D — Machine Learning Training Neural Nets 30 / 40

slide-31
SLIDE 31

Network Depth and Batch Normalization

Batch Normalization

  • Normalize each activation by an estimate of its mean and

standard deviation

  • During training, compute the estimate over each mini-batch
  • During inference, use the the mean estimate over all

mini-batches

  • Let x be a scalar activation just before a non-linearity
  • Let µ, σ be the sample mean and standard deviation of x
  • ver the current mini-batch
  • Pass x through a Batch Normalization (BN) module that
  • Normalizes each component of x:

ˆ x = x−µ

σ

  • Computes z = γˆ

x + β

  • The learnable parameters γ and β restore the layer’s

expressive power

COMPSCI 371D — Machine Learning Training Neural Nets 31 / 40

slide-32
SLIDE 32

Network Depth and Batch Normalization

Normalization and De-Normalization

  • Wait, what? What is the point of normalizing x to ˆ

x = x−µ

σ

and then let the network undo the normalization by z = γˆ x + β?

  • Why we must do this: If we don’t, we restrict the expressive

power of the layer

  • Why we can do this: The de-normalization is local
  • If, say, γ = 2 in layer k, then mini-batch inputs to layer k + 1

are twice as large, and will be normalized again by a σ that is also twice as big

  • BN in layer k accounts for all the γs in previous layers
  • The γs in different layers do not multiply

COMPSCI 371D — Machine Learning Training Neural Nets 32 / 40

slide-33
SLIDE 33

Network Depth and Batch Normalization

Example

  • Only look at standard deviations for simplicity

(Similar considerations hold for means)

  • Start with γ1 = 1 in layer 1
  • Outputs of layer 2 have standard deviation σ2 before BN
  • Now change γ1 to γ′

1 = 2

  • Outputs from layer 2 now have σ′

2 = 2σ2 before BN

  • They are twice as big, but BN divides them by a standard

deviation that is twice as big as well

  • µ, σ statistics of the outputs from layer 2 are unchanged

after BN

  • Key point: γ1, β1 affect µ2, σ2

COMPSCI 371D — Machine Learning Training Neural Nets 33 / 40

slide-34
SLIDE 34

Network Depth and Batch Normalization

Going Deeper

  • With batch normalization, gradients are tame
  • Need to compute BN Jacobians for back-propagation
  • Need to store estimates of µ, σ for inference
  • Everything else remains the same
  • Network depth is no longer a problem for training
  • Regularization reduces overfitting for deep networks
  • Networks with BN often have tens or hundreds of layers
  • A network with 1000 layers was shown to be trainable

Deep Residual Learning for Image Recognition, He et al., ArXiv, 2015

  • Of course, regularization and data augmentation are now

even more crucial

COMPSCI 371D — Machine Learning Training Neural Nets 34 / 40

slide-35
SLIDE 35

Experiments with SGD

Experiments with SGD

  • SGD is difficult to analyze because the risk depends on

many samples and in many dimensions: LT(w) = 1

N

N

n=1 ℓn(w)

  • Xing et al., A Walk with SGD, ArXiv, 2018, gives some

insights (optional reading)

  • Main (empirical) results:
  • SGD spends quite a bit of time bouncing between the walls
  • f descending canyons with debris at the bottom
  • A greater learning rate helps stay away from the bottom
  • Stochasticity from mini-batches helps move along, rather

than oscillating in place

COMPSCI 371D — Machine Learning Training Neural Nets 35 / 40

slide-36
SLIDE 36

Experiments with SGD

Is the Risk Landscape a Wadi?

  • Full gradient, no momentum: wt+1 = wt − α∇LT(wt)
  • Plotted ten points of LT(w) between wt and wt+1
  • Parameter Distance is wt − w0

COMPSCI 371D — Machine Learning Training Neural Nets 36 / 40

slide-37
SLIDE 37

Experiments with SGD

Reducing α Brings to the Bottom

  • Height over floor: ht = LT (wt)+LT (wt+1)

2

− minw∈[wt,wt+1] LT(w)

w

t t+1

w L (

T

) L (

T

) ht

  • Typical averages (see paper for variances):

α Epoch 1 Epoch 10 Epoch 25 0.1 0.0625 0.0199 0.0104 0.05 0.0102 0.0050 0.0035

  • Smaller α

⇒ More likely to get stuck in the debris

COMPSCI 371D — Machine Learning Training Neural Nets 37 / 40

slide-38
SLIDE 38

Experiments with SGD

Stochasticity Reduces In-Place Oscillations

Minibatch Size N = 10, 000 Minibatch Size 100

  • CIFAR-10, α = 0.1
  • Many more variants in the paper, similar results

COMPSCI 371D — Machine Learning Training Neural Nets 38 / 40

slide-39
SLIDE 39

Experiments with SGD

Flatter Minima Seem to Generalize Better

  • The spectral norm of a square positive semidefinite matrix

is the square root of its largest eigenvalue

  • A Hessian with large spectral norm indicates a deep

minimum (high curvature)

  • Spectral norm correlates negatively with validation accuracy
  • Shallow minima seem to generalize better
  • Caveat: Some literature shows that deep minima can

generalize well, too

COMPSCI 371D — Machine Learning Training Neural Nets 39 / 40

slide-40
SLIDE 40

Experiments with SGD

Thoughts about Xing et al.

  • Much more analysis in the paper
  • Main points:
  • A large learning rate helps avoid local traps
  • Stochasticity from mini-batches prevents in-place oscillations
  • Flatter minima seem to generalize better
  • Experiments are done well, across datasets and with error

bars

  • Analysis combines landscape properties and SGD behavior
  • Separating these factors would be nice but hard, as there is

“too much space out there”

  • Empirical data are best used as observations to spark

theoretical conjectures

COMPSCI 371D — Machine Learning Training Neural Nets 40 / 40