Lecture 6: Optimization CS109B Data Science 2 Pavlos Protopapas and - - PowerPoint PPT Presentation

lecture 6 optimization
SMART_READER_LITE
LIVE PREVIEW

Lecture 6: Optimization CS109B Data Science 2 Pavlos Protopapas and - - PowerPoint PPT Presentation

Lecture 6: Optimization CS109B Data Science 2 Pavlos Protopapas and Mark Glickman Outline Optimization Challenges in Optimization Momentum Adaptive Learning Rate Parameter Initialization Batch Normalization CS109B, P


slide-1
SLIDE 1

CS109B Data Science 2

Pavlos Protopapas and Mark Glickman

Lecture 6: Optimization

slide-2
SLIDE 2

CS109B, PROTOPAPAS, GLICKMAN

Outline

Optimization

  • Challenges in Optimization
  • Momentum
  • Adaptive Learning Rate
  • Parameter Initialization
  • Batch Normalization

2

slide-3
SLIDE 3

CS109B, PROTOPAPAS, GLICKMAN

Learning vs. Optimization

Goal of learning: minimize generalization error In practice, empirical risk minimization:

3

Quantity optimized different from the quantity we care about

J(θ) = E(x,y)~pdata L( f (x;θ), y)

[ ]

ˆ J(θ) = 1 m L

i=1 m

∑ ( f (x(i);θ), y(i))

slide-4
SLIDE 4

CS109B, PROTOPAPAS, GLICKMAN

Batch vs. Stochastic Algorithms

Batch algorithms

  • Optimize empirical risk using exact gradients

Stochastic algorithms

  • Estimates gradient from a small random sample

4

Large mini-batch: gradient computation expensive Small mini-batch: greater variance in estimate, longer steps for convergence

∇J(θ) = E(x,y)~pdata ∇L( f (x;θ), y)

[ ]

slide-5
SLIDE 5

CS109B, PROTOPAPAS, GLICKMAN

Critical Points

Points with zero gradient 2nd-derivate (Hessian) determines curvature

5

Goodfellow et al. (2016)

slide-6
SLIDE 6

CS109B, PROTOPAPAS, GLICKMAN

Stochastic Gradient Descent

Take small steps in direction of negative gradient Sample m examples from training set and compute: Update parameters:

6

In practice: shuffle training set once and pass through multiple times

g = 1 m ∇L( f (x(i);θ), y(i))

i

θ =θ −εkg

slide-7
SLIDE 7

CS109B, PROTOPAPAS, GLICKMAN

Outline

Optimization

  • Challenges in Optimization
  • Momentum
  • Adaptive Learning Rate
  • Parameter Initialization
  • Batch Normalization

7

slide-8
SLIDE 8

CS109B, PROTOPAPAS, GLICKMAN

Local Minima

8

Goodfellow et al. (2016)

slide-9
SLIDE 9

CS109B, PROTOPAPAS, GLICKMAN

Local Minima

Old view: local minima is major problem in neural network training Recent view:

  • For sufficiently large neural networks, most local minima incur low cost
  • Not important to find true global minimum

9

slide-10
SLIDE 10

CS109B, PROTOPAPAS, GLICKMAN

Saddle Points

Recent studies indicate that in high dim, saddle points are more likely than local min Gradient can be very small near saddle points

10

Both local min and max

Goodfellow et al. (2016)

slide-11
SLIDE 11

CS109B, PROTOPAPAS, GLICKMAN

No Critical Points

Gradient norm increases, but validation error decreases

11

Convolution Nets for Object Detection Goodfellow et al. (2016)

slide-12
SLIDE 12

CS109B, PROTOPAPAS, GLICKMAN

Saddle Points

SGD is seen to escape saddle points – Moves down-hill, uses noisy gradients Second-order methods get stuck – solves for a point with zero gradient

12

Goodfellow et al. (2016)

slide-13
SLIDE 13

CS109B, PROTOPAPAS, GLICKMAN

Poor Conditioning

Poorly conditioned Hessian matrix – High curvature: small steps leads to huge increase Learning is slow despite strong gradients

13

Oscillations slow down progress

slide-14
SLIDE 14

CS109B, PROTOPAPAS, GLICKMAN

No Critical Points

Some cost functions do not have critical points. In particular classification.

14

slide-15
SLIDE 15

CS109B, PROTOPAPAS, GLICKMAN

Exploding and Vanishing Gradients

15

Linear activation

deeplearning.ai

ℎ" = 𝑋𝑦 ℎ" = 𝑋ℎ"&', 𝑗 = 2, … , 𝑜

slide-16
SLIDE 16

CS109B, PROTOPAPAS, GLICKMAN

Exploding and Vanishing Gradients

16

h1

1

h1

2

! " # # $ % & &= a b ! " # $ % & x1 x2 ! " # # $ % & & ! hn

1

hn

2

! " # # $ % & &= an bn ! " # # $ % & & x1 x2 ! " # # $ % & &

Suppose W = a b ! " # $ % &:

slide-17
SLIDE 17

CS109B, PROTOPAPAS, GLICKMAN

Exploding and Vanishing Gradients

17

Explodes! Vanishes! Suppose x = 1 1 ! " # $ % & Case 1: a =1, b = 2 : y →1, ∇y → n n2n−1 ! " # # $ % & & Case 2: a = 0.5, b = 0.9 : y → 0, ∇y → ! " # $ % &

slide-18
SLIDE 18

CS109B, PROTOPAPAS, GLICKMAN

Exploding and Vanishing Gradients

Exploding gradients lead to cliffs Can be mitigated using gradient clipping

18

Goodfellow et al. (2016)

if 𝑕 > 𝑣 𝑕 ⟵ 𝑕𝑣 𝑕

slide-19
SLIDE 19

CS109B, PROTOPAPAS, GLICKMAN

Outline

Optimization

  • Challenges in Optimization
  • Momentum
  • Adaptive Learning Rate
  • Parameter Initialization
  • Batch Normalization

19

slide-20
SLIDE 20

CS109B, PROTOPAPAS, GLICKMAN

Stochastic Gradient Descent

20

Oscillations because updates do not exploit curvature information

Goodfellow et al. (2016)

J(θ)

slide-21
SLIDE 21

CS109B, PROTOPAPAS, GLICKMAN

Momentum

SGD is slow when there is high curvature Average gradient presents faster path to opt: – vertical components cancel out

21

J(θ)

slide-22
SLIDE 22

CS109B, PROTOPAPAS, GLICKMAN

Momentum

Uses past gradients for update Maintains a new quantity: ‘velocity’ Exponentially decaying average of gradients:

22

controls how quickly effect of past gradients decay

Current gradient update

v = αv + (−εg)

α ∈ [0,1)

g = 1 m ∇θL( f (x(i);θ), y(i))

i

slide-23
SLIDE 23

CS109B, PROTOPAPAS, GLICKMAN

Momentum

Compute gradient estimate: Update velocity: Update parameters:

23

g = 1 m ∇θL( f (x(i);θ), y(i))

i

v =αv −εg θ =θ + v

slide-24
SLIDE 24

CS109B, PROTOPAPAS, GLICKMAN

Momentum

24

Damped oscillations: gradients in opposite directions get cancelled out

Goodfellow et al. (2016)

J(θ)

slide-25
SLIDE 25

CS109B, PROTOPAPAS, GLICKMAN

Nesterov Momentum

Apply an interim update: Perform a correction based on gradient at the interim point:

25

Momentum based on look-ahead slope

g = 1 m ∇θL( f (x(i); ! θ), y(i))

i

v =αv −εg θ =θ + v ! θ =θ + v

slide-26
SLIDE 26

26

slide-27
SLIDE 27

CS109B, PROTOPAPAS, GLICKMAN

Outline

Optimization

  • Challenges in Optimization
  • Momentum
  • Adaptive Learning Rate
  • Parameter Initialization
  • Batch Normalization

27

slide-28
SLIDE 28

CS109B, PROTOPAPAS, GLICKMAN

Adaptive Learning Rates

Oscillations along vertical direction

– Learning must be slower along parameter 2

Use a different learning rate for each parameter?

28

θ2 θ1

J(θ)

slide-29
SLIDE 29

CS109B, PROTOPAPAS, GLICKMAN

AdaGrad

  • Accumulate squared gradients:
  • Update each parameter:
  • Greater progress along gently sloped directions

29

Inversely proportional to cumulative squared gradient

r

i = r i + gi 2

θi =θi − ε δ + r

i

gi

slide-30
SLIDE 30

CS109B, PROTOPAPAS, GLICKMAN

RMSProp

  • For non-convex problems, AdaGrad can prematurely decrease learning

rate

  • Use exponentially weighted average for gradient accumulation

30

r

i = ρr i +(1− ρ)gi 2

θi =θi − ε δ + r

i

gi

slide-31
SLIDE 31

CS109B, PROTOPAPAS, GLICKMAN

Adam

  • RMSProp + Momentum
  • Estimate first moment:
  • Estimate second moment:
  • Update parameters:

31

Also applies bias correction to v and r Works well in practice, is fairly robust to hyper-parameters

vi = ρ1vi +(1− ρ1)gi

θi =θi − ε δ + r

i

vi

r

i = ρ2r i +(1− ρ2)gi 2

slide-32
SLIDE 32

CS109B, PROTOPAPAS, GLICKMAN

Outline

Optimization

  • Challenges in Optimization
  • Momentum
  • Adaptive Learning Rate
  • Parameter Initialization
  • Batch Normalization

32

slide-33
SLIDE 33

CS109B, PROTOPAPAS, GLICKMAN

Parameter Initialization

  • Goal: break symmetry between units
  • so that each unit computes a different function
  • Initialize all weights (not biases) randomly
  • Gaussian or uniform distribution
  • Scale of initialization?
  • Large -> grad explosion, Small -> grad vanishing

33

slide-34
SLIDE 34

CS109B, PROTOPAPAS, GLICKMAN

Xavier Initialization

  • Heuristic for all outputs to have unit variance
  • For a fully-connected layer with m inputs:
  • For ReLU units, it is recommended:

34

Wij ~ N 0, 1 m ! " # $ % & Wij ~ N 0, 2 m ! " # $ % &

slide-35
SLIDE 35

CS109B, PROTOPAPAS, GLICKMAN

Normalized Initialization

  • Fully-connected layer with m inputs, n outputs:
  • Heuristic trades off between initialize all layers have same

activation and gradient variance

  • Sparse variant when m is large

Initialize k nonzero weights in each unit

35

Wij ~U − 6 m + n, 6 m + n " # $ % & '

slide-36
SLIDE 36

CS109B, PROTOPAPAS, GLICKMAN

Bias Initialization

  • Output unit bias
  • Marginal statistics of the output in the training set
  • Hidden unit bias
  • Avoid saturation at initialization
  • E.g. in ReLU, initialize bias to 0.1 instead of 0
  • Units controlling participation of other units
  • Set bias to allow participation at initialization

36

slide-37
SLIDE 37

37

slide-38
SLIDE 38

CS109B, PROTOPAPAS, GLICKMAN

Outline

Challenges in Optimization Momentum Adaptive Learning Rate Parameter Initialization Batch Normalization

38

slide-39
SLIDE 39

CS109B, PROTOPAPAS, GLICKMAN

Feature Normalization

Good practice to normalize features before applying learning algorithm: Features in same scale: mean 0 and variance 1

– Speeds up learning

39

Vector of mean feature values Vector of SD of feature values Feature vector

! x = x −µ σ

slide-40
SLIDE 40

CS109B, PROTOPAPAS, GLICKMAN

Feature Normalization

Before normalization After normalization

J(θ)

slide-41
SLIDE 41

CS109B, PROTOPAPAS, GLICKMAN

Internal Covariance Shift

Each hidden layer changes distribution of inputs to next layer: slows down learning

41

Normalize inputs to layer 2 Normalize inputs to layer n

slide-42
SLIDE 42

CS109B, PROTOPAPAS, GLICKMAN

Batch Normalization

Training time:

– Mini-batch of activations for layer to normalize

42

K hidden layer activations N data points in mini-batch

H = H11 ! H1K " # " H N1 ! H NK ! " # # # # $ % & & & &

slide-43
SLIDE 43

CS109B, PROTOPAPAS, GLICKMAN

Batch Normalization

Training time:

– Mini-batch of activations for layer to normalize where

43

Vector of mean activations across mini-batch Vector of SD of each unit across mini-batch

H ' = H −µ σ

µ = 1 m Hi,:

i

σ = 1 m (H −µ)i

2 +δ i

slide-44
SLIDE 44

CS109B, PROTOPAPAS, GLICKMAN

Batch Normalization

Training time:

– Normalization can reduce expressive power – Instead use: – Allows network to control range of normalization

44

Learnable parameters

γ ! H + β

slide-45
SLIDE 45

CS109B, PROTOPAPAS, GLICKMAN

Batch Normalization

45

…..

Batch 1 Batch N

Add normalization

  • perations for layer 1

µ1 = 1 m Hi,:

i

σ 1 = 1 m (H −µ)i

2 +δ i

slide-46
SLIDE 46

CS109B, PROTOPAPAS, GLICKMAN

µ 2 = 1 m Hi,:

i

σ 2 = 1 m (H −µ)i

2 +δ i

Batch Normalization

46

Batch 1 Batch N

…..

Add normalization

  • perations for layer 2

and so on …

slide-47
SLIDE 47

CS109B, PROTOPAPAS, GLICKMAN

Batch Normalization

Differentiate the joint loss for N mini-batches Back-propagate through the norm operations Test time:

– Model needs to be evaluated on a single example – Replace μ and σ with running averages collected during training

47