CS109B Data Science 2
Pavlos Protopapas and Mark Glickman
Lecture 6: Optimization CS109B Data Science 2 Pavlos Protopapas and - - PowerPoint PPT Presentation
Lecture 6: Optimization CS109B Data Science 2 Pavlos Protopapas and Mark Glickman Outline Optimization Challenges in Optimization Momentum Adaptive Learning Rate Parameter Initialization Batch Normalization CS109B, P
Pavlos Protopapas and Mark Glickman
CS109B, PROTOPAPAS, GLICKMAN
2
CS109B, PROTOPAPAS, GLICKMAN
Goal of learning: minimize generalization error In practice, empirical risk minimization:
3
Quantity optimized different from the quantity we care about
i=1 m
CS109B, PROTOPAPAS, GLICKMAN
Batch algorithms
Stochastic algorithms
4
Large mini-batch: gradient computation expensive Small mini-batch: greater variance in estimate, longer steps for convergence
CS109B, PROTOPAPAS, GLICKMAN
Points with zero gradient 2nd-derivate (Hessian) determines curvature
5
Goodfellow et al. (2016)
CS109B, PROTOPAPAS, GLICKMAN
Take small steps in direction of negative gradient Sample m examples from training set and compute: Update parameters:
6
In practice: shuffle training set once and pass through multiple times
i
CS109B, PROTOPAPAS, GLICKMAN
7
CS109B, PROTOPAPAS, GLICKMAN
8
Goodfellow et al. (2016)
CS109B, PROTOPAPAS, GLICKMAN
Old view: local minima is major problem in neural network training Recent view:
9
CS109B, PROTOPAPAS, GLICKMAN
Recent studies indicate that in high dim, saddle points are more likely than local min Gradient can be very small near saddle points
10
Both local min and max
Goodfellow et al. (2016)
CS109B, PROTOPAPAS, GLICKMAN
11
Convolution Nets for Object Detection Goodfellow et al. (2016)
CS109B, PROTOPAPAS, GLICKMAN
SGD is seen to escape saddle points – Moves down-hill, uses noisy gradients Second-order methods get stuck – solves for a point with zero gradient
12
Goodfellow et al. (2016)
CS109B, PROTOPAPAS, GLICKMAN
Poorly conditioned Hessian matrix – High curvature: small steps leads to huge increase Learning is slow despite strong gradients
13
Oscillations slow down progress
CS109B, PROTOPAPAS, GLICKMAN
Some cost functions do not have critical points. In particular classification.
14
CS109B, PROTOPAPAS, GLICKMAN
15
Linear activation
deeplearning.ai
ℎ" = 𝑋𝑦 ℎ" = 𝑋ℎ"&', 𝑗 = 2, … , 𝑜
CS109B, PROTOPAPAS, GLICKMAN
16
h1
1
h1
2
! " # # $ % & &= a b ! " # $ % & x1 x2 ! " # # $ % & & ! hn
1
hn
2
! " # # $ % & &= an bn ! " # # $ % & & x1 x2 ! " # # $ % & &
CS109B, PROTOPAPAS, GLICKMAN
17
CS109B, PROTOPAPAS, GLICKMAN
Exploding gradients lead to cliffs Can be mitigated using gradient clipping
18
Goodfellow et al. (2016)
CS109B, PROTOPAPAS, GLICKMAN
19
CS109B, PROTOPAPAS, GLICKMAN
20
Oscillations because updates do not exploit curvature information
Goodfellow et al. (2016)
CS109B, PROTOPAPAS, GLICKMAN
SGD is slow when there is high curvature Average gradient presents faster path to opt: – vertical components cancel out
21
CS109B, PROTOPAPAS, GLICKMAN
Uses past gradients for update Maintains a new quantity: ‘velocity’ Exponentially decaying average of gradients:
22
controls how quickly effect of past gradients decay
Current gradient update
α ∈ [0,1)
i
CS109B, PROTOPAPAS, GLICKMAN
Compute gradient estimate: Update velocity: Update parameters:
23
i
CS109B, PROTOPAPAS, GLICKMAN
24
Damped oscillations: gradients in opposite directions get cancelled out
Goodfellow et al. (2016)
CS109B, PROTOPAPAS, GLICKMAN
Apply an interim update: Perform a correction based on gradient at the interim point:
25
Momentum based on look-ahead slope
i
26
CS109B, PROTOPAPAS, GLICKMAN
27
CS109B, PROTOPAPAS, GLICKMAN
– Learning must be slower along parameter 2
28
CS109B, PROTOPAPAS, GLICKMAN
29
Inversely proportional to cumulative squared gradient
i = r i + gi 2
i
CS109B, PROTOPAPAS, GLICKMAN
rate
30
i = ρr i +(1− ρ)gi 2
i
CS109B, PROTOPAPAS, GLICKMAN
31
Also applies bias correction to v and r Works well in practice, is fairly robust to hyper-parameters
i
i = ρ2r i +(1− ρ2)gi 2
CS109B, PROTOPAPAS, GLICKMAN
32
CS109B, PROTOPAPAS, GLICKMAN
33
CS109B, PROTOPAPAS, GLICKMAN
34
CS109B, PROTOPAPAS, GLICKMAN
activation and gradient variance
–
Initialize k nonzero weights in each unit
35
CS109B, PROTOPAPAS, GLICKMAN
36
37
CS109B, PROTOPAPAS, GLICKMAN
Challenges in Optimization Momentum Adaptive Learning Rate Parameter Initialization Batch Normalization
38
CS109B, PROTOPAPAS, GLICKMAN
Good practice to normalize features before applying learning algorithm: Features in same scale: mean 0 and variance 1
– Speeds up learning
39
Vector of mean feature values Vector of SD of feature values Feature vector
CS109B, PROTOPAPAS, GLICKMAN
Before normalization After normalization
CS109B, PROTOPAPAS, GLICKMAN
41
Normalize inputs to layer 2 Normalize inputs to layer n
CS109B, PROTOPAPAS, GLICKMAN
– Mini-batch of activations for layer to normalize
42
K hidden layer activations N data points in mini-batch
CS109B, PROTOPAPAS, GLICKMAN
– Mini-batch of activations for layer to normalize where
43
Vector of mean activations across mini-batch Vector of SD of each unit across mini-batch
i
2 +δ i
CS109B, PROTOPAPAS, GLICKMAN
– Normalization can reduce expressive power – Instead use: – Allows network to control range of normalization
44
Learnable parameters
CS109B, PROTOPAPAS, GLICKMAN
45
…..
Batch 1 Batch N
µ1 = 1 m Hi,:
i
σ 1 = 1 m (H −µ)i
2 +δ i
CS109B, PROTOPAPAS, GLICKMAN
µ 2 = 1 m Hi,:
i
σ 2 = 1 m (H −µ)i
2 +δ i
46
Batch 1 Batch N
…..
CS109B, PROTOPAPAS, GLICKMAN
– Model needs to be evaluated on a single example – Replace μ and σ with running averages collected during training
47