[PPT] - Lecture 21: Optimization and Regularization CS109A Introduction to PowerPoint Presentation

SLIDE 1

CS109A Introduction to Data Science

Pavlos Protopapas, Kevin Rader and Chris Tanner

Lecture 21: Optimization and Regularization

SLIDE 2

CS109A, PROTOPAPAS, RADER, TANNER

2

Homework 7 OH:
For conceptual questions: Kevin and Chris will continue their office hours.
If you have problems with TensorFlow please let us know on ED. We will

arrange special OH to help if necessary.

Project:
Milestone3 due on Wed. EDA and base model

ANNOUNCEMENTS

SLIDE 3

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Optimization Regularization of NN

3

SLIDE 4

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Optimization

Challenges in Optimization
Momentum
Adaptive Learning Rate
Parameter Initialization
Batch Normalization

Regularization of NN

§ Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout

4

SLIDE 5

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Optimization

Challenges in Optimization
Momentum
Adaptive Learning Rate
Parameter Initialization
Batch Normalization

Regularization of NN

§ Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout

5

SLIDE 6

CS109A, PROTOPAPAS, RADER, TANNER

Learning vs. Optimization

Goal of learning: minimize generalization error, or the loss function

𝓜 𝑿 = 𝔽 𝒚,𝒛 ~𝒒𝒆𝒃𝒖𝒃 𝑀(𝑔 𝑦, 𝑋 , 𝑧

In practice, empirical risk minimization:

ℒ 𝑋 = 4 𝑀(𝑔 𝑦5; 𝑋 , 𝑧5

5

6

Quantity optimized different from the quantity we care about f is the neural network

SLIDE 7

CS109A, PROTOPAPAS, RADER, TANNER

Local Minima

7

Goodfellow et al. (2016)

SLIDE 8

CS109A, PROTOPAPAS, RADER, TANNER

Critical Points

Points with zero gradient 2nd-derivate (Hessian) determines curvature

8

Goodfellow et al. (2016)

SLIDE 9

CS109A, PROTOPAPAS, RADER, TANNER

Local Minima

Old view: local minima is major problem in neural network training Recent view:

For sufficiently large neural networks, most local minima incur low cost
Not important to find true global minimum

9

SLIDE 10

CS109A, PROTOPAPAS, RADER, TANNER

Saddle Points

Recent studies indicate that in high dim, saddle points are more likely than local min Gradient can be very small near saddle points

10

Both local min and max

Goodfellow et al. (2016)

SLIDE 11

CS109A, PROTOPAPAS, RADER, TANNER

Poor Conditioning

Poorly conditioned Hessian matrix – High curvature: small steps leads to huge increase Learning is slow despite strong gradients

11

Oscillations slow down progress

SLIDE 12

CS109A, PROTOPAPAS, RADER, TANNER

No Critical Points

Some cost functions do not have critical points. In particular classification. WHY?

12

SLIDE 13

CS109A, PROTOPAPAS, RADER, TANNER

Exploding and Vanishing Gradients

13

Linear activation

deeplearning.ai

ℎ5 = 𝑋𝑦 ℎ5 = 𝑋ℎ5;<, 𝑗 = 2, … , 𝑜

SLIDE 14

CS109A, PROTOPAPAS, RADER, TANNER

Exploding and Vanishing Gradients

14

h1

1

h1

2

! " # # $ % & &= a b ! " # $ % & x1 x2 ! " # # $ % & & ! hn

1

hn

2

! " # # $ % & &= an bn ! " # # $ % & & x1 x2 ! " # # $ % & &

Suppose W = a b ! " # $ % &:

SLIDE 15

CS109A, PROTOPAPAS, RADER, TANNER

Exploding and Vanishing Gradients

15

Explodes! Vanishes! Suppose x = 1 1 ! " # $ % & Case 1: a =1, b = 2 : y →1, ∇y → n n2n−1 ! " # # $ % & & Case 2: a = 0.5, b = 0.9 : y → 0, ∇y → ! " # $ % &

SLIDE 16

CS109A, PROTOPAPAS, RADER, TANNER

Exploding and Vanishing Gradients

Exploding gradients lead to cliffs Can be mitigated using gradient clipping

16

Goodfellow et al. (2016)

if 𝑕 > 𝑣 𝑕 ⟵ 𝑕𝑣 𝑕

SLIDE 17

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Optimization

Challenges in Optimization
Momentum
Adaptive Learning Rate
Parameter Initialization
Batch Normalization

Regularization of NN

§ Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout

17

SLIDE 18

CS109A, PROTOPAPAS, RADER, TANNER

Momentum

Oscillations because updates do not exploit curvature information Average gradient presents faster path to optimal: vertical components cancel out

18

𝑀(𝑋) 𝑋

G

𝑋

<

SLIDE 19

CS109A, PROTOPAPAS, RADER, TANNER

Momentum

Question: Why not this?

19

𝑀(𝑋) 𝑋

G

𝑋

<

SLIDE 20

CS109A, PROTOPAPAS, RADER, TANNER

Momentum

Let us figure out an algorithm which will lead us to the minimum faster.

20

𝑀(𝑋) 𝑋

G

𝑋

<

SLIDE 21

CS109A, PROTOPAPAS, RADER, TANNER

Momentum

Look each component at a time

21

𝑀(𝑋) 𝑋

G

𝑋

<

SLIDE 22

CS109A, PROTOPAPAS, RADER, TANNER

Momentum

Let us figure out an algorithm

22

𝑀(𝑋) 𝑋

G

𝑋

<

SLIDE 23

CS109A, PROTOPAPAS, RADER, TANNER

Momentum

Let us figure out an algorithm

23

𝑀(𝑋) 𝑋

G

𝑋

<

SLIDE 24

CS109A, PROTOPAPAS, RADER, TANNER

Momentum

Let us figure out an algorithm

24

𝑀(𝑋) 𝑋

G

𝑋

<

SLIDE 25

CS109A, PROTOPAPAS, RADER, TANNER

Momentum

Let us figure out an algorithm

25

𝑀(𝑋) 𝑋

G

𝑋

<

SLIDE 26

CS109A, PROTOPAPAS, RADER, TANNER

Momentum

Old gradient descent: New gradient descent with momentum:

26

𝑕 = 𝑕 + 𝑏𝑤𝑓𝑠𝑏𝑕𝑓 𝑕 𝑔𝑠𝑝𝑛 𝑐𝑓𝑔𝑝𝑠𝑓 𝑕 = 1 𝑛 4 𝛼R𝑀(𝑔 𝑦5; 𝑋 , 𝑧5)

5

𝑔is the Neural Network

𝑋∗ = 𝑋 − 𝜇𝑕 𝜉 = 𝛽𝜉 + (1 − 𝛽) 𝑕 𝑋∗ = 𝑋 − 𝜇𝜉

controls how quickly effect of past gradients decay α ∈ [0,1)

SLIDE 27

CS109A, PROTOPAPAS, RADER, TANNER

Nesterov Momentum

Apply an interim update: Perform a correction based on gradient at the interim point:

27

Momentum based on look-ahead slope

v =αv −εg

𝑋 X = 𝑋 + 𝜉

𝑕 = 1 𝑛 4 𝛼R𝑀(𝑔 𝑦5; 𝑋 X , 𝑧5)

5

SLIDE 28

28

SLIDE 29

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Optimization

Challenges in Optimization
Momentum
Adaptive Learning Rate
Parameter Initialization
Batch Normalization

Regularization of NN

§ Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout

29

SLIDE 30

CS109A, PROTOPAPAS, RADER, TANNER

Adaptive Learning Rates

Oscillations along vertical direction

– Learning must be slower along parameter 2

Use a different learning rate for each parameter?

30

𝑀(𝑋) 𝑋

G

𝑋

<

SLIDE 31

CS109A, PROTOPAPAS, RADER, TANNER

Adaptive Learning Rates

Oscillations along vertical direction

– Learning must be slower along parameter 2

Use a different learning rate for each parameter?

31

𝑀(𝑋) 𝑋

G

𝑋

<

SLIDE 32

CS109A, PROTOPAPAS, RADER, TANNER

Adaptive Learning Rates

Oscillations along vertical direction

– Learning must be slower along parameter 2

Use a different learning rate for each parameter?

32

𝑀(𝑋) 𝑋

G

𝑋

<

SLIDE 33

CS109A, PROTOPAPAS, RADER, TANNER

Adaptive Learning Rates

Oscillations along vertical direction

– Learning must be slower along parameter 2

Use a different learning rate for each parameter?

33

𝑀(𝑋) 𝑋

G

𝑋

<

SLIDE 34

CS109A, PROTOPAPAS, RADER, TANNER

Adaptive Learning Rates

Oscillations along vertical direction

– Learning must be slower along parameter 2

Use a different learning rate for each parameter?

34

𝑀(𝑋) 𝑋

G

𝑋

<

SLIDE 35

CS109A, PROTOPAPAS, RADER, TANNER

AdaGrad

Accumulate squared gradients:
Update each parameter:
Greater progress along gently sloped directions

35

r

i = r i + gi 2

𝑕 is the gradient Inversely proportional to cumulative squared gradient

SLIDE 36

CS109A, PROTOPAPAS, RADER, TANNER

AdaGrad

36

𝑕 = 1 𝑛 4 𝛼R𝑀(𝑔 𝑦5; 𝑋 , 𝑧5)

5

𝑋∗ = 𝑋 − 𝜇𝑕

Old gradient descent: We would like 𝜇Y𝑡 not to be the same and inversely proportional to the |𝑕5|

𝑋

5 ∗ = 𝑋 5 − 𝜇5𝑕5

𝜇5 ∝ 1 |𝑕5| = 1 𝜀 + |𝑕5| 𝑠

5 ∗ = 𝑠 5 + 𝑕5 G

𝑋

5 ∗ = 𝑋 5 −

𝜗 𝜀 + 𝑠

5

g`

New gradient descent with adaptive learning rate:

𝜀 is a small number, making sure this does not become too large

SLIDE 37

CS109A, PROTOPAPAS, RADER, TANNER

RMSProp

For non-convex problems, AdaGrad can prematurely decrease learning

rate

Use exponentially weighted average for gradient accumulation

37

r

i = ρr i +(1− ρ)gi 2

SLIDE 38

CS109A, PROTOPAPAS, RADER, TANNER

Adam

RMSProp + Momentum
Estimate first moment:
Estimate second moment:
Update parameters:

38

Also applies bias correction to v and r Works well in practice, is fairly robust to hyper-parameters

vi = ρ1vi +(1− ρ1)gi r

i = ρ2r i +(1− ρ2)gi 2

SLIDE 39

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Optimization

Challenges in Optimization
Momentum
Adaptive Learning Rate
Parameter Initialization
Batch Normalization

Regularization of NN

§ Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout

39

SLIDE 40

CS109A, PROTOPAPAS, RADER, TANNER

Parameter Initialization

Goal: break symmetry between units
so that each unit computes a different function
Initialize all weights (not biases) randomly
Gaussian or uniform distribution
Scale of initialization?
Large -> grad explosion, Small -> grad vanishing

40

SLIDE 41

CS109A, PROTOPAPAS, RADER, TANNER

Xavier Initialization

Heuristic for all outputs to have unit variance
For a fully-connected layer with m inputs:
For ReLU units, it is recommended:

41

Wij ~ N 0, 1 m ! " # $ % & Wij ~ N 0, 2 m ! " # $ % &

SLIDE 42

CS109A, PROTOPAPAS, RADER, TANNER

Normalized Initialization

Fully-connected layer with m inputs, n outputs:
Heuristic trades off between initialize all layers have same

activation and gradient variance

Sparse variant when m is large

–

Initialize k nonzero weights in each unit

42

Wij ~U − 6 m + n, 6 m + n " # $ % & '

SLIDE 43

43

SLIDE 44

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Optimization

Challenges in Optimization
Momentum
Adaptive Learning Rate
Parameter Initialization
Batch Normalization

Regularization of NN

§ Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout

44

SLIDE 45

CS109A, PROTOPAPAS, RADER, TANNER

Feature Normalization

Good practice to normalize features before applying learning algorithm: Features in same scale: mean 0 and variance 1

– Speeds up learning

45

Vector of mean feature values Vector of SD of feature values Feature vector

! x = x −µ σ

SLIDE 46

CS109A, PROTOPAPAS, RADER, TANNER

Feature Normalization

Before normalization After normalization

𝑀(𝑋)

SLIDE 47

CS109A, PROTOPAPAS, RADER, TANNER

Internal Covariance Shift

Each hidden layer changes distribution of inputs to next layer: slows down learning

47

Normalize inputs to layer 2 Normalize inputs to layer n

…

SLIDE 48

CS109A, PROTOPAPAS, RADER, TANNER

Batch Normalization

Training time:

– Mini-batch of activations for layer to normalize

48

K hidden layer activations N data points in mini-batch

H = H11 ! H1K " # " H N1 ! H NK ! " # # # # $ % & & & &

SLIDE 49

CS109A, PROTOPAPAS, RADER, TANNER

Batch Normalization

Training time:

– Mini-batch of activations for layer to normalize where

49

Vector of mean activations across mini-batch Vector of SD of each unit across mini-batch

H ' = H −µ σ

µ = 1 m Hi,:

i

∑

σ = 1 m (H −µ)i

2 +δ i

∑

SLIDE 50

CS109A, PROTOPAPAS, RADER, TANNER

Batch Normalization

Training time:

– Normalization can reduce expressive power – Instead use: – Allows network to control range of normalization

50

Learnable parameters

γ ! H + β

SLIDE 51

CS109A, PROTOPAPAS, RADER, TANNER

Batch Normalization

51

…..

Batch 1 Batch N

Add normalization

perations for layer 1

µ1 = 1 m Hi,:

i

∑

σ 1 = 1 m (H −µ)i

2 +δ i

∑

SLIDE 52

CS109A, PROTOPAPAS, RADER, TANNER

µ 2 = 1 m Hi,:

i

∑

σ 2 = 1 m (H −µ)i

2 +δ i

∑

Batch Normalization

52

Batch 1 Batch N

…..

Add normalization

perations for layer 2

and so on …

SLIDE 53

CS109A, PROTOPAPAS, RADER, TANNER

Batch Normalization

Differentiate the joint loss for N mini-batches Back-propagate through the norm operations Test time:

– Model needs to be evaluated on a single example – Replace μ and σ with running averages collected during training

53

SLIDE 54

CS109A, PROTOPAPAS, RADER, TANNER

54

SLIDE 55

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Optimization

Challenges in Optimization
Momentum
Adaptive Learning Rate
Parameter Initialization
Batch Normalization

Regularization of NN

§ Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout

55

SLIDE 56

CS109A, PROTOPAPAS, RADER, TANNER

Regularization

56

Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

SLIDE 57

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Optimization

Challenges in Optimization
Momentum
Adaptive Learning Rate
Parameter Initialization
Batch Normalization

Regularization of NN

§ Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout

57

SLIDE 58

CS109A, PROTOPAPAS, RADER, TANNER

Overfitting

58

Fitting a deep neural network with 5 layers and 100 neurons per layer can lead to a very good prediction on the training set but poor prediction on validations set.

SLIDE 59

CS109A, PROTOPAPAS, RADER, TANNER

Norm Penalties

We used to optimize:

𝑀 𝑋; 𝑌, 𝑧 Change to … 𝑀b 𝑋; 𝑌, 𝑧 = 𝑀 𝑋; 𝑌, 𝑧 + 𝛽Ω(𝑋)

L2 regularization:

– Weights decay – MAP estimation with Gaussian prior

L1 regularization:

– encourages sparsity – MAP estimation with Laplacian prior

59

Biases not penalized

Ω 𝑋 = 1 2 ∥ 𝑋 ∥G

G

Ω 𝑋 = 1 2 ∥ 𝑋 ∥<

SLIDE 60

CS109A, PROTOPAPAS, RADER, TANNER

Norm Penalties

We used to optimize:

𝑀 𝑋; 𝑌, 𝑧 Change to … 𝑀b 𝑋; 𝑌, 𝑧 = 𝑀 𝑋; 𝑌, 𝑧 + 𝛽Ω(𝑋)

L2 regularization:

– Decay of weights – MAP estimation with Gaussian prior

L1 regularization:

– encourages sparsity – MAP estimation with Laplacian prior

60

Biases not penalized

Ω 𝑋 = 1 2 ∥ 𝑋 ∥G

G

Ω 𝑋 = 1 2 ∥ 𝑋 ∥< 𝑋(5e<) = 𝑋(5) − 𝜇 𝜖𝑀 𝜖𝑋 𝑀b 𝑋; 𝑌, 𝑧 = 𝑀 𝑋; 𝑌, 𝑧 + 1 2 𝛽𝑋G 𝑋(5e<) = 𝑋(5) − 𝜇 𝜖𝑀 𝜖𝑋 − 𝜇𝛽 𝑋

Weights decay in proportion to size

SLIDE 61

CS109A, PROTOPAPAS, RADER, TANNER

Norm Penalties

61

Ω 𝑋 = 1 2 ∥ 𝑋 ∥G

G

Ω 𝑋 = 1 2 ∥ 𝑋 ∥<

SLIDE 62

CS109A, PROTOPAPAS, RADER, TANNER

Norm Penalties as Constraints

Useful if K is known in advance Optimization:

Construct Lagrangian and apply gradient descent
Projected gradient descent

62

min

i R jk 𝐾(𝑋; 𝑌, 𝑧)

SLIDE 63

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Optimization

Challenges in Optimization
Momentum
Adaptive Learning Rate
Parameter Initialization
Batch Normalization

Regularization of NN

§ Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout

63

SLIDE 64

CS109A, PROTOPAPAS, RADER, TANNER

Early Stopping

64

Training time can be treated as a hyperparameter

Early stopping: terminate while validation set performance is better

SLIDE 65

CS109A, PROTOPAPAS, RADER, TANNER

Early Stopping

65

SLIDE 66

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Optimization

Challenges in Optimization
Momentum
Adaptive Learning Rate
Parameter Initialization
Batch Normalization

Regularization of NN

§ Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout

66

SLIDE 67

CS109A, PROTOPAPAS, RADER, TANNER

Data Augmentation

67

SLIDE 68

CS109A, PROTOPAPAS, RADER, TANNER

Data Augmentation

68

SLIDE 69

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Optimization

Challenges in Optimization
Momentum
Adaptive Learning Rate
Parameter Initialization
Batch Normalization

Regularization of NN

§ Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout

69

SLIDE 70

CS109A, PROTOPAPAS, RADER, TANNER

Sparse Representation

70

𝑀 𝜄; 𝑌, 𝑧

𝑋

<

𝑋

G

𝑋

n

𝑋

𝑋

p

𝑋

q

4.34 = 3.2 2.0 1.8 2 −2.2 1.3

𝑋

w

𝑍 𝑋

w

SLIDE 71

CS109A, PROTOPAPAS, RADER, TANNER

Sparse Representation

71

0.69 = 0.5 .2 0.1 2 −2.2 1.3 𝑀b 𝑋; 𝑌, 𝑧 = 𝑀 𝜄; 𝑌, 𝑧 + 𝛽Ω(𝑋)

𝑋

<

𝑋

G

𝑋

n

𝑋

𝑋

p

𝑋

q

Weights in output layer

𝑋

w

𝑍 𝑋

w

SLIDE 72

CS109A, PROTOPAPAS, RADER, TANNER

Sparse Representation

72

𝑀 𝜄; 𝑌, 𝑧

ℎn< ℎnG ℎnn

4.34 = 3.2 2 1 2 −2.2 1.3

𝑋

w

𝑍 ℎn<, ℎnG, ℎnn

SLIDE 73

CS109A, PROTOPAPAS, RADER, TANNER

Sparse Representation

73

𝑀b 𝑋; 𝑌, 𝑧 = 𝑀 𝜄; 𝑌, 𝑧 + 𝛽Ω(ℎ)

Output of hidden layer

ℎn< ℎnG ℎnn

1.3 = 3.2 2 1 −0.2 .9

ℎn<, ℎnG, ℎnn

SLIDE 74

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Optimization

Challenges in Optimization
Momentum
Adaptive Learning Rate
Parameter Initialization
Batch Normalization

Regularization of NN

§ Norm Penalties § Early Stopping § Data Augmentation § Sparse Representation § Dropout

74

SLIDE 75

CS109A, PROTOPAPAS, RADER, TANNER

Noise Robustness

Random perturbation of network weights

Gaussian noise: Equivalent to minimizing loss with regularization term
Encourages smooth function: small perturbation in weights leads to

small changes in output

Injecting noise in output labels

Better convergence: prevents pursuit of hard probabilities

75

SLIDE 76

CS109A, PROTOPAPAS, RADER, TANNER

Dropout

76

Randomly set some neurons and their connections to zero (i.e. “dropped”)
Prevent overfitting by reducing co-adaptation of neurons
Like training many random sub-networks

SLIDE 77

CS109A, PROTOPAPAS, RADER, TANNER

Dropout

77

Widely used and highly effective
Proposed as an alternative to ensembling, which is too expensive for neural

nets

http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf Test error for different architectures with and without dropout. The networks have 2 to 4 hidden layers each with 1024 to 2048 units.

SLIDE 78

CS109A, PROTOPAPAS, RADER, TANNER

Dropout: Stochastic GD

For each new example/mini-batch:

Randomly sample a binary mask μ independently, where μi indicates if

input/hidden node i is included

Multiply output of node i with μi, and perform gradient update

Typically, an input node is included with prob=0.8, hidden node with prob=0.5.

78

SLIDE 79

CS109A, PROTOPAPAS, RADER, TANNER

Dropout: Weight Scaling

We can think of dropout as training many of sub-networks
At test time, we can “aggregate” over these sub-networks by reducing

connection weights in proportion to dropout probability, p

79

SLIDE 80

CS109A, PROTOPAPAS, RADER, TANNER

80

SLIDE 81

CS109A, PROTOPAPAS, RADER, TANNER

Regression Statistical Learning Uncertainty in model and prediction Cross validation Overfitting: Variance & Bias Methods of regularization: Lasso and Ridge Classification PCA & dimensionality reduction Pandas Matplotlib Scikit- Learn NumPy Trees Neural Networks Computing Tools Linear KNN Logistic

81

Experimenta l Design & Causal Inference