Neural Network Part 2: Regularization Yingyu Liang Computer - - PowerPoint PPT Presentation

β–Ά
neural network part 2
SMART_READER_LITE
LIVE PREVIEW

Neural Network Part 2: Regularization Yingyu Liang Computer - - PowerPoint PPT Presentation

Neural Network Part 2: Regularization Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page,


slide-1
SLIDE 1

Neural Network Part 2: Regularization

Yingyu Liang Computer Sciences 760 Fall 2017

http://pages.cs.wisc.edu/~yliang/cs760/

Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

slide-2
SLIDE 2

Goals for the lecture

you should understand the following concepts

  • regularization
  • different views of regularization
  • norm constraint
  • data augmentation
  • early stopping
  • dropout
  • batch normalization

2

slide-3
SLIDE 3

What is regularization?

  • In general: any method to prevent overfitting or help the optimization
  • Specifically: additional terms in the training optimization objective to

prevent overfitting or help the optimization

slide-4
SLIDE 4

𝑒 = sin 2πœŒπ‘¦ + πœ—

Figure from Machine Learning and Pattern Recognition, Bishop

Overfitting example: regression using polynomials

slide-5
SLIDE 5

Overfitting example: regression using polynomials

Figure from Machine Learning and Pattern Recognition, Bishop

slide-6
SLIDE 6

Overfitting

  • Key: empirical loss and expected loss are different
  • Smaller the data set, larger the difference between the two
  • Larger the hypothesis class, easier to find a hypothesis that fits the

difference between the two

  • Thus has small training error but large test error (overfitting)
  • Larger data set helps
  • Throwing away useless hypotheses also helps (regularization)
slide-7
SLIDE 7

Different views of regularization

slide-8
SLIDE 8

Regularization as hard constraint

  • Training objective

min

𝑔

ΰ·  𝑀 𝑔 = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(𝑔, 𝑦𝑗, 𝑧𝑗) subject to: 𝑔 ∈ π“˜

  • When parametrized

min

πœ„

ΰ·  𝑀 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(πœ„, 𝑦𝑗, 𝑧𝑗) subject to: πœ„ ∈ 𝛻

slide-9
SLIDE 9

Regularization as hard constraint

  • When 𝛻 measured by some quantity 𝑆

min

πœ„

ΰ·  𝑀 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(πœ„, 𝑦𝑗, 𝑧𝑗) subject to: 𝑆 πœ„ ≀ 𝑠

  • Example: π‘š2 regularization

min

πœ„

ΰ·  𝑀 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(πœ„, 𝑦𝑗, 𝑧𝑗) subject to: | πœ„| 2

2 ≀ 𝑠2

slide-10
SLIDE 10

Regularization as soft constraint

  • The hard-constraint optimization is equivalent to soft-constraint

min

πœ„

ΰ·  𝑀𝑆 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(πœ„, 𝑦𝑗, 𝑧𝑗) + πœ‡βˆ—π‘†(πœ„) for some regularization parameter πœ‡βˆ— > 0

  • Example: π‘š2 regularization

min

πœ„

ΰ·  𝑀𝑆 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(πœ„, 𝑦𝑗, 𝑧𝑗) + πœ‡βˆ—| πœ„| 2

2

slide-11
SLIDE 11

Regularization as soft constraint

  • Showed by Lagrangian multiplier method

β„’ πœ„, πœ‡ ≔ ΰ·  𝑀 πœ„ + πœ‡[𝑆 πœ„ βˆ’ 𝑠]

  • Suppose πœ„βˆ— is the optimal for hard-constraint optimization

πœ„βˆ— = argmin

πœ„

max

πœ‡β‰₯0 β„’ πœ„, πœ‡ ≔ ΰ· 

𝑀 πœ„ + πœ‡[𝑆 πœ„ βˆ’ 𝑠]

  • Suppose πœ‡βˆ— is the corresponding optimal for max

πœ„βˆ— = argmin

πœ„

β„’ πœ„, πœ‡βˆ— ≔ ΰ·  𝑀 πœ„ + πœ‡βˆ—[𝑆 πœ„ βˆ’ 𝑠]

slide-12
SLIDE 12

Regularization as Bayesian prior

  • Bayesian view: everything is a distribution
  • Prior over the hypotheses: π‘ž πœ„
  • Posterior over the hypotheses: π‘ž πœ„ | {𝑦𝑗, 𝑧𝑗}
  • Likelihood: π‘ž

𝑦𝑗, 𝑧𝑗 πœ„)

  • Bayesian rule:

π‘ž πœ„ | {𝑦𝑗, 𝑧𝑗} = π‘ž πœ„ π‘ž 𝑦𝑗, 𝑧𝑗 πœ„) π‘ž({𝑦𝑗, 𝑧𝑗})

slide-13
SLIDE 13

Regularization as Bayesian prior

  • Bayesian rule:

π‘ž πœ„ | {𝑦𝑗, 𝑧𝑗} = π‘ž πœ„ π‘ž 𝑦𝑗, 𝑧𝑗 πœ„) π‘ž({𝑦𝑗, 𝑧𝑗})

  • Maximum A Posteriori (MAP):

max

πœ„

log π‘ž πœ„ | {𝑦𝑗, 𝑧𝑗} = max

πœ„

log π‘ž πœ„ + log π‘ž 𝑦𝑗, 𝑧𝑗 | πœ„ Regularization MLE loss

slide-14
SLIDE 14

Regularization as Bayesian prior

  • Example: π‘š2 loss with π‘š2 regularization

min

πœ„

ΰ·  𝑀𝑆 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

𝑔

πœ„ 𝑦𝑗 βˆ’ 𝑧𝑗 2 + πœ‡βˆ—| πœ„| 2 2

  • Correspond to a normal likelihood π‘ž 𝑦, 𝑧 | πœ„ and a normal prior π‘ž(πœ„)
slide-15
SLIDE 15

Three views

  • Typical choice for optimization: soft-constraint

min

πœ„

ΰ·  𝑀𝑆 πœ„ = ΰ·  𝑀 πœ„ + πœ‡π‘†(πœ„)

  • Hard constraint and Bayesian view: conceptual; or used for derivation
slide-16
SLIDE 16

Three views

  • Hard-constraint preferred if
  • Know the explicit bound 𝑆 πœ„ ≀ 𝑠
  • Soft-constraint causes trapped in a local minima while projection back to

feasible set leads to stability

  • Bayesian view preferred if
  • Domain knowledge easy to represent as a prior
slide-17
SLIDE 17

Examples of Regularization

slide-18
SLIDE 18

Classical regularization

  • Norm penalty
  • π‘š2 regularization
  • π‘š1 regularization
  • Robustness to noise
  • Noise to the input
  • Noise to the weights
slide-19
SLIDE 19

π‘š2 regularization

min

πœ„

ΰ·  𝑀𝑆 πœ„ = ΰ·  𝑀(πœ„) + 𝛽 2 | πœ„| 2

2

  • Effect on (stochastic) gradient descent
  • Effect on the optimal solution
slide-20
SLIDE 20

Effect on gradient descent

  • Gradient of regularized objective

𝛼෠ 𝑀𝑆 πœ„ = 𝛼෠ 𝑀(πœ„) + π›½πœ„

  • Gradient descent update

πœ„ ← πœ„ βˆ’ πœƒπ›Όΰ·  𝑀𝑆 πœ„ = πœ„ βˆ’ πœƒ 𝛼෠ 𝑀 πœ„ βˆ’ πœƒπ›½πœ„ = 1 βˆ’ πœƒπ›½ πœ„ βˆ’ πœƒ 𝛼෠ 𝑀 πœ„

  • Terminology: weight decay
slide-21
SLIDE 21

Effect on the optimal solution

  • Consider a quadratic approximation around πœ„βˆ—

ΰ·  𝑀 πœ„ β‰ˆ ΰ·  𝑀 πœ„βˆ— + πœ„ βˆ’ πœ„βˆ— π‘ˆπ›Όΰ·  𝑀 πœ„βˆ— + 1 2 πœ„ βˆ’ πœ„βˆ— π‘ˆπΌ πœ„ βˆ’ πœ„βˆ—

  • Since πœ„βˆ— is optimal, 𝛼෠

𝑀 πœ„βˆ— = 0 ΰ·  𝑀 πœ„ β‰ˆ ΰ·  𝑀 πœ„βˆ— + 1 2 πœ„ βˆ’ πœ„βˆ— π‘ˆπΌ πœ„ βˆ’ πœ„βˆ— 𝛼෠ 𝑀 πœ„ β‰ˆ 𝐼 πœ„ βˆ’ πœ„βˆ—

slide-22
SLIDE 22

Effect on the optimal solution

  • Gradient of regularized objective

𝛼෠ 𝑀𝑆 πœ„ β‰ˆ 𝐼 πœ„ βˆ’ πœ„βˆ— + π›½πœ„

  • On the optimal πœ„π‘†

βˆ—

0 = 𝛼෠ 𝑀𝑆 πœ„π‘†

βˆ— β‰ˆ 𝐼 πœ„π‘† βˆ— βˆ’ πœ„βˆ— + π›½πœ„π‘† βˆ—

πœ„π‘†

βˆ— β‰ˆ 𝐼 + 𝛽𝐽 βˆ’1πΌπœ„βˆ—

slide-23
SLIDE 23

Effect on the optimal solution

  • The optimal

πœ„π‘†

βˆ— β‰ˆ 𝐼 + 𝛽𝐽 βˆ’1πΌπœ„βˆ—

  • Suppose 𝐼 has eigen-decomposition 𝐼 = π‘…Ξ›π‘…π‘ˆ

πœ„π‘†

βˆ— β‰ˆ 𝐼 + 𝛽𝐽 βˆ’1πΌπœ„βˆ— = 𝑅 Ξ› + 𝛽𝐽 βˆ’1Ξ›π‘…π‘ˆπœ„βˆ—

  • Effect: rescale along eigenvectors of 𝐼
slide-24
SLIDE 24

Effect on the optimal solution

Figure from Deep Learning, Goodfellow, Bengio and Courville

Notations: πœ„βˆ— = π‘₯βˆ—, πœ„π‘†

βˆ— = ΰ·₯

π‘₯

slide-25
SLIDE 25

π‘š1 regularization

min

πœ„

ΰ·  𝑀𝑆 πœ„ = ΰ·  𝑀(πœ„) + 𝛽| πœ„ |1

  • Effect on (stochastic) gradient descent
  • Effect on the optimal solution
slide-26
SLIDE 26

Effect on gradient descent

  • Gradient of regularized objective

𝛼෠ 𝑀𝑆 πœ„ = 𝛼෠ 𝑀 πœ„ + 𝛽 sign(πœ„) where sign applies to each element in πœ„

  • Gradient descent update

πœ„ ← πœ„ βˆ’ πœƒπ›Όΰ·  𝑀𝑆 πœ„ = πœ„ βˆ’ πœƒ 𝛼෠ 𝑀 πœ„ βˆ’ πœƒπ›½ sign(πœ„)

slide-27
SLIDE 27

Effect on the optimal solution

  • Consider a quadratic approximation around πœ„βˆ—

ΰ·  𝑀 πœ„ β‰ˆ ΰ·  𝑀 πœ„βˆ— + πœ„ βˆ’ πœ„βˆ— π‘ˆπ›Όΰ·  𝑀 πœ„βˆ— + 1 2 πœ„ βˆ’ πœ„βˆ— π‘ˆπΌ πœ„ βˆ’ πœ„βˆ—

  • Since πœ„βˆ— is optimal, 𝛼෠

𝑀 πœ„βˆ— = 0 ΰ·  𝑀 πœ„ β‰ˆ ΰ·  𝑀 πœ„βˆ— + 1 2 πœ„ βˆ’ πœ„βˆ— π‘ˆπΌ πœ„ βˆ’ πœ„βˆ—

slide-28
SLIDE 28

Effect on the optimal solution

  • Further assume that 𝐼 is diagonal and positive (𝐼𝑗𝑗> 0, βˆ€π‘—)
  • not true in general but assume for getting some intuition
  • The regularized objective is (ignoring constants)

ΰ·  𝑀𝑆 πœ„ β‰ˆ ෍

𝑗

1 2 𝐼𝑗𝑗 πœ„π‘— βˆ’ πœ„π‘—

βˆ— 2 + 𝛽 |πœ„π‘—|

  • The optimal πœ„π‘†

βˆ—

(πœ„π‘†

βˆ—)𝑗 β‰ˆ

max πœ„π‘—

βˆ— βˆ’ 𝛽

𝐼𝑗𝑗 , 0 if πœ„π‘—

βˆ— β‰₯ 0

min πœ„π‘—

βˆ— + 𝛽

𝐼𝑗𝑗 , 0 if πœ„π‘—

βˆ— < 0

slide-29
SLIDE 29

Effect on the optimal solution

  • Effect: induce sparsity

βˆ’ 𝛽 𝐼𝑗𝑗 𝛽 𝐼𝑗𝑗 (πœ„π‘†

βˆ—)𝑗

(πœ„βˆ—)𝑗

slide-30
SLIDE 30

Effect on the optimal solution

  • Further assume that 𝐼 is diagonal
  • Compact expression for the optimal πœ„π‘†

βˆ—

(πœ„π‘†

βˆ—)𝑗 β‰ˆ sign πœ„π‘— βˆ— max{ πœ„π‘— βˆ— βˆ’ 𝛽

𝐼𝑗𝑗 , 0}

slide-31
SLIDE 31

Bayesian view

  • π‘š1 regularization corresponds to Laplacian prior

π‘ž πœ„ ∝ exp(𝛽 ෍

𝑗

|πœ„π‘—|) log π‘ž πœ„ = 𝛽 ෍

𝑗

|πœ„π‘—| + constant = 𝛽| πœ„ |1 + constant

slide-32
SLIDE 32

Multiple optimal solutions?

Class +1 Class -1 π‘₯2 π‘₯3 π‘₯1 Prefer π‘₯2 (higher confidence)

slide-33
SLIDE 33

Add noise to the input

Class +1 Class -1 π‘₯2 Prefer π‘₯2 (higher confidence)

slide-34
SLIDE 34

Caution: not too much noise

Class +1 Class -1 π‘₯2 Prefer π‘₯2 (higher confidence) Too much noise leads to data points cross the boundary

slide-35
SLIDE 35

Equivalence to weight decay

  • Suppose the hypothesis is 𝑔 𝑦 = π‘₯π‘ˆπ‘¦, noise is πœ—~𝑂(0, πœ‡π½)
  • After adding noise, the loss is

𝑀(𝑔) = 𝔽𝑦,𝑧,πœ— 𝑔 𝑦 + πœ— βˆ’ 𝑧 2 = 𝔽𝑦,𝑧,πœ— 𝑔 𝑦 + π‘₯π‘ˆπœ— βˆ’ 𝑧 2 𝑀(𝑔) =𝔽𝑦,𝑧,πœ— 𝑔 𝑦 βˆ’ 𝑧 2 + 2𝔽𝑦,𝑧,πœ— π‘₯π‘ˆπœ— 𝑔 𝑦 βˆ’ 𝑧 + 𝔽𝑦,𝑧,πœ— π‘₯π‘ˆπœ— 2 𝑀(𝑔) =𝔽𝑦,𝑧,πœ— 𝑔 𝑦 βˆ’ 𝑧 2 + πœ‡ π‘₯

2

slide-36
SLIDE 36

Add noise to the weights

  • For the loss on each data point, add a noise term to the weights

before computing the prediction πœ—~𝑂(0, πœƒπ½), π‘₯β€² = π‘₯ + πœ—

  • Prediction: 𝑔π‘₯β€² 𝑦 instead of 𝑔

π‘₯ 𝑦

  • Loss becomes

𝑀(𝑔) = 𝔽𝑦,𝑧,πœ— 𝑔

π‘₯+πœ— 𝑦 βˆ’ 𝑧 2

slide-37
SLIDE 37

Add noise to the weights

  • Loss becomes

𝑀(𝑔) = 𝔽𝑦,𝑧,πœ— 𝑔

π‘₯+πœ— 𝑦 βˆ’ 𝑧 2

  • To simplify, use Taylor expansion
  • 𝑔

π‘₯+πœ— 𝑦 β‰ˆ 𝑔 π‘₯ 𝑦 + πœ—π‘ˆπ›Όπ‘” 𝑦 + πœ—π‘ˆπ›Ό2𝑔 𝑦 πœ— 2

  • Plug in
  • 𝑀 𝑔 β‰ˆ 𝔽 𝑔

π‘₯ 𝑦 βˆ’ 𝑧 2 + πœƒπ”½[ 𝑔 π‘₯ 𝑦 βˆ’ 𝑧 𝛼2𝑔 π‘₯ 𝑦 ] + πœƒπ”½||𝛼𝑔 π‘₯(𝑦)||2

Small so can be ignored Regularization term

slide-38
SLIDE 38

Other types of regularizations

  • Data augmentation
  • Early stopping
  • Dropout
  • Batch Normalization
slide-39
SLIDE 39

Data augmentation

Figure from Image Classification with Pyramid Representation and Rotated Data Augmentation on Torch 7, by Keven Wang

slide-40
SLIDE 40

Data augmentation

  • Adding noise to the input: a special kind of augmentation
  • Be careful about the transformation applied:
  • Example: classifying β€˜b’ and β€˜d’
  • Example: classifying β€˜6’ and β€˜9’
slide-41
SLIDE 41

Early stopping

  • Idea: don’t train the network to too small training error
  • Recall overfitting: Larger the hypothesis class, easier to find a

hypothesis that fits the difference between the two

  • Prevent overfitting: do not push the hypothesis too much; use

validation error to decide when to stop

slide-42
SLIDE 42

Early stopping

Figure from Deep Learning, Goodfellow, Bengio and Courville

slide-43
SLIDE 43

Early stopping

  • When training, also output validation error
  • Every time validation error improved, store a copy of the weights
  • When validation error not improved for some time, stop
  • Return the copy of the weights stored
slide-44
SLIDE 44

Early stopping

  • hyperparameter selection: training step is the hyperparameter
  • Advantage
  • Efficient: along with training; only store an extra copy of weights
  • Simple: no change to the model/algo
  • Disadvantage: need validation data
slide-45
SLIDE 45

Early stopping as a regularizer

Figure from Deep Learning, Goodfellow, Bengio and Courville

slide-46
SLIDE 46

Dropout

  • Randomly select weights to update
  • More precisely, in each update step
  • Randomly sample a different binary mask to all the input and hidden units
  • Multiple the mask bits with the units and do the update as usual
  • Typical dropout probability: 0.2 for input and 0.5 for hidden units
slide-47
SLIDE 47

Dropout

Figure from Deep Learning, Goodfellow, Bengio and Courville

slide-48
SLIDE 48

Dropout

Figure from Deep Learning, Goodfellow, Bengio and Courville

slide-49
SLIDE 49

Dropout

Figure from Deep Learning, Goodfellow, Bengio and Courville

slide-50
SLIDE 50
  • If outputs of earlier layers are uniform or change

greatly on one round for one mini-batch, then neurons at next levels can’t keep up: they output all high (or all low) values

  • Next layer doesn’t have ability to change its
  • utputs with learning-rate-sized changes to its

input weights

  • We say the layer has β€œsaturated”

51

Batch Normalization

slide-51
SLIDE 51

Another View of Problem

  • In ML, we assume future data will be drawn from same probability

distribution as training data

  • For a hidden unit, after training, the earlier layers have new weights

and hence generate input data for this hidden unit from a new distribution

  • Want to reduce this internal covariate shift for the benefit of later

layers

52

slide-52
SLIDE 52

53

slide-53
SLIDE 53

Comments on Batch Normalization

  • First three steps are just like standardization of input data, but with

respect to only the data in mini-batch. Can take derivative and incorporate the learning of last step parameters into backpropagation.

  • Note last step can completely un-do previous 3 steps
  • But if so this un-doing is driven by the later layers, not the earlier

layers; later layers get to β€œchoose” whether they want standard normal inputs or not

54

slide-54
SLIDE 54

What regularizations are frequently used?

  • π‘š2 regularization
  • Early stopping
  • Dropout/Batch Normalization
  • Data augmentation if the transformations known/easy to implement