Lecture 4: Regularization II Princeton University COS 495 - - PowerPoint PPT Presentation

β–Ά
lecture 4 regularization ii
SMART_READER_LITE
LIVE PREVIEW

Lecture 4: Regularization II Princeton University COS 495 - - PowerPoint PPT Presentation

Deep Learning Basics Lecture 4: Regularization II Princeton University COS 495 Instructor: Yingyu Liang Review Regularization as hard constraint Constrained optimization = 1 min (, ,


slide-1
SLIDE 1

Deep Learning Basics Lecture 4: Regularization II

Princeton University COS 495 Instructor: Yingyu Liang

slide-2
SLIDE 2

Review

slide-3
SLIDE 3

Regularization as hard constraint

  • Constrained optimization

min

πœ„

ΰ·  𝑀 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(πœ„, 𝑦𝑗, 𝑧𝑗) subject to: 𝑆 πœ„ ≀ 𝑠

slide-4
SLIDE 4

Regularization as soft constraint

  • Unconstrained optimization

min

πœ„

ΰ·  𝑀𝑆 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(πœ„, 𝑦𝑗, 𝑧𝑗) + πœ‡π‘†(πœ„) for some regularization parameter πœ‡ > 0

slide-5
SLIDE 5

Regularization as Bayesian prior

  • Bayesian rule:

π‘ž πœ„ | {𝑦𝑗, 𝑧𝑗} = π‘ž πœ„ π‘ž 𝑦𝑗, 𝑧𝑗 πœ„) π‘ž({𝑦𝑗, 𝑧𝑗})

  • Maximum A Posteriori (MAP):

max

πœ„

log π‘ž πœ„ | {𝑦𝑗, 𝑧𝑗} = max

πœ„

log π‘ž πœ„ + log π‘ž 𝑦𝑗, 𝑧𝑗 | πœ„ Regularization MLE loss

slide-6
SLIDE 6

Classical regularizations

  • Norm penalty
  • π‘š2 regularization
  • π‘š1 regularization
slide-7
SLIDE 7

More examples

slide-8
SLIDE 8

Other types of regularizations

  • Robustness to noise
  • Noise to the input
  • Noise to the weights
  • Noise to the output
  • Data augmentation
  • Early stopping
  • Dropout
slide-9
SLIDE 9

Multiple optimal solutions?

Class +1 Class -1 π‘₯2 π‘₯3 π‘₯1 Prefer π‘₯2 (higher confidence)

slide-10
SLIDE 10

Add noise to the input

Class +1 Class -1 π‘₯2 Prefer π‘₯2 (higher confidence)

slide-11
SLIDE 11

Caution: not too much noise

Class +1 Class -1 π‘₯2 Prefer π‘₯2 (higher confidence) Too much noise leads to data points cross the boundary

slide-12
SLIDE 12

Equivalence to weight decay

  • Suppose the hypothesis is 𝑔 𝑦 = π‘₯π‘ˆπ‘¦, noise is πœ—~𝑂(0, πœ‡π½)
  • After adding noise, the loss is

𝑀(𝑔) = 𝔽𝑦,𝑧,πœ— 𝑔 𝑦 + πœ— βˆ’ 𝑧 2 = 𝔽𝑦,𝑧,πœ— 𝑔 𝑦 + π‘₯π‘ˆπœ— βˆ’ 𝑧 2 𝑀(𝑔) =𝔽𝑦,𝑧,πœ— 𝑔 𝑦 βˆ’ 𝑧 2 + 2𝔽𝑦,𝑧,πœ— π‘₯π‘ˆπœ— 𝑔 𝑦 βˆ’ 𝑧 + 𝔽𝑦,𝑧,πœ— π‘₯π‘ˆπœ— 2 𝑀(𝑔) =𝔽𝑦,𝑧,πœ— 𝑔 𝑦 βˆ’ 𝑧 2 + πœ‡ π‘₯

2

slide-13
SLIDE 13

Add noise to the weights

  • For the loss on each data point, add a noise term to the weights

before computing the prediction πœ—~𝑂(0, πœƒπ½), π‘₯β€² = π‘₯ + πœ—

  • Prediction: 𝑔π‘₯β€² 𝑦 instead of 𝑔

π‘₯ 𝑦

  • Loss becomes

𝑀(𝑔) = 𝔽𝑦,𝑧,πœ— 𝑔

π‘₯+πœ— 𝑦 βˆ’ 𝑧 2

slide-14
SLIDE 14

Add noise to the weights

  • Loss becomes

𝑀(𝑔) = 𝔽𝑦,𝑧,πœ— 𝑔

π‘₯+πœ— 𝑦 βˆ’ 𝑧 2

  • To simplify, use Taylor expansion
  • 𝑔

π‘₯+πœ— 𝑦 β‰ˆ 𝑔 π‘₯ 𝑦 + πœ—π‘ˆπ›Όπ‘” 𝑦 + πœ—π‘ˆπ›Ό2𝑔 𝑦 πœ— 2

  • Plug in
  • 𝑀 𝑔 β‰ˆ 𝔽 𝑔

π‘₯ 𝑦 βˆ’ 𝑧 2 + πœƒπ”½[ 𝑔 π‘₯ 𝑦 βˆ’ 𝑧 𝛼2𝑔 π‘₯ 𝑦 ] + πœƒπ”½||𝛼𝑔 π‘₯(𝑦)||2

Small so can be ignored Regularization term

slide-15
SLIDE 15

Data augmentation

Figure from Image Classification with Pyramid Representation and Rotated Data Augmentation on Torch 7, by Keven Wang

slide-16
SLIDE 16

Data augmentation

  • Adding noise to the input: a special kind of augmentation
  • Be careful about the transformation applied:
  • Example: classifying β€˜b’ and β€˜d’
  • Example: classifying β€˜6’ and β€˜9’
slide-17
SLIDE 17

Early stopping

  • Idea: don’t train the network to too small training error
  • Recall overfitting: Larger the hypothesis class, easier to find a

hypothesis that fits the difference between the two

  • Prevent overfitting: do not push the hypothesis too much; use

validation error to decide when to stop

slide-18
SLIDE 18

Early stopping

Figure from Deep Learning, Goodfellow, Bengio and Courville

slide-19
SLIDE 19

Early stopping

  • When training, also output validation error
  • Every time validation error improved, store a copy of the weights
  • When validation error not improved for some time, stop
  • Return the copy of the weights stored
slide-20
SLIDE 20

Early stopping

  • hyperparameter selection: training step is the hyperparameter
  • Advantage
  • Efficient: along with training; only store an extra copy of weights
  • Simple: no change to the model/algo
  • Disadvantage: need validation data
slide-21
SLIDE 21

Early stopping

  • Strategy to get rid of the disadvantage
  • After early stopping of the first run, train a second run and reuse validation

data

  • How to reuse validation data
  • 1. Start fresh, train with both training data and validation data up to the

previous number of epochs

  • 2. Start from the weights in the first run, train with both training data and

validation data util the validation loss < the training loss at the early stopping point

slide-22
SLIDE 22

Early stopping as a regularizer

Figure from Deep Learning, Goodfellow, Bengio and Courville

slide-23
SLIDE 23

Dropout

  • Randomly select weights to update
  • More precisely, in each update step
  • Randomly sample a different binary mask to all the input and hidden units
  • Multiple the mask bits with the units and do the update as usual
  • Typical dropout probability: 0.2 for input and 0.5 for hidden units
slide-24
SLIDE 24

Dropout

Figure from Deep Learning, Goodfellow, Bengio and Courville

slide-25
SLIDE 25

Dropout

Figure from Deep Learning, Goodfellow, Bengio and Courville

slide-26
SLIDE 26

Dropout

Figure from Deep Learning, Goodfellow, Bengio and Courville

slide-27
SLIDE 27

What regularizations are frequently used?

  • π‘š2 regularization
  • Early stopping
  • Dropout
  • Data augmentation if the transformations known/easy to implement