Lecture 3: Regularization I Princeton University COS 495 - - PowerPoint PPT Presentation

β–Ά
lecture 3 regularization i
SMART_READER_LITE
LIVE PREVIEW

Lecture 3: Regularization I Princeton University COS 495 - - PowerPoint PPT Presentation

Deep Learning Basics Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang What is regularization? In general: any method to prevent overfitting or help the optimization Specifically: additional terms in the


slide-1
SLIDE 1

Deep Learning Basics Lecture 3: Regularization I

Princeton University COS 495 Instructor: Yingyu Liang

slide-2
SLIDE 2

What is regularization?

  • In general: any method to prevent overfitting or help the optimization
  • Specifically: additional terms in the training optimization objective to

prevent overfitting or help the optimization

slide-3
SLIDE 3

Review: overfitting

slide-4
SLIDE 4

𝑒 = sin 2πœŒπ‘¦ + πœ—

Figure from Machine Learning and Pattern Recognition, Bishop

Overfitting example: regression using polynomials

slide-5
SLIDE 5

Overfitting example: regression using polynomials

Figure from Machine Learning and Pattern Recognition, Bishop

slide-6
SLIDE 6

Overfitting

  • Empirical loss and expected loss are different
  • Smaller the data set, larger the difference between the two
  • Larger the hypothesis class, easier to find a hypothesis that fits the

difference between the two

  • Thus has small training error but large test error (overfitting)
slide-7
SLIDE 7

Prevent overfitting

  • Larger data set helps
  • Throwing away useless hypotheses also helps
  • Classical regularization: some principal ways to constrain hypotheses
  • Other types of regularization: data augmentation, early stopping, etc.
slide-8
SLIDE 8

Different views of regularization

slide-9
SLIDE 9

Regularization as hard constraint

  • Training objective

min

𝑔

ΰ·  𝑀 𝑔 = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(𝑔, 𝑦𝑗, 𝑧𝑗) subject to: 𝑔 ∈ π“˜

  • When parametrized

min

πœ„

ΰ·  𝑀 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(πœ„, 𝑦𝑗, 𝑧𝑗) subject to: πœ„ ∈ 𝛻

slide-10
SLIDE 10

Regularization as hard constraint

  • When 𝛻 measured by some quantity 𝑆

min

πœ„

ΰ·  𝑀 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(πœ„, 𝑦𝑗, 𝑧𝑗) subject to: 𝑆 πœ„ ≀ 𝑠

  • Example: π‘š2 regularization

min

πœ„

ΰ·  𝑀 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(πœ„, 𝑦𝑗, 𝑧𝑗) subject to: | πœ„| 2

2 ≀ 𝑠2

slide-11
SLIDE 11

Regularization as soft constraint

  • The hard-constraint optimization is equivalent to soft-constraint

min

πœ„

ΰ·  𝑀𝑆 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(πœ„, 𝑦𝑗, 𝑧𝑗) + πœ‡βˆ—π‘†(πœ„) for some regularization parameter πœ‡βˆ— > 0

  • Example: π‘š2 regularization

min

πœ„

ΰ·  𝑀𝑆 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

π‘š(πœ„, 𝑦𝑗, 𝑧𝑗) + πœ‡βˆ—| πœ„| 2

2

slide-12
SLIDE 12

Regularization as soft constraint

  • Showed by Lagrangian multiplier method

β„’ πœ„, πœ‡ ≔ ΰ·  𝑀 πœ„ + πœ‡[𝑆 πœ„ βˆ’ 𝑠]

  • Suppose πœ„βˆ— is the optimal for hard-constraint optimization

πœ„βˆ— = argmin

πœ„

max

πœ‡β‰₯0 β„’ πœ„, πœ‡ ≔ ΰ· 

𝑀 πœ„ + πœ‡[𝑆 πœ„ βˆ’ 𝑠]

  • Suppose πœ‡βˆ— is the corresponding optimal for max

πœ„βˆ— = argmin

πœ„

β„’ πœ„, πœ‡βˆ— ≔ ΰ·  𝑀 πœ„ + πœ‡βˆ—[𝑆 πœ„ βˆ’ 𝑠]

slide-13
SLIDE 13

Regularization as Bayesian prior

  • Bayesian view: everything is a distribution
  • Prior over the hypotheses: π‘ž πœ„
  • Posterior over the hypotheses: π‘ž πœ„ | {𝑦𝑗, 𝑧𝑗}
  • Likelihood: π‘ž

𝑦𝑗, 𝑧𝑗 πœ„)

  • Bayesian rule:

π‘ž πœ„ | {𝑦𝑗, 𝑧𝑗} = π‘ž πœ„ π‘ž 𝑦𝑗, 𝑧𝑗 πœ„) π‘ž({𝑦𝑗, 𝑧𝑗})

slide-14
SLIDE 14

Regularization as Bayesian prior

  • Bayesian rule:

π‘ž πœ„ | {𝑦𝑗, 𝑧𝑗} = π‘ž πœ„ π‘ž 𝑦𝑗, 𝑧𝑗 πœ„) π‘ž({𝑦𝑗, 𝑧𝑗})

  • Maximum A Posteriori (MAP):

max

πœ„

log π‘ž πœ„ | {𝑦𝑗, 𝑧𝑗} = max

πœ„

log π‘ž πœ„ + log π‘ž 𝑦𝑗, 𝑧𝑗 | πœ„ Regularization MLE loss

slide-15
SLIDE 15

Regularization as Bayesian prior

  • Example: π‘š2 loss with π‘š2 regularization

min

πœ„

ΰ·  𝑀𝑆 πœ„ = 1 π‘œ ෍

𝑗=1 π‘œ

𝑔

πœ„ 𝑦𝑗 βˆ’ 𝑧𝑗 2 + πœ‡βˆ—| πœ„| 2 2

  • Correspond to a normal likelihood π‘ž 𝑦, 𝑧 | πœ„ and a normal prior π‘ž(πœ„)
slide-16
SLIDE 16

Three views

  • Typical choice for optimization: soft-constraint

min

πœ„

ΰ·  𝑀𝑆 πœ„ = ΰ·  𝑀 πœ„ + πœ‡π‘†(πœ„)

  • Hard constraint and Bayesian view: conceptual; or used for derivation
slide-17
SLIDE 17

Three views

  • Hard-constraint preferred if
  • Know the explicit bound 𝑆 πœ„ ≀ 𝑠
  • Soft-constraint causes trapped in a local minima with small πœ„
  • Projection back to feasible set leads to stability
  • Bayesian view preferred if
  • Know the prior distribution
slide-18
SLIDE 18

Some examples

slide-19
SLIDE 19

Classical regularization

  • Norm penalty
  • π‘š2 regularization
  • π‘š1 regularization
  • Robustness to noise
slide-20
SLIDE 20

π‘š2 regularization

min

πœ„

ΰ·  𝑀𝑆 πœ„ = ΰ·  𝑀(πœ„) + 𝛽 2 | πœ„| 2

2

  • Effect on (stochastic) gradient descent
  • Effect on the optimal solution
slide-21
SLIDE 21

Effect on gradient descent

  • Gradient of regularized objective

𝛼෠ 𝑀𝑆 πœ„ = 𝛼෠ 𝑀(πœ„) + π›½πœ„

  • Gradient descent update

πœ„ ← πœ„ βˆ’ πœƒπ›Όΰ·  𝑀𝑆 πœ„ = πœ„ βˆ’ πœƒ 𝛼෠ 𝑀 πœ„ βˆ’ πœƒπ›½πœ„ = 1 βˆ’ πœƒπ›½ πœ„ βˆ’ πœƒ 𝛼෠ 𝑀 πœ„

  • Terminology: weight decay
slide-22
SLIDE 22

Effect on the optimal solution

  • Consider a quadratic approximation around πœ„βˆ—

ΰ·  𝑀 πœ„ β‰ˆ ΰ·  𝑀 πœ„βˆ— + πœ„ βˆ’ πœ„βˆ— π‘ˆπ›Όΰ·  𝑀 πœ„βˆ— + 1 2 πœ„ βˆ’ πœ„βˆ— π‘ˆπΌ πœ„ βˆ’ πœ„βˆ—

  • Since πœ„βˆ— is optimal, 𝛼෠

𝑀 πœ„βˆ— = 0 ΰ·  𝑀 πœ„ β‰ˆ ΰ·  𝑀 πœ„βˆ— + 1 2 πœ„ βˆ’ πœ„βˆ— π‘ˆπΌ πœ„ βˆ’ πœ„βˆ— 𝛼෠ 𝑀 πœ„ β‰ˆ 𝐼 πœ„ βˆ’ πœ„βˆ—

slide-23
SLIDE 23

Effect on the optimal solution

  • Gradient of regularized objective

𝛼෠ 𝑀𝑆 πœ„ β‰ˆ 𝐼 πœ„ βˆ’ πœ„βˆ— + π›½πœ„

  • On the optimal πœ„π‘†

βˆ—

0 = 𝛼෠ 𝑀𝑆 πœ„π‘†

βˆ— β‰ˆ 𝐼 πœ„π‘† βˆ— βˆ’ πœ„βˆ— + π›½πœ„π‘† βˆ—

πœ„π‘†

βˆ— β‰ˆ 𝐼 + 𝛽𝐽 βˆ’1πΌπœ„βˆ—

slide-24
SLIDE 24

Effect on the optimal solution

  • The optimal

πœ„π‘†

βˆ— β‰ˆ 𝐼 + 𝛽𝐽 βˆ’1πΌπœ„βˆ—

  • Suppose 𝐼 has eigen-decomposition 𝐼 = π‘…Ξ›π‘…π‘ˆ

πœ„π‘†

βˆ— β‰ˆ 𝐼 + 𝛽𝐽 βˆ’1πΌπœ„βˆ— = 𝑅 Ξ› + 𝛽𝐽 βˆ’1Ξ›π‘…π‘ˆπœ„βˆ—

  • Effect: rescale along eigenvectors of 𝐼
slide-25
SLIDE 25

Effect on the optimal solution

Figure from Deep Learning, Goodfellow, Bengio and Courville

Notations: πœ„βˆ— = π‘₯βˆ—, πœ„π‘†

βˆ— = ΰ·₯

π‘₯

slide-26
SLIDE 26

π‘š1 regularization

min

πœ„

ΰ·  𝑀𝑆 πœ„ = ΰ·  𝑀(πœ„) + 𝛽| πœ„ |1

  • Effect on (stochastic) gradient descent
  • Effect on the optimal solution
slide-27
SLIDE 27

Effect on gradient descent

  • Gradient of regularized objective

𝛼෠ 𝑀𝑆 πœ„ = 𝛼෠ 𝑀 πœ„ + 𝛽 sign(πœ„) where sign applies to each element in πœ„

  • Gradient descent update

πœ„ ← πœ„ βˆ’ πœƒπ›Όΰ·  𝑀𝑆 πœ„ = πœ„ βˆ’ πœƒ 𝛼෠ 𝑀 πœ„ βˆ’ πœƒπ›½ sign(πœ„)

slide-28
SLIDE 28

Effect on the optimal solution

  • Consider a quadratic approximation around πœ„βˆ—

ΰ·  𝑀 πœ„ β‰ˆ ΰ·  𝑀 πœ„βˆ— + πœ„ βˆ’ πœ„βˆ— π‘ˆπ›Όΰ·  𝑀 πœ„βˆ— + 1 2 πœ„ βˆ’ πœ„βˆ— π‘ˆπΌ πœ„ βˆ’ πœ„βˆ—

  • Since πœ„βˆ— is optimal, 𝛼෠

𝑀 πœ„βˆ— = 0 ΰ·  𝑀 πœ„ β‰ˆ ΰ·  𝑀 πœ„βˆ— + 1 2 πœ„ βˆ’ πœ„βˆ— π‘ˆπΌ πœ„ βˆ’ πœ„βˆ—

slide-29
SLIDE 29

Effect on the optimal solution

  • Further assume that 𝐼 is diagonal and positive (𝐼𝑗𝑗> 0, βˆ€π‘—)
  • not true in general but assume for getting some intuition
  • The regularized objective is (ignoring constants)

ΰ·  𝑀𝑆 πœ„ β‰ˆ ෍

𝑗

1 2 𝐼𝑗𝑗 πœ„π‘— βˆ’ πœ„π‘—

βˆ— 2 + 𝛽 |πœ„π‘—|

  • The optimal πœ„π‘†

βˆ—

(πœ„π‘†

βˆ—)𝑗 β‰ˆ

max πœ„π‘—

βˆ— βˆ’ 𝛽

𝐼𝑗𝑗 , 0 if πœ„π‘—

βˆ— β‰₯ 0

min πœ„π‘—

βˆ— + 𝛽

𝐼𝑗𝑗 , 0 if πœ„π‘—

βˆ— < 0

slide-30
SLIDE 30

Effect on the optimal solution

  • Effect: induce sparsity

βˆ’ 𝛽 𝐼𝑗𝑗 𝛽 𝐼𝑗𝑗 (πœ„π‘†

βˆ—)𝑗

(πœ„βˆ—)𝑗

slide-31
SLIDE 31

Effect on the optimal solution

  • Further assume that 𝐼 is diagonal
  • Compact expression for the optimal πœ„π‘†

βˆ—

(πœ„π‘†

βˆ—)𝑗 β‰ˆ sign πœ„π‘— βˆ— max{ πœ„π‘— βˆ— βˆ’ 𝛽

𝐼𝑗𝑗 , 0}

slide-32
SLIDE 32

Bayesian view

  • π‘š1 regularization corresponds to Laplacian prior

π‘ž πœ„ ∝ exp(𝛽 ෍

𝑗

|πœ„π‘—|) log π‘ž πœ„ = 𝛽 ෍

𝑗

|πœ„π‘—| + constant = 𝛽| πœ„ |1 + constant