Regularization for Deep Learning Lecture slides for Chapter 7 of - - PowerPoint PPT Presentation

regularization for deep learning
SMART_READER_LITE
LIVE PREVIEW

Regularization for Deep Learning Lecture slides for Chapter 7 of - - PowerPoint PPT Presentation

Regularization for Deep Learning Lecture slides for Chapter 7 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-27 Definition Regularization is any modification we make to a learning algorithm that is intended to reduce


slide-1
SLIDE 1

Regularization for Deep Learning

Lecture slides for Chapter 7 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-27

slide-2
SLIDE 2

(Goodfellow 2016)

Definition

  • “Regularization is any modification we make to a

learning algorithm that is intended to reduce its generalization error but not its training error.”

slide-3
SLIDE 3

(Goodfellow 2016)

Weight Decay as Constrained Optimization

w1 w2 w∗ ˜ w

Figure 7.1

slide-4
SLIDE 4

(Goodfellow 2016)

Norm Penalties

  • L1: Encourages sparsity, equivalent to MAP

Bayesian estimation with Laplace prior

  • Squared L2: Encourages small weights, equivalent to

MAP Bayesian estimation with Gaussian prior

slide-5
SLIDE 5

(Goodfellow 2016)

Dataset Augmentation

Affine Distortion Noise Elastic Deformation Horizontal flip Random Translation Hue Shift

slide-6
SLIDE 6

(Goodfellow 2016)

Multi-Task Learning

h(1) h(1) h(2) h(2) h(3) h(3) y(1) y(1) y(2) y(2) h(shared) h(shared) x

Figure 7.2

slide-7
SLIDE 7

(Goodfellow 2016)

Learning Curves

50 100 150 200 250 Time (epochs) 0.00 0.05 0.10 0.15 0.20 Loss (negative log-likelihood)

Training set loss Validation set loss

Figure 7.3 Early stopping: terminate while validation set performance is better

slide-8
SLIDE 8

(Goodfellow 2016)

Early Stopping and Weight Decay

w1 w2 w∗ ˜ w w1 w2 w∗ ˜ w

Figure 7.4

slide-9
SLIDE 9

(Goodfellow 2016)

Sparse Representations

2 6 6 6 6 4 14 1 19 2 23 3 7 7 7 7 5 = 2 6 6 6 6 4 3 1 2 5 4 1 4 2 3 1 1 3 1 5 4 2 3 2 3 1 2 3 3 5 4 2 2 5 1 3 7 7 7 7 5 2 6 6 6 6 6 6 4 2 3 3 7 7 7 7 7 7 5 y 2 Rm B 2 Rm⇥n h 2 Rn (7.47)

slide-10
SLIDE 10

(Goodfellow 2016)

Bagging

8 8 First ensemble member Second ensemble member Original dataset First resampled dataset Second resampled dataset

Figure 7.5

slide-11
SLIDE 11

(Goodfellow 2016)

Dropout

y h1 h1 h2 h2 x1 x1 x2 x2 y h1 h1 h2 h2 x1 x1 x2 x2 y h1 h1 h2 h2 x2 x2 y h1 h1 h2 h2 x1 x1 y h2 h2 x1 x1 x2 x2 y h1 h1 x1 x1 x2 x2 y h1 h1 h2 h2 y x1 x1 x2 x2 y h2 h2 x2 x2 y h1 h1 x1 x1 y h1 h1 x2 x2 y h2 h2 x1 x1 y x1 x1 y x2 x2 y h2 h2 y h1 h1 y Base network Ensemble of subnetworks

Figure 7.6

slide-12
SLIDE 12

(Goodfellow 2016)

Adversarial Examples

+ .007 ⇥ = x sign(rxJ(θ, x, y)) x + ✏ sign(rxJ(θ, x, y)) y =“panda” “nematode” “gibbon” w/ 57.7% confidence w/ 8.2% confidence w/ 99.3 % confidence

Figure 7.8 Training on adversarial examples is mostly intended to improve security, but can sometimes provide generic regularization.

slide-13
SLIDE 13

(Goodfellow 2016)

Tangent Propagation

x1 x2 Normal Tangent

Figure 7.9