Lecture 10: Training Neural Networks (Part 1) Justin Johnson - - PowerPoint PPT Presentation

β–Ά
lecture 10 training neural networks part 1
SMART_READER_LITE
LIVE PREVIEW

Lecture 10: Training Neural Networks (Part 1) Justin Johnson - - PowerPoint PPT Presentation

Lecture 10: Training Neural Networks (Part 1) Justin Johnson October 5, 2020 Lecture 1 - 1 Reminder: A3 Due Friday, October 9 Justin Johnson October 5, 2020 Lecture 10 - 2 Midterm Exam We are still working out details! Will share more


slide-1
SLIDE 1

Justin Johnson October 5, 2020

Lecture 10: Training Neural Networks (Part 1)

Lecture 1 - 1

slide-2
SLIDE 2

Justin Johnson October 5, 2020

Reminder: A3

  • Due Friday, October 9

Lecture 10 - 2

slide-3
SLIDE 3

Justin Johnson October 5, 2020

Midterm Exam

Lecture 10 - 3

We are still working out details! Will share more on Wednesday

  • Will (most likely) be online via https://crabster.org/
  • Material up to Lecture 13 is fair game
  • Mostly conceptual questions, no coding
  • Some combination of:
  • True / False
  • Multiple choice
  • Short answer (math on paper)
  • If you need accommodations, send your SSD letter to me
slide-4
SLIDE 4

Justin Johnson October 5, 2020

Last Time: Hardware and Software

Lecture 10 - 4

CPU GPU TPU

Static Graphs vs Dynamic Graphs PyTorch vs TensorFlow

slide-5
SLIDE 5

Justin Johnson October 5, 2020

Overview

Lecture 10 - 5

1.One time setup Activation functions, data preprocessing, weight initialization, regularization 2.Training dynamics Learning rate schedules; large-batch training; hyperparameter optimization 3.After training Model ensembles, transfer learning

slide-6
SLIDE 6

Justin Johnson October 5, 2020

Overview

Lecture 10 - 6

1.One time setup Activation functions, data preprocessing, weight initialization, regularization 2.Training dynamics Learning rate schedules; large-batch training; hyperparameter optimization 3.After training Model ensembles, transfer learning

Today Next time

slide-7
SLIDE 7

Justin Johnson October 5, 2020

Activation Functions

Lecture 10 - 7

slide-8
SLIDE 8

Justin Johnson October 5, 2020

Activation Functions

Lecture 10 - 8

slide-9
SLIDE 9

Justin Johnson October 5, 2020

Activation Functions

Lecture 10 - 9

Sigmoid tanh ReLU Leaky ReLU Maxout ELU

slide-10
SLIDE 10

Justin Johnson October 5, 2020

Activation Functions: Sigmoid

Lecture 10 - 10

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they have

nice interpretation as a saturating β€œfiring rate” of a neuron

𝜏 𝑦 = 1 1 + 𝑓!"

slide-11
SLIDE 11

Justin Johnson October 5, 2020

Activation Functions: Sigmoid

Lecture 10 - 11

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they have

nice interpretation as a saturating β€œfiring rate” of a neuron 3 problems:

1.

Saturated neurons β€œkill” the gradients

𝜏 𝑦 = 1 1 + 𝑓!"

slide-12
SLIDE 12

Justin Johnson October 5, 2020

Activation Functions: Sigmoid

Lecture 10 - 12

sigmoid gate

x What happens when x = -10? What happens when x = 0? What happens when x = 10?

slide-13
SLIDE 13

Justin Johnson October 5, 2020

Activation Functions: Sigmoid

Lecture 10 - 13

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they have

nice interpretation as a saturating β€œfiring rate” of a neuron 3 problems:

1.

Saturated neurons β€œkill” the gradients

𝜏 𝑦 = 1 1 + 𝑓!"

slide-14
SLIDE 14

Justin Johnson October 5, 2020

Activation Functions: Sigmoid

Lecture 10 - 14

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they have

nice interpretation as a saturating β€œfiring rate” of a neuron 3 problems:

1.

Saturated neurons β€œkill” the gradients

2.

Sigmoid outputs are not zero-centered

𝜏 𝑦 = 1 1 + 𝑓!"

slide-15
SLIDE 15

Justin Johnson October 5, 2020 Lecture 10 - 15

Consider what happens when nonlinearity is always positive What can we say about the gradients on π‘₯ β„“ ?

β„Ž!

(β„“) = # %

π‘₯!,%

(β„“)𝜏 β„Ž% (β„“'() + 𝑐! β„“

β„Ž!

(β„“) is the 𝑗th element of the hidden layer at

layer β„“ (before activation) π‘₯ β„“ , b β„“ are the weights and bias of layer β„“

slide-16
SLIDE 16

Justin Johnson October 5, 2020 Lecture 10 - 16

What can we say about the gradients on π‘₯ β„“ ?

β„Ž!

(β„“) = # %

π‘₯!,%

(β„“)𝜏 β„Ž% (β„“'() + 𝑐! β„“

β„Ž!

(β„“) is the 𝑗th element of the hidden layer at

layer β„“ (before activation) π‘₯ β„“ , b β„“ are the weights and bias of layer β„“

Consider what happens when nonlinearity is always positive

slide-17
SLIDE 17

Justin Johnson October 5, 2020 Lecture 10 - 17

What can we say about the gradients on π‘₯ β„“ ? Always all positive or all negative :(

hypothetical

  • ptimal w

vector

allowed gradient update directions allowed gradient update directions

β„Ž!

(β„“) = # %

π‘₯!,%

(β„“)𝜏 β„Ž% (β„“'() + 𝑐! β„“

β„Ž!

(β„“) is the 𝑗th element of the hidden layer at

layer β„“ (before activation) π‘₯ β„“ , b β„“ are the weights and bias of layer β„“

Consider what happens when nonlinearity is always positive

slide-18
SLIDE 18

Justin Johnson October 5, 2020 Lecture 10 - 18

What can we say about the gradients on π‘₯ β„“ ? Always all positive or all negative :( (For a single element! Minibatches help)

hypothetical

  • ptimal w

vector

allowed gradient update directions allowed gradient update directions

β„Ž!

(β„“) = # %

π‘₯!,%

(β„“)𝜏 β„Ž% (β„“'() + 𝑐! β„“

β„Ž!

(β„“) is the 𝑗th element of the hidden layer at

layer β„“ (before activation) π‘₯ β„“ , b β„“ are the weights and bias of layer β„“

Consider what happens when nonlinearity is always positive

slide-19
SLIDE 19

Justin Johnson October 5, 2020

Activation Functions: Sigmoid

Lecture 10 - 19

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they have

nice interpretation as a saturating β€œfiring rate” of a neuron 3 problems:

1.

Saturated neurons β€œkill” the gradients

2.

Sigmoid outputs are not zero-centered

𝜏 𝑦 = 1 1 + 𝑓!"

slide-20
SLIDE 20

Justin Johnson October 5, 2020

Activation Functions: Sigmoid

Lecture 10 - 20

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they have

nice interpretation as a saturating β€œfiring rate” of a neuron 3 problems:

1.

Saturated neurons β€œkill” the gradients

2.

Sigmoid outputs are not zero-centered

3.

exp() is a bit compute expensive

𝜏 𝑦 = 1 1 + 𝑓!"

slide-21
SLIDE 21

Justin Johnson October 5, 2020

Activation Functions: Tanh

Lecture 10 - 21

tanh(x)

  • Squashes numbers to range [-1,1]
  • zero centered (nice)
  • still kills gradients when saturated :(
slide-22
SLIDE 22

Justin Johnson October 5, 2020

Activation Functions: ReLU

Lecture 10 - 22

ReLU (Rectified Linear Unit) f(x) = max(0,x)

  • Does not saturate (in +region)
  • Very computationally efficient
  • Converges much faster than

sigmoid/tanh in practice (e.g. 6x)

slide-23
SLIDE 23

Justin Johnson October 5, 2020

Activation Functions: ReLU

Lecture 10 - 23

ReLU (Rectified Linear Unit) f(x) = max(0,x)

  • Does not saturate (in +region)
  • Very computationally efficient
  • Converges much faster than

sigmoid/tanh in practice (e.g. 6x)

  • Not zero-centered output
slide-24
SLIDE 24

Justin Johnson October 5, 2020

Activation Functions: ReLU

Lecture 10 - 24

ReLU (Rectified Linear Unit) f(x) = max(0,x)

  • Does not saturate (in +region)
  • Very computationally efficient
  • Converges much faster than

sigmoid/tanh in practice (e.g. 6x)

  • Not zero-centered output
  • An annoyance:

hint: what is the gradient when x < 0?

slide-25
SLIDE 25

Justin Johnson October 5, 2020

Activation Functions: ReLU

Lecture 10 - 25

ReLU gate

x What happens when x = -10? What happens when x = 0? What happens when x = 10?

slide-26
SLIDE 26

Justin Johnson October 5, 2020 Lecture 10 - 26

DATA CLOUD active ReLU dead ReLU will never activate => never update

slide-27
SLIDE 27

Justin Johnson October 5, 2020 Lecture 10 - 27

DATA CLOUD active ReLU dead ReLU will never activate => never update => Sometimes initialize ReLU neurons with slightly positive biases (e.g. 0.01)

slide-28
SLIDE 28

Justin Johnson October 5, 2020

Activation Functions: Leaky ReLU

Lecture 10 - 28

  • Does not saturate
  • Computationally efficient
  • Converges much faster than

sigmoid/tanh in practice! (e.g. 6x)

  • will not β€œdie”.

Maas et al, β€œRectifier Nonlinearities Improve Neural Network Acoustic Models”, ICML 2013

Leaky ReLU 𝑔 𝑦 = max 𝛽𝑦, 𝑦 𝛽 is a hyperparameter,

  • ften 𝛽 = 0.1
slide-29
SLIDE 29

Justin Johnson October 5, 2020

Activation Functions: Leaky ReLU

Lecture 10 - 29

  • Does not saturate
  • Computationally efficient
  • Converges much faster than

sigmoid/tanh in practice! (e.g. 6x)

  • will not β€œdie”.

Maas et al, β€œRectifier Nonlinearities Improve Neural Network Acoustic Models”, ICML 2013 He et al, β€œDelving Deep into Rectifiers: Surpassing Human- Level Performance on ImageNet Classification”, ICCV 2015

Leaky ReLU 𝑔 𝑦 = max 𝛽𝑦, 𝑦 𝛽 is a hyperparameter,

  • ften 𝛽 = 0.1

Parametric ReLU (PReLU) 𝑔 𝑦 = max 𝛽𝑦, 𝑦 𝛽 is learned via backprop

slide-30
SLIDE 30

Justin Johnson October 5, 2020

Activation Functions: Exponential Linear Unit (ELU)

Lecture 10 - 30

(Default alpha=1)

  • All benefits of ReLU
  • Closer to zero mean outputs
  • Negative saturation regime

compared with Leaky ReLU adds some robustness to noise

  • Computation requires exp()

𝑔 𝑦 = - 𝑦 𝑗𝑔 𝑦 > 0 𝛽 𝑓" βˆ’ 1 𝑗𝑔 𝑦 ≀ 0

slide-31
SLIDE 31

Justin Johnson October 5, 2020

Activation Functions: Scaled Exponential Linear Unit (SELU)

Lecture 10 - 31

𝛽 = 1.6732632423543772848170429916717 πœ‡ = 1.0507009873554804934193349852946

  • Scaled version of ELU that

works better for deep networks

  • β€œSelf-Normalizing” property;

can train deep SELU networks without BatchNorm

Klambauer et al, Self-Normalizing Neural Networks, ICLR 2017

π‘‘π‘“π‘šπ‘£ 𝑦 = - πœ‡π‘¦ 𝑗𝑔 𝑦 > 0 πœ‡π›½ 𝑓" βˆ’ 1 𝑗𝑔 𝑦 ≀ 0

slide-32
SLIDE 32

Justin Johnson October 5, 2020

Activation Functions: Scaled Exponential Linear Unit (SELU)

Lecture 10 - 32

  • Scaled version of ELU that

works better for deep networks

  • β€œSelf-Normalizing” property;

can train deep SELU networks without BatchNorm

Klambauer et al, Self-Normalizing Neural Networks, ICLR 2017

Derivation takes 91 pages of math in appendix…

𝛽 = 1.6732632423543772848170429916717 πœ‡ = 1.0507009873554804934193349852946

slide-33
SLIDE 33

Justin Johnson October 5, 2020

Activation Functions: Gaussian Error Linear Unit (GELU)

Lecture 10 - 33

π‘Œ~𝑂 0, 1 π‘•π‘“π‘šπ‘£ 𝑦 = 𝑦𝑄 π‘Œ ≀ 𝑦 = 𝑦 2 1 + erf 𝑦/√2 β‰ˆ π‘¦πœ 1.702𝑦

Hendrycks and Gimpel, Gaussian Error Linear Units (GELUs), 2016

  • Idea: Multiply input by 0 or 1

at random; large values more likely to be multiplied by 1, small values more likely to be multiplied by 0 (data-dependent dropout)

  • Take expectation over

randomness

  • Very common in Transformers

(BERT, GPT, GPT-2, GPT-3)

slide-34
SLIDE 34

Justin Johnson October 5, 2020 Lecture 10 - 34

93.8 95.3 94.8 94.2 95.6 94.7 94.1 95.1 94.5 94.6 94.9 94.7 94.1 94.1 94.4 93 93.2 93.9 94.3 95.5 94.8 94.7 95.5 94.8

90 91 92 93 94 95 96

ResNet Wide ResNet DenseNet

Accuracy on CIFAR10

ReLU Leaky ReLU Parametric ReLU Softplus ELU SELU GELU Swish

Ramachandran et al, β€œSearching for activation functions”, ICLR Workshop 2018

slide-35
SLIDE 35

Justin Johnson October 5, 2020

Activation Functions: Summary

Lecture 10 - 35

  • Don’t think too hard. Just use ReLU
  • Try out Leaky ReLU / ELU / SELU / GELU

if you need to squeeze that last 0.1%

  • Don’t use sigmoid or tanh
slide-36
SLIDE 36

Justin Johnson October 5, 2020

Data Preprocessing

Lecture 10 - 36

slide-37
SLIDE 37

Justin Johnson October 5, 2020

Data Preprocessing

Lecture 10 - 37

(Assume X [NxD] is data matrix, each example in a row)

slide-38
SLIDE 38

Justin Johnson October 5, 2020

Remember: Consider what happens when the input to a neuron is always positive...

Lecture 10 - 38

What can we say about the gradients on w? Always all positive or all negative :( (this is also why you want zero-mean data!)

hypothetical

  • ptimal w

vector

allowed gradient update directions allowed gradient update directions

β„Ž!

(β„“) = # %

π‘₯!,%

(β„“)𝜏 β„Ž% (β„“'() + 𝑐! β„“

slide-39
SLIDE 39

Justin Johnson October 5, 2020

Data Preprocessing

Lecture 10 - 39

(Assume X [NxD] is data matrix, each example in a row)

slide-40
SLIDE 40

Justin Johnson October 5, 2020

Data Preprocessing

Lecture 10 - 40

In practice, you may also see PCA and Whitening of the data

(data has diagonal covariance matrix) (covariance matrix is the identity matrix)

slide-41
SLIDE 41

Justin Johnson October 5, 2020

Data Preprocessing

Lecture 10 - 41

Before normalization: classification loss very sensitive to changes in weight matrix; hard to optimize After normalization: less sensitive to small changes in weights; easier to

  • ptimize
slide-42
SLIDE 42

Justin Johnson October 5, 2020

Data Preprocessing for Images

Lecture 10 - 42

  • Subtract the mean image (e.g. AlexNet)

(mean image = [32,32,3] array)

  • Subtract per-channel mean (e.g. VGGNet)

(mean along each channel = 3 numbers)

  • Subtract per-channel mean and

Divide by per-channel std (e.g. ResNet)

(mean along each channel = 3 numbers) e.g. consider CIFAR-10 example with [32,32,3] images

Not common to do PCA or whitening

slide-43
SLIDE 43

Justin Johnson October 5, 2020

Weight Initialization

Lecture 10 - 43

slide-44
SLIDE 44

Justin Johnson October 5, 2020

Weight Initialization

Lecture 10 - 44

Q: What happens if we initialize all W=0, b=0?

slide-45
SLIDE 45

Justin Johnson October 5, 2020

Weight Initialization

Lecture 10 - 45

Q: What happens if we initialize all W=0, b=0? A: All outputs are 0, all gradients are the same! No β€œsymmetry breaking”

slide-46
SLIDE 46

Justin Johnson October 5, 2020

Weight Initialization

Lecture 10 - 46

Next idea: small random numbers (Gaussian with zero mean, std=0.01)

slide-47
SLIDE 47

Justin Johnson October 5, 2020

Weight Initialization

Lecture 10 - 47

Next idea: small random numbers (Gaussian with zero mean, std=0.01) Works ~okay for small networks, but problems with deeper networks.

slide-48
SLIDE 48

Justin Johnson October 5, 2020

Weight Initialization: Activation Statistics

Lecture 10 - 48

Forward pass for a 6-layer net with hidden size 4096

slide-49
SLIDE 49

Justin Johnson October 5, 2020

Weight Initialization: Activation Statistics

Lecture 10 - 49

Forward pass for a 6-layer net with hidden size 4096

All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like?

slide-50
SLIDE 50

Justin Johnson October 5, 2020

Weight Initialization: Activation Statistics

Lecture 10 - 50

Forward pass for a 6-layer net with hidden size 4096

All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like? A: All zero, no learning =(

slide-51
SLIDE 51

Justin Johnson October 5, 2020

Weight Initialization: Activation Statistics

Lecture 10 - 51

Increase std of initial weights from 0.01 to 0.05

slide-52
SLIDE 52

Justin Johnson October 5, 2020

Weight Initialization: Activation Statistics

Lecture 10 - 52

Increase std of initial weights from 0.01 to 0.05

All activations saturate Q: What do the gradients look like?

slide-53
SLIDE 53

Justin Johnson October 5, 2020

Weight Initialization: Activation Statistics

Lecture 10 - 53

Increase std of initial weights from 0.01 to 0.05

All activations saturate Q: What do the gradients look like? A: Local gradients all zero, no learning =(

slide-54
SLIDE 54

Justin Johnson October 5, 2020

Weight Initialization: Xavier Initialization

Lecture 10 - 54

β€œXavier” initialization: std = 1/sqrt(Din)

Glorot and Bengio, β€œUnderstanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

slide-55
SLIDE 55

Justin Johnson October 5, 2020

Weight Initialization: Xavier Initialization

Lecture 10 - 55

β€œJust right”: Activations are nicely scaled for all layers!

β€œXavier” initialization: std = 1/sqrt(Din)

Glorot and Bengio, β€œUnderstanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

slide-56
SLIDE 56

Justin Johnson October 5, 2020

Weight Initialization: Xavier Initialization

Lecture 10 - 56

β€œJust right”: Activations are nicely scaled for all layers! For conv layers, Din is

kernel_size2 * input_channels

β€œXavier” initialization: std = 1/sqrt(Din)

Glorot and Bengio, β€œUnderstanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

slide-57
SLIDE 57

Justin Johnson October 5, 2020

Weight Initialization: Xavier Initialization

Lecture 10 - 57

β€œXavier” initialization: std = 1/sqrt(Din)

y = Wx Var(yi) = Din * Var(xiwi)

[Assume x, w are iid]

= Din * (E[xi

2]E[wi 2] - E[xi]2 E[wi]2) [Assume x, w independent]

= Din * Var(xi) * Var(wi) [Assume x, w are zero-mean] If Var(wi) = 1/Din then Var(yi) = Var(xi) Derivation: Variance of output = Variance of input 𝑧# = 8

$%& '#(

𝑦$π‘₯

$

slide-58
SLIDE 58

Justin Johnson October 5, 2020

Weight Initialization: Xavier Initialization

Lecture 10 - 58

β€œXavier” initialization: std = 1/sqrt(Din)

y = Wx Var(yi) = Din * Var(xiwi)

[Assume x, w are iid]

= Din * (E[xi

2]E[wi 2] - E[xi]2 E[wi]2) [Assume x, w independent]

= Din * Var(xi) * Var(wi) [Assume x, w are zero-mean] If Var(wi) = 1/Din then Var(yi) = Var(xi) Derivation: Variance of output = Variance of input 𝑧# = 8

$%& '#(

𝑦$π‘₯

$

slide-59
SLIDE 59

Justin Johnson October 5, 2020

Weight Initialization: Xavier Initialization

Lecture 10 - 59

β€œXavier” initialization: std = 1/sqrt(Din)

y = Wx Var(yi) = Din * Var(xiwi)

[Assume x, w are iid]

= Din * (E[xi

2]E[wi 2] - E[xi]2 E[wi]2) [Assume x, w independent]

= Din * Var(xi) * Var(wi) [Assume x, w are zero-mean] If Var(wi) = 1/Din then Var(yi) = Var(xi) Derivation: Variance of output = Variance of input 𝑧# = 8

$%& '#(

𝑦$π‘₯

$

slide-60
SLIDE 60

Justin Johnson October 5, 2020

Weight Initialization: Xavier Initialization

Lecture 10 - 60

β€œXavier” initialization: std = 1/sqrt(Din)

y = Wx Var(yi) = Din * Var(xiwi)

[Assume x, w are iid]

= Din * (E[xi

2]E[wi 2] - E[xi]2 E[wi]2) [Assume x, w independent]

= Din * Var(xi) * Var(wi) [Assume x, w are zero-mean] If Var(wi) = 1/Din then Var(yi) = Var(xi) Derivation: Variance of output = Variance of input 𝑧# = 8

$%& '#(

𝑦$π‘₯

$

slide-61
SLIDE 61

Justin Johnson October 5, 2020

Weight Initialization: Xavier Initialization

Lecture 10 - 61

β€œXavier” initialization: std = 1/sqrt(Din)

y = Wx Var(yi) = Din * Var(xiwi)

[Assume x, w are iid]

= Din * (E[xi

2]E[wi 2] - E[xi]2 E[wi]2) [Assume x, w independent]

= Din * Var(xi) * Var(wi) [Assume x, w are zero-mean] If Var(wi) = 1/Din then Var(yi) = Var(xi) Derivation: Variance of output = Variance of input 𝑧# = 8

$%& '#(

𝑦$π‘₯

$

slide-62
SLIDE 62

Justin Johnson October 5, 2020

Weight Initialization: What about ReLU?

Lecture 10 - 62

Change from tanh to ReLU

slide-63
SLIDE 63

Justin Johnson October 5, 2020

Weight Initialization: What about ReLU?

Lecture 10 - 63

Xavier assumes zero centered activation function Activations collapse to zero again, no learning =(

Change from tanh to ReLU

slide-64
SLIDE 64

Justin Johnson October 5, 2020

Weight Initialization: Kaiming / MSRA Initialization

Lecture 10 - 64

”Just right” – activations nicely scaled for all layers

ReLU correction: std = sqrt(2 / Din)

He et al, β€œDelving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015

slide-65
SLIDE 65

Justin Johnson October 5, 2020

Weight Initialization: Residual Networks

Lecture 10 - 65

relu Residual Block

conv conv

F(x) + x F(x) relu X

If we initialize with MSRA: then Var(F(x)) = Var(x) But then Var(F(x) + x) > Var(x) – variance grows with each block!

slide-66
SLIDE 66

Justin Johnson October 5, 2020

Weight Initialization: Residual Networks

Lecture 10 - 66

relu Residual Block

conv conv

F(x) + x F(x) relu X

If we initialize with MSRA: then Var(F(x)) = Var(x) But then Var(F(x) + x) > Var(x) variance grows with each block! Solution: Initialize first conv with MSRA, initialize second conv to

  • zero. Then Var(x + F(x)) = Var(x)

Zhang et al, β€œFixup Initialization: Residual Learning Without Normalization”, ICLR 2019

slide-67
SLIDE 67

Justin Johnson October 5, 2020

Proper initialization is an active area of research

Lecture 10 - 67

Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013 Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014 Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015 Data-dependent Initializations of Convolutional Neural Networks by KrΓ€henbΓΌhl et al., 2015 All you need is a good init, Mishkin and Matas, 2015 Fixup Initialization: Residual Learning Without Normalization, Zhang et al, 2019 The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Frankle and Carbin, 2019

slide-68
SLIDE 68

Justin Johnson October 5, 2020

Now your model is training … but it overfits!

Lecture 10 - 68

Regularization

slide-69
SLIDE 69

Justin Johnson October 5, 2020

Regularization: Add term to the loss

Lecture 10 - 69

In common use: L2 regularization L1 regularization Elastic net (L1 + L2)

(Weight decay)

slide-70
SLIDE 70

Justin Johnson October 5, 2020

Regularization: Dropout

Lecture 10 - 70

Srivastava et al, β€œDropout: A simple way to prevent neural networks from overfitting”, JMLR 2014

In each forward pass, randomly set some neurons to zero Probability of dropping is a hyperparameter; 0.5 is common

slide-71
SLIDE 71

Justin Johnson October 5, 2020

Regularization: Dropout

Lecture 10 - 71

Example forward pass with a 3-layer network using dropout

slide-72
SLIDE 72

Justin Johnson October 5, 2020

Regularization: Dropout

Lecture 10 - 72

Forces the network to have a redundant representation; Prevents co-adaptation of features has an ear has a tail is furry has claws mischievous look X X X cat score cat score

slide-73
SLIDE 73

Justin Johnson October 5, 2020

Regularization: Dropout

Lecture 10 - 73

Another interpretation: Dropout is training a large ensemble of models (that share parameters). Each binary mask is one model An FC layer with 4096 units has 24096 ~ 101233 possible masks! Only ~ 1082 atoms in the universe...

slide-74
SLIDE 74

Justin Johnson October 5, 2020

Dropout: Test Time

Lecture 10 - 74

Dropout makes our output random!

Output (label) Input (image) Random mask

Want to β€œaverage out” the randomness at test-time But this integral seems hard …

𝒛 = 𝑔

# π’š, π’œ

𝑧 = 𝑔 𝑦 = 𝐹) 𝑔 𝑦, 𝑨 = . π‘ž 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨

slide-75
SLIDE 75

Justin Johnson October 5, 2020

Dropout: Test Time

Lecture 10 - 75

Want to approximate the integral

Consider a single neuron: At test time we have: 𝐹 𝑏 = π‘₯%𝑦 + π‘₯&𝑧

a x y w1 w2

𝑧 = 𝑔 𝑦 = 𝐹) 𝑔 𝑦, 𝑨 = ; π‘ž 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨

slide-76
SLIDE 76

Justin Johnson October 5, 2020

Dropout: Test Time

Lecture 10 - 76

Want to approximate the integral

Consider a single neuron: At test time we have: 𝐹 𝑏 = π‘₯%𝑦 + π‘₯&𝑧 During training we have: 𝐹 𝑏 = !

" π‘₯!𝑦 + π‘₯#𝑧 + ! " π‘₯!𝑦 + 0𝑧

+ !

" 0𝑦 + 0𝑧 + ! " 0𝑦 + π‘₯#𝑧

= !

# π‘₯!𝑦 + π‘₯#𝑧

a x y w1 w2

𝑧 = 𝑔 𝑦 = 𝐹) 𝑔 𝑦, 𝑨 = ; π‘ž 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨

slide-77
SLIDE 77

Justin Johnson October 5, 2020

Dropout: Test Time

Lecture 10 - 77

Want to approximate the integral

Consider a single neuron: At test time we have: 𝐹 𝑏 = π‘₯%𝑦 + π‘₯&𝑧 During training we have: 𝐹 𝑏 = !

" π‘₯!𝑦 + π‘₯#𝑧 + ! " π‘₯!𝑦 + 0𝑧

+ !

" 0𝑦 + 0𝑧 + ! " 0𝑦 + π‘₯#𝑧

= !

# π‘₯!𝑦 + π‘₯#𝑧

a x y w1 w2

At test time, drop nothing and multiply by dropout probability

𝑧 = 𝑔 𝑦 = 𝐹) 𝑔 𝑦, 𝑨 = ; π‘ž 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨

slide-78
SLIDE 78

Justin Johnson October 5, 2020

Dropout: Test Time

Lecture 10 - 78

At test time all neurons are active always => We must scale the activations so that for each neuron:

  • utput at test time = expected output at training time
slide-79
SLIDE 79

Justin Johnson October 5, 2020

Dropout Summary

Lecture 10 - 79

drop in forward pass scale at test time

slide-80
SLIDE 80

Justin Johnson October 5, 2020

More common: β€œInverted dropout”

Lecture 10 - 80

test time is unchanged! Drop and scale during training

slide-81
SLIDE 81

Justin Johnson October 5, 2020

Dropout architectures

Lecture 10 - 81

20000 40000 60000 80000 100000 120000

conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8

AlexNet vs VGG-16 (Params, M)

AlexNet VGG-16

Recall AlexNet, VGG have most of their parameters in fully-connected layers; usually Dropout is applied there

Dropout here!

slide-82
SLIDE 82

Justin Johnson October 5, 2020

Dropout architectures

Lecture 10 - 82

20000 40000 60000 80000 100000 120000

conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8

AlexNet vs VGG-16 (Params, M)

AlexNet VGG-16

Recall AlexNet, VGG have most of their parameters in fully-connected layers; usually Dropout is applied there

Dropout here!

Later architectures (GoogLeNet, ResNet, etc) use global average pooling instead of fully-connected layers: they don’t use dropout at all!

slide-83
SLIDE 83

Justin Johnson October 5, 2020

Re Regularization: A common pattern

Lecture 10 - 83

Training: Add some kind of randomness Testing: Average out randomness (sometimes approximate)

𝑧 = 𝑔

> 𝑦, 𝑨 𝑧 = 𝑔 𝑦 = 𝐹) 𝑔 𝑦, 𝑨 = ; π‘ž 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨

slide-84
SLIDE 84

Justin Johnson October 5, 2020

Re Regularization: A common pattern

Lecture 10 - 84

Training: Add some kind of randomness Testing: Average out randomness (sometimes approximate) Example: Batch Normalization Training: Normalize using stats from random minibatches Testing: Use fixed stats to normalize

𝑧 = 𝑔

> 𝑦, 𝑨 𝑧 = 𝑔 𝑦 = 𝐹) 𝑔 𝑦, 𝑨 = ; π‘ž 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨

slide-85
SLIDE 85

Justin Johnson October 5, 2020

Re Regularization: A common pattern

Lecture 10 - 85

Training: Add some kind of randomness Testing: Average out randomness (sometimes approximate) Example: Batch Normalization Training: Normalize using stats from random minibatches Testing: Use fixed stats to normalize

For ResNet and later,

  • ften L2 and Batch

Normalization are the only regularizers!

𝑧 = 𝑔

> 𝑦, 𝑨 𝑧 = 𝑔 𝑦 = 𝐹) 𝑔 𝑦, 𝑨 = ; π‘ž 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨

slide-86
SLIDE 86

Justin Johnson October 5, 2020

Da Data Au Augmen gmentati tion

Lecture 10 - 86

Load image and label

β€œcat” CNN Compute loss

This image by Nikita is licensed under CC-BY 2.0

slide-87
SLIDE 87

Justin Johnson October 5, 2020

Da Data Au Augmen gmentati tion

Lecture 10 - 87

Transform image Load image and label

β€œcat” CNN Compute loss

slide-88
SLIDE 88

Justin Johnson October 5, 2020

Da Data Au Augmen gmentati tion: Horizontal Flips

Lecture 10 - 88

slide-89
SLIDE 89

Justin Johnson October 5, 2020

Da Data Au Augmen gmentati tion: Random Crops and Scales

Lecture 10 - 89

Training: sample random crops / scales

ResNet:

1.

Pick random L in range [256, 480]

2.

Resize training image, short side = L

3.

Sample random 224 x 224 patch

slide-90
SLIDE 90

Justin Johnson October 5, 2020

Da Data Au Augmen gmentati tion: Random Crops and Scales

Lecture 10 - 90

Training: sample random crops / scales

ResNet:

1.

Pick random L in range [256, 480]

2.

Resize training image, short side = L

3.

Sample random 224 x 224 patch

Testing: average a fixed set of crops

ResNet:

1.

Resize image at 5 scales: {224, 256, 384, 480, 640}

2.

For each size, use 10 224 x 224 crops: 4 corners + center, + flips

slide-91
SLIDE 91

Justin Johnson October 5, 2020

Da Data Au Augmen gmentati tion: Color Jitter

Lecture 10 - 91

Simple: Randomize contrast and brightness

More Complex:

  • 1. Apply PCA to all [R, G, B]

pixels in training set

  • 2. Sample a β€œcolor offset”

along principal component directions

  • 3. Add offset to all pixels
  • f a training image

(Used in AlexNet, ResNet, etc)

slide-92
SLIDE 92

Justin Johnson October 5, 2020

Da Data Au Augmen gmentati tion: Get creative for your problem!

Lecture 10 - 92

Random mix/combinations of :

  • translation
  • rotation
  • stretching
  • shearing,
  • lens distortions, … (go crazy)
slide-93
SLIDE 93

Justin Johnson October 5, 2020

Re Regularization: A common pattern

Lecture 10 - 93

Wan et al, β€œRegularization of Neural Networks using DropConnect”, ICML 2013

Examples:

Dropout Batch Normalization Data Augmentation

Training: Add some randomness Testing: Marginalize over randomness

slide-94
SLIDE 94

Justin Johnson October 5, 2020

Re Regularization: DropConnect

Lecture 10 - 94

Wan et al, β€œRegularization of Neural Networks using DropConnect”, ICML 2013

Examples:

Dropout Batch Normalization Data Augmentation DropConnect

Training: Drop random connections between neurons (set weight=0) Testing: Use all the connections

slide-95
SLIDE 95

Justin Johnson October 5, 2020

Re Regularization: Fractional Pooling

Lecture 10 - 95

Examples:

Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling

Training: Use randomized pooling regions Testing: Average predictions over different samples

Graham, β€œFractional Max Pooling”, arXiv 2014

slide-96
SLIDE 96

Justin Johnson October 5, 2020

Re Regularization: Stochastic Depth

Lecture 10 - 96

Examples:

Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth

Training: Skip some residual blocks in ResNet Testing: Use the whole network

Huang et al, β€œDeep Networks with Stochastic Depth”, ECCV 2016

slide-97
SLIDE 97

Justin Johnson October 5, 2020

Re Regularization: Stochastic Depth

Lecture 10 - 97

Examples:

Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout

Training: Set random images regions to 0 Testing: Use the whole image

DeVries and Taylor, β€œImproved Regularization of Convolutional Neural Networks with Cutout”, arXiv 2017

Works very well for small datasets like CIFAR, less common for large datasets like ImageNet

slide-98
SLIDE 98

Justin Johnson October 5, 2020

Re Regularization: Mixup

Lecture 10 - 98

Examples:

Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout Mixup

Training: Train on random blends of images Testing: Use original images

Zhang et al, β€œmixup: Beyond Empirical Risk Minimization”, ICLR 2018

Randomly blend the pixels of pairs of training images, e.g. 40% cat, 60% dog

CNN

Target label: cat: 0.4 dog: 0.6

Sample blend probability from a beta distribution Beta(a, b) with a=bβ‰ˆ0 so blend weights are close to 0/1

slide-99
SLIDE 99

Justin Johnson October 5, 2020

Re Regularization: Mixup

Lecture 10 - 99

Examples:

Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout Mixup

Training: Train on random blends of images Testing: Use original images

Zhang et al, β€œmixup: Beyond Empirical Risk Minimization”, ICLR 2018

Randomly blend the pixels of pairs of training images, e.g. 40% cat, 60% dog

CNN

Target label: cat: 0.4 dog: 0.6

slide-100
SLIDE 100

Justin Johnson October 5, 2020

Re Regularization: Mixup

Lecture 10 - 100

Examples:

Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout Mixup

Training: Train on random blends of images Testing: Use original images

Zhang et al, β€œmixup: Beyond Empirical Risk Minimization”, ICLR 2018

  • Consider dropout for large fully-

connected layers

  • Batch normalization and data

augmentation almost always a good idea

  • Try cutout and mixup especially

for small classification datasets

slide-101
SLIDE 101

Justin Johnson October 5, 2020

Summary

Lecture 10 - 101

1.One time setup Activation functions, data preprocessing, weight initialization, regularization 2.Training dynamics Learning rate schedules; large-batch training; hyperparameter optimization 3.After training Model ensembles, transfer learning

Today Next time

slide-102
SLIDE 102

Justin Johnson October 5, 2020

Next time: Training Neural Networks (part 2)

Lecture 10 - 102