Lecture 5: Training Neural Networks, Part I Fei-Fei Li & - - PowerPoint PPT Presentation

lecture 5 training neural networks part i
SMART_READER_LITE
LIVE PREVIEW

Lecture 5: Training Neural Networks, Part I Fei-Fei Li & - - PowerPoint PPT Presentation

Lecture 5: Training Neural Networks, Part I Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - Lecture 5 - 20 Jan 2016 20 Jan 2016 1 Administrative A1 is due today


slide-1
SLIDE 1

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 1

Lecture 5: Training Neural Networks, Part I

slide-2
SLIDE 2

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 2

Administrative

A1 is due today (midnight) I’m holding make up office hours on today: 5pm @ Gates 259 A2 will be released ~tomorrow. It’s meaty, but educational! Also:

  • We are shuffling the course schedule around a bit
  • the grading scheme is subject to few % changes
slide-3
SLIDE 3

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 3

Things you should know for your Project Proposal

“ConvNets need a lot

  • f data to train”
slide-4
SLIDE 4

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 4

Things you should know for your Project Proposal

“ConvNets need a lot

  • f data to train”

finetuning! we rarely ever train ConvNets from scratch.

slide-5
SLIDE 5

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 5

ImageNet data

1. Train on ImageNet

  • 2. Finetune network on

your own data

your data

slide-6
SLIDE 6

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 6

Transfer Learning with CNNs

  • 1. Train on

ImageNet

  • 2. If small dataset: fix

all weights (treat CNN as fixed feature extractor), retrain only the classifier i.e. swap the Softmax layer at the end

  • 3. If you have medium sized

dataset, “finetune” instead: use the old weights as initialization, train the full network or only some of the higher layers retrain bigger portion of the network, or even all of it.

slide-7
SLIDE 7

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 7

E.g. Caffe Model Zoo: Lots of pretrained ConvNets

https://github.com/BVLC/caffe/wiki/Model-Zoo

...

slide-8
SLIDE 8

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 8

Things you should know for your Project Proposal

“We have infinite compute available because Terminal.”

slide-9
SLIDE 9

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 9

Things you should know for your Project Proposal

“We have infinite compute available because Terminal.”

You have finite compute. Don’t be overly ambitious.

slide-10
SLIDE 10

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 10

Mini-batch SGD

Loop:

  • 1. Sample a batch of data
  • 2. Forward prop it through the graph, get loss
  • 3. Backprop to calculate the gradients
  • 4. Update the parameters using the gradient

Where we are now...

slide-11
SLIDE 11

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 11

(image credits to Alec Radford)

Where we are now...

slide-12
SLIDE 12

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 12

Neural Turing Machine input tape loss

slide-13
SLIDE 13

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 13

f

activations gradients

“local gradient”

slide-14
SLIDE 14

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 14

Implementation: forward/backward API

Graph (or Net) object. (Rough psuedo code)

slide-15
SLIDE 15

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 15

Implementation: forward/backward API

(x,y,z are scalars) * x y z

slide-16
SLIDE 16

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 16

Example: Torch Layers =

slide-17
SLIDE 17

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 17

Neural Network: without the brain stuff

(Before) Linear score function: (Now) 2-layer Neural Network

  • r 3-layer Neural Network
slide-18
SLIDE 18

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 18

slide-19
SLIDE 19

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 19

Neural Networks: Architectures

“Fully-connected” layers “2-layer Neural Net”, or “1-hidden-layer Neural Net” “3-layer Neural Net”, or “2-hidden-layer Neural Net”

slide-20
SLIDE 20

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 20

Training Neural Networks

A bit of history...

slide-21
SLIDE 21

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 21

A bit of history

Frank Rosenblatt, ~1957: Perceptron

The Mark I Perceptron machine was the first implementation of the perceptron algorithm. The machine was connected to a camera that used 20×20 cadmium sulfide photocells to produce a 400-pixel image. recognized letters of the alphabet

update rule:

slide-22
SLIDE 22

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 22

A bit of history

Widrow and Hoff, ~1960: Adaline/Madaline

slide-23
SLIDE 23

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 23

A bit of history

Rumelhart et al. 1986: First time back-propagation became popular

recognizable maths

slide-24
SLIDE 24

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 24

A bit of history

[Hinton and Salakhutdinov 2006]

Reinvigorated research in Deep Learning

slide-25
SLIDE 25

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 25

First strong results

Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition George Dahl, Dong Yu, Li Deng, Alex Acero, 2010 Imagenet classification with deep convolutional neural networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, 2012

slide-26
SLIDE 26

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 26

Overview

  • 1. One time setup

activation functions, preprocessing, weight initialization, regularization, gradient checking

  • 2. Training dynamics

babysitting the learning process, parameter updates, hyperparameter optimization

  • 3. Evaluation

model ensembles

slide-27
SLIDE 27

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 27

Activation Functions

slide-28
SLIDE 28

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 28

Activation Functions

slide-29
SLIDE 29

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 29

Activation Functions

Sigmoid tanh tanh(x) ReLU max(0,x) Leaky ReLU max(0.1x, x) Maxout ELU

slide-30
SLIDE 30

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 30

Activation Functions

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they

have nice interpretation as a saturating “firing rate” of a neuron

slide-31
SLIDE 31

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 31

Activation Functions

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they

have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the gradients

slide-32
SLIDE 32

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 32

sigmoid gate

x What happens when x = -10? What happens when x = 0? What happens when x = 10?

slide-33
SLIDE 33

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 33

Activation Functions

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they

have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the gradients 2. Sigmoid outputs are not zero- centered

slide-34
SLIDE 34

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 34

Consider what happens when the input to a neuron (x) is always positive: What can we say about the gradients on w?

slide-35
SLIDE 35

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 35

Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? Always all positive or all negative :( (this is also why you want zero-mean data!)

hypothetical

  • ptimal w

vector zig zag path

allowed gradient update directions allowed gradient update directions

slide-36
SLIDE 36

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 36

Activation Functions

Sigmoid

  • Squashes numbers to range [0,1]
  • Historically popular since they

have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the gradients 2. Sigmoid outputs are not zero- centered 3. exp() is a bit compute expensive

slide-37
SLIDE 37

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 37

Activation Functions

tanh(x)

  • Squashes numbers to range [-1,1]
  • zero centered (nice)
  • still kills gradients when saturated :(

[LeCun et al., 1991]

slide-38
SLIDE 38

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 38

Activation Functions

  • Computes f(x) = max(0,x)
  • Does not saturate (in +region)
  • Very computationally efficient
  • Converges much faster than

sigmoid/tanh in practice (e.g. 6x) ReLU (Rectified Linear Unit)

[Krizhevsky et al., 2012]

slide-39
SLIDE 39

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 39

Activation Functions

ReLU (Rectified Linear Unit)

  • Computes f(x) = max(0,x)
  • Does not saturate (in +region)
  • Very computationally efficient
  • Converges much faster than

sigmoid/tanh in practice (e.g. 6x)

  • Not zero-centered output
  • An annoyance:

hint: what is the gradient when x < 0?

slide-40
SLIDE 40

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 40

ReLU gate

x What happens when x = -10? What happens when x = 0? What happens when x = 10?

slide-41
SLIDE 41

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 41

DATA CLOUD active ReLU dead ReLU will never activate => never update

slide-42
SLIDE 42

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 42

DATA CLOUD active ReLU dead ReLU will never activate => never update => people like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)

slide-43
SLIDE 43

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 43

Activation Functions

Leaky ReLU

  • Does not saturate
  • Computationally efficient
  • Converges much faster than

sigmoid/tanh in practice! (e.g. 6x)

  • will not “die”.

[Mass et al., 2013] [He et al., 2015]

slide-44
SLIDE 44

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 44

Activation Functions

Leaky ReLU

  • Does not saturate
  • Computationally efficient
  • Converges much faster than

sigmoid/tanh in practice! (e.g. 6x)

  • will not “die”.

Parametric Rectifier (PReLU)

backprop into \alpha (parameter) [Mass et al., 2013] [He et al., 2015]

slide-45
SLIDE 45

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 45

Activation Functions

Exponential Linear Units (ELU)

  • All benefits of ReLU
  • Does not die
  • Closer to zero mean outputs
  • Computation requires exp()

[Clevert et al., 2015]

slide-46
SLIDE 46

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 46

Maxout “Neuron”

  • Does not have the basic form of dot product ->

nonlinearity

  • Generalizes ReLU and Leaky ReLU
  • Linear Regime! Does not saturate! Does not die!

Problem: doubles the number of parameters/neuron :(

[Goodfellow et al., 2013]

slide-47
SLIDE 47

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 47

TLDR: In practice:

  • Use ReLU. Be careful with your learning rates
  • Try out Leaky ReLU / Maxout / ELU
  • Try out tanh but don’t expect much
  • Don’t use sigmoid
slide-48
SLIDE 48

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 48

Data Preprocessing

slide-49
SLIDE 49

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 49

Step 1: Preprocess the data

(Assume X [NxD] is data matrix, each example in a row)

slide-50
SLIDE 50

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 50

Step 1: Preprocess the data

In practice, you may also see PCA and Whitening of the data

(data has diagonal covariance matrix) (covariance matrix is the identity matrix)

slide-51
SLIDE 51

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 51

TLDR: In practice for Images: center only

  • Subtract the mean image (e.g. AlexNet)

(mean image = [32,32,3] array)

  • Subtract per-channel mean (e.g. VGGNet)

(mean along each channel = 3 numbers) e.g. consider CIFAR-10 example with [32,32,3] images

Not common to normalize variance, to do PCA or whitening

slide-52
SLIDE 52

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 52

Weight Initialization

slide-53
SLIDE 53

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 53

  • Q: what happens when W=0 init is used?
slide-54
SLIDE 54

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 54

  • First idea: Small random numbers

(gaussian with zero mean and 1e-2 standard deviation)

slide-55
SLIDE 55

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 55

  • First idea: Small random numbers

(gaussian with zero mean and 1e-2 standard deviation) Works ~okay for small networks, but can lead to non-homogeneous distributions of activations across the layers of a network.

slide-56
SLIDE 56

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 56

Lets look at some activation statistics

E.g. 10-layer net with 500 neurons on each layer, using tanh non- linearities, and initializing as described in last slide.

slide-57
SLIDE 57

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 57

slide-58
SLIDE 58

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 58

All activations become zero!

Q: think about the backward pass. What do the gradients look like?

Hint: think about backward pass for a W*X gate.

slide-59
SLIDE 59

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 59

Almost all neurons completely saturated, either -1 and 1. Gradients will be all zero.

*1.0 instead of *0.01

slide-60
SLIDE 60

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 60

“Xavier initialization” [Glorot et al., 2010] Reasonable initialization. (Mathematical derivation assumes linear activations)

slide-61
SLIDE 61

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 61

but when using the ReLU nonlinearity it breaks.

slide-62
SLIDE 62

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 62

He et al., 2015 (note additional /2)

slide-63
SLIDE 63

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 63

He et al., 2015 (note additional /2)

slide-64
SLIDE 64

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 64

Proper initialization is an active area of research…

Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013 Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014 Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015 Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015 All you need is a good init, Mishkin and Matas, 2015 …

slide-65
SLIDE 65

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 65

Batch Normalization

“you want unit gaussian activations? just make them so.”

[Ioffe and Szegedy, 2015]

consider a batch of activations at some layer. To make each dimension unit gaussian, apply: this is a vanilla differentiable function...

slide-66
SLIDE 66

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 66

Batch Normalization

“you want unit gaussian activations? just make them so.”

[Ioffe and Szegedy, 2015]

X

N D

  • 1. compute the empirical mean and

variance independently for each dimension.

  • 2. Normalize
slide-67
SLIDE 67

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 67

Batch Normalization

[Ioffe and Szegedy, 2015]

FC BN tanh FC BN tanh ...

Usually inserted after Fully Connected / (or Convolutional, as we’ll see soon) layers, and before nonlinearity.

Problem: do we necessarily want a unit gaussian input to a tanh layer?

slide-68
SLIDE 68

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 68

Batch Normalization

[Ioffe and Szegedy, 2015] And then allow the network to squash the range if it wants to: Note, the network can learn: to recover the identity mapping. Normalize:

slide-69
SLIDE 69

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 69

Batch Normalization

[Ioffe and Szegedy, 2015]

  • Improves gradient flow through

the network

  • Allows higher learning rates
  • Reduces the strong dependence
  • n initialization
  • Acts as a form of regularization

in a funny way, and slightly reduces the need for dropout, maybe

slide-70
SLIDE 70

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 70

Batch Normalization

[Ioffe and Szegedy, 2015] Note: at test time BatchNorm layer functions differently: The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used. (e.g. can be estimated during training with running averages)

slide-71
SLIDE 71

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 71

Babysitting the Learning Process

slide-72
SLIDE 72

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 72

Step 1: Preprocess the data

(Assume X [NxD] is data matrix, each example in a row)

slide-73
SLIDE 73

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 73

Step 2: Choose the architecture: say we start with one hidden layer of 50 neurons:

input layer hidden layer

  • utput layer

CIFAR-10 images, 3072 numbers 10 output neurons, one per class 50 hidden neurons

slide-74
SLIDE 74

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 74

Double check that the loss is reasonable:

returns the loss and the gradient for all parameters disable regularization loss ~2.3. “correct “ for 10 classes

slide-75
SLIDE 75

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 75

Double check that the loss is reasonable:

crank up regularization loss went up, good. (sanity check)

slide-76
SLIDE 76

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 76

Lets try to train now… Tip: Make sure that you can overfit very small portion of the training data The above code:

  • take the first 20 examples from

CIFAR-10

  • turn off regularization (reg = 0.0)
  • use simple vanilla ‘sgd’
slide-77
SLIDE 77

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 77

Lets try to train now… Tip: Make sure that you can overfit very small portion of the training data Very small loss, train accuracy 1.00, nice!

slide-78
SLIDE 78

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 78

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down.

slide-79
SLIDE 79

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 79

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down.

Loss barely changing

slide-80
SLIDE 80

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 80

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low

Loss barely changing: Learning rate is probably too low

slide-81
SLIDE 81

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 81

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low

Loss barely changing: Learning rate is probably too low Notice train/val accuracy goes to 20% though, what’s up with that? (remember this is softmax)

slide-82
SLIDE 82

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 82

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low

Okay now lets try learning rate 1e6. What could possibly go wrong?

slide-83
SLIDE 83

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 83

cost: NaN almost always means high learning rate... Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

slide-84
SLIDE 84

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 84

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

3e-3 is still too high. Cost explodes…. => Rough range for learning rate we should be cross-validating is somewhere [1e-3 … 1e-5]

slide-85
SLIDE 85

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 85

Hyperparameter Optimization

slide-86
SLIDE 86

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 86

Cross-validation strategy

I like to do coarse -> fine cross-validation in stages

First stage: only a few epochs to get rough idea of what params work Second stage: longer running time, finer search … (repeat as necessary) Tip for detecting explosions in the solver: If the cost is ever > 3 * original cost, break out early

slide-87
SLIDE 87

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 87

For example: run coarse search for 5 epochs

nice

note it’s best to optimize in log space!

slide-88
SLIDE 88

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 88

Now run finer search...

adjust range 53% - relatively good for a 2-layer neural net with 50 hidden neurons.

slide-89
SLIDE 89

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 89

Now run finer search...

adjust range 53% - relatively good for a 2-layer neural net with 50 hidden neurons. But this best cross- validation result is

  • worrying. Why?
slide-90
SLIDE 90

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 90

Random Search vs. Grid Search

Random Search for Hyper-Parameter Optimization Bergstra and Bengio, 2012

slide-91
SLIDE 91

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 91

Hyperparameters to play with:

  • network architecture
  • learning rate, its decay schedule, update type
  • regularization (L2/Dropout strength)

neural networks practitioner music = loss function

slide-92
SLIDE 92

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 92

My cross-validation “command center”

slide-93
SLIDE 93

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 93

Monitor and visualize the loss curve

slide-94
SLIDE 94

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 94

Loss time

slide-95
SLIDE 95

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 95

Loss time

Bad initialization a prime suspect

slide-96
SLIDE 96

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 96

lossfunctions.tumblr.com

Loss function specimen

slide-97
SLIDE 97

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 97

lossfunctions.tumblr.com

slide-98
SLIDE 98

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 98

lossfunctions.tumblr.com

slide-99
SLIDE 99

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 99

Monitor and visualize the accuracy: big gap = overfitting => increase regularization strength? no gap

=> increase model capacity?

slide-100
SLIDE 100

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 10

Track the ratio of weight updates / weight magnitudes:

ratio between the values and updates: ~ 0.0002 / 0.02 = 0.01 (about okay) want this to be somewhere around 0.001 or so

slide-101
SLIDE 101

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 10 1

Summary

We looked in detail at:

  • Activation Functions (use ReLU)
  • Data Preprocessing (images: subtract mean)
  • Weight Initialization (use Xavier init)
  • Batch Normalization (use)
  • Babysitting the Learning process
  • Hyperparameter Optimization

(random sample hyperparams, in log space when appropriate)

TLDRs

slide-102
SLIDE 102

Lecture 5 - 20 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016 10 2

TODO

Look at:

  • Parameter update schemes
  • Learning rate schedules
  • Gradient Checking
  • Regularization (Dropout etc)
  • Evaluation (Ensembles etc)