[PPT] - Administrative - A2 is out. It was late 2 days so due date will be PowerPoint Presentation

SLIDE 1

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 1

Administrative

A2 is out. It was late 2 days so due date

will be shifted by ~2 days.

we updated the project page with many

pointers to datasets.

SLIDE 2

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 2

SLIDE 3

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 3

SLIDE 4

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 4

SLIDE 5

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 5

Backpropagation

(recursive chain rule)

SLIDE 6

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 6

Mini-batch Gradient descent

Loop:

1. Sample a batch of data
2. Backprop to calculate the analytic gradient
3. Perform a parameter update

SLIDE 7

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 7

A bit of history

Widrow and Hoff, ~1960: Adaline

SLIDE 8

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 8

A bit of history

Rumelhart et al. 1986: First time back-propagation became popular recognizable maths

SLIDE 9

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 9

A bit of history

[Hinton and Salkhutdinov 2006]

Reinvigorated research in Deep Learning

SLIDE 10

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 10

Training Neural Networks

SLIDE 11

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 11

Step 1: Preprocess the data

(Assume X [NxD] is data matrix, each example in a row)

SLIDE 12

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 12

Step 1: Preprocess the data

In practice, you may also see PCA and Whitening of the data

(data has diagonal covariance matrix) (covariance matrix is the identity matrix)

SLIDE 13

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 13

Step 2: Choose the architecture: say we start with one hidden layer of 50 neurons:

input layer hidden layer

utput layer

CIFAR-10 images, 3072 numbers 10 output neurons, one per class 50 hidden neurons

SLIDE 14

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 14

Before we try training, lets initialize well:

set weights to small random numbers
set biases to zero

(Matrix of small numbers drawn randomly from a gaussian)

Warning: This is not optimal, but simplest! (More on this later)

SLIDE 15

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 15

Double check that the loss is reasonable:

returns the loss and the gradient for all parameters disable regularization loss ~2.3. “correct “ for 10 classes

SLIDE 16

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 16

Double check that the loss is reasonable:

crank up regularization loss went up, good. (sanity check)

SLIDE 17

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 17

Lets try to train now… Tip: Make sure that you can overfit very small portion of the data

The above code:

take the first 20 examples from CIFAR-10
turn off regularization (reg = 0.0)
use simple vanilla ‘sgd’

details:

(learning_rate_decay = 1 means no decay, the

learning rate will stay constant)

sample_batches = False means we’re doing full

gradient descent, not mini-batch SGD

we’ll perform 200 updates (epochs = 200)

“epoch”: number of times we see the training set

SLIDE 18

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 18

Lets try to train now… Tip: Make sure that you can overfit very small portion of the data Very small loss, train accuracy 1.00, nice!

SLIDE 19

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 19

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

SLIDE 20

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 20

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

Loss barely changing: Learning rate must be too low. (could also be reg too high) Notice train/val accuracy goes to 20% though, what’s up with that? (remember this is softmax)

SLIDE 21

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 21

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

Okay now lets try learning rate 1e6. What could possibly go wrong?

SLIDE 22

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015

cost: NaN almost always means high learning rate...

22

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

SLIDE 23

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 23

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

3e-3 is still too high. Cost explodes…. => Rough range for learning rate we should be cross-validating is somewhere [1e-3 … 1e-5]

SLIDE 24

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 24

Cross-validation strategy

I like to do coarse -> fine cross-validation in stages

First stage: only a few epochs to get rough idea of what params work Second stage: longer running time, finer search … (repeat as necessary) Tip for detecting explosions in the solver: If the cost is ever > 3 * original cost, break out early

SLIDE 25

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 25

For example: run coarse search for 5 epochs

nice

note it’s best to optimize in log space

SLIDE 26

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 26

Now run finer search...

adjust range 53% - relatively good for a 2-layer neural net with 50 hidden neurons.

SLIDE 27

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 27

Now run finer search...

adjust range 53% - relatively good for a 2-layer neural net with 50 hidden neurons. But this best cross- validation result is

worrying. Why?

SLIDE 28

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 28

Normally you can’t afford a huge computational budget for expensive cross-validations. Need to rely more on intuitions and visualizations… Visualizations to play with:

loss function
validation and training accuracy
min,max,std for values and updates, (and monitor

their ratio)

first-layer visualization of weights (if working with

images)

SLIDE 29

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 29

Monitor and visualize the loss curve

If this looks too linear: learning rate is low. If it doesn’t decrease much: learning rate might be too high

SLIDE 30

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 30

Monitor and visualize the loss curve

If this looks too linear: learning rate is low. If it doesn’t decrease much: learning rate might be too high the “width” of the curve is related to the batch size. This one looks too wide (noisy) => might want to increase batch size

SLIDE 31

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 31

Monitor and visualize the accuracy: big gap = overfitting => increase regularization strength no gap

=> increase model capacity

SLIDE 32

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 32

Track the ratio of weight updates / weight magnitudes:

ratio between the values and updates: ~ 0.0002 / 0.02 = 0.01 (about okay) want this to be somewhere around 0.01 - 0.001 or so

max min mean

SLIDE 33

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 33

Visualizing first-layer weights: Noisy weights => Regularization maybe not strong enough

SLIDE 34

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 34

(Regarding your Assignment #1)

=> Regularization not strong enough

SLIDE 35

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 35

So far:

We’ve seen the process for performing cross-

validations

There are several things we can track to get

intuitions about how to do it more efficiently

SLIDE 36

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 36

Hyperparameters to play with:

network architecture
learning rate, its decay schedule, update type
regularization (L2/L1/Maxnorm/Dropout)
loss to use

(e.g. SVM/Softmax)

initialization

neural networks practitioner music = loss function

SLIDE 37

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 37

Initialization

becomes more tricky and important in

deeper networks. Usually approx. W ~ N(0, 0.01) works. If not:

Consider what happens to the output distribution of neurons with different number of inputs (low or high)

10 inputs 100 inputs

SLIDE 38

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 38

normalize by square root of the number of incoming connections (fan in) => ensures equal variance of each neuron in network (tricky, subtle, but very important topic. See notes for details)

Initialization

becomes more tricky and important in

deeper networks. Usually approx. W ~ N(0, 0.01) works. If not:

SLIDE 39

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 39

Regularization knobs

L2 regularization
L1 regularization
L1 + L2 can also be combined
Max norm constraint

L1 is “sparsity inducing” (many weights become almost exactly zero)

enforce maximum L2 norm

f the incoming weights

SLIDE 40

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 40

Seemingly unrelated: Model Ensembles

One way to always improve final accuracy:

take several trained models and average their predictions

SLIDE 41

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 41

Regularization: Dropout “randomly set some neurons to zero”

[Srivastava et al.]

SLIDE 42

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 42

Example forward pass with a 3- layer network using dropout

SLIDE 43

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 43

At test time we don’t drop, but have to be careful:

At test time all neurons are active always => We must scale the activations so that for each neuron:

utput at test time = expected output at training time

if the output of a neuron is x but the probability of keeping it is

nly p, then the output of the neuron (in expectation) is:

px + (1-p)0 = px (this has the interpretation as an ensemble of all subnetworks)

SLIDE 44

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 44

More common: “Inverted dropout”

test time is unchanged

SLIDE 45

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 45

Learning rates and updates:

SGD + Momentum > SGD
Momentum 0.9 usually works well
Decrease the learning rate over time

(people use 1/t, exp(-t), or steps) simplest: learning_rate *= 0.97 every epoch (or so)

SLIDE 46

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 46

Summary

Properly preprocess the data
Run cross-validations across many tips/tricks
Use visualizations to guide the ranges and cross-val
Ensemble multiple models and report test error

SLIDE 47

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 47

Administrative

will be shifted by ~2 days.

pointers to datasets.

Backpropagation

(recursive chain rule)

Mini-batch Gradient descent

Loop:

A bit of history

A bit of history

A bit of history

Reinvigorated research in Deep Learning

Training Neural Networks

Step 1: Preprocess the data

Step 1: Preprocess the data

In practice, you may also see PCA and Whitening of the data

Step 2: Choose the architecture: say we start with one hidden layer of 50 neurons:

Before we try training, lets initialize well:

(Matrix of small numbers drawn randomly from a gaussian)

Double check that the loss is reasonable:

Double check that the loss is reasonable:

Lets try to train now… Tip: Make sure that you can overfit very small portion of the data

Lets try to train now… Tip: Make sure that you can overfit very small portion of the data Very small loss, train accuracy 1.00, nice!

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

cost: NaN almost always means high learning rate...

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

Cross-validation strategy

I like to do coarse -> fine cross-validation in stages

For example: run coarse search for 5 epochs

Now run finer search...

Now run finer search...

Normally you can’t afford a huge computational budget for expensive cross-validations. Need to rely more on intuitions and visualizations… Visualizations to play with:

their ratio)

images)

Monitor and visualize the loss curve

Monitor and visualize the loss curve

Monitor and visualize the accuracy: big gap = overfitting => increase regularization strength no gap

Track the ratio of weight updates / weight magnitudes:

Visualizing first-layer weights: Noisy weights => Regularization maybe not strong enough

(Regarding your Assignment #1)

=> Regularization not strong enough

So far:

validations

intuitions about how to do it more efficiently

Hyperparameters to play with:

(e.g. SVM/Softmax)

Initialization

deeper networks. Usually approx. W ~ N(0, 0.01) works. If not:

Initialization

deeper networks. Usually approx. W ~ N(0, 0.01) works. If not:

Regularization knobs

L1 is “sparsity inducing” (many weights become almost exactly zero)

Seemingly unrelated: Model Ensembles

take several trained models and average their predictions

Regularization: Dropout “randomly set some neurons to zero”

At test time we don’t drop, but have to be careful:

More common: “Inverted dropout”

Learning rates and updates:

(people use 1/t, exp(-t), or steps) simplest: learning_rate *= 0.97 every epoch (or so)

Summary

Next Lecture: Convolutional Neural Networks