Administrative - A2 is out. It was late 2 days so due date will be - - PowerPoint PPT Presentation

administrative a2 is out it was late 2 days so due date
SMART_READER_LITE
LIVE PREVIEW

Administrative - A2 is out. It was late 2 days so due date will be - - PowerPoint PPT Presentation

Administrative - A2 is out. It was late 2 days so due date will be shifted by ~2 days. - we updated the project page with many pointers to datasets. Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 6 - Lecture 6 - 21


slide-1
SLIDE 1

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 1

Administrative

  • A2 is out. It was late 2 days so due date

will be shifted by ~2 days.

  • we updated the project page with many

pointers to datasets.

slide-2
SLIDE 2

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 2

slide-3
SLIDE 3

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 3

slide-4
SLIDE 4

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 4

slide-5
SLIDE 5

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 5

Backpropagation

(recursive chain rule)

slide-6
SLIDE 6

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 6

Mini-batch Gradient descent

Loop:

  • 1. Sample a batch of data
  • 2. Backprop to calculate the analytic gradient
  • 3. Perform a parameter update
slide-7
SLIDE 7

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 7

A bit of history

Widrow and Hoff, ~1960: Adaline

slide-8
SLIDE 8

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 8

A bit of history

Rumelhart et al. 1986: First time back-propagation became popular recognizable maths

slide-9
SLIDE 9

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 9

A bit of history

[Hinton and Salkhutdinov 2006]

Reinvigorated research in Deep Learning

slide-10
SLIDE 10

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 10

Training Neural Networks

slide-11
SLIDE 11

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 11

Step 1: Preprocess the data

(Assume X [NxD] is data matrix, each example in a row)

slide-12
SLIDE 12

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 12

Step 1: Preprocess the data

In practice, you may also see PCA and Whitening of the data

(data has diagonal covariance matrix) (covariance matrix is the identity matrix)

slide-13
SLIDE 13

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 13

Step 2: Choose the architecture: say we start with one hidden layer of 50 neurons:

input layer hidden layer

  • utput layer

CIFAR-10 images, 3072 numbers 10 output neurons, one per class 50 hidden neurons

slide-14
SLIDE 14

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 14

Before we try training, lets initialize well:

  • set weights to small random numbers
  • set biases to zero

(Matrix of small numbers drawn randomly from a gaussian)

Warning: This is not optimal, but simplest! (More on this later)

slide-15
SLIDE 15

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 15

Double check that the loss is reasonable:

returns the loss and the gradient for all parameters disable regularization loss ~2.3. “correct “ for 10 classes

slide-16
SLIDE 16

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 16

Double check that the loss is reasonable:

crank up regularization loss went up, good. (sanity check)

slide-17
SLIDE 17

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 17

Lets try to train now… Tip: Make sure that you can overfit very small portion of the data

The above code:

  • take the first 20 examples from CIFAR-10
  • turn off regularization (reg = 0.0)
  • use simple vanilla ‘sgd’

details:

  • (learning_rate_decay = 1 means no decay, the

learning rate will stay constant)

  • sample_batches = False means we’re doing full

gradient descent, not mini-batch SGD

  • we’ll perform 200 updates (epochs = 200)

“epoch”: number of times we see the training set

slide-18
SLIDE 18

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 18

Lets try to train now… Tip: Make sure that you can overfit very small portion of the data Very small loss, train accuracy 1.00, nice!

slide-19
SLIDE 19

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 19

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

slide-20
SLIDE 20

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 20

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

Loss barely changing: Learning rate must be too low. (could also be reg too high) Notice train/val accuracy goes to 20% though, what’s up with that? (remember this is softmax)

slide-21
SLIDE 21

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 21

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

Okay now lets try learning rate 1e6. What could possibly go wrong?

slide-22
SLIDE 22

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015

cost: NaN almost always means high learning rate...

22

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

slide-23
SLIDE 23

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 23

Lets try to train now… I like to start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high

3e-3 is still too high. Cost explodes…. => Rough range for learning rate we should be cross-validating is somewhere [1e-3 … 1e-5]

slide-24
SLIDE 24

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 24

Cross-validation strategy

I like to do coarse -> fine cross-validation in stages

First stage: only a few epochs to get rough idea of what params work Second stage: longer running time, finer search … (repeat as necessary) Tip for detecting explosions in the solver: If the cost is ever > 3 * original cost, break out early

slide-25
SLIDE 25

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 25

For example: run coarse search for 5 epochs

nice

note it’s best to optimize in log space

slide-26
SLIDE 26

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 26

Now run finer search...

adjust range 53% - relatively good for a 2-layer neural net with 50 hidden neurons.

slide-27
SLIDE 27

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 27

Now run finer search...

adjust range 53% - relatively good for a 2-layer neural net with 50 hidden neurons. But this best cross- validation result is

  • worrying. Why?
slide-28
SLIDE 28

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 28

Normally you can’t afford a huge computational budget for expensive cross-validations. Need to rely more on intuitions and visualizations… Visualizations to play with:

  • loss function
  • validation and training accuracy
  • min,max,std for values and updates, (and monitor

their ratio)

  • first-layer visualization of weights (if working with

images)

slide-29
SLIDE 29

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 29

Monitor and visualize the loss curve

If this looks too linear: learning rate is low. If it doesn’t decrease much: learning rate might be too high

slide-30
SLIDE 30

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 30

Monitor and visualize the loss curve

If this looks too linear: learning rate is low. If it doesn’t decrease much: learning rate might be too high the “width” of the curve is related to the batch size. This one looks too wide (noisy) => might want to increase batch size

slide-31
SLIDE 31

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 31

Monitor and visualize the accuracy: big gap = overfitting => increase regularization strength no gap

=> increase model capacity

slide-32
SLIDE 32

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 32

Track the ratio of weight updates / weight magnitudes:

ratio between the values and updates: ~ 0.0002 / 0.02 = 0.01 (about okay) want this to be somewhere around 0.01 - 0.001 or so

max min mean

slide-33
SLIDE 33

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 33

Visualizing first-layer weights: Noisy weights => Regularization maybe not strong enough

slide-34
SLIDE 34

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 34

(Regarding your Assignment #1)

=> Regularization not strong enough

slide-35
SLIDE 35

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 35

So far:

  • We’ve seen the process for performing cross-

validations

  • There are several things we can track to get

intuitions about how to do it more efficiently

slide-36
SLIDE 36

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 36

Hyperparameters to play with:

  • network architecture
  • learning rate, its decay schedule, update type
  • regularization (L2/L1/Maxnorm/Dropout)
  • loss to use

(e.g. SVM/Softmax)

  • initialization

neural networks practitioner music = loss function

slide-37
SLIDE 37

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 37

Initialization

  • becomes more tricky and important in

deeper networks. Usually approx. W ~ N(0, 0.01) works. If not:

Consider what happens to the output distribution of neurons with different number of inputs (low or high)

10 inputs 100 inputs

slide-38
SLIDE 38

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 38

normalize by square root of the number of incoming connections (fan in) => ensures equal variance of each neuron in network (tricky, subtle, but very important topic. See notes for details)

Initialization

  • becomes more tricky and important in

deeper networks. Usually approx. W ~ N(0, 0.01) works. If not:

slide-39
SLIDE 39

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 39

Regularization knobs

  • L2 regularization
  • L1 regularization
  • L1 + L2 can also be combined
  • Max norm constraint

L1 is “sparsity inducing” (many weights become almost exactly zero)

enforce maximum L2 norm

  • f the incoming weights
slide-40
SLIDE 40

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 40

Seemingly unrelated: Model Ensembles

  • One way to always improve final accuracy:

take several trained models and average their predictions

slide-41
SLIDE 41

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 41

Regularization: Dropout “randomly set some neurons to zero”

[Srivastava et al.]

slide-42
SLIDE 42

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 42

Example forward pass with a 3- layer network using dropout

slide-43
SLIDE 43

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 43

At test time we don’t drop, but have to be careful:

At test time all neurons are active always => We must scale the activations so that for each neuron:

  • utput at test time = expected output at training time

if the output of a neuron is x but the probability of keeping it is

  • nly p, then the output of the neuron (in expectation) is:

px + (1-p)0 = px (this has the interpretation as an ensemble of all subnetworks)

slide-44
SLIDE 44

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 44

More common: “Inverted dropout”

test time is unchanged

slide-45
SLIDE 45

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 45

Learning rates and updates:

  • SGD + Momentum > SGD
  • Momentum 0.9 usually works well
  • Decrease the learning rate over time

(people use 1/t, exp(-t), or steps) simplest: learning_rate *= 0.97 every epoch (or so)

slide-46
SLIDE 46

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 46

Summary

  • Properly preprocess the data
  • Run cross-validations across many tips/tricks
  • Use visualizations to guide the ranges and cross-val
  • Ensemble multiple models and report test error
slide-47
SLIDE 47

Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 47

Next Lecture: Convolutional Neural Networks