[PPT] - Administrative - how is the assignment going? - btw, the notes get PowerPoint Presentation

SLIDE 1

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 1

Administrative

how is the assignment going?
btw, the notes get updated all the time

based on your feedback

no lecture on Monday

SLIDE 2

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 2

Lecture 4: Optimization

SLIDE 3

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 3

Image Classification

cat

assume given set of discrete labels {dog, cat, truck, plane, ...}

SLIDE 4

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 4

Data-driven approach

SLIDE 5

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 5

1. Score function

SLIDE 6

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 6

SLIDE 7

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 7

1. Score function
2. Two loss functions

SLIDE 8

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 8

SLIDE 9

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 9

Three key components to training Neural Nets:

1. Score function
2. Loss function
3. Optimization

SLIDE 10

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 10

Brief aside: Image Features

In practice, very rare to see Computer Vision applications

that train linear classifiers on pixel values

SLIDE 11

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 11

Brief aside: Image Features

In practice, very rare to see Computer Vision applications

that train linear classifiers on pixel values

SLIDE 12

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 12

Example: Color (Hue) Histogram

hue bins

+1

SLIDE 13

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 13

Example: HOG features

8x8 pixel region, quantize the edge

rientation into 9 bins

(images from vlfeat.org)

SLIDE 14

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 14

Example: Bag of Words

1. Resize patch to a fixed size (e.g. 32x32 pixels) 2. Extract HOG on the patch (get 144 numbers)

gives a matrix of size [number_of_features x 144]

repeat for each detected feature

Problem: different images will have different numbers of

features. Need fixed-sized vectors for linear classification

SLIDE 15

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 15

Example: Bag of Words

1. Resize patch to a fixed size (e.g. 32x32 pixels) 2. Extract HOG on the patch (get 144 numbers)

gives a matrix of size [number_of_features x 144]

repeat for each detected feature

SLIDE 16

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 16

Example: Bag of Words

144 visual word vectors learn k-means centroids “vocabulary of visual words e.g. 1000 centroids 1000-d vector 1000-d vector 1000-d vector histogram of visual words

SLIDE 17

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 17

Brief aside: Image Features

SLIDE 18

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 18

Most recognition systems are build on the same Architecture

(slide from Yann LeCun)

SLIDE 19

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 19

Most recognition systems are build on the same Architecture

(slide from Yann LeCun)

CNNs: end-to-end models

SLIDE 20

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 20

Visualizing the loss function

SLIDE 21

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 21

Visualizing the (SVM) loss function

SLIDE 22

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 22

Visualizing the (SVM) loss function

SLIDE 23

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 23

Visualizing the (SVM) loss function

SLIDE 24

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 24

Visualizing the (SVM) loss function

the full data loss:

SLIDE 25

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 25

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence), then this becomes:

SLIDE 26

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 26

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence), then this becomes:

SLIDE 27

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 27

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence), then this becomes:

SLIDE 28

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 28

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence), then this becomes:

Question: CIFAR-10 has 50,000 training images, 5,000 per class and 10 labels. How many

ccurrences of one classifier row in

the full data loss?

SLIDE 29

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 29

Optimization

SLIDE 30

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 30

Strategy #1: A first very bad idea solution: Random search

SLIDE 31

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 31

Strategy #1: A first very bad idea solution: Random search

SLIDE 32

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 32

Strategy #1: A first very bad idea solution: Random search

what’s up with 0.0001?

SLIDE 33

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 33

Lets see how well this works on the test set...

SLIDE 34

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 34

Fun aside: When W = 0, what is the CIFAR-10 loss for SVM and Softmax?

SLIDE 35

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 35

Strategy #2: A better but still very bad idea solution: Random local search

SLIDE 36

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 36

Strategy #2: A better but still very bad idea solution: Random local search gives 21.4%!

SLIDE 37

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 37

SLIDE 38

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 38

SLIDE 39

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 39

Strategy #3: Following the gradient In 1-dimension, the derivative of a function:

In multiple dimension, the gradient is the vector of (partial derivatives).

SLIDE 40

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 40

Evaluation the gradient numerically

SLIDE 41

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 41

Evaluation the gradient numerically

“finite difference approximation”

SLIDE 42

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 42

Evaluation the gradient numerically

“centered difference formula” in practice:

SLIDE 43

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 43

Evaluation the gradient numerically

SLIDE 44

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 44

performing a parameter update

SLIDE 45

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 45

performing a parameter update

SLIDE 46

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 46

riginal W

negative gradient direction

SLIDE 47

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 47

The problems with numerical gradient:

SLIDE 48

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 48

The problems with numerical gradient:

approximate
very slow to evaluate

SLIDE 49

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 49

We need something better...

SLIDE 50

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 50

We need something better...

Calculus

SLIDE 51

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 51

SLIDE 52

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 52

SLIDE 53

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 53

SLIDE 54

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 54

In summary:

Numerical gradient: approximate, slow, easy to write
Analytic gradient: exact, fast, error-prone

=>

In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

SLIDE 55

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 55

Gradient check: Words of caution

SLIDE 56

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 56

Gradient Descent

SLIDE 57

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 57

Mini-batch Gradient Descent

nly use a small portion of the training set to compute the gradient.

Common mini-batch sizes are ~100 examples. e.g. Krizhevsky ILSVRC ConvNet used 256 examples

SLIDE 58

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 58

Stochastic Gradient Descent (SGD)

use a single example at a time

1

(also sometimes called on-line Gradient Descent)

SLIDE 59

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 59

Summary

Always use mini-batch gradient descent
Incorrectly refer to it as “doing SGD” as everyone else
The mini-batch size is a hyperparameter, but it is not

very common to cross-validate over it (usually based

n practical concerns, e.g. space/time efficiency)

(or call it batch gradient descent)

SLIDE 60

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 60

Fun question: Suppose you were training with mini-batch size of 100, and now you switch to mini-batch of size 1. Your learning rate (step size) should:

increase
decrease
stay the same
become zero

SLIDE 61

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 61

The dynamics of Gradient Descent

SLIDE 62

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 62

The dynamics of Gradient Descent

always pull the weights down pull some weights up and some down

SLIDE 63

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 63

Momentum Update

gradient momentum update

SLIDE 64

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 64

Many other ways to perform optimization…

Second order methods that use the Hessian (or its

approximation): BFGS, LBFGS, etc.

Currently, the lesson from the trenches is that well-tuned

SGD+Momentum is very hard to beat for CNNs.

SLIDE 65

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 65

Summary

We looked at image features, and saw that CNNs can be

thought of as learning the features in end-to-end manner

We explored intuition about what the loss surfaces of linear

classifiers look like

We introduced gradient descent as a way of optimizing loss

functions, as well as batch gradient descent and SGD.

Numerical gradient: slow :(, approximate :(, easy to write :)
Analytic gradient: fast :), exact :), error-prone :(
In practice: Gradient check (but be careful)

SLIDE 66

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 66

Administrative

based on your feedback

Lecture 4: Optimization

Image Classification

cat

Data-driven approach

Three key components to training Neural Nets:

Brief aside: Image Features

that train linear classifiers on pixel values

Brief aside: Image Features

that train linear classifiers on pixel values

Example: Color (Hue) Histogram

Example: HOG features

Example: Bag of Words

Problem: different images will have different numbers of

Example: Bag of Words

Example: Bag of Words

Brief aside: Image Features

Most recognition systems are build on the same Architecture

Most recognition systems are build on the same Architecture

Visualizing the loss function

Visualizing the (SVM) loss function

Visualizing the (SVM) loss function

Visualizing the (SVM) loss function

Visualizing the (SVM) loss function

Visualizing the (SVM) loss function

Visualizing the (SVM) loss function

Visualizing the (SVM) loss function

Visualizing the (SVM) loss function

Optimization

Strategy #1: A first very bad idea solution: Random search

Strategy #1: A first very bad idea solution: Random search

Strategy #1: A first very bad idea solution: Random search

Lets see how well this works on the test set...

Fun aside: When W = 0, what is the CIFAR-10 loss for SVM and Softmax?

Strategy #2: A better but still very bad idea solution: Random local search

Strategy #2: A better but still very bad idea solution: Random local search gives 21.4%!

Strategy #3: Following the gradient In 1-dimension, the derivative of a function:

Evaluation the gradient numerically

Evaluation the gradient numerically

“finite difference approximation”

Evaluation the gradient numerically

Evaluation the gradient numerically

negative gradient direction

The problems with numerical gradient:

The problems with numerical gradient:

We need something better...

We need something better...

Calculus

In summary:

In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

Gradient check: Words of caution

Gradient Descent

Mini-batch Gradient Descent

Stochastic Gradient Descent (SGD)

1

Summary

very common to cross-validate over it (usually based

Fun question: Suppose you were training with mini-batch size of 100, and now you switch to mini-batch of size 1. Your learning rate (step size) should:

The dynamics of Gradient Descent

The dynamics of Gradient Descent

Momentum Update

Many other ways to perform optimization…

approximation): BFGS, LBFGS, etc.

SGD+Momentum is very hard to beat for CNNs.

Summary

thought of as learning the features in end-to-end manner

classifiers look like

functions, as well as batch gradient descent and SGD.

Next class: Becoming a backprop ninja