Administrative - how is the assignment going? - btw, the notes get - - PowerPoint PPT Presentation

administrative how is the assignment going btw the notes
SMART_READER_LITE
LIVE PREVIEW

Administrative - how is the assignment going? - btw, the notes get - - PowerPoint PPT Presentation

Administrative - how is the assignment going? - btw, the notes get updated all the time based on your feedback - no lecture on Monday Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 4 - Lecture 4 - 7 Jan 2015 7


slide-1
SLIDE 1

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 1

Administrative

  • how is the assignment going?
  • btw, the notes get updated all the time

based on your feedback

  • no lecture on Monday
slide-2
SLIDE 2

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 2

Lecture 4: Optimization

slide-3
SLIDE 3

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 3

Image Classification

cat

assume given set of discrete labels {dog, cat, truck, plane, ...}

slide-4
SLIDE 4

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 4

Data-driven approach

slide-5
SLIDE 5

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 5

  • 1. Score function
slide-6
SLIDE 6

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 6

slide-7
SLIDE 7

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 7

  • 1. Score function
  • 2. Two loss functions
slide-8
SLIDE 8

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 8

slide-9
SLIDE 9

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 9

Three key components to training Neural Nets:

  • 1. Score function
  • 2. Loss function
  • 3. Optimization
slide-10
SLIDE 10

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 10

Brief aside: Image Features

  • In practice, very rare to see Computer Vision applications

that train linear classifiers on pixel values

slide-11
SLIDE 11

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 11

Brief aside: Image Features

  • In practice, very rare to see Computer Vision applications

that train linear classifiers on pixel values

slide-12
SLIDE 12

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 12

Example: Color (Hue) Histogram

hue bins

+1

slide-13
SLIDE 13

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 13

Example: HOG features

8x8 pixel region, quantize the edge

  • rientation into 9 bins

(images from vlfeat.org)

slide-14
SLIDE 14

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 14

Example: Bag of Words

1. Resize patch to a fixed size (e.g. 32x32 pixels) 2. Extract HOG on the patch (get 144 numbers)

gives a matrix of size [number_of_features x 144]

repeat for each detected feature

Problem: different images will have different numbers of

  • features. Need fixed-sized vectors for linear classification
slide-15
SLIDE 15

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 15

Example: Bag of Words

1. Resize patch to a fixed size (e.g. 32x32 pixels) 2. Extract HOG on the patch (get 144 numbers)

gives a matrix of size [number_of_features x 144]

repeat for each detected feature

slide-16
SLIDE 16

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 16

Example: Bag of Words

144 visual word vectors learn k-means centroids “vocabulary of visual words e.g. 1000 centroids 1000-d vector 1000-d vector 1000-d vector histogram of visual words

slide-17
SLIDE 17

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 17

Brief aside: Image Features

slide-18
SLIDE 18

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 18

Most recognition systems are build on the same Architecture

(slide from Yann LeCun)

slide-19
SLIDE 19

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 19

Most recognition systems are build on the same Architecture

(slide from Yann LeCun)

CNNs: end-to-end models

slide-20
SLIDE 20

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 20

Visualizing the loss function

slide-21
SLIDE 21

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 21

Visualizing the (SVM) loss function

slide-22
SLIDE 22

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 22

Visualizing the (SVM) loss function

slide-23
SLIDE 23

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 23

Visualizing the (SVM) loss function

slide-24
SLIDE 24

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 24

Visualizing the (SVM) loss function

the full data loss:

slide-25
SLIDE 25

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 25

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence), then this becomes:

slide-26
SLIDE 26

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 26

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence), then this becomes:

slide-27
SLIDE 27

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 27

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence), then this becomes:

slide-28
SLIDE 28

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 28

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence), then this becomes:

Question: CIFAR-10 has 50,000 training images, 5,000 per class and 10 labels. How many

  • ccurrences of one classifier row in

the full data loss?

slide-29
SLIDE 29

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 29

Optimization

slide-30
SLIDE 30

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 30

Strategy #1: A first very bad idea solution: Random search

slide-31
SLIDE 31

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 31

Strategy #1: A first very bad idea solution: Random search

slide-32
SLIDE 32

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 32

Strategy #1: A first very bad idea solution: Random search

what’s up with 0.0001?

slide-33
SLIDE 33

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 33

Lets see how well this works on the test set...

slide-34
SLIDE 34

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 34

Fun aside: When W = 0, what is the CIFAR-10 loss for SVM and Softmax?

slide-35
SLIDE 35

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 35

Strategy #2: A better but still very bad idea solution: Random local search

slide-36
SLIDE 36

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 36

Strategy #2: A better but still very bad idea solution: Random local search gives 21.4%!

slide-37
SLIDE 37

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 37

slide-38
SLIDE 38

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 38

slide-39
SLIDE 39

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 39

Strategy #3: Following the gradient In 1-dimension, the derivative of a function:

In multiple dimension, the gradient is the vector of (partial derivatives).

slide-40
SLIDE 40

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 40

Evaluation the gradient numerically

slide-41
SLIDE 41

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 41

Evaluation the gradient numerically

“finite difference approximation”

slide-42
SLIDE 42

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 42

Evaluation the gradient numerically

“centered difference formula” in practice:

slide-43
SLIDE 43

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 43

Evaluation the gradient numerically

slide-44
SLIDE 44

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 44

performing a parameter update

slide-45
SLIDE 45

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 45

performing a parameter update

slide-46
SLIDE 46

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 46

  • riginal W

negative gradient direction

slide-47
SLIDE 47

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 47

The problems with numerical gradient:

slide-48
SLIDE 48

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 48

The problems with numerical gradient:

  • approximate
  • very slow to evaluate
slide-49
SLIDE 49

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 49

We need something better...

slide-50
SLIDE 50

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 50

We need something better...

Calculus

slide-51
SLIDE 51

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 51

slide-52
SLIDE 52

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 52

slide-53
SLIDE 53

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 53

slide-54
SLIDE 54

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 54

In summary:

  • Numerical gradient: approximate, slow, easy to write
  • Analytic gradient: exact, fast, error-prone

=>

In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

slide-55
SLIDE 55

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 55

Gradient check: Words of caution

slide-56
SLIDE 56

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 56

Gradient Descent

slide-57
SLIDE 57

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 57

Mini-batch Gradient Descent

  • nly use a small portion of the training set to compute the gradient.

Common mini-batch sizes are ~100 examples. e.g. Krizhevsky ILSVRC ConvNet used 256 examples

slide-58
SLIDE 58

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 58

Stochastic Gradient Descent (SGD)

  • use a single example at a time

1

(also sometimes called on-line Gradient Descent)

slide-59
SLIDE 59

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 59

Summary

  • Always use mini-batch gradient descent
  • Incorrectly refer to it as “doing SGD” as everyone else
  • The mini-batch size is a hyperparameter, but it is not

very common to cross-validate over it (usually based

  • n practical concerns, e.g. space/time efficiency)

(or call it batch gradient descent)

slide-60
SLIDE 60

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 60

Fun question: Suppose you were training with mini-batch size of 100, and now you switch to mini-batch of size 1. Your learning rate (step size) should:

  • increase
  • decrease
  • stay the same
  • become zero
slide-61
SLIDE 61

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 61

The dynamics of Gradient Descent

slide-62
SLIDE 62

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 62

The dynamics of Gradient Descent

always pull the weights down pull some weights up and some down

slide-63
SLIDE 63

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 63

Momentum Update

gradient momentum update

slide-64
SLIDE 64

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 64

Many other ways to perform optimization…

  • Second order methods that use the Hessian (or its

approximation): BFGS, LBFGS, etc.

  • Currently, the lesson from the trenches is that well-tuned

SGD+Momentum is very hard to beat for CNNs.

slide-65
SLIDE 65

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 65

Summary

  • We looked at image features, and saw that CNNs can be

thought of as learning the features in end-to-end manner

  • We explored intuition about what the loss surfaces of linear

classifiers look like

  • We introduced gradient descent as a way of optimizing loss

functions, as well as batch gradient descent and SGD.

  • Numerical gradient: slow :(, approximate :(, easy to write :)
  • Analytic gradient: fast :), exact :), error-prone :(
  • In practice: Gradient check (but be careful)
slide-66
SLIDE 66

Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 4 - 7 Jan 2015 66

Next class: Becoming a backprop ninja