Lecture 3: Loss functions and Optimization Fei-Fei Li & Andrej - - PowerPoint PPT Presentation

lecture 3 loss functions and optimization
SMART_READER_LITE
LIVE PREVIEW

Lecture 3: Loss functions and Optimization Fei-Fei Li & Andrej - - PowerPoint PPT Presentation

Lecture 3: Loss functions and Optimization Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - Lecture 3 - 11 Jan 2016 11 Jan 2016 1 Administrative A1 is due Jan 20


slide-1
SLIDE 1

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 1

Lecture 3: Loss functions and Optimization

slide-2
SLIDE 2

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 2

Administrative

A1 is due Jan 20 (Wednesday). ~9 days left Warning: Jan 18 (Monday) is Holiday (no class/office hours)

slide-3
SLIDE 3

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 3

Recall from last time… Challenges in Visual Recognition

Camera pose Illumination Deformation Occlusion Background clutter Intraclass variation

slide-4
SLIDE 4

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 4

Recall from last time… data-driven approach, kNN

the data NN classifier 5-NN classifier

slide-5
SLIDE 5

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 5

Recall from last time… Linear classifier

[32x32x3] array of numbers 0...1 (3072 numbers total)

f(x,W)

image parameters

10 numbers, indicating class scores

slide-6
SLIDE 6

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 6

Recall from last time… Going forward: Loss function/Optimization

  • 3.45
  • 8.87

0.09 2.9 4.48 8.02 3.78 1.06

  • 0.36
  • 0.72
  • 0.51

6.04 5.31

  • 4.22
  • 4.19

3.58 4.49

  • 4.37
  • 2.09
  • 2.93

3.42 4.64 2.65 5.1 2.64 5.55

  • 4.34
  • 1.5
  • 4.79

6.14

1. Define a loss function that quantifies our unhappiness with the scores across the training data. 2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization)

TODO:

slide-7
SLIDE 7

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 7

Suppose: 3 training examples, 3 classes. With some W the scores are:

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

slide-8
SLIDE 8

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 8

Suppose: 3 training examples, 3 classes. With some W the scores are:

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

slide-9
SLIDE 9

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 9

Suppose: 3 training examples, 3 classes. With some W the scores are:

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

= max(0, 5.1 - 3.2 + 1) +max(0, -1.7 - 3.2 + 1) = max(0, 2.9) + max(0, -3.9) = 2.9 + 0 = 2.9

2.9

Losses:

slide-10
SLIDE 10

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 10

Suppose: 3 training examples, 3 classes. With some W the scores are:

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

= max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1) = max(0, -2.6) + max(0, -1.9) = 0 + 0 = 0

Losses:

2.9

slide-11
SLIDE 11

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 11

Suppose: 3 training examples, 3 classes. With some W the scores are:

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

= max(0, 2.2 - (-3.1) + 1) +max(0, 2.5 - (-3.1) + 1) = max(0, 5.3) + max(0, 5.6) = 5.3 + 5.6 = 10.9

Losses:

2.9 10.9

slide-12
SLIDE 12

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 12

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Losses:

2.9 10.9

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form: and the full training loss is the mean

  • ver all examples in the training data:

L = (2.9 + 0 + 10.9)/3 = 4.6

slide-13
SLIDE 13

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 13

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Losses:

2.9 10.9

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

Q: what if the sum was instead over all classes? (including j = y_i)

slide-14
SLIDE 14

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 14

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Losses:

2.9 10.9

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

Q2: what if we used a mean instead of a sum here?

slide-15
SLIDE 15

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 15

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Losses:

2.9 10.9

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

Q3: what if we used

slide-16
SLIDE 16

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 16

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Losses:

2.9 10.9

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

Q4: what is the min/max possible loss?

slide-17
SLIDE 17

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 17

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Losses:

2.9 10.9

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

Q5: usually at initialization W are small numbers, so all s ~= 0. What is the loss?

slide-18
SLIDE 18

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 18

Example numpy code:

slide-19
SLIDE 19

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 19

slide-20
SLIDE 20

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 20

There is a bug with the loss:

slide-21
SLIDE 21

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 21

There is a bug with the loss:

E.g. Suppose that we found a W such that L = 0. Is this W unique?

slide-22
SLIDE 22

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 22

Suppose: 3 training examples, 3 classes. With some W the scores are:

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

= max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1) = max(0, -2.6) + max(0, -1.9) = 0 + 0 = 0

Losses:

2.9

Before: With W twice as large: = max(0, 2.6 - 9.8 + 1) +max(0, 4.0 - 9.8 + 1) = max(0, -6.2) + max(0, -4.8) = 0 + 0 = 0

slide-23
SLIDE 23

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 23

Weight Regularization

\lambda = regularization strength (hyperparameter)

In common use: L2 regularization L1 regularization Elastic net (L1 + L2) Max norm regularization (might see later) Dropout (will see later)

slide-24
SLIDE 24

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 24

L2 regularization: motivation

slide-25
SLIDE 25

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 25

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7
slide-26
SLIDE 26

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 26

Softmax Classifier (Multinomial Logistic Regression)

scores = unnormalized log probabilities of the classes.

cat frog car

3.2 5.1

  • 1.7
slide-27
SLIDE 27

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 27

Softmax Classifier (Multinomial Logistic Regression)

scores = unnormalized log probabilities of the classes.

cat frog car

3.2 5.1

  • 1.7

where

slide-28
SLIDE 28

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 28

Softmax Classifier (Multinomial Logistic Regression)

scores = unnormalized log probabilities of the classes.

cat frog car

3.2 5.1

  • 1.7

where

Softmax function

slide-29
SLIDE 29

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 29

Softmax Classifier (Multinomial Logistic Regression)

scores = unnormalized log probabilities of the classes. Want to maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class:

cat frog car

3.2 5.1

  • 1.7

where

slide-30
SLIDE 30

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 30

Softmax Classifier (Multinomial Logistic Regression)

scores = unnormalized log probabilities of the classes. Want to maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class:

cat frog car

3.2 5.1

  • 1.7

in summary:

where

slide-31
SLIDE 31

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 31

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

unnormalized log probabilities

slide-32
SLIDE 32

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 32

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

unnormalized log probabilities

24.5 164.0 0.18

exp unnormalized probabilities

slide-33
SLIDE 33

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 33

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

unnormalized log probabilities

24.5 164.0 0.18

exp normalize unnormalized probabilities

0.13 0.87 0.00

probabilities

slide-34
SLIDE 34

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 34

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

unnormalized log probabilities

24.5 164.0 0.18

exp normalize unnormalized probabilities

0.13 0.87 0.00

probabilities

L_i = -log(0.13) = 0.89

slide-35
SLIDE 35

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 35

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

unnormalized log probabilities

24.5 164.0 0.18

exp normalize unnormalized probabilities

0.13 0.87 0.00

probabilities

L_i = -log(0.13) = 0.89

Q: What is the min/max possible loss L_i?

slide-36
SLIDE 36

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 36

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

unnormalized log probabilities

24.5 164.0 0.18

exp normalize unnormalized probabilities

0.13 0.87 0.00

probabilities

L_i = -log(0.13) = 0.89

Q5: usually at initialization W are small numbers, so all s ~= 0. What is the loss?

slide-37
SLIDE 37

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 37

slide-38
SLIDE 38

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 38

Softmax vs. SVM

slide-39
SLIDE 39

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 39

Softmax vs. SVM assume scores: [10, -2, 3] [10, 9, 9] [10, -100, -100] and

Q: Suppose I take a datapoint and I jiggle a bit (changing its score slightly). What happens to the loss in both cases?

slide-40
SLIDE 40

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 40

Interactive Web Demo time....

http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/

slide-41
SLIDE 41

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 41

Optimization

slide-42
SLIDE 42

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 42

Recap

  • We have some dataset of (x,y)
  • We have a score function:
  • We have a loss function:

e.g.

Softmax SVM Full loss

slide-43
SLIDE 43

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 43

Strategy #1: A first very bad idea solution: Random search

slide-44
SLIDE 44

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 44

Lets see how well this works on the test set... 15.5% accuracy! not bad! (SOTA is ~95%)

slide-45
SLIDE 45

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 45

slide-46
SLIDE 46

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 46

slide-47
SLIDE 47

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 47

Strategy #2: Follow the slope In 1-dimension, the derivative of a function:

In multiple dimensions, the gradient is the vector of (partial derivatives).

slide-48
SLIDE 48

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 48

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 gradient dW: [?, ?, ?, ?, ?, ?, ?, ?, ?,…]

slide-49
SLIDE 49

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 49

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (first dim): [0.34 + 0.0001,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25322 gradient dW: [?, ?, ?, ?, ?, ?, ?, ?, ?,…]

slide-50
SLIDE 50

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 50

gradient dW: [-2.5, ?, ?, ?, ?, ?, ?, ?, ?,…]

(1.25322 - 1.25347)/0.0001 = -2.5

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (first dim): [0.34 + 0.0001,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25322

slide-51
SLIDE 51

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 51

gradient dW: [-2.5, ?, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (second dim): [0.34,

  • 1.11 + 0.0001,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25353

slide-52
SLIDE 52

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 52

gradient dW: [-2.5, 0.6, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (second dim): [0.34,

  • 1.11 + 0.0001,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25353

(1.25353 - 1.25347)/0.0001 = 0.6

slide-53
SLIDE 53

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 53

gradient dW: [-2.5, 0.6, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (third dim): [0.34,

  • 1.11,

0.78 + 0.0001, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347

slide-54
SLIDE 54

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 54

gradient dW: [-2.5, 0.6, 0, ?, ?, ?, ?, ?, ?,…] current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (third dim): [0.34,

  • 1.11,

0.78 + 0.0001, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347

(1.25347 - 1.25347)/0.0001 = 0

slide-55
SLIDE 55

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 55

Evaluation the gradient numerically

slide-56
SLIDE 56

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 56

Evaluation the gradient numerically

  • approximate
  • very slow to evaluate
slide-57
SLIDE 57

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 57

This is silly. The loss is just a function of W:

want

slide-58
SLIDE 58

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 58

This is silly. The loss is just a function of W:

want

slide-59
SLIDE 59

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 59

This is silly. The loss is just a function of W: Calculus

want

slide-60
SLIDE 60

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 60

This is silly. The loss is just a function of W:

= ...

slide-61
SLIDE 61

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 61

gradient dW: [-2.5, 0.6, 0, 0.2, 0.7,

  • 0.5,

1.1, 1.3,

  • 2.1,…]

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 dW = ... (some function data and W)

slide-62
SLIDE 62

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 62

In summary:

  • Numerical gradient: approximate, slow, easy to write
  • Analytic gradient: exact, fast, error-prone

=>

In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

slide-63
SLIDE 63

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 63

Gradient Descent

slide-64
SLIDE 64

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 64

  • riginal W

negative gradient direction

W_1 W_2

slide-65
SLIDE 65

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 65

Mini-batch Gradient Descent

  • nly use a small portion of the training set to compute the gradient.

Common mini-batch sizes are 32/64/128 examples e.g. Krizhevsky ILSVRC ConvNet used 256 examples

slide-66
SLIDE 66

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 66

Example of optimization progress while training a neural network. (Loss over mini-batches goes down

  • ver time.)
slide-67
SLIDE 67

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 67

The effects of step size (or “learning rate”)

slide-68
SLIDE 68

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 68

Mini-batch Gradient Descent

  • nly use a small portion of the training set to compute the gradient.

Common mini-batch sizes are 32/64/128 examples e.g. Krizhevsky ILSVRC ConvNet used 256 examples we will look at more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …)

slide-69
SLIDE 69

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 69

(image credits to Alec Radford)

The effects of different update form formulas

slide-70
SLIDE 70

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 70

Aside: Image Features

slide-71
SLIDE 71

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 71

Example: Color (Hue) Histogram

hue bins

+1

slide-72
SLIDE 72

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 72

Example: HOG/SIFT features

8x8 pixel region, quantize the edge

  • rientation into 9 bins

(image from vlfeat.org)

slide-73
SLIDE 73

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 73

Example: HOG/SIFT features

8x8 pixel region, quantize the edge

  • rientation into 9 bins

(image from vlfeat.org)

Many more:

GIST, LBP, Texton, SSIM, ...

slide-74
SLIDE 74

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 74

Example: Bag of Words

144 visual word vectors learn k-means centroids “vocabulary” of visual words e.g. 1000 centroids 1000-d vector 1000-d vector 1000-d vector histogram of visual words

slide-75
SLIDE 75

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 75

[32x32x3]

f

10 numbers, indicating class scores

Feature Extraction

vector describing various image statistics [32x32x3]

f

10 numbers, indicating class scores training training

slide-76
SLIDE 76

Lecture 3 - 11 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 3 - 11 Jan 2016 76

Next class:

Becoming a backprop ninja and Neural Networks (part 1)