Lecture 3: Loss Functions and Optimization Fei-Fei Li & Justin - - PowerPoint PPT Presentation

lecture 3 loss functions and optimization
SMART_READER_LITE
LIVE PREVIEW

Lecture 3: Loss Functions and Optimization Fei-Fei Li & Justin - - PowerPoint PPT Presentation

Lecture 3: Loss Functions and Optimization Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 1 April 10, 2018 Administrative: Live Questions Well use Zoom to take questions from remote students live-streaming the lecture


slide-1
SLIDE 1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 1

Lecture 3: Loss Functions and Optimization

slide-2
SLIDE 2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Administrative: Live Questions

We’ll use Zoom to take questions from remote students live-streaming the lecture Check Piazza for instructions and meeting ID: https://piazza.com/class/jdmurnqexkt47x?cid=108 2

slide-3
SLIDE 3

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Administrative: Office Hours

Office hours started this week, schedule is on the course website: http://cs231n.stanford.edu/office_hours.html Areas of expertise for all TAs are posted on Piazza: https://piazza.com/class/jdmurnqexkt47x?cid=155 3

slide-4
SLIDE 4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Administrative: Assignment 1

Assignment 1 is released: http://cs231n.github.io/assignments2018/assignment1/ Due Wednesday April 18, 11:59pm

4

slide-5
SLIDE 5

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Administrative: Google Cloud

You should have received an email yesterday about claiming a coupon for Google Cloud; make a private post on Piazza if you didn’t get it There was a problem with @cs.stanford.edu emails; this is resolved If you have problems with coupons: Post on Piazza DO NOT email me, DO NOT email Prof. Phil Levis 5

slide-6
SLIDE 6

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Administrative: SCPD Tutors

6 This year the SCPD office has hired tutors specifically for SCPD students taking CS231N; you should have received an email about this yesterday (4/9/2018)

slide-7
SLIDE 7

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Administrative: Poster Session

7 Poster session will be Tuesday June 12 (our final exam slot) Attendance is mandatory for non-SCPD students; if you don’t have a legitimate reason for skipping it then you forfeit the points for the poster presentation

slide-8
SLIDE 8

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Recall from last time: Challenges of recognition

8

This image is CC0 1.0 public domain This image by Umberto Salvagnin is licensed under CC-BY 2.0 This image by jonsson is licensed under CC-BY 2.0

Illumination Deformation Occlusion

This image is CC0 1.0 public domain

Clutter

This image is CC0 1.0 public domain

Intraclass Variation Viewpoint

slide-9
SLIDE 9

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Recall from last time: data-driven approach, kNN

9

1-NN classifier 5-NN classifier

train test train test validation

slide-10
SLIDE 10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Recall from last time: Linear Classifier

10

f(x,W) = Wx + b

slide-11
SLIDE 11

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Recall from last time: Linear Classifier

11 1. Define a loss function that quantifies our unhappiness with the scores across the training data. 2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization)

TODO:

Cat image by Nikita is licensed under CC-BY 2.0; Car image is CC0 1.0 public domain; Frog image is in the public domain

slide-12
SLIDE 12

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 12

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are:

slide-13
SLIDE 13

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 13

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: A loss function tells how good our current classifier is Given a dataset of examples Where is image and is (integer) label Loss over the dataset is a sum of loss over examples:

slide-14
SLIDE 14

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 14

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

slide-15
SLIDE 15

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 15

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

“Hinge loss”

slide-16
SLIDE 16

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 16

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

slide-17
SLIDE 17

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 17

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

= max(0, 5.1 - 3.2 + 1) +max(0, -1.7 - 3.2 + 1) = max(0, 2.9) + max(0, -3.9) = 2.9 + 0 = 2.9

Losses:

2.9

slide-18
SLIDE 18

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 18

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

Losses:

= max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1) = max(0, -2.6) + max(0, -1.9) = 0 + 0 = 0

2.9

slide-19
SLIDE 19

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 19

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

Losses:

= max(0, 2.2 - (-3.1) + 1) +max(0, 2.5 - (-3.1) + 1) = max(0, 6.3) + max(0, 6.6) = 6.3 + 6.6 = 12.9

12.9 2.9

slide-20
SLIDE 20

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 20

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form: Loss over full dataset is average:

Losses:

12.9 2.9

L = (2.9 + 0 + 12.9)/3 = 5.27

slide-21
SLIDE 21

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 21

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

Q: What happens to loss if car scores change a bit? Losses:

12.9 2.9

slide-22
SLIDE 22

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 22

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

Q2: what is the min/max possible loss? Losses:

12.9 2.9

slide-23
SLIDE 23

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 23

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

Q3: At initialization W is small so all s ≈ 0. What is the loss? Losses:

12.9 2.9

slide-24
SLIDE 24

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 24

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

Q4: What if the sum was over all classes? (including j = y_i) Losses:

12.9 2.9

slide-25
SLIDE 25

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 25

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

Q5: What if we used mean instead of sum? Losses:

12.9 2.9

slide-26
SLIDE 26

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 26

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

Q6: What if we used Losses:

12.9 2.9

slide-27
SLIDE 27

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Multiclass SVM Loss: Example code

27

slide-28
SLIDE 28

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 28

E.g. Suppose that we found a W such that L = 0. Is this W unique?

slide-29
SLIDE 29

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 29

E.g. Suppose that we found a W such that L = 0. Is this W unique? No! 2W is also has L = 0!

slide-30
SLIDE 30

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 30 Suppose: 3 training examples, 3 classes. With some W the scores are:

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

= max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1) = max(0, -2.6) + max(0, -1.9) = 0 + 0 = 0

Losses:

2.9

Before: With W twice as large: = max(0, 2.6 - 9.8 + 1) +max(0, 4.0 - 9.8 + 1) = max(0, -6.2) + max(0, -4.8) = 0 + 0 = 0

slide-31
SLIDE 31

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 31

E.g. Suppose that we found a W such that L = 0. Is this W unique? No! 2W is also has L = 0! How do we choose between W and 2W?

slide-32
SLIDE 32

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Regularization

32 Data loss: Model predictions should match training data

slide-33
SLIDE 33

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Regularization

33 Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data

slide-34
SLIDE 34

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Regularization

34 Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter)

slide-35
SLIDE 35

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Regularization

35 Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Simple examples L2 regularization: L1 regularization: Elastic net (L1 + L2):

slide-36
SLIDE 36

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Regularization

36 Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Simple examples L2 regularization: L1 regularization: Elastic net (L1 + L2): More complex: Dropout Batch normalization Stochastic depth, fractional pooling, etc

slide-37
SLIDE 37

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Regularization

37 Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Why regularize?

  • Express preferences over weights
  • Make the model simple so it works on test data
  • Improve optimization by adding curvature
slide-38
SLIDE 38

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Regularization: Expressing Preferences

38 L2 Regularization

slide-39
SLIDE 39

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Regularization: Expressing Preferences

39 L2 Regularization L2 regularization likes to “spread out” the weights

slide-40
SLIDE 40

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Regularization: Prefer Simpler Models

40

x y

slide-41
SLIDE 41

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Regularization: Prefer Simpler Models

41

x y f1 f2

slide-42
SLIDE 42

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Regularization: Prefer Simpler Models

42

x y f1 f2

Regularization pushes against fitting the data too well so we don’t fit noise in the data

slide-43
SLIDE 43

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 43

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

slide-44
SLIDE 44

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 44

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

slide-45
SLIDE 45

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 45

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18

exp

unnormalized probabilities

Probabilities must be >= 0

slide-46
SLIDE 46

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 46

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

slide-47
SLIDE 47

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 47

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log-probabilities / logits

slide-48
SLIDE 48

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 48

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log-probabilities / logits

Li = -log(0.13) = 0.89

slide-49
SLIDE 49

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 49

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log-probabilities / logits

Li = -log(0.13) = 2.04

Maximum Likelihood Estimation Choose probabilities to maximize the likelihood of the observed data (See CS 229 for details)

slide-50
SLIDE 50

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 50

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log-probabilities / logits

1.00 0.00 0.00

Correct probs

compare

slide-51
SLIDE 51

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 51

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log-probabilities / logits

1.00 0.00 0.00

Correct probs

compare Kullback–Leibler divergence

slide-52
SLIDE 52

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 52

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log-probabilities / logits

1.00 0.00 0.00

Correct probs

compare Cross Entropy

slide-53
SLIDE 53

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 53

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function Maximize probability of correct class Putting it all together:

slide-54
SLIDE 54

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 54

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function Maximize probability of correct class Putting it all together:

Q: What is the min/max possible loss L_i?

slide-55
SLIDE 55

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 55

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function Maximize probability of correct class Putting it all together:

Q: What is the min/max possible loss L_i? A: min 0, max infinity

slide-56
SLIDE 56

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 56

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function Maximize probability of correct class Putting it all together:

Q2: At initialization all s will be approximately equal; what is the loss?

slide-57
SLIDE 57

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 57

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function Maximize probability of correct class Putting it all together:

Q2: At initialization all s will be approximately equal; what is the loss? A: log(C), eg log(10) ≈ 2.3

slide-58
SLIDE 58

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 58

Softmax vs. SVM

slide-59
SLIDE 59

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 59

Softmax vs. SVM

slide-60
SLIDE 60

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 60

Softmax vs. SVM assume scores: [10, -2, 3] [10, 9, 9] [10, -100, -100] and

Q: Suppose I take a datapoint and I jiggle a bit (changing its score slightly). What happens to the loss in both cases?

slide-61
SLIDE 61

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 61

Recap

  • We have some dataset of (x,y)
  • We have a score function:
  • We have a loss function:

e.g.

Softmax SVM Full loss

slide-62
SLIDE 62

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 62

Recap

  • We have some dataset of (x,y)
  • We have a score function:
  • We have a loss function:

e.g.

Softmax SVM Full loss

How do we find the best W?

slide-63
SLIDE 63

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 63

Optimization

slide-64
SLIDE 64

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 64

This image is CC0 1.0 public domain

slide-65
SLIDE 65

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 65

Walking man image is CC0 1.0 public domain

slide-66
SLIDE 66

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 66

Strategy #1: A first very bad idea solution: Random search

slide-67
SLIDE 67

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 67

Lets see how well this works on the test set... 15.5% accuracy! not bad! (SOTA is ~95%)

slide-68
SLIDE 68

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 68

Strategy #2: Follow the slope

slide-69
SLIDE 69

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 69

Strategy #2: Follow the slope In 1-dimension, the derivative of a function:

In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient

slide-70
SLIDE 70

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 70

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 gradient dW: [?, ?, ?, ?, ?, ?, ?, ?, ?,…]

slide-71
SLIDE 71

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 71

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (first dim): [0.34 + 0.0001,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25322 gradient dW: [?, ?, ?, ?, ?, ?, ?, ?, ?,…]

slide-72
SLIDE 72

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 72

gradient dW: [-2.5, ?, ?, ?, ?, ?, ?, ?, ?,…]

(1.25322 - 1.25347)/0.0001 = -2.5

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (first dim): [0.34 + 0.0001,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25322

slide-73
SLIDE 73

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 73

gradient dW: [-2.5, ?, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (second dim): [0.34,

  • 1.11 + 0.0001,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25353

slide-74
SLIDE 74

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 74

gradient dW: [-2.5, 0.6, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (second dim): [0.34,

  • 1.11 + 0.0001,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25353

(1.25353 - 1.25347)/0.0001 = 0.6

slide-75
SLIDE 75

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 75

gradient dW: [-2.5, 0.6, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (third dim): [0.34,

  • 1.11,

0.78 + 0.0001, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347

slide-76
SLIDE 76

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 76

gradient dW: [-2.5, 0.6, 0, ?, ?, ?, ?, ?, ?,…] current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (third dim): [0.34,

  • 1.11,

0.78 + 0.0001, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347

(1.25347 - 1.25347)/0.0001 = 0

slide-77
SLIDE 77

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 77

gradient dW: [-2.5, 0.6, 0, ?, ?, ?, ?, ?, ?,…] current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (third dim): [0.34,

  • 1.11,

0.78 + 0.0001, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347

Numeric Gradient

  • Slow! Need to loop over

all dimensions

  • Approximate
slide-78
SLIDE 78

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 78

This is silly. The loss is just a function of W:

want

slide-79
SLIDE 79

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 79

This is silly. The loss is just a function of W:

want

This image is in the public domain This image is in the public domain Hammer image is in the public domain

Use calculus to compute an analytic gradient

slide-80
SLIDE 80

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 80

gradient dW: [-2.5, 0.6, 0, 0.2, 0.7,

  • 0.5,

1.1, 1.3,

  • 2.1,…]

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 dW = ... (some function data and W)

slide-81
SLIDE 81

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 81

In summary:

  • Numerical gradient: approximate, slow, easy to write
  • Analytic gradient: exact, fast, error-prone

=>

In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

slide-82
SLIDE 82

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 82

Gradient Descent

slide-83
SLIDE 83

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 83

  • riginal W

negative gradient direction

W_1 W_2

slide-84
SLIDE 84

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 84

slide-85
SLIDE 85

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Stochastic Gradient Descent (SGD)

85 Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common

slide-86
SLIDE 86

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 86

Interactive Web Demo time....

http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/

slide-87
SLIDE 87

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 87

Interactive Web Demo time....

slide-88
SLIDE 88

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Aside: Image Features

88

f(x) = Wx

Class scores

slide-89
SLIDE 89

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Aside: Image Features

89

f(x) = Wx

Class scores Feature Representation

slide-90
SLIDE 90

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Image Features: Motivation

90

x y Cannot separate red and blue points with linear classifier

slide-91
SLIDE 91

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Image Features: Motivation

91

x y r θ

f(x, y) = (r(x, y), θ(x, y))

Cannot separate red and blue points with linear classifier After applying feature transform, points can be separated by linear classifier

slide-92
SLIDE 92

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Example: Color Histogram

92

+1

slide-93
SLIDE 93

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Example: Histogram of Oriented Gradients (HoG)

93

Divide image into 8x8 pixel regions Within each region quantize edge direction into 9 bins Example: 320x240 image gets divided into 40x30 bins; in each bin there are 9 numbers so feature vector has 30*40*9 = 10,800 numbers

Lowe, “Object recognition from local scale-invariant features”, ICCV 1999 Dalal and Triggs, "Histograms of oriented gradients for human detection," CVPR 2005

slide-94
SLIDE 94

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Example: Bag of Words

94

Extract random patches Cluster patches to form “codebook”

  • f “visual words”

Step 1: Build codebook Step 2: Encode images

Fei-Fei and Perona, “A bayesian hierarchical model for learning natural scene categories”, CVPR 2005

slide-95
SLIDE 95

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Aside: Image Features

95

slide-96
SLIDE 96

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Feature Extraction

Image features vs ConvNets

96

f

10 numbers giving scores for classes

training training

10 numbers giving scores for classes

slide-97
SLIDE 97

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Next time:

Introduction to neural networks Backpropagation

97