Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 1
Lecture 3: Loss functions and Optimization Fei-Fei Li & Andrej - - PowerPoint PPT Presentation
Lecture 3: Loss functions and Optimization Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 3 - Lecture 3 - 11 Jan 2016 11 Jan 2016 1 Administrative A1 is due Jan 20
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 1
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 2
A1 is due Jan 20 (Wednesday). ~9 days left Warning: Jan 18 (Monday) is Holiday (no class/office hours)
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 3
Camera pose Illumination Deformation Occlusion Background clutter Intraclass variation
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 4
the data NN classifier 5-NN classifier
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 5
[32x32x3] array of numbers 0...1 (3072 numbers total)
10 numbers, indicating class scores
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 6
0.09 2.9 4.48 8.02 3.78 1.06
6.04 5.31
3.58 4.49
3.42 4.64 2.65 5.1 2.64 5.55
6.14
1. Define a loss function that quantifies our unhappiness with the scores across the training data. 2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization)
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 7
Suppose: 3 training examples, 3 classes. With some W the scores are:
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 8
Suppose: 3 training examples, 3 classes. With some W the scores are:
Multiclass SVM loss:
Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 9
Suppose: 3 training examples, 3 classes. With some W the scores are:
Multiclass SVM loss:
Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:
= max(0, 5.1 - 3.2 + 1) +max(0, -1.7 - 3.2 + 1) = max(0, 2.9) + max(0, -3.9) = 2.9 + 0 = 2.9
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 10
Suppose: 3 training examples, 3 classes. With some W the scores are:
Multiclass SVM loss:
Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:
= max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1) = max(0, -2.6) + max(0, -1.9) = 0 + 0 = 0
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 11
Suppose: 3 training examples, 3 classes. With some W the scores are:
Multiclass SVM loss:
Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:
= max(0, 2.2 - (-3.1) + 1) +max(0, 2.5 - (-3.1) + 1) = max(0, 5.3) + max(0, 5.6) = 5.3 + 5.6 = 10.9
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 12
Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:
Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form: and the full training loss is the mean
L = (2.9 + 0 + 10.9)/3 = 4.6
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 13
Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:
Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 14
Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:
Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 15
Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:
Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 16
Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:
Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 17
Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:
Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:
Q5: usually at initialization W are small numbers, so all s ~= 0. What is the loss?
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 18
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 19
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 20
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 21
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 22
Suppose: 3 training examples, 3 classes. With some W the scores are:
= max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1) = max(0, -2.6) + max(0, -1.9) = 0 + 0 = 0
Before: With W twice as large: = max(0, 2.6 - 9.8 + 1) +max(0, 4.0 - 9.8 + 1) = max(0, -6.2) + max(0, -4.8) = 0 + 0 = 0
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 23
\lambda = regularization strength (hyperparameter)
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 24
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 25
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 26
scores = unnormalized log probabilities of the classes.
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 27
scores = unnormalized log probabilities of the classes.
where
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 28
scores = unnormalized log probabilities of the classes.
where
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 29
scores = unnormalized log probabilities of the classes. Want to maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class:
where
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 30
scores = unnormalized log probabilities of the classes. Want to maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class:
in summary:
where
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 31
unnormalized log probabilities
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 32
unnormalized log probabilities
exp unnormalized probabilities
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 33
unnormalized log probabilities
exp normalize unnormalized probabilities
probabilities
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 34
unnormalized log probabilities
exp normalize unnormalized probabilities
probabilities
L_i = -log(0.13) = 0.89
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 35
unnormalized log probabilities
exp normalize unnormalized probabilities
probabilities
L_i = -log(0.13) = 0.89
Q: What is the min/max possible loss L_i?
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 36
unnormalized log probabilities
exp normalize unnormalized probabilities
probabilities
L_i = -log(0.13) = 0.89
Q5: usually at initialization W are small numbers, so all s ~= 0. What is the loss?
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 37
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 38
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 39
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 40
http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 41
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 42
e.g.
Softmax SVM Full loss
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 43
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 44
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 45
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 46
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 47
In multiple dimensions, the gradient is the vector of (partial derivatives).
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 48
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 49
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 50
(1.25322 - 1.25347)/0.0001 = -2.5
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 51
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 52
(1.25353 - 1.25347)/0.0001 = 0.6
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 53
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 54
(1.25347 - 1.25347)/0.0001 = 0
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 55
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 56
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 57
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 58
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 59
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 60
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 61
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 62
=>
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 63
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 64
W_1 W_2
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 65
Common mini-batch sizes are 32/64/128 examples e.g. Krizhevsky ILSVRC ConvNet used 256 examples
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 66
Example of optimization progress while training a neural network. (Loss over mini-batches goes down
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 67
The effects of step size (or “learning rate”)
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 68
Common mini-batch sizes are 32/64/128 examples e.g. Krizhevsky ILSVRC ConvNet used 256 examples we will look at more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …)
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 69
(image credits to Alec Radford)
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 70
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 71
hue bins
+1
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 72
8x8 pixel region, quantize the edge
(image from vlfeat.org)
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 73
8x8 pixel region, quantize the edge
(image from vlfeat.org)
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 74
144 visual word vectors learn k-means centroids “vocabulary” of visual words e.g. 1000 centroids 1000-d vector 1000-d vector 1000-d vector histogram of visual words
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 75
[32x32x3]
10 numbers, indicating class scores
vector describing various image statistics [32x32x3]
10 numbers, indicating class scores training training
Lecture 3 - 11 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 3 - 11 Jan 2016 76