Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
Lecture 14: Linear Classifiers
1
Lecture 14: Linear Classifiers Justin Johnson EECS 442 WI 2020: - - PowerPoint PPT Presentation
Lecture 14: Linear Classifiers Justin Johnson EECS 442 WI 2020: Lecture 14 - 1 February 25, 2020 Administrative HW3 due Wednesday, March 4 11:59pm TAs will not be checking Piazza over Spring Break. You are strongly encouraged to finish
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
1
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
You are strongly encouraged to finish the assignment by Friday, February 25
2
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
3
Example training set
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
4
Inference (x):
arg min
๐ 1 23& 4
๐"๐๐ โ ๐ง2
8
arg min
๐
๐ โ ๐๐ 8
8
Training (xi,yi): Testing/Inference: Given a new output, whatโs the prediction?
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
5
arg min
๐
๐ โ ๐๐ 8
8 + ๐ ๐ 8 8
Loss Regularization Trade-off
Least-squares w=0 Something sensible?
Objective:
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
6
arg min
๐
๐ โ ๐๐ 8
8 + ๐ ๐ 8 8
Loss Regularization Trade-off
Objective:
W is a parameter, since we optimize for it on the training set ฮป is a hyperparameter, since we choose it before fitting the training set
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
7
Idea #1: Choose hyperparameters that work best on the data BAD: ฮป =0 always works best on training data Idea #2: Split data into train and test, choose hyperparameters that work best on test data BAD: No idea how we will perform on new data
Your Dataset train test
Idea #3: Split data into train, val, and test; choose hyperparameters on val and evaluate on test
Better!
train test validation
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
8
Your Dataset test fold 1 fold 2 fold 3 fold 4 fold 5
Idea #4: Cross-Validation: Split data into folds, try each fold as validation and average the results
test fold 1 fold 2 fold 3 fold 4 fold 5 test fold 1 fold 2 fold 3 fold 4 fold 5
Useful for small datasets, but (unfortunately) not used too frequently in deep learning
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
9
Training Test Validation Fit model parameters on training set; find hyperparameters by testing on validation set; evaluate on entirely unseen test set. Use these data points to fit w*=(XTX+ ฮปI )-1XTy Evaluate on these points for different ฮป, pick the best
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
10
Start with simplest example: binary classification
Cat or not cat?
Actually: a feature vector representing the image
x1 x2
xN
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
11
Rifkin, Yeo, Poggio. Regularized Least Squares Classification (http://cbcl.mit.edu/publications/ps/rlsc.pdf). 2003 Redmon, Divvala, Girshick, Farhadi. You Only Look Once: Unified, Real-Time Object Detection. CVPR 2016.
Treat as regression: xi is image feature; yi is 1 if itโs a cat, 0 if itโs not a cat. Minimize least-squares loss.
arg min
๐ 1 23& 4
๐"๐๐ โ ๐ง2
8
Training (xi,yi): Inference (x): ๐"๐ > ๐ข
Unprincipled in theory, but often effective in practice The reverse (regression via discrete bins) is also common
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
12
If this: cat. If this: dog. If this: hippo.
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
13
same. Rule: if this, then cat
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
14
Known Images Labels
Test Image
(1) Compute distance between feature vectors (2) find nearest (3) use label.
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
15
Training (xi,yi):
Inference (x):
bestDist, prediction = Inf, None for i in range(N): if dist(xi,x) < bestDist: bestDist = dist(xi,x) prediction = yi
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
16
Nearest neighbors in two dimensions Points are training examples; colors give training labels Background colors give the category a test point would be assigned
x0 x1
x
How to smooth out decision boundaries? Use more neighbors! Decision boundary is the boundary between two classification regions Decision boundaries can be noisy; affected by outliers
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
17
Instead of copying label from nearest neighbor, take majority vote from K closest points
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
18
Using more neighbors helps smooth
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
19
Using more neighbors helps reduce the effect of outliers
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
20
When K > 1 there can be ties! Need to break them somehow
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
21
๐ ๐ฆ, ๐ง = 1
2
๐ฆ2 โ ๐ง2 ๐ ๐ฆ, ๐ง = 1
2
๐ฆ2 โ ๐ง2 8
&/8
L1 (Manhattan) Distance L2 (Euclidean) Distance
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
22
L1 (Manhattan) Distance L2 (Euclidean) Distance
๐ ๐ฆ, ๐ง = 1
2
๐ฆ2 โ ๐ง2 ๐ ๐ฆ, ๐ง = 1
2
๐ฆ2 โ ๐ง2 8
&/8
K = 1 K = 1
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
23
What distance? What value for K? Training Test Validation Use these data points for lookup Evaluate on these points for different k, distances
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
24
is guaranteed to be at most 2x worse than
Opposite of what we want!
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
25
big if cat ๐G
"๐
big if dog ๐&
"๐
big if hippo ๐8
"๐
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
26 0.2
0.1 2.0 1.5 1.3 2.1 0.0 0.0 0.3 0.2
1.1 3.2
56 231 24 2 1
Cat weight vector Dog weight vector Hippo weight vector
437.9 61.95
Cat score Dog score Hippo score
Diagram by: Karpathy, Fei-Fei
Weight matrix a collection of scoring functions, one per class Prediction is vector where jth component is โscoreโ for jth class.
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
27
Diagram credit: Karpathy & Fei-Fei
Be aware: Intuition from 2D doesnโt always carry over into high-dimensional
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
28
feature by unrolling all pixels
to recognize 10 classes
CIFAR 10: 32x32x3 Images, 10 Classes
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
29
Decision rule is wTx. If wi is big, then big values of xi are indicative of the class.
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
30
Decision rule is wTx. If wi is big, then big values of xi are indicative of the class.
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
31
Decision rule is wTx. If wi is big, then big values of xi are indicative of the class.
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
32
big if cat ๐G
"๐
big if dog ๐&
"๐
big if hippo ๐8
"๐
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
33
A loss function tells how good our current classifier is Low loss = good classifier High loss = bad classifier (Also called: objective function; cost function) Negative loss function sometimes called reward function, profit function, utility function, fitness function, etc Given a dataset ๐ฆ2, ๐ง2
23& @
Loss for a single example is: ๐2 ๐ ๐ฆ2, ๐ , ๐ง2 Loss for the dataset is ๐ = 1 ๐ 1
23& @
๐2 ๐ ๐ฆ2, ๐ , ๐ง2
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
โThe score of the correct class should be higher than all the other scoresโ
34
Loss Score for correct class Highest score among other classes โMarginโ Given an example ๐ฆ2, ๐ง2 (๐ฆ2 is image, ๐ง2 is label) Let ๐ก = ๐(๐ฆ2, ๐) be scores Then the SVM loss has the form: ๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
โHinge Lossโ
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
35
Given an example ๐ฆ2, ๐ง2 (๐ฆ2 is image, ๐ง2 is label) Let ๐ก = ๐(๐ฆ2, ๐) be scores Then the SVM loss has the form: ๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
36
Given an example ๐ฆ2, ๐ง2 (๐ฆ2 is image, ๐ง2 is label) Let ๐ก = ๐(๐ฆ2, ๐) be scores Then the SVM loss has the form: ๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
37
= max(0, 5.1 - 3.2 + 1) + max(0, -1.7 - 3.2 + 1) = max(0, 2.9) + max(0, -3.9) = 2.9 + 0 = 2.9
Given an example ๐ฆ2, ๐ง2 (๐ฆ2 is image, ๐ง2 is label) Let ๐ก = ๐(๐ฆ2, ๐) be scores Then the SVM loss has the form: ๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
38
= max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1) = max(0, -2.6) + max(0, -1.9) = 0 + 0 = 0
Given an example ๐ฆ2, ๐ง2 (๐ฆ2 is image, ๐ง2 is label) Let ๐ก = ๐(๐ฆ2, ๐) be scores Then the SVM loss has the form: ๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
39
= max(0, 2.2 - (-3.1) + 1) +max(0, 2.5 - (-3.1) + 1) = max(0, 6.3) + max(0, 6.6) = 6.3 + 6.6 = 12.9
Given an example ๐ฆ2, ๐ง2 (๐ฆ2 is image, ๐ง2 is label) Let ๐ก = ๐(๐ฆ2, ๐) be scores Then the SVM loss has the form: ๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
40
Given an example ๐ฆ2, ๐ง2 (๐ฆ2 is image, ๐ง2 is label) Let ๐ก = ๐(๐ฆ2, ๐) be scores Then the SVM loss has the form: ๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
Loss over the dataset is: L = (2.9 + 0.0 + 12.9) / 3 = 5.27
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
41
Given an example ๐ฆ2, ๐ง2 (๐ฆ2 is image, ๐ง2 is label) Let ๐ก = ๐(๐ฆ2, ๐) be scores Then the SVM loss has the form: ๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
Q: What happens to the loss if the scores for the car image change a bit?
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
42
Given an example ๐ฆ2, ๐ง2 (๐ฆ2 is image, ๐ง2 is label) Let ๐ก = ๐(๐ฆ2, ๐) be scores Then the SVM loss has the form: ๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
Q: What are the min and max possible loss?
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
43
Given an example ๐ฆ2, ๐ง2 (๐ฆ2 is image, ๐ง2 is label) Let ๐ก = ๐(๐ฆ2, ๐) be scores Then the SVM loss has the form: ๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
Q: If all scores were random, what loss would we expect?
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
44
Given an example ๐ฆ2, ๐ง2 (๐ฆ2 is image, ๐ง2 is label) Let ๐ก = ๐(๐ฆ2, ๐) be scores Then the SVM loss has the form: ๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
Q: What would happen if sum were
(including ๐ = ๐ง2)
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
45
Given an example ๐ฆ2, ๐ง2 (๐ฆ2 is image, ๐ง2 is label) Let ๐ก = ๐(๐ฆ2, ๐) be scores Then the SVM loss has the form: ๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
Q: What if the loss used mean instead of sum?
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
46
Given an example ๐ฆ2, ๐ง2 (๐ฆ2 is image, ๐ง2 is label) Let ๐ก = ๐(๐ฆ2, ๐) be scores Then the SVM loss has the form: ๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
Q: What if we used this loss instead?
๐2 = 1
QRST
max 0, ๐กQ โ ๐กST + 1
8
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 47
Want to interpret raw classifier scores as probabilities
Classifier scores ๐ก = ๐(๐ฆ2, ๐)
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 48
Want to interpret raw classifier scores as probabilities
Classifier scores ๐ก = ๐(๐ฆ2, ๐) Softmax function ๐2 = exp(๐ก2) โQ exp ๐กQ
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 49
Want to interpret raw classifier scores as probabilities
Unnormalized log- probabilities / logits
Classifier scores ๐ก = ๐(๐ฆ2, ๐) Softmax function ๐2 = exp(๐ก2) โQ exp ๐กQ
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 50
Want to interpret raw classifier scores as probabilities
Probabilities must be >= 0
exp
unnormalized probabilities
Unnormalized log- probabilities / logits
Classifier scores ๐ก = ๐(๐ฆ2, ๐) Softmax function ๐2 = exp(๐ก2) โQ exp ๐กQ
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 51
Want to interpret raw classifier scores as probabilities
Probabilities must be >= 0 Probabilities must sum to 1
exp
normalize
unnormalized probabilities probabilities
Unnormalized log- probabilities / logits
Classifier scores ๐ก = ๐(๐ฆ2, ๐) Softmax function ๐2 = exp(๐ก2) โQ exp ๐กQ
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 52
Want to interpret raw classifier scores as probabilities
Probabilities must be >= 0 Probabilities must sum to 1
exp
normalize
unnormalized probabilities probabilities
Unnormalized log- probabilities / logits
Classifier scores ๐ก = ๐(๐ฆ2, ๐) Softmax function ๐2 = exp(๐ก2) โQ exp ๐กQ Loss ๐2 = โ log(๐ST)
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 53
Want to interpret raw classifier scores as probabilities
Probabilities must be >= 0 Probabilities must sum to 1
exp
normalize
unnormalized probabilities probabilities
Unnormalized log- probabilities / logits
Classifier scores ๐ก = ๐(๐ฆ2, ๐) Softmax function ๐2 = exp(๐ก2) โQ exp ๐กQ Loss ๐2 = โ log(๐ST)
Maximum Likelihood Estimation Choose weights to maximize the likelihood of the observed data (See EECS 445 or EECS 545)
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 54
Want to interpret raw classifier scores as probabilities
Probabilities must be >= 0 Probabilities must sum to 1
exp
normalize
unnormalized probabilities probabilities
Unnormalized log- probabilities / logits
Correct probs
Compare
Classifier scores ๐ก = ๐(๐ฆ2, ๐) Softmax function ๐2 = exp(๐ก2) โQ exp ๐กQ Loss ๐2 = โ log(๐ST)
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 55
Want to interpret raw classifier scores as probabilities
Probabilities must be >= 0 Probabilities must sum to 1
exp
normalize
unnormalized probabilities probabilities
Unnormalized log- probabilities / logits
Correct probs
Compare
Classifier scores ๐ก = ๐(๐ฆ2, ๐) Softmax function ๐2 = exp(๐ก2) โQ exp ๐กQ Loss ๐2 = โ log(๐ST)
Kullback-Leibler Divergence ๐ธ]^ ๐ธ || ๐น = 1
S
๐ธ ๐ง log ๐ธ(๐ง) ๐น(๐ง)
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 56
Want to interpret raw classifier scores as probabilities
Probabilities must be >= 0 Probabilities must sum to 1
exp
normalize
unnormalized probabilities probabilities
Unnormalized log- probabilities / logits
Correct probs
Compare
Classifier scores ๐ก = ๐(๐ฆ2, ๐) Softmax function ๐2 = exp(๐ก2) โQ exp ๐กQ Loss ๐2 = โ log(๐ST)
Cross-Entropy: ๐ผ ๐, ๐ = ๐ผ ๐ +๐ธ]^ ๐ || ๐
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 57
Want to interpret raw classifier scores as probabilities
Classifier scores ๐ก = ๐(๐ฆ2, ๐) Softmax function ๐2 = exp(๐ก2) โQ exp ๐กQ Loss ๐2 = โ log(๐ST)
Putting it all together: ๐2 = โ log exp ๐กST โQ exp(๐ก
Q)
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 58
Want to interpret raw classifier scores as probabilities
Classifier scores ๐ก = ๐(๐ฆ2, ๐) Softmax function ๐2 = exp(๐ก2) โQ exp ๐กQ Loss ๐2 = โ log(๐ST)
Putting it all together: ๐2 = โ log exp ๐กST โQ exp(๐ก
Q)
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 59
Want to interpret raw classifier scores as probabilities
Classifier scores ๐ก = ๐(๐ฆ2, ๐) Softmax function ๐2 = exp(๐ก2) โQ exp ๐กQ Loss ๐2 = โ log(๐ST)
Putting it all together: ๐2 = โ log exp ๐กST โQ exp(๐ก
Q)
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
60
๐2 = โ log exp ๐กST โQ exp(๐ก
Q)
๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
A: Cross-entropy loss > 0 SVM loss = 0 Q: What is cross-entropy loss? What is SVM loss?
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
61
๐2 = โ log exp ๐กST โQ exp(๐ก
Q)
๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
A: Cross-entropy loss will change; SVM loss will stay the same Q: What happens to each loss if I slightly change the scores of the last datapoint?
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
62
๐2 = โ log exp ๐กST โQ exp(๐ก
Q)
๐2 = 1
QRST
max 0, ๐ก
Q โ ๐กST + 1
A: Cross-entropy loss will decrease, SVM loss still 0 Q: What happens to each loss if I double the score of the correct class from 10 to 20?
Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -
63