Lecture 14: Linear Classifiers Justin Johnson EECS 442 WI 2020: - - PowerPoint PPT Presentation

โ–ถ
lecture 14 linear classifiers
SMART_READER_LITE
LIVE PREVIEW

Lecture 14: Linear Classifiers Justin Johnson EECS 442 WI 2020: - - PowerPoint PPT Presentation

Lecture 14: Linear Classifiers Justin Johnson EECS 442 WI 2020: Lecture 14 - 1 February 25, 2020 Administrative HW3 due Wednesday, March 4 11:59pm TAs will not be checking Piazza over Spring Break. You are strongly encouraged to finish


slide-1
SLIDE 1

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Lecture 14: Linear Classifiers

1

slide-2
SLIDE 2

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Administrative

  • HW3 due Wednesday, March 4 11:59pm
  • TAs will not be checking Piazza over Spring Break.

You are strongly encouraged to finish the assignment by Friday, February 25

2

slide-3
SLIDE 3

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Last Time: Supervised Learning

3

  • 1. Collect a dataset of images and labels
  • 2. Use Machine Learning to train a classifier
  • 3. Evaluate the classifier on new images

Example training set

slide-4
SLIDE 4

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Last Time: Least Squares

4

๐’™"๐’š = ๐‘ฅ&๐‘ฆ& + โ‹ฏ + ๐‘ฅ*๐‘ฆ*

Inference (x):

arg min

๐’™ 1 23& 4

๐’™"๐’š๐’‹ โˆ’ ๐‘ง2

8

arg min

๐’™

๐’› โˆ’ ๐’€๐’™ 8

8

  • r

Training (xi,yi): Testing/Inference: Given a new output, whatโ€™s the prediction?

slide-5
SLIDE 5

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Last Time: Regularization

5

arg min

๐’™

๐’› โˆ’ ๐’€๐’™ 8

8 + ๐œ‡ ๐’™ 8 8

Loss Regularization Trade-off

What happens (and why) if:

  • ฮป=0
  • ฮป=โˆž

โˆž

Least-squares w=0 Something sensible?

?

Objective:

slide-6
SLIDE 6

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Hyperparameters

6

arg min

๐’™

๐’› โˆ’ ๐’€๐’™ 8

8 + ๐œ‡ ๐’™ 8 8

Loss Regularization Trade-off

What happens (and why) if:

  • ฮป=0
  • ฮป=โˆž

Objective:

W is a parameter, since we optimize for it on the training set ฮป is a hyperparameter, since we choose it before fitting the training set

slide-7
SLIDE 7

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Choosing Hyperparameters

7

Idea #1: Choose hyperparameters that work best on the data BAD: ฮป =0 always works best on training data Idea #2: Split data into train and test, choose hyperparameters that work best on test data BAD: No idea how we will perform on new data

Your Dataset train test

Idea #3: Split data into train, val, and test; choose hyperparameters on val and evaluate on test

Better!

train test validation

slide-8
SLIDE 8

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Choosing Hyperparameters

8

Your Dataset test fold 1 fold 2 fold 3 fold 4 fold 5

Idea #4: Cross-Validation: Split data into folds, try each fold as validation and average the results

test fold 1 fold 2 fold 3 fold 4 fold 5 test fold 1 fold 2 fold 3 fold 4 fold 5

Useful for small datasets, but (unfortunately) not used too frequently in deep learning

slide-9
SLIDE 9

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Training and Testing

9

Training Test Validation Fit model parameters on training set; find hyperparameters by testing on validation set; evaluate on entirely unseen test set. Use these data points to fit w*=(XTX+ ฮปI )-1XTy Evaluate on these points for different ฮป, pick the best

slide-10
SLIDE 10

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Image Classification

10

Start with simplest example: binary classification

Cat or not cat?

Actually: a feature vector representing the image

x1 x2

โ€ฆ

xN

slide-11
SLIDE 11

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Classification with Least Squares

11

Rifkin, Yeo, Poggio. Regularized Least Squares Classification (http://cbcl.mit.edu/publications/ps/rlsc.pdf). 2003 Redmon, Divvala, Girshick, Farhadi. You Only Look Once: Unified, Real-Time Object Detection. CVPR 2016.

Treat as regression: xi is image feature; yi is 1 if itโ€™s a cat, 0 if itโ€™s not a cat. Minimize least-squares loss.

arg min

๐’™ 1 23& 4

๐’™"๐’š๐’‹ โˆ’ ๐‘ง2

8

Training (xi,yi): Inference (x): ๐’™"๐’š > ๐‘ข

Unprincipled in theory, but often effective in practice The reverse (regression via discrete bins) is also common

slide-12
SLIDE 12

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Classification via Memorization

12

Just memorize (as in a Python dictionary) Consider cat/dog/hippo classification.

If this: cat. If this: dog. If this: hippo.

slide-13
SLIDE 13

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Classification via Memorization

13

  • Hmmm. Not quite the

same. Rule: if this, then cat

Where does this go wrong?

slide-14
SLIDE 14

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Classification via Memorization

14

Known Images Labels

โ€ฆ

๐’š& ๐’š@

Test Image

๐’š" ๐ธ(๐’š@, ๐’š") ๐ธ(๐’š&, ๐’š")

(1) Compute distance between feature vectors (2) find nearest (3) use label.

Cat Dog Cat!

slide-15
SLIDE 15

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Nearest Neighbor

15

โ€œAlgorithmโ€

Training (xi,yi):

Memorize training set

Inference (x):

bestDist, prediction = Inf, None for i in range(N): if dist(xi,x) < bestDist: bestDist = dist(xi,x) prediction = yi

slide-16
SLIDE 16

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Nearest Neighbor

16

Nearest neighbors in two dimensions Points are training examples; colors give training labels Background colors give the category a test point would be assigned

x0 x1

x

How to smooth out decision boundaries? Use more neighbors! Decision boundary is the boundary between two classification regions Decision boundaries can be noisy; affected by outliers

slide-17
SLIDE 17

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

K-Nearest Neighbors

17

K = 1 K = 3

Instead of copying label from nearest neighbor, take majority vote from K closest points

slide-18
SLIDE 18

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

K-Nearest Neighbors

18

K = 1 K = 3

Using more neighbors helps smooth

  • ut rough decision boundaries
slide-19
SLIDE 19

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

K-Nearest Neighbors

19

K = 1 K = 3

Using more neighbors helps reduce the effect of outliers

slide-20
SLIDE 20

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

K-Nearest Neighbors

20

K = 1 K = 3

When K > 1 there can be ties! Need to break them somehow

slide-21
SLIDE 21

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

K-Nearest Neighbors: Distance Metric

21

๐‘’ ๐‘ฆ, ๐‘ง = 1

2

๐‘ฆ2 โˆ’ ๐‘ง2 ๐‘’ ๐‘ฆ, ๐‘ง = 1

2

๐‘ฆ2 โˆ’ ๐‘ง2 8

&/8

L1 (Manhattan) Distance L2 (Euclidean) Distance

slide-22
SLIDE 22

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

K-Nearest Neighbors: Distance Metric

22

L1 (Manhattan) Distance L2 (Euclidean) Distance

๐‘’ ๐‘ฆ, ๐‘ง = 1

2

๐‘ฆ2 โˆ’ ๐‘ง2 ๐‘’ ๐‘ฆ, ๐‘ง = 1

2

๐‘ฆ2 โˆ’ ๐‘ง2 8

&/8

K = 1 K = 1

slide-23
SLIDE 23

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

K-Nearest Neighbors

23

What distance? What value for K? Training Test Validation Use these data points for lookup Evaluate on these points for different k, distances

slide-24
SLIDE 24

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

K-Nearest Neighbors

24

  • No learning going on but usually effective
  • Same algorithm for every task
  • As number of datapoints โ†’ โˆž, error rate

is guaranteed to be at most 2x worse than

  • ptimal you could do on data
  • Training is fast, but inference is slow.

Opposite of what we want!

slide-25
SLIDE 25

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Linear Classifiers

25

Example Setup: 3 classes ๐’™G, ๐’™&, ๐’™8 Model โ€“ one weight per class:

big if cat ๐’™G

"๐’š

big if dog ๐’™&

"๐’š

big if hippo ๐’™8

"๐’š

๐‘ฟ๐Ÿ’๐’š๐‘ฎ Stack together: where x is in RF

slide-26
SLIDE 26

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Linear Classifiers

26 0.2

  • 0.5

0.1 2.0 1.5 1.3 2.1 0.0 0.0 0.3 0.2

  • 0.3

1.1 3.2

  • 1.2

๐‘ฟ

56 231 24 2 1

๐’š๐’‹

Cat weight vector Dog weight vector Hippo weight vector

๐‘ฟ๐’š๐’‹

  • 96.8

437.9 61.95

Cat score Dog score Hippo score

Diagram by: Karpathy, Fei-Fei

Weight matrix a collection of scoring functions, one per class Prediction is vector where jth component is โ€œscoreโ€ for jth class.

slide-27
SLIDE 27

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Linear Classifiers: Geometric Intuition

27

What does a linear classifier look like in 2D?

Diagram credit: Karpathy & Fei-Fei

Be aware: Intuition from 2D doesnโ€™t always carry over into high-dimensional

  • spaces. See: On the Surprising Behavior of Distance Metrics in High Dimensional
  • Space. Charu, Hinneburg, Keim. ICDT 2001
slide-28
SLIDE 28

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Linear Classifiers: Visual Intuition

28

  • Turn each image into

feature by unrolling all pixels

  • Train a linear model

to recognize 10 classes

CIFAR 10: 32x32x3 Images, 10 Classes

slide-29
SLIDE 29

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Linear Classifiers: Visual Intuition

29

Decision rule is wTx. If wi is big, then big values of xi are indicative of the class.

Deer or Plane?

slide-30
SLIDE 30

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Linear Classifiers: Visual Intuition

30

Decision rule is wTx. If wi is big, then big values of xi are indicative of the class.

Ship or Dog?

slide-31
SLIDE 31

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Linear Classifiers: Visual Intuition

31

Decision rule is wTx. If wi is big, then big values of xi are indicative of the class.

slide-32
SLIDE 32

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

So Far: Linear Score Function

32

๐’™G, ๐’™&, ๐’™8 Model โ€“ one weight per class:

big if cat ๐’™G

"๐’š

big if dog ๐’™&

"๐’š

big if hippo ๐’™8

"๐’š

๐‘ฟ๐Ÿ’๐’š๐‘ฎ Stack together: where x is in RF How do we know which W is best?

slide-33
SLIDE 33

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Choosing W: Loss Function

33

A loss function tells how good our current classifier is Low loss = good classifier High loss = bad classifier (Also called: objective function; cost function) Negative loss function sometimes called reward function, profit function, utility function, fitness function, etc Given a dataset ๐‘ฆ2, ๐‘ง2

23& @

  • f images ๐‘ฆ2 and labels ๐‘ง2,

Loss for a single example is: ๐‘€2 ๐‘” ๐‘ฆ2, ๐‘‹ , ๐‘ง2 Loss for the dataset is ๐‘€ = 1 ๐‘‚ 1

23& @

๐‘€2 ๐‘” ๐‘ฆ2, ๐‘‹ , ๐‘ง2

slide-34
SLIDE 34

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Multiclass SVM Loss

โ€The score of the correct class should be higher than all the other scoresโ€

34

Loss Score for correct class Highest score among other classes โ€œMarginโ€ Given an example ๐‘ฆ2, ๐‘ง2 (๐‘ฆ2 is image, ๐‘ง2 is label) Let ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) be scores Then the SVM loss has the form: ๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

โ€œHinge Lossโ€

slide-35
SLIDE 35

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Multiclass SVM Loss

35

3.2 cat car frog 5.1

  • 1.7

1.3 4.9 2.0 2.2 2.5

  • 3.1

Given an example ๐‘ฆ2, ๐‘ง2 (๐‘ฆ2 is image, ๐‘ง2 is label) Let ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) be scores Then the SVM loss has the form: ๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

slide-36
SLIDE 36

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Multiclass SVM Loss

36

3.2 cat car frog 5.1

  • 1.7

1.3 4.9 2.0 2.2 2.5

  • 3.1

Given an example ๐‘ฆ2, ๐‘ง2 (๐‘ฆ2 is image, ๐‘ง2 is label) Let ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) be scores Then the SVM loss has the form: ๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

slide-37
SLIDE 37

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Multiclass SVM Loss

37

3.2 cat car frog 5.1

  • 1.7

1.3 4.9 2.0 2.2 2.5

  • 3.1

= max(0, 5.1 - 3.2 + 1) + max(0, -1.7 - 3.2 + 1) = max(0, 2.9) + max(0, -3.9) = 2.9 + 0 = 2.9

Loss 2.9

Given an example ๐‘ฆ2, ๐‘ง2 (๐‘ฆ2 is image, ๐‘ง2 is label) Let ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) be scores Then the SVM loss has the form: ๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

slide-38
SLIDE 38

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Multiclass SVM Loss

38

3.2 cat car frog 5.1

  • 1.7

1.3 4.9 2.0 2.2 2.5

  • 3.1

= max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1) = max(0, -2.6) + max(0, -1.9) = 0 + 0 = 0

Loss

Given an example ๐‘ฆ2, ๐‘ง2 (๐‘ฆ2 is image, ๐‘ง2 is label) Let ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) be scores Then the SVM loss has the form: ๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

2.9

slide-39
SLIDE 39

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Multiclass SVM Loss

39

3.2 cat car frog 5.1

  • 1.7

1.3 4.9 2.0 2.2 2.5

  • 3.1

= max(0, 2.2 - (-3.1) + 1) +max(0, 2.5 - (-3.1) + 1) = max(0, 6.3) + max(0, 6.6) = 6.3 + 6.6 = 12.9

Loss 12.9

Given an example ๐‘ฆ2, ๐‘ง2 (๐‘ฆ2 is image, ๐‘ง2 is label) Let ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) be scores Then the SVM loss has the form: ๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

2.9

slide-40
SLIDE 40

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Multiclass SVM Loss

40

3.2 cat car frog 5.1

  • 1.7

1.3 4.9 2.0 2.2 2.5

  • 3.1

Loss 12.9

Given an example ๐‘ฆ2, ๐‘ง2 (๐‘ฆ2 is image, ๐‘ง2 is label) Let ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) be scores Then the SVM loss has the form: ๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

2.9

Loss over the dataset is: L = (2.9 + 0.0 + 12.9) / 3 = 5.27

slide-41
SLIDE 41

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Multiclass SVM Loss

41

3.2 cat car frog 5.1

  • 1.7

1.3 4.9 2.0 2.2 2.5

  • 3.1

Loss 12.9

Given an example ๐‘ฆ2, ๐‘ง2 (๐‘ฆ2 is image, ๐‘ง2 is label) Let ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) be scores Then the SVM loss has the form: ๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

2.9

Q: What happens to the loss if the scores for the car image change a bit?

slide-42
SLIDE 42

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Multiclass SVM Loss

42

3.2 cat car frog 5.1

  • 1.7

1.3 4.9 2.0 2.2 2.5

  • 3.1

Loss 12.9

Given an example ๐‘ฆ2, ๐‘ง2 (๐‘ฆ2 is image, ๐‘ง2 is label) Let ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) be scores Then the SVM loss has the form: ๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

2.9

Q: What are the min and max possible loss?

slide-43
SLIDE 43

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Multiclass SVM Loss

43

3.2 cat car frog 5.1

  • 1.7

1.3 4.9 2.0 2.2 2.5

  • 3.1

Loss 12.9

Given an example ๐‘ฆ2, ๐‘ง2 (๐‘ฆ2 is image, ๐‘ง2 is label) Let ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) be scores Then the SVM loss has the form: ๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

2.9

Q: If all scores were random, what loss would we expect?

slide-44
SLIDE 44

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Multiclass SVM Loss

44

3.2 cat car frog 5.1

  • 1.7

1.3 4.9 2.0 2.2 2.5

  • 3.1

Loss 12.9

Given an example ๐‘ฆ2, ๐‘ง2 (๐‘ฆ2 is image, ๐‘ง2 is label) Let ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) be scores Then the SVM loss has the form: ๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

2.9

Q: What would happen if sum were

  • ver all classes?

(including ๐‘˜ = ๐‘ง2)

slide-45
SLIDE 45

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Multiclass SVM Loss

45

3.2 cat car frog 5.1

  • 1.7

1.3 4.9 2.0 2.2 2.5

  • 3.1

Loss 12.9

Given an example ๐‘ฆ2, ๐‘ง2 (๐‘ฆ2 is image, ๐‘ง2 is label) Let ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) be scores Then the SVM loss has the form: ๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

2.9

Q: What if the loss used mean instead of sum?

slide-46
SLIDE 46

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Multiclass SVM Loss

46

3.2 cat car frog 5.1

  • 1.7

1.3 4.9 2.0 2.2 2.5

  • 3.1

Loss 12.9

Given an example ๐‘ฆ2, ๐‘ง2 (๐‘ฆ2 is image, ๐‘ง2 is label) Let ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) be scores Then the SVM loss has the form: ๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

2.9

Q: What if we used this loss instead?

๐‘€2 = 1

QRST

max 0, ๐‘กQ โˆ’ ๐‘กST + 1

8

slide-47
SLIDE 47

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 47

Want to interpret raw classifier scores as probabilities

3.2 cat car frog 5.1

  • 1.7

Cross-Entropy Loss

Classifier scores ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹)

slide-48
SLIDE 48

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 48

Want to interpret raw classifier scores as probabilities

3.2 cat car frog 5.1

  • 1.7

Cross-Entropy Loss

Classifier scores ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) Softmax function ๐‘ž2 = exp(๐‘ก2) โˆ‘Q exp ๐‘กQ

slide-49
SLIDE 49

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 49

Want to interpret raw classifier scores as probabilities

3.2 cat car frog 5.1

  • 1.7

Unnormalized log- probabilities / logits

Cross-Entropy Loss

Classifier scores ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) Softmax function ๐‘ž2 = exp(๐‘ก2) โˆ‘Q exp ๐‘กQ

slide-50
SLIDE 50

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 50

Want to interpret raw classifier scores as probabilities

3.2 cat car frog 5.1

  • 1.7

24.5 164 0.18

Probabilities must be >= 0

exp

unnormalized probabilities

Unnormalized log- probabilities / logits

Cross-Entropy Loss

Classifier scores ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) Softmax function ๐‘ž2 = exp(๐‘ก2) โˆ‘Q exp ๐‘กQ

slide-51
SLIDE 51

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 51

Want to interpret raw classifier scores as probabilities

3.2 cat car frog 5.1

  • 1.7

24.5 164 0.18 0.13 0.87 0.00

Probabilities must be >= 0 Probabilities must sum to 1

exp

normalize

unnormalized probabilities probabilities

Unnormalized log- probabilities / logits

Cross-Entropy Loss

Classifier scores ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) Softmax function ๐‘ž2 = exp(๐‘ก2) โˆ‘Q exp ๐‘กQ

slide-52
SLIDE 52

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 52

Want to interpret raw classifier scores as probabilities

3.2 cat car frog 5.1

  • 1.7

24.5 164 0.18 0.13 0.87 0.00

Probabilities must be >= 0 Probabilities must sum to 1

exp

normalize

unnormalized probabilities probabilities

Unnormalized log- probabilities / logits

Cross-Entropy Loss

Classifier scores ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) Softmax function ๐‘ž2 = exp(๐‘ก2) โˆ‘Q exp ๐‘กQ Loss ๐‘€2 = โˆ’ log(๐‘žST)

Li = -log(0.13) = 2.04

slide-53
SLIDE 53

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 53

Want to interpret raw classifier scores as probabilities

3.2 cat car frog 5.1

  • 1.7

24.5 164 0.18 0.13 0.87 0.00

Probabilities must be >= 0 Probabilities must sum to 1

exp

normalize

unnormalized probabilities probabilities

Unnormalized log- probabilities / logits

Cross-Entropy Loss

Classifier scores ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) Softmax function ๐‘ž2 = exp(๐‘ก2) โˆ‘Q exp ๐‘กQ Loss ๐‘€2 = โˆ’ log(๐‘žST)

Li = -log(0.13) = 2.04

Maximum Likelihood Estimation Choose weights to maximize the likelihood of the observed data (See EECS 445 or EECS 545)

slide-54
SLIDE 54

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 54

Want to interpret raw classifier scores as probabilities

3.2 cat car frog 5.1

  • 1.7

24.5 164 0.18 0.13 0.87 0.00

Probabilities must be >= 0 Probabilities must sum to 1

exp

normalize

unnormalized probabilities probabilities

Unnormalized log- probabilities / logits

1.00 0.00 0.00

Correct probs

Compare

Cross-Entropy Loss

Classifier scores ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) Softmax function ๐‘ž2 = exp(๐‘ก2) โˆ‘Q exp ๐‘กQ Loss ๐‘€2 = โˆ’ log(๐‘žST)

slide-55
SLIDE 55

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 55

Want to interpret raw classifier scores as probabilities

3.2 cat car frog 5.1

  • 1.7

24.5 164 0.18 0.13 0.87 0.00

Probabilities must be >= 0 Probabilities must sum to 1

exp

normalize

unnormalized probabilities probabilities

Unnormalized log- probabilities / logits

1.00 0.00 0.00

Correct probs

Compare

Cross-Entropy Loss

Classifier scores ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) Softmax function ๐‘ž2 = exp(๐‘ก2) โˆ‘Q exp ๐‘กQ Loss ๐‘€2 = โˆ’ log(๐‘žST)

Kullback-Leibler Divergence ๐ธ]^ ๐‘ธ || ๐‘น = 1

S

๐‘ธ ๐‘ง log ๐‘ธ(๐‘ง) ๐‘น(๐‘ง)

slide-56
SLIDE 56

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 56

Want to interpret raw classifier scores as probabilities

3.2 cat car frog 5.1

  • 1.7

24.5 164 0.18 0.13 0.87 0.00

Probabilities must be >= 0 Probabilities must sum to 1

exp

normalize

unnormalized probabilities probabilities

Unnormalized log- probabilities / logits

1.00 0.00 0.00

Correct probs

Compare

Cross-Entropy Loss

Classifier scores ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) Softmax function ๐‘ž2 = exp(๐‘ก2) โˆ‘Q exp ๐‘กQ Loss ๐‘€2 = โˆ’ log(๐‘žST)

Cross-Entropy: ๐ผ ๐‘„, ๐‘… = ๐ผ ๐‘„ +๐ธ]^ ๐‘„ || ๐‘…

slide-57
SLIDE 57

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 57

Want to interpret raw classifier scores as probabilities

3.2 cat car frog 5.1

  • 1.7

Cross-Entropy Loss

Classifier scores ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) Softmax function ๐‘ž2 = exp(๐‘ก2) โˆ‘Q exp ๐‘กQ Loss ๐‘€2 = โˆ’ log(๐‘žST)

Putting it all together: ๐‘€2 = โˆ’ log exp ๐‘กST โˆ‘Q exp(๐‘ก

Q)

slide-58
SLIDE 58

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 58

Want to interpret raw classifier scores as probabilities

3.2 cat car frog 5.1

  • 1.7

Cross-Entropy Loss

Classifier scores ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) Softmax function ๐‘ž2 = exp(๐‘ก2) โˆ‘Q exp ๐‘กQ Loss ๐‘€2 = โˆ’ log(๐‘žST)

Q: What is the min / max possible loss Li?

Putting it all together: ๐‘€2 = โˆ’ log exp ๐‘กST โˆ‘Q exp(๐‘ก

Q)

slide-59
SLIDE 59

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 - 59

Want to interpret raw classifier scores as probabilities

3.2 cat car frog 5.1

  • 1.7

Cross-Entropy Loss

Classifier scores ๐‘ก = ๐‘”(๐‘ฆ2, ๐‘‹) Softmax function ๐‘ž2 = exp(๐‘ก2) โˆ‘Q exp ๐‘กQ Loss ๐‘€2 = โˆ’ log(๐‘žST)

Q: If all scores are small random values, what is the loss?

Putting it all together: ๐‘€2 = โˆ’ log exp ๐‘กST โˆ‘Q exp(๐‘ก

Q)

slide-60
SLIDE 60

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Cross-Entropy vs SVM Loss

60

๐‘€2 = โˆ’ log exp ๐‘กST โˆ‘Q exp(๐‘ก

Q)

๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

Assume scores: [10, -2, 3] [10, 9, 9] [10, -100, -100] and ๐‘ง2 = 0

A: Cross-entropy loss > 0 SVM loss = 0 Q: What is cross-entropy loss? What is SVM loss?

slide-61
SLIDE 61

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Cross-Entropy vs SVM Loss

61

๐‘€2 = โˆ’ log exp ๐‘กST โˆ‘Q exp(๐‘ก

Q)

๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

Assume scores: [10, -2, 3] [10, 9, 9] [10, -100, -100] and ๐‘ง2 = 0

A: Cross-entropy loss will change; SVM loss will stay the same Q: What happens to each loss if I slightly change the scores of the last datapoint?

slide-62
SLIDE 62

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Cross-Entropy vs SVM Loss

62

๐‘€2 = โˆ’ log exp ๐‘กST โˆ‘Q exp(๐‘ก

Q)

๐‘€2 = 1

QRST

max 0, ๐‘ก

Q โˆ’ ๐‘กST + 1

Assume scores: [10, -2, 3] [10, 9, 9] [10, -100, -100] and ๐‘ง2 = 0

A: Cross-entropy loss will decrease, SVM loss still 0 Q: What happens to each loss if I double the score of the correct class from 10 to 20?

slide-63
SLIDE 63

Justin Johnson February 25, 2020 EECS 442 WI 2020: Lecture 14 -

Next Time: How to choose W? Optimization!

63