Evaluating Binary Classifiers TPR FPR Many slides attributable - - PowerPoint PPT Presentation

evaluating binary classifiers
SMART_READER_LITE
LIVE PREVIEW

Evaluating Binary Classifiers TPR FPR Many slides attributable - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Evaluating Binary Classifiers TPR FPR Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten,


slide-1
SLIDE 1

Evaluating Binary Classifiers

8

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes

FPR TPR

slide-2
SLIDE 2

Today’s objectives (day 08) Evaluating Binary Classifiers

9

Mike Hughes - Tufts COMP 135 - Fall 2020

1) Evaluate binary decisions at specific threshold

accuracy, TPR, TNR, PPV, NPV, …

2) Evaluate across range of thresholds

ROC curve, Precision-Recall curve 3) Evaluate probabilities / scores directly cross entropy loss (aka log loss)

slide-3
SLIDE 3

What will we learn?

10

Mike Hughes - Tufts COMP 135 - Fall 2020

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-4
SLIDE 4

11

Mike Hughes - Tufts COMP 135 - Fall 2020

y

x2 x1

is a binary variable (red or blue)

Supervised Learning

binary classification

Unsupervised Learning Reinforcement Learning

Task: Binary Classification

slide-5
SLIDE 5

Example: Hotdog or Not

12

Mike Hughes - Tufts COMP 135 - Fall 2020

https://www.theverge.com/tldr/2017/5/14/15639784/hbo- silicon-valley-not-hotdog-app-download

slide-6
SLIDE 6

From Features to Predictions

Goal: Predict label (0 or 1) given features x

13

Mike Hughes - Tufts COMP 135 - Fall 2020

xi , [xi1, xi2, . . . xif . . . xiF ]

Input features Binary label (0 or 1)

yi ∈ {0, 1}

Score (a real number)

si = h(xi, θ)

<latexit sha1_base64="+i3LJAjdB5LiyhIGKCnqfu7uJxk=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxCBSlJFXQjFN24rGAf0IYwmU6aoZMHMzdiDcVfceNCEbf+hzv/xmbhVYPXDicy/3uMlgiuwrC+jsLC4tLxSXC2trW9sbpnbOy0Vp5KyJo1FLDseUzwiDWBg2CdRDISeoK1veHVxG/fMal4HN3CKGFOSAYR9zkloCX3FMuxc4qNy7/Bj3IGBAjlyzbFWtKfBfYuekjHI0XPOz149pGrIqCBKdW0rAScjEjgVbFzqpYolhA7JgHU1jUjIlJNrx/jQ630sR9LXRHgqfpzIiOhUqPQ050hgUDNexPxP6+bgn/uZDxKUmARnS3yU4EhxpMocJ9LRkGMNCFUcn0rpgGRhIOrKRDsOdf/ktatap9Uq3dnJbrl3kcRbSPDlAF2egM1dE1aqAmougBPaEX9Go8Gs/Gm/E+ay0Y+cwu+gXj4xtlNpPm</latexit>

Chosen threshold

slide-7
SLIDE 7

From Features to Predictions via Probabilities

Goal: Predict label (0 or 1) given features x

14

Mike Hughes - Tufts COMP 135 - Fall 2020

xi , [xi1, xi2, . . . xif . . . xiF ]

Input features Binary label (0 or 1)

yi ∈ {0, 1}

Probability of positive class (between 0.0 and 1.0) Score (a real number)

sigmoid(z) = 1 1 + e−z

si = h(xi, θ)

<latexit sha1_base64="+i3LJAjdB5LiyhIGKCnqfu7uJxk=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxCBSlJFXQjFN24rGAf0IYwmU6aoZMHMzdiDcVfceNCEbf+hzv/xmbhVYPXDicy/3uMlgiuwrC+jsLC4tLxSXC2trW9sbpnbOy0Vp5KyJo1FLDseUzwiDWBg2CdRDISeoK1veHVxG/fMal4HN3CKGFOSAYR9zkloCX3FMuxc4qNy7/Bj3IGBAjlyzbFWtKfBfYuekjHI0XPOz149pGrIqCBKdW0rAScjEjgVbFzqpYolhA7JgHU1jUjIlJNrx/jQ630sR9LXRHgqfpzIiOhUqPQ050hgUDNexPxP6+bgn/uZDxKUmARnS3yU4EhxpMocJ9LRkGMNCFUcn0rpgGRhIOrKRDsOdf/ktatap9Uq3dnJbrl3kcRbSPDlAF2egM1dE1aqAmougBPaEX9Go8Gs/Gm/E+ay0Y+cwu+gXj4xtlNpPm</latexit>

pi = σ(si)

<latexit sha1_base64="EYNrH7IL4mNqkIlAC+i56/+hS7I=">AB+nicbVBNS8NAEJ34WetXqkcvi0Wol5JUQS9C0YvHCvYD2hA2027dLMJuxulxP4ULx4U8eov8ea/cdvmoK0PBh7vzTAzL0g4U9pxvq2V1bX1jc3CVnF7Z3dv3y4dtFScSkKbJOax7ARYUc4EbWqmOe0kuIo4LQdjG6mfvuBSsVica/HCfUiPBAsZARrI/l2KfEZukI9xQYRrifnfp2ak6M6Bl4uakDkav3V68ckjajQhGOluq6TaC/DUjPC6aTYSxVNMBnhAe0aKnBElZfNTp+gE6P0URhLU0Kjmfp7IsORUuMoMJ0R1kO16E3F/7xuqsNL2MiSTUVZL4oTDnSMZrmgPpMUqL52BMJDO3IjLEhNt0iqaENzFl5dJq1Z1z6q1u/Ny/TqPowBHcAwVcOEC6nALDWgCgUd4hld4s56sF+vd+pi3rlj5zCH8gfX5A8Khkwc=</latexit>

Chosen threshold between 0.0 and 1.0

slide-8
SLIDE 8

Classifier: Evaluation Step

Goal: Assess quality of predictions

15

Mike Hughes - Tufts COMP 135 - Fall 2020

Many ways in practice: 1) Evaluate binary decisions at specific threshold

accuracy, TPR, TNR, PPV, NPV, …

2) Evaluate across range of thresholds

ROC curve, Precision-Recall curve 3) Evaluate probabilities / scores directly cross entropy loss (aka log loss), hinge loss, …

slide-9
SLIDE 9

Types of binary predictions

16

FP : false positive TP : true positive TN : true negative FN : false negative

slide-10
SLIDE 10

Example:

17

Which outcome is this?

FP : false positive TP : true positive TN : true negative FN : false negative

slide-11
SLIDE 11

Example:

18

Which outcome is this?

Answer: True Positive

FP : false positive TP : true positive TN : true negative FN : false negative

slide-12
SLIDE 12

Example:

19

FP : false positive TP : true positive TN : true negative FN : false negative

Which outcome is this?

slide-13
SLIDE 13

Example:

20

Which outcome is this?

Answer: True Negative (TN)

FP : false positive TP : true positive TN : true negative FN : false negative

slide-14
SLIDE 14

Example:

21

Which outcome is this?

FP : false positive TP : true positive TN : true negative FN : false negative

slide-15
SLIDE 15

Example:

22

Which outcome is this?

FP : false positive TP : true positive TN : true negative FN : false negative

Answer: False Negative (FN)

slide-16
SLIDE 16

Example:

23

Which outcome is this?

FP : false positive TP : true positive TN : true negative FN : false negative

slide-17
SLIDE 17

Example:

24

Which outcome is this?

FP : false positive TP : true positive TN : true negative FN : false negative

Answer: False Positive (FP)

slide-18
SLIDE 18

Metric: Confusion Matrix

Counting mistakes in binary predictions

25

#TP : num. true positive #FP : num. false positive #TN : num. true negative #FN : num. false negative

#FN #FP #TN #TP

slide-19
SLIDE 19

Metric: Accuracy

accuracy = fraction of correct predictions

26

Mike Hughes - Tufts COMP 135 - Fall 2020

Potential problem: Suppose your dataset has 1 positive example and 99 negative examples What is the accuracy of the classifier that always predicts ”negative”?

= TP + TN TP + TN + FN + FP

slide-20
SLIDE 20

Metric: Accuracy

accuracy = fraction of correct predictions

27

Mike Hughes - Tufts COMP 135 - Fall 2020

Potential problem: Suppose your dataset has 1 positive example and 99 negative examples What is the accuracy of the classifier that always predicts ”negative”? 99%!

= TP + TN TP + TN + FN + FP

slide-21
SLIDE 21

28

Metrics for Binary Decisions

In practice, you need to emphasize the metrics appropriate for your application.

“sensitivity”, “recall” “specificity”, 1 - FPR “precision”

slide-22
SLIDE 22

29

Mike Hughes - Tufts COMP 135 - Fall 2020

Goal: Classifier to find relevant tweets to list on Tufts website

  • If in top 10 by predicted probability, put on website
  • If not, discard that tweet

Which metric might be most important? Could we just use accuracy?

slide-23
SLIDE 23

30

Mike Hughes - Tufts COMP 135 - Fall 2020

Goal: Detector for cancer based on medical image

  • If called positive, patient gets further screening
  • If called negative, no further attention until 5+ years later

Which metric might be most important? Could we just use accuracy?

slide-24
SLIDE 24

31

Mike Hughes - Tufts COMP 135 - Fall 2020

Classifier: Evaluation Step

Goal: Assess quality of predictions

Mike Hughes - Tufts COMP 135 - Fall 2020

Many ways in practice: 1) Evaluate binary decisions at specific threshold

accuracy, TPR, TNR, PPV, NPV, …

2) Evaluate across range of thresholds

ROC curve, Precision-Recall curve 3) Evaluate probabilities / scores directly cross entropy loss (aka log loss), hinge loss, …

slide-25
SLIDE 25

ROC curve

32

FPR (1 – TNR) TPR

random guess perfect

Specific thresh Each point represents TPR and FPR of one specific threshold Connecting all points (all thresholds) produces the curve

slide-26
SLIDE 26

Area under ROC curve

(aka AUROC or AUC or “C statistic”)

33

FPR TPR

AUROC , Pr(ˆ y(xi) > ˆ y(xj)|yi = 1, yj = 0)

Graphical view: Probabilistic interpretation:

For random pair of examples, one positive and one negative, What is probability classifier will rank positive one higher?

Area varies from 0.0 – 1.0. 0.5 is random guess. 1.0 is perfect.

slide-27
SLIDE 27

Precision-Recall Curve

34

Mike Hughes - Tufts COMP 135 - Fall 2020

recall (aka TPR) precision

PPV

slide-28
SLIDE 28

AUROC not always best choice

35

AUROC says red is better Blue much better for avoiding false alarms

TPR FPR TPR

slide-29
SLIDE 29

36

Mike Hughes - Tufts COMP 135 - Fall 2020

Classifier: Evaluation Step

Goal: Assess quality of predictions

Mike Hughes - Tufts COMP 135 - Fall 2020

Many ways in practice: 1) Evaluate binary decisions at specific threshold

accuracy, TPR, TNR, PPV, NPV, …

2) Evaluate across range of thresholds

ROC curve, Precision-Recall curve 3) Evaluate probabilities / scores directly cross entropy loss (aka log loss) Not covered yet: hinge loss, many others

slide-30
SLIDE 30

Measuring quality

  • f predicted probabilities

37

Mike Hughes - Tufts COMP 135 - Fall 2020

Use the log loss (aka “binary cross entropy”)

log loss(y, ˆ p) = −y log ˆ p − (1 − y) log(1 − ˆ p)

from sklearn.metrics import log_loss

Advantages:

  • smooth
  • easy to take

derivatives!

slide-31
SLIDE 31

Why minimize log loss? The upper bound justification

Log loss (if implemented in correct base) is a smooth upper bound

  • f the error rate.

Why smooth matters: easy to do gradient descent Why upper bound matters: achieving a log loss of 0.1 (averaged

  • ver dataset) guarantees us that error rate is no worse than 0.1 (10%)

38

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-32
SLIDE 32

Log loss upper bounds 0-1 error

39

Mike Hughes - Tufts COMP 135 - Spring 2019

log loss(y, ˆ p) = −y log ˆ p − (1 − y) log(1 − ˆ p)

Plot assumes:

  • True label is 1
  • Threshold is 0.5
  • Log base 2

error(y, ˆ y) = ( 1 if y 6= ˆ y if y = ˆ y

slide-33
SLIDE 33

Why minimize log loss?

An information-theory justification

40

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-34
SLIDE 34

Entropy of Binary Random Var.

41

Mike Hughes - Tufts COMP 135 - Spring 2019

Goal: Entropy of a distribution captures the amount of uncertainty entropy(X) = −p(X = 1) log2 p(X = 1) − p(X = 0) log2 p(X = 0) Log base 2: Units are “bits” Log base e: Units are “nats” 1 bit of information is always needed to represent a binary variable X Entropy tells us how much of this one bit is uncertain

slide-35
SLIDE 35

Entropy of Binary Random Var.

42

Mike Hughes - Tufts COMP 135 - Spring 2019

Goal: Entropy of a distribution captures the amount of uncertainty entropy(X) = −p(X = 1) log2 p(X = 1) − p(X = 0) log2 p(X = 0)

H[X] = − X

x∈{0,1}

p(X = x) log2 p(X = x) = −Ex∼p(X) [log2 p(X = x)]

Entropy is the average number of bits needed to encode an outcome Want: low entropy (low cost storage and transmission!)

slide-36
SLIDE 36

Cross Entropy

43

Mike Hughes - Tufts COMP 135 - Spring 2019

Goal: Measure cost of using estimated q to capture true distribution p

Entropy[p(X)] = − X

x∈{0,1}

p(X = x) log2 p(X = x) Cross-Entropy[p(X), q(X)] = − X

x∈{0,1}

p(X = x) log2 q(X = x)

Info theory interpretation: Average number of bits needed to encode samples from a true distribution p(X) with codes defined by a model q(X) Goal: Want a model that uses fewer bits! Lower entropy = more information captured about the outcome labels!

slide-37
SLIDE 37

Log loss is cross entropy!

44

Mike Hughes - Tufts COMP 135 - Spring 2019

Let our “true” distribution p(Y) be empirical distribution of labels in our observed dataset with N examples Let our “model” distribution q(Y) be our estimated probabilities Cross-Entropy[p(Y ), q(Y )] = Ey∼p(Y ) [− log q(Y = y)] = 1 N

N

X

n=1

−yn log ˆ pn − (1 − yn) log(1 − ˆ pn) Same as the average “log loss”! Info Theory Justification for log loss: Want to set model parameters to provide best probabilistic encoding of the training data’s label distribution

slide-38
SLIDE 38

The log loss metric

45

Mike Hughes - Tufts COMP 135 - Spring 2019

Log loss (aka “binary cross entropy”)

log loss(y, ˆ p) = −y log ˆ p − (1 − y) log(1 − ˆ p)

from sklearn.metrics import log_loss

Advantages:

  • smooth and not flat
  • easy to take

derivatives!

  • convex function

Lower is better!

slide-39
SLIDE 39

Code for Evaluation Metrics

46

Mike Hughes - Tufts COMP 135 - Fall 2020

https://scikit-learn.org/stable/modules/model_evaluation.html

1) To evaluate predicted scores / probabilities 2) To evaluate specific binary decisions 3) To make ROC or PR curves

slide-40
SLIDE 40

Today’s objectives (day 08) Evaluating Binary Classifiers

47

Mike Hughes - Tufts COMP 135 - Fall 2020

1) Evaluate binary decisions at specific threshold

accuracy, TPR, TNR, PPV, NPV, …

2) Evaluate across range of thresholds

ROC curve, Precision-Recall curve 3) Evaluate probabilities / scores directly cross entropy loss (aka log loss)