[PPT] - and Evaluation CMSC 678 UMBC Central Question: How Well Are We PowerPoint Presentation

SLIDE 1

Experimental Setup, Multi-class vs. Multi-label classification, and Evaluation

CMSC 678 UMBC

SLIDE 2

Central Question: How Well Are We Doing?

Classification Regression Clustering

the task: what kind

f problem are you

solving?

Precision,

Recall, F1

Accuracy
Log-loss
ROC-AUC
…
(Root) Mean Square Error
Mean Absolute Error
…
Mutual Information
V-score
…

SLIDE 3

Central Question: How Well Are We Doing?

Classification Regression Clustering

the task: what kind

f problem are you

solving?

Precision,

Recall, F1

Accuracy
Log-loss
ROC-AUC
…
(Root) Mean Square Error
Mean Absolute Error
…
Mutual Information
V-score
…

This does not have to be the same thing as the loss function you

ptimize

SLIDE 4

Outline

Experimental Design: Rule 1 Multi-class vs. Multi-label classification Evaluation Regression Metrics Classification Metrics

SLIDE 5

Experimenting with Machine Learning Models

All your data Training Data

Dev Data Test Data

SLIDE 6

Rule #1

SLIDE 7

Experimenting with Machine Learning Models

What is “correct?” What is working “well?”

Training Data

Dev Data Test Data

Learn model parameters from training set

set hyperparameters

SLIDE 8

Experimenting with Machine Learning Models

What is “correct?” What is working “well?”

Training Data

Dev Data Test Data

set hyperparameters

Evaluate the learned model on dev with that hyperparameter setting Learn model parameters from training set

SLIDE 9

Experimenting with Machine Learning Models

What is “correct?” What is working “well?”

Training Data

Dev Data Test Data

set hyperparameters

Evaluate the learned model on dev with that hyperparameter setting Learn model parameters from training set perform final evaluation on test, using the hyperparameters that

ptimized dev performance and

retraining the model

SLIDE 10

Experimenting with Machine Learning Models

What is “correct?” What is working “well?”

Training Data

Dev Data Test Data

set hyperparameters

Evaluate the learned model on dev with that hyperparameter setting Learn model parameters from training set perform final evaluation on test, using the hyperparameters that

ptimized dev performance and

retraining the model

Rule 1: DO NOT ITERATE ON THE TEST DATA

SLIDE 11

On-board Exercise

Produce dev and test tables for a linear regression model with learned weights and set/fixed (non-learned) bias

SLIDE 12

Outline

Experimental Design: Rule 1 Multi-class vs. Multi-label classification Evaluation Regression Metrics Classification Metrics

SLIDE 13

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

Multi-label Classification

SLIDE 14

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task

Multi-label Classification

SLIDE 15

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task If 𝑧 ∈ {0,1, … , 𝐿 − 1} (for finite K), then a multi-class classification task Q: What are some examples

f multi-class classification?

Multi-label Classification

SLIDE 16

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task If 𝑧 ∈ {0,1, … , 𝐿 − 1} (for finite K), then a multi-class classification task Q: What are some examples

f multi-class classification?

A: Many possibilities. See A2, Q{1,2,4-7}

Multi-label Classification

SLIDE 17

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task If 𝑧 ∈ {0,1, … , 𝐿 − 1} (for finite K), then a multi-class classification task

Multi-label Classification

Single

utput

Multi-

utput

If multiple 𝑧𝑚 are predicted, then a multi- label classification task

SLIDE 18

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task If 𝑧 ∈ {0,1, … , 𝐿 − 1} (for finite K), then a multi-class classification task

Multi-label Classification

Single

utput

Multi-

utput

Given input 𝑦, predict multiple discrete labels 𝑧 = (𝑧1, … , 𝑧𝑀)

If multiple 𝑧𝑚 are predicted, then a multi- label classification task

SLIDE 19

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task If 𝑧 ∈ {0,1, … , 𝐿 − 1} (for finite K), then a multi-class classification task

Multi-label Classification

Single

utput

Multi-

utput

Given input 𝑦, predict multiple discrete labels 𝑧 = (𝑧1, … , 𝑧𝑀)

If multiple 𝑧𝑚 are predicted, then a multi- label classification task Each 𝑧𝑚 could be binary or multi-class

SLIDE 20

Multi-Label Classification…

Will not be a primary focus of this course Many of the single output classification methods apply to multi-label classification Predicting “in the wild” can be trickier Evaluation can be trickier

SLIDE 21

We’ve only developed binary classifiers so far…

Option 1: Develop a multi- class version Option 2: Build a one-vs- all (OvA) classifier Option 3: Build an all-vs- all (AvA) classifier

(there can be others)

SLIDE 22

We’ve only developed binary classifiers so far…

Option 1: Develop a multi- class version Option 2: Build a one-vs- all (OvA) classifier Option 3: Build an all-vs- all (AvA) classifier

(there can be others)

Loss function may (or may not) need to be extended & the model structure may need to change (big or small)

SLIDE 23

We’ve only developed binary classifiers so far…

Option 1: Develop a multi- class version Option 2: Build a one-vs- all (OvA) classifier Option 3: Build an all-vs- all (AvA) classifier

(there can be others)

Loss function may (or may not) need to be extended & the model structure may need to change (big or small) Common change:

instead of a single weight vector 𝑥, keep a weight vector 𝑥(𝑑) for each class c Compute class specific scores, e.g., ෢ 𝑧𝑗

(𝑑) = 𝑥(𝑑) 𝑈𝑦 + 𝑐(𝑑)

SLIDE 24

Multi-class Option 1: Linear Regression/Perceptron

𝐱 𝑦 𝑧 𝑧 = 𝐱𝑈𝑦 + 𝑐

utput:

if y > 0: class 1 else: class 2

SLIDE 25

Multi-class Option 1: Linear Regression/Perceptron: A Per-Class View

𝐱 𝑦 𝑧 𝑧 = 𝐱𝑈𝑦 + 𝑐 𝐱𝟑 𝑦 𝑧 𝑧1 = 𝐱𝟐

𝑈𝑦 + 𝑐1

𝐱𝟐 𝑧2 𝑧2 = 𝐱𝟑

𝑈𝑦 + 𝑐2

𝑧1

utput:

if y > 0: class 1 else: class 2

utput:

i = argmax {y1, y2} class i

binary version is special case

SLIDE 26

Multi-class Option 1: Linear Regression/Perceptron: A Per-Class View (alternative)

𝐱 𝑦 𝑧 𝑧 = 𝐱𝑈𝑦 + 𝑐 𝐱𝟑 𝑦 𝑧 𝑧1 = 𝒙𝟐; 𝒙𝟑 𝑼[𝑦; 𝟏] + 𝑐1 𝐱𝟐 𝑧2 𝑧1

utput:

if y > 0: class 1 else: class 2

utput:

i = argmax {y1, y2} class i 𝑧2 = 𝒙𝟐; 𝒙𝟑 𝑼[𝟏; 𝑦] + 𝑐2 concatenation Q: (For discussion) Why does this work?

SLIDE 27

We’ve only developed binary classifiers so far…

Option 1: Develop a multi- class version Option 2: Build a one-vs- all (OvA) classifier Option 3: Build an all-vs- all (AvA) classifier

(there can be others)

With C classes: Train C different binary classifiers 𝛿𝑑(𝑦)

𝛿𝑑(𝑦) predicts 1 if x is likely class c, 0 otherwise

SLIDE 28

We’ve only developed binary classifiers so far…

Option 1: Develop a multi- class version Option 2: Build a one-vs- all (OvA) classifier Option 3: Build an all-vs- all (AvA) classifier

(there can be others)

With C classes: Train C different binary classifiers 𝛿𝑑(𝑦)

𝛿𝑑(𝑦) predicts 1 if x is likely class c, 0 otherwise

To test/predict a new instance z:

Get scores 𝑡𝑑 = 𝛿𝑑(𝑨) Output the max of these scores, ො 𝑧 = argmax𝑑 𝑡𝑑

SLIDE 29

We’ve only developed binary classifiers so far…

Option 1: Develop a multi- class version Option 2: Build a one-vs- all (OvA) classifier Option 3: Build an all-vs- all (AvA) classifier

(there can be others)

With C classes: Train 𝐷

2 different binary

classifiers 𝛿𝑑1,𝑑2(𝑦)

SLIDE 30

We’ve only developed binary classifiers so far…

Option 1: Develop a multi- class version Option 2: Build a one-vs- all (OvA) classifier Option 3: Build an all-vs- all (AvA) classifier

(there can be others)

With C classes: Train 𝐷

2 different binary

classifiers 𝛿𝑑1,𝑑2(𝑦)

𝛿𝑑1,𝑑2(𝑦) predicts 1 if x is likely class 𝑑1, 0 otherwise (likely class 𝑑2)

SLIDE 31

We’ve only developed binary classifiers so far…

Option 1: Develop a multi- class version Option 2: Build a one-vs- all (OvA) classifier Option 3: Build an all-vs- all (AvA) classifier

(there can be others)

With C classes: Train 𝐷

2 different binary

classifiers 𝛿𝑑1,𝑑2(𝑦)

𝛿𝑑1,𝑑2(𝑦) predicts 1 if x is likely class 𝑑1, 0 otherwise (likely class 𝑑2)

To test/predict a new instance z:

Get scores or predictions 𝑡𝑑1,𝑑2 = 𝛿𝑑1,𝑑2 𝑨

SLIDE 32

We’ve only developed binary classifiers so far…

Option 1: Develop a multi- class version Option 2: Build a one-vs-all (OvA) classifier Option 3: Build an all-vs-all (AvA) classifier

(there can be others)

With C classes: Train 𝐷

2 different binary

classifiers 𝛿𝑑1,𝑑2(𝑦)

𝛿𝑑1,𝑑2(𝑦) predicts 1 if x is likely class 𝑑1, 0 otherwise (likely class 𝑑2)

To test/predict a new instance z:

Get scores or predictions 𝑡𝑑1,𝑑2 = 𝛿𝑑1,𝑑2 𝑨 Multiple options for final prediction:

(1) count # times a class c was predicted (2) margin-based approach

SLIDE 33

We’ve only developed binary classifiers so far…

Option 1: Develop a multi- class version Option 2: Build a one-vs- all (OvA) classifier Option 3: Build an all-vs- all (AvA) classifier

(there can be others)

Q: (to discuss) Why might you want to use option 1 or options OvA/AvA? What are the benefits of OvA vs. AvA?

SLIDE 34

We’ve only developed binary classifiers so far…

Option 1: Develop a multi- class version Option 2: Build a one-vs-all (OvA) classifier Option 3: Build an all-vs-all (AvA) classifier

(there can be others)

Q: (to discuss) Why might you want to use

ption 1 or options

OvA/AvA? What are the benefits of OvA vs. AvA?

What if you start with a balanced dataset, e.g., 100 instances per class?

SLIDE 35

Outline

Experimental Design: Rule 1 Multi-class vs. Multi-label classification Evaluation Regression Metrics Classification Metrics

SLIDE 36

Regression Metrics

(Root) Mean Square Error

𝑆𝑁𝑇𝐹 = 1 𝑂 ෍

𝑗 𝑂

𝑧𝑗 − ෝ 𝑧𝑗 2

SLIDE 37

Regression Metrics

(Root) Mean Square Error Mean Absolute Error

𝑆𝑁𝑇𝐹 = 1 𝑂 ෍

𝑗 𝑂

𝑧𝑗 − ෝ 𝑧𝑗 2 𝑁𝐵𝐹 = 1 𝑂 ෍

𝑗 𝑂

|𝑧𝑗 − ෝ 𝑧𝑗|

SLIDE 38

Regression Metrics

(Root) Mean Square Error Mean Absolute Error

𝑆𝑁𝑇𝐹 = 1 𝑂 ෍

𝑗 𝑂

𝑧𝑗 − ෝ 𝑧𝑗 2 𝑁𝐵𝐹 = 1 𝑂 ෍

𝑗 𝑂

|𝑧𝑗 − ෝ 𝑧𝑗|

Q: How can these reward/punish predictions differently?

SLIDE 39

Regression Metrics

(Root) Mean Square Error Mean Absolute Error

𝑆𝑁𝑇𝐹 = 1 𝑂 ෍

𝑗 𝑂

𝑧𝑗 − ෝ 𝑧𝑗 2 𝑁𝐵𝐹 = 1 𝑂 ෍

𝑗 𝑂

|𝑧𝑗 − ෝ 𝑧𝑗|

Q: How can these reward/punish predictions differently? A: RMSE punishes outlier predictions more harshly

SLIDE 40

Outline

Experimental Design: Rule 1 Multi-class vs. Multi-label classification Evaluation Regression Metrics Classification Metrics

SLIDE 41

Training Loss vs. Evaluation Score

In training, compute loss to update parameters Sometimes loss is a computational compromise

surrogate loss

The loss you use might not be as informative as you’d like Binary classification: 90 of 100 training examples are +1, 10 of 100 are -1

SLIDE 42

Some Classification Metrics

Accuracy Precision Recall AUC (Area Under Curve) F1 Confusion Matrix

SLIDE 43

Classification Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed Not selected/ not guessed

Classes/Choices

SLIDE 44

Classification Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) Not selected/ not guessed

Classes/Choices

Correct Guessed

SLIDE 45

Classification Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed

Classes/Choices

Correct Guessed Correct Guessed

SLIDE 46

Classification Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed False Negative (FN)

Classes/Choices

Correct Guessed Correct Guessed Correct Guessed

SLIDE 47

Classification Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed False Negative (FN) True Negative (TN)

Classes/Choices

Correct Guessed Correct Guessed Correct Guessed Correct Guessed

SLIDE 48

Classification Evaluation: Accuracy, Precision, and Recall

Accuracy: % of items correct

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

TP + TN TP + FP + FN + TN

SLIDE 49

Classification Evaluation: Accuracy, Precision, and Recall

Accuracy: % of items correct Precision: % of selected items that are correct

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

TP TP + FP TP + TN TP + FP + FN + TN

SLIDE 50

Classification Evaluation: Accuracy, Precision, and Recall

Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

TP TP + FP TP TP + FN TP + TN TP + FP + FN + TN

SLIDE 51

Classification Evaluation: Accuracy, Precision, and Recall

Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

TP TP + FP TP TP + FN TP + TN TP + FP + FN + TN Min: 0 ☹︐ Max: 1 😁

SLIDE 52

Precision and Recall Present a Tradeoff

precision recall 1 1 Q: Where do you want your ideal model?

model

SLIDE 53

Precision and Recall Present a Tradeoff

precision recall 1 1 Q: Where do you want your ideal model?

model

Q: You have a model that always identifies correct

instances. Where
n this graph is it?

model

SLIDE 54

Precision and Recall Present a Tradeoff

precision recall 1 1 Q: You have a model that always identifies correct

instances. Where
n this graph is it?

model

Q: You have a model that only make correct

predictions. Where
n this graph is it?

model

Q: Where do you want your ideal model?

model

SLIDE 55

Precision and Recall Present a Tradeoff

precision recall 1 1 Q: You have a model that always identifies correct

instances. Where
n this graph is it?

model

Q: You have a model that only make correct

predictions. Where
n this graph is it?

model

Q: Where do you want your ideal model?

model

SLIDE 56

Precision and Recall Present a Tradeoff

precision recall 1 1 Q: You have a model that always identifies correct

instances. Where
n this graph is it?

model

Q: You have a model that only make correct

predictions. Where
n this graph is it?

model

Q: Where do you want your ideal model?

model

Idea: measure the tradeoff between precision and recall Remember those hyperparameters: Each point is a differently trained/tuned model

SLIDE 57

Precision and Recall Present a Tradeoff

precision recall 1 1 Q: You have a model that always identifies correct

instances. Where
n this graph is it?

model

Q: You have a model that only make correct

predictions. Where
n this graph is it?

model

Q: Where do you want your ideal model?

model

Idea: measure the tradeoff between precision and recall Improve overall model: push the curve that way

SLIDE 58

Measure this Tradeoff: Area Under the Curve (AUC)

AUC measures the area under this tradeoff curve

precision recall

1 1

Improve overall model: push the curve that way

Min AUC: 0 ☹︐ Max AUC: 1 😁

SLIDE 59

Measure this Tradeoff: Area Under the Curve (AUC)

AUC measures the area under this tradeoff curve

1. Computing the curve

You need true labels & predicted labels with some score/confidence estimate Threshold the scores and for each threshold compute precision and recall

precision recall

1 1

Improve overall model: push the curve that way

Min AUC: 0 ☹︐ Max AUC: 1 😁

SLIDE 60

Measure this Tradeoff: Area Under the Curve (AUC)

AUC measures the area under this tradeoff curve 1. Computing the curve

You need true labels & predicted labels with some score/confidence estimate Threshold the scores and for each threshold compute precision and recall

2. Finding the area

How to implement: trapezoidal rule (&

thers)

In practice: external library like the sklearn.metrics module

precision recall

1 1

Improve overall model: push the curve that way

Min AUC: 0 ☹︐ Max AUC: 1 😁

SLIDE 61

Measure A Slightly Different Tradeoff: ROC-AUC

AUC measures the area under this tradeoff curve 1. Computing the curve

You need true labels & predicted labels with some score/confidence estimate Threshold the scores and for each threshold compute metrics

2. Finding the area

How to implement: trapezoidal rule (& others)

In practice: external library like the sklearn.metrics module

True positive rate False positive rate

1 1

Improve overall model: push the curve that way

Min ROC-AUC: 0.5 ☹︐ Max ROC-AUC: 1 😁 Main variant: ROC-AUC

Same idea as before but with some flipped metrics

SLIDE 62

A combined measure: F

Weighted (harmonic) average of Precision & Recall 𝐺 = 1 𝛽 1 𝑄 + (1 − 𝛽) 1 𝑆

SLIDE 63

A combined measure: F

Weighted (harmonic) average of Precision & Recall 𝐺 = 1 𝛽 1 𝑄 + (1 − 𝛽) 1 𝑆 = 1 + 𝛾2 ∗ 𝑄 ∗ 𝑆 (𝛾2 ∗ 𝑄) + 𝑆

algebra (not important)

SLIDE 64

A combined measure: F

Weighted (harmonic) average of Precision & Recall Balanced F1 measure: β=1 𝐺 = 1 + 𝛾2 ∗ 𝑄 ∗ 𝑆 (𝛾2 ∗ 𝑄) + 𝑆 𝐺

1 = 2 ∗ 𝑄 ∗ 𝑆

𝑄 + 𝑆

SLIDE 65

P/R/F in a Multi-class Setting: Micro- vs. Macro-Averaging

If we have more than one class, how do we combine multiple performance measures into one quantity? Macroaveraging: Compute performance for each class, then average. Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

Sec. 15.2.4

SLIDE 66

P/R/F in a Multi-class Setting: Micro- vs. Macro-Averaging

Macroaveraging: Compute performance for each class, then average. Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

Sec. 15.2.4

microprecision = σc TP

c

σc TP

c + σc FP c

macroprecision = ෍

𝑑

TP

c

TP

c + FP c

= ෍

𝑑

precision𝑑

SLIDE 67

P/R/F in a Multi-class Setting: Micro- vs. Macro-Averaging

Macroaveraging: Compute performance for each class, then average. Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

Sec. 15.2.4

microprecision = σc TP

c

σc TP

c + σc FP c

macroprecision = ෍

𝑑

TP

c

TP

c + FP c

= ෍

𝑑

precision𝑑

when to prefer the macroaverage? when to prefer the microaverage?

SLIDE 68

Micro- vs. Macro-Averaging: Example

Truth : yes Truth : no Classifier: yes 10 10 Classifier: no 10 970 Truth : yes Truth : no Classifier: yes 90 10 Classifier: no 10 890 Truth : yes Truth : no Classifier: yes 100 20 Classifier: no 20 1860

Class 1 Class 2 Micro Ave. Table

Sec. 15.2.4

Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Microaveraged score is dominated by score on frequent classes

SLIDE 69

Confusion Matrix: Generalizing the 2-by-2 contingency table

Correct Value Guessed Value # # # # # # # # #

SLIDE 70

Confusion Matrix: Generalizing the 2-by-2 contingency table

Correct Value Guessed Value 80 9 11 7 86 7 2 8 9

Q: Is this a good result?

SLIDE 71

Confusion Matrix: Generalizing the 2-by-2 contingency table

Correct Value Guessed Value 30 40 30 25 30 50 30 35 35

Q: Is this a good result?

SLIDE 72

Confusion Matrix: Generalizing the 2-by-2 contingency table

Correct Value Guessed Value 7 3 90 4 8 88 3 7 90

Q: Is this a good result?

SLIDE 73

Some Classification Metrics

Accuracy Precision Recall AUC (Area Under Curve) F1 Confusion Matrix

SLIDE 74