Evaluation Measures Sebastian Plsterl Computer Aided Medical - - PowerPoint PPT Presentation

evaluation measures
SMART_READER_LITE
LIVE PREVIEW

Evaluation Measures Sebastian Plsterl Computer Aided Medical - - PowerPoint PPT Presentation

Evaluation Measures Sebastian Plsterl Computer Aided Medical Procedures | Technische Universitt Mnchen April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics 3. Precision-Recall Curve 2


slide-1
SLIDE 1

Evaluation Measures

Sebastian Pölsterl

Computer Aided Medical Procedures | Technische Universität München

April 28, 2015

slide-2
SLIDE 2

Outline

1 Classification

  • 1. Confusion Matrix
  • 2. Receiver operating characteristics
  • 3. Precision-Recall Curve

2 Regression 3 Unsupervised Methods 4 Validation

  • 1. Cross-Validation
  • 2. Leave-one-out Cross-Validation
  • 3. Bootstrap Validation

5 How to Do Cross-Validation

Sebastian Pölsterl 2 of 49

slide-3
SLIDE 3

Performance Measures: Classification

Confusion Matrix Deterministic Classifiers Multi-class No Change Correction

Accuracy Error Rate Micro/Macro Average

Change Correction

Chohen’s Kappa Fleiss’ Kappa

Single-class

TP/FP Rate, Precision, Recall, Sensitivity, Specificity, F1-Measure, Dice, Geometric Mean

Scoring Classifiers Graphical Measures

ROC Curves PR Curves Lift Charts Cost Curves

Summary Statistics

Area under the curve H Measure

Sebastian Pölsterl 3 of 49

slide-4
SLIDE 4

Test Outcomes

Let us consider a binary classification problem:

  • True Positive (TP) = positive sample correctly classified as

belonging to the positive class

  • False Positive (FP) = negative sample misclassified as belonging

to the positive class

  • True Negative (TN) = negative sample correctly classified as

belonging to the negative class

  • False Negative (FN) = positive sample misclassified as

belonging to the negative class

Sebastian Pölsterl 4 of 49

slide-5
SLIDE 5

Confusion Matrix I

Ground Truth Class A Class B Prediction Class A True positive False positive Type I Error (α) Class B False negative Type II Error (β) True negative

  • Let class A indicate the positive class and class B the negative class.
  • Accuracy =

TP+TN TP+FP+TN+FN

  • Error rate = 1 - Accuracy

Sebastian Pölsterl 5 of 49

slide-6
SLIDE 6

Confusion Matrix II

Ground Truth Class A Class B Pred. Class A TP FP Class B FN TN Sensitivity Specificity False negative rate False positive rate

  • Sensitivity/True positive rate/Recall =

TP TP+FN

  • Specificity/True negative rate =

TN TN+FP

  • False negative rate =

FN FN+TP = 1 - Sensitivity

  • False positive rate =

FP FP+TN = 1 - Specificity

Sebastian Pölsterl 6 of 49

slide-7
SLIDE 7

Confusion Matrix III

Ground Truth Class A Class B Pred. Class A TP FP Positive predictive value Class B FN TN Negative predictive value

  • Positive predictive value (PPV)/Precision =

TP TP+FP

  • Negative predictive value (NPV) =

TN TN+FN

Sebastian Pölsterl 7 of 49

slide-8
SLIDE 8

Multiple Classes – One vs. One

Ground Truth Class A Class B Class C Class D Prediction Class A Correct Wrong Wrong Wrong Class B Wrong Correct Wrong Wrong Class C Wrong Wrong Correct Wrong Class D Wrong Wrong Wrong Corrent

  • With k classes confusion matrix becomes a k × k matrix.
  • No clear notion of positives and negatives.

Sebastian Pölsterl 8 of 49

slide-9
SLIDE 9

Multiple Classes – One vs. All

Ground Truth Class A Other Pred. Class A True positive False positive Other False negative True negative

  • Choose one of k classes as positive (here: class A).
  • Collapse all other classes into negative to obtain k different 2 × 2

matrices.

  • In each of these matrices the number of true positives is the same

as in the corresponding cell of the original confusion matrix.

Sebastian Pölsterl 9 of 49

slide-10
SLIDE 10

Micro and Macro Average

  • Micro Average:
  • 1. Construct a single 2 × 2 confusion matrix by summing up TP, FP, TN

and FN from all k one-vs-all matrices.

  • 2. Calculate performance measure based on this average.
  • Macro Average:
  • 1. Obtain performance measure from each of the k one-vs-all matrices

separately.

  • 2. Calculate average of all these measures.

Sebastian Pölsterl 10 of 49

slide-11
SLIDE 11

F1-Measure

F1-measure is the harmonic mean of positive predictive value and sensitivity: F1 = 2 · PPV · sensitivity PPV + sensitivity (1)

  • Micro Average F1-Measure:
  • 1. Calculate sums of TP, FP, and FN

across all classes

  • 2. Calculate F1 based on these values
  • Macro Average F1-Measure:
  • 1. Calculate PPV and sensitivity for each

class separately

  • 2. Calculate mean PPV and sensitivity
  • 3. Calculate F1 based on mean values

S e n s i t i v i t y PPV F 1

Sebastian Pölsterl 11 of 49

slide-12
SLIDE 12

1 Classification

  • 1. Confusion Matrix
  • 2. Receiver operating characteristics
  • 3. Precision-Recall Curve

2 Regression 3 Unsupervised Methods 4 Validation

  • 1. Cross-Validation
  • 2. Leave-one-out Cross-Validation
  • 3. Bootstrap Validation

5 How to Do Cross-Validation

Sebastian Pölsterl 12 of 49

slide-13
SLIDE 13

Receiver operating characteristics (ROC)

  • Binary classifier returns

probability or score that represents the degree to which class an instance belongs to.

  • The ROC plot compares

sensitivity (y-axis) with false positive rate (x-axis) for all possible thresholds of the classifier’s score.

  • It visualizes the trade-off

between benefits (sensitivity) and costs (FPR).

False positive rate True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Sebastian Pölsterl 13 of 49

slide-14
SLIDE 14

ROC Curve

  • Line from the lower left to upper

right corner indicates random classifier.

  • Curve of perfect classifier goes

through the upper left corner at (0, 1).

  • A single confusion matrix

corresponds to one point in ROC space.

  • It is insensitive to changes in

class distribution or changes in error costs.

False positive rate True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Sebastian Pölsterl 14 of 49

slide-15
SLIDE 15

Area under the ROC curve (AUC)

  • The AUC is equivalent to the

probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance (Mann-Whitney U test).

  • The Gini coefficient is twice

the area that lies between the diagonal and the ROC curve: Gini coefficient + 1 = 2 · AUC

False positive rate True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 AUC = 0.89 Sebastian Pölsterl 15 of 49

slide-16
SLIDE 16

Averaging ROC curves I

  • Merging: Merge instances of n

tests and their respective scores and sort the complete set

  • Vertical averaging:
  • 1. Take vertical samples of the

ROC curves for fixed false positive rate

  • 2. Construct confidence intervals

for the mean of true positive rates

Vertical Average

False positive rate Average true positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Sebastian Pölsterl 16 of 49

slide-17
SLIDE 17

Averaging ROC curves II

  • Threshold averaging:
  • 1. Do merging as described above
  • 2. Sample based on thresholds

instead of points in ROC space

  • 3. Create confidence intervals for

FPR and TPR at each point

Threshold Average

Average false positive rate Average true positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Sebastian Pölsterl 17 of 49

slide-18
SLIDE 18

Disadvantages of ROC curves

  • ROC curves can present an overly optimistic view of an algorithm’s

performance if there is a large skew in the class distribution, i.e. the data set contains much more samples of one class.

  • A large change in the number of false positives can lead to a small

change in the false positive rate (FPR). FPR = FP FP + TN

  • Comparing false positives to true positives (precision) rather than

true negatives (FPR), captures the effect of the large number of negative examples. Precision = TP FP + TP

Sebastian Pölsterl 18 of 49

slide-19
SLIDE 19

1 Classification

  • 1. Confusion Matrix
  • 2. Receiver operating characteristics
  • 3. Precision-Recall Curve

2 Regression 3 Unsupervised Methods 4 Validation

  • 1. Cross-Validation
  • 2. Leave-one-out Cross-Validation
  • 3. Bootstrap Validation

5 How to Do Cross-Validation

Sebastian Pölsterl 19 of 49

slide-20
SLIDE 20

Precision-Recall Curve

  • Compares precision (y-axes) to

recall (x-axes) at different thresholds.

  • PR curve of optimal classifier is

in the upper-right corner.

  • One point in PR space

corresponds to a single confusion matrix.

  • Average precision is the area

under the PR curve.

Recall Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.6 0.7 0.8 0.9 1.0 Sebastian Pölsterl 20 of 49

slide-21
SLIDE 21

Relationship to Precision-Recall Curve

  • Algorithms that optimize the area under the ROC curve are not

guaranteed to optimize the area under the PR curve

  • Example: Dataset has 20 positive examples and 2000 negative

examples.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate True Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Precision

Sebastian Pölsterl 21 of 49

slide-22
SLIDE 22

1 Classification

  • 1. Confusion Matrix
  • 2. Receiver operating characteristics
  • 3. Precision-Recall Curve

2 Regression 3 Unsupervised Methods 4 Validation

  • 1. Cross-Validation
  • 2. Leave-one-out Cross-Validation
  • 3. Bootstrap Validation

5 How to Do Cross-Validation

Sebastian Pölsterl 22 of 49

slide-23
SLIDE 23

Evaluating Regression Results

  • Remember that the predicted

value is continuous.

  • Measuring the performance is

based on comparing the actual value yi with the predicted value ˆ yi for each sample.

  • Measures are either the sum of

squared or absolute differences.

  • 0.0

0.2 0.4 0.6 0.8 1.0

Sebastian Pölsterl 23 of 49

slide-24
SLIDE 24

Regression – Performance Measures

  • Sum of absolute error (SAE):

n

  • i=1

|yi − ˆ yi|

  • Sum of squared errors (SSE):

n

  • i=1

(yi − ˆ yi)2

  • Mean squared error (MSE): 1

nSSE

  • Root mean squared error (RMSE):

√ MSE

Sebastian Pölsterl 24 of 49

slide-25
SLIDE 25

1 Classification

  • 1. Confusion Matrix
  • 2. Receiver operating characteristics
  • 3. Precision-Recall Curve

2 Regression 3 Unsupervised Methods 4 Validation

  • 1. Cross-Validation
  • 2. Leave-one-out Cross-Validation
  • 3. Bootstrap Validation

5 How to Do Cross-Validation

Sebastian Pölsterl 25 of 49

slide-26
SLIDE 26

Unsupervised Methods

  • Problem: Ground truth is usually not available or requires manual

assignment

  • Without ground truth (internal validation):
  • Cohesion
  • Separation
  • Silhouette Coefficient
  • With ground truth (external validation):
  • Jaccard index
  • Dice’s coefficient
  • (Normalized) mutual information
  • (Adjusted) rand index

Sebastian Pölsterl 26 of 49

slide-27
SLIDE 27

Cohesion and Separation

  • Requires definition of proximity measure, such as distance or

similarity cohesion(Ci) =

  • x,y∈Ci

proximity(x, y) seperation(Ci, Cj) =

  • x∈Ci,y∈Cj

proximity(x, y)

Sebastian Pölsterl 27 of 49

slide-28
SLIDE 28

Silhouette Coefficient

  • a(i) is the mean distance between the i-th sample and all other

points in the same class

  • b(i) the mean distance to all other points in the next nearest cluster
  • The silhouette coefficient s(i) ∈ [−1; 1] is defined as

s(i) = b(i) − a(i) max(a(i), b(i))

  • s(i) = 1 if the clustering is dense and well separated
  • s(i) = −1 if the i-th sample was assigned incorrectly
  • s(i) = 0 if clusters overlap

Sebastian Pölsterl 28 of 49

slide-29
SLIDE 29

Jaccard Index and Dice’s Coefficient

  • Consider two sets S1, S2 where
  • ne set is used as ground truth

and the other was predicted.

  • Example: Pixels in image

classification or segmentation.

  • Jaccard Index

Jaccard(S1, S2) = |S1 ∩ S2| |S1 ∪ S2| ∈ [0; 1]

  • Dice’s coefficient

Dice(S1, S2) = 2|S1 ∩ S2| |S1| + |S2| ∈ [0; 1]

Sebastian Pölsterl 29 of 49

slide-30
SLIDE 30

1 Classification

  • 1. Confusion Matrix
  • 2. Receiver operating characteristics
  • 3. Precision-Recall Curve

2 Regression 3 Unsupervised Methods 4 Validation

  • 1. Cross-Validation
  • 2. Leave-one-out Cross-Validation
  • 3. Bootstrap Validation

5 How to Do Cross-Validation

Sebastian Pölsterl 30 of 49

slide-31
SLIDE 31

Validation Regimes

Validation Regimes No Re-sampling Hold-Out Re-sampling Multiple Re-sampling Simple Re-sampling Cross- Validation Stratified Non-Stratified Leave-One-Out Random Sub-Sampling Bootstrapping Randomization Repeated Cross- Validation Permutation Test 5×2 CV 10×10 CV Sebastian Pölsterl 31 of 49

slide-32
SLIDE 32

Validation

  • Test error: Prediction error over an independent sample.
  • Training error: Average loss over the training samples

1 n

n

  • i=1

L(yi, ˆ f (xi))

  • As the model gets more complex it infers more information from the

training data to represent more complicated underlying structures.

Sebastian Pölsterl 32 of 49

slide-33
SLIDE 33

Validation – Training Error

  • Training error consistently decreases with increasing model

complexity, whereas Test error starts to increase again.

  • Training error is not a good measure of performance.

Sebastian Pölsterl 33 of 49

slide-34
SLIDE 34

Validation – Over- and Underfitting

Model Complexity Prediction Error Underfitting Overfitting Optimum

  • Overfitting: A model with zero or very low training error is likely to

perform well on the training data but generalize badly (model too complex).

  • Underfitting: Model does not capture the underlying structure and

hence performs poorly (model too simple).

Sebastian Pölsterl 34 of 49

slide-35
SLIDE 35

Validation – Ideal Situation

  • Assume we have access to large amount of data.
  • Construct three different sets
  • 1. Training set: Used to fit the model.
  • 2. Validation set: Estimate prediction error to choose best model (e.g.

different costs C for SVMs).

  • 3. Test set: Used to asses how well final model generalizes.

Training Validation Test

Sebastian Pölsterl 35 of 49

slide-36
SLIDE 36

Cross-Validation

Dataset Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Train Valid. Performance 1 Performance 2 Performance 3 Performance 4 Performance 5

Average Performances

  • Cross-validation: Split data set into k equally large parts.
  • Stratified cross-validation: Ensures that the ratio between classes

is the same in each fold as in the complete dataset.

Sebastian Pölsterl 36 of 49

slide-37
SLIDE 37

Leave-one-out Cross-Validation

  • Use all but one sample for training and assess performance on the

excluded sample.

  • For a data set with n samples, leave-one-out cross-validation is

equivalent to n-fold cross-validation.

  • Not suitable if data set is very large and/or training the classifier

takes a long time.

Sebastian Pölsterl 37 of 49

slide-38
SLIDE 38

Bootstrap Sampling

  • The bootstrap is a general tool

for assessing statistical accuracy.

  • Assumption: Our data set is a

representative portion of the

  • verall population.
  • Bootstrap sampling:

Randomly draw samples with replacement from the original data set to generate new data sets of the same size.

Sebastian Pölsterl 38 of 49

slide-39
SLIDE 39

Bootstrap Validation

  • Bootstrap sampling is repeated

B times and samples not included in each bootstrap sample are recorded.

  • Train model on each of the B

bootstrap samples.

  • For each sample of the original

data set, asses performance only

  • n bootstrap samples not

containing this sample: 1 n

n

  • i=1

1 |C−i|

  • b∈C−i

L(yi, ˆ fb(xi))

Sebastian Pölsterl 39 of 49

slide-40
SLIDE 40

1 Classification

  • 1. Confusion Matrix
  • 2. Receiver operating characteristics
  • 3. Precision-Recall Curve

2 Regression 3 Unsupervised Methods 4 Validation

  • 1. Cross-Validation
  • 2. Leave-one-out Cross-Validation
  • 3. Bootstrap Validation

5 How to Do Cross-Validation

Sebastian Pölsterl 40 of 49

slide-41
SLIDE 41

A Typical Strategy

  • 1. Find a “good” subset of features that show fairly strong (univariate)

correlation with the class labels

  • 2. Using just this subset of features, build a multivariate classifier
  • 3. Use cross-validation to estimate the unknown hyper-parameters and

to estimate the prediction error of the final model.

Sebastian Pölsterl 41 of 49

slide-42
SLIDE 42

A Typical Strategy

  • 1. Find a “good” subset of features that show fairly strong (univariate)

correlation with the class labels

  • 2. Using just this subset of features, build a multivariate classifier
  • 3. Use cross-validation to estimate the unknown hyper-parameters and

to estimate the prediction error of the final model.

Is this the correct way to do cross-validation?

Sebastian Pölsterl 41 of 49

slide-43
SLIDE 43

Scenario

  • Consider a data set with 50 samples in two equal-sized classes and

5000 features that are independent of the class labels

  • The true test error rate of any classifier is 50%
  • Example:
  • 1. Choose 100 predictors with highest correlation with class labels
  • 2. Use a 1-Nearest Neighbor classifier based on these 100 features
  • 3. Result: Doing 50 simulations in this setting, yielded an average CV

error rate of 1.4%

Sebastian Pölsterl 42 of 49

slide-44
SLIDE 44

What Happened?

  • Classifier had an unfair advantage because features were selected

based on all samples

  • This validates the requirement that the test set is completely

independent of the training set, because the classifier has already “seen” the samples in the test set

Sebastian Pölsterl 43 of 49

slide-45
SLIDE 45

What Happened?

Wrong

Correlations of Selected Features with Label Frequency −1.0 −0.5 0.0 0.5 1.0 100 300

Correct

Correlations of Selected Features with Label Frequency −0.5 0.0 0.5 1.0 200 400

Sebastian Pölsterl 44 of 49

slide-46
SLIDE 46

How to Do It Right?

  • 1. Divide data set into K folds at random
  • 2. For each fold

2.1 Find a subset of “good” features 2.2 Using this subset, build a multivariate classifier, using all samples expect those in fold k 2.3 Use the classifier to predict the class label of samples in fold k

Sebastian Pölsterl 45 of 49

slide-47
SLIDE 47

How to Do It Right?

  • 1. Divide data set into K folds at random
  • 2. For each fold

2.1 Find a subset of “good” features 2.2 Using this subset, build a multivariate classifier, using all samples expect those in fold k 2.3 Use the classifier to predict the class label of samples in fold k

Result

The estimated mean error rate is 51.2%, which is much closer to the true test error rate.

Sebastian Pölsterl 45 of 49

slide-48
SLIDE 48

How to Do It Right?

  • Cross-validation must be applied to the entire sequence of

modeling steps

  • Examples:
  • Selection of features
  • Tuning of hyper-parameters

Sebastian Pölsterl 46 of 49

slide-49
SLIDE 49

Conclusion

  • Many different performance measures for classification exist.
  • ROC and Precision-Recall curves can be applied for binary classifiers

which return probabilities or scores.

  • Cross-Validation is the most commonly used validation scheme.
  • Bootstrap cannot only be used for validation, it can be used in

many more applications as well (e.g. bagging).

Important

Every performance measure has its advantages and its disadvantages. There is no best measure. Therefore, you have to consider multiple measures to evaluate your model.

Sebastian Pölsterl 47 of 49

slide-50
SLIDE 50

References (1)

Davis, J. and Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning, ICML ’06, pages 233–240, New York, NY, USA. ACM. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer, second edition. http://www-stat.stanford.edu/~tibs/ElemStatLearn/.

Sebastian Pölsterl 48 of 49

slide-51
SLIDE 51

References (2)

Parker, C. (2011). An Analysis of Performance Measures for Binary Classifiers. In 2011 IEEE 11th International Conference on Data Mining, pages 517–526. IEEE.

Sebastian Pölsterl 49 of 49