The Evaluation Issues The accuracy of a classifier can be evaluated - - PowerPoint PPT Presentation

the evaluation issues
SMART_READER_LITE
LIVE PREVIEW

The Evaluation Issues The accuracy of a classifier can be evaluated - - PowerPoint PPT Presentation

The Evaluation Issues The accuracy of a classifier can be evaluated using a test data set The test set is a part of the available labeled data set But how can we evaluate the accuracy of a classification method? A classification


slide-1
SLIDE 1

Jian Pei: CMPT 741/459 Classification (2) 1

The Evaluation Issues

  • The accuracy of a classifier can be

evaluated using a test data set

– The test set is a part of the available labeled data set

  • But how can we evaluate the accuracy of a

classification method?

– A classification method can generate many classifiers

  • What if the available labeled data set is too

small?

slide-2
SLIDE 2

Jian Pei: CMPT 741/459 Classification (2) 2

Holdout Method

  • Partition the available labeled data set into

two disjoint subsets: the training set and the test set

– 50-50 – 2/3 for training and 1/3 for testing

  • Build a classifier using the training set
  • Evaluate the accuracy using the test set
slide-3
SLIDE 3

Jian Pei: CMPT 741/459 Classification (2) 3

Limitations of Holdout Method

  • Fewer labeled examples for training
  • The classifier highly depends on the

composition of the training and test sets

– The smaller the training set, the larger the variance

  • If the test set is too small, the evaluation is

not reliable

  • The training and test sets are not

independent

slide-4
SLIDE 4

Jian Pei: CMPT 741/459 Classification (2) 4

Cross-Validation

  • Each record is used the same number of times for

training and exactly once for testing

  • K-fold cross-validation

– Partition the data into k equal-sized subsets – In each round, use one subset as the test set, and use the rest subsets together as the training set – Repeat k times – The total error is the sum of the errors in k rounds

  • Leave-one-out: k = n

– Utilize as much data as possible for training – Computationally expensive

slide-5
SLIDE 5

Jian Pei: CMPT 741/459 Classification (2) 5

Confidence Interval for Accuracy

  • Suppose a classifier C is tested on a test set of n

cases, and the accuracy is acc

  • How much confidence can we have on acc?
  • We need to estimate the confidence interval of a

given model accuracy

– Within which one is sufficiently sure that the true population value lies or, equivalently, by placing a bound

  • n the probable error of the estimate
  • A confidence interval procedure uses the data to

determine an interval with the property that – viewed before the sample is selected – the interval has a given high probability of containing the true population value

slide-6
SLIDE 6

Jian Pei: CMPT 741/459 Classification (2) 6

Binomial Experiments

  • When a coin is flipped, it has a probability p

to have the head turned up

  • If the coin is flipped N times, what is the

probability that we see the head X times?

– Expectation (mean): Np – Variance: Np(1 - p)

v N v

p p v N v X P

− ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = = ) 1 ( ) (

slide-7
SLIDE 7

Jian Pei: CMPT 741/459 Classification (2) 7

Confidence Level and Approximation

Area = 1 - α

Zα/2 Z1- α /2

α

α α

− = < − − <

1 ) / ) 1 ( (

2 / 1 2 /

Z N p p p acc Z P

) ( 2 4 4 2

2 2 / 2 2 2 / 2 / 2 2 / α α α α

Z N acc N acc N Z Z Z acc N + ⋅ − ⋅ + ± + ⋅ Zα: the bound at confidence level (1-α) Approximating using normal distribution

slide-8
SLIDE 8

Jian Pei: CMPT 741/459 Classification (2) 8

Accuracy Can Be Misleading …

  • Consider a data set of 99% of the negative

class and 1% of the positive class

  • A classifier predicts everything negative has

an accuracy of 99%, though it does not work for the positive class at all!

  • Imbalance class distribution is popular in

many applications

– Medical applications, fraud detection, …

slide-9
SLIDE 9

Jian Pei: CMPT 741/459 Classification (2) 9

Performance Evaluation Matrix

PREDICTED CLASS ACTUAL CLASS

Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)

FN FP TN TP TN TP d c b a d a + + + + = + + + + = Accuracy

Confusion matrix (contingency table, error matrix): used for imbalance class distribution

slide-10
SLIDE 10

Jian Pei: CMPT 741/459 Classification (2) 10

Performance Evaluation Matrix

PREDICTED CLASS ACTUAL CLASS

Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)

True positive rate (TPR, sensitivity) = TP / (TP + FN) True negative rate (TNR, specificity) = TN / (TN + FP) False positive rate (FNR) = FP / (TN + FP) False negative rate (FNR) = FN / (TP + FN)

slide-11
SLIDE 11

Jian Pei: CMPT 741/459 Classification (2) 11

Recall and Precision

  • Target class is more important than the other

classes

PREDICTED CLASS ACTUAL CLASS

Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)

Precision p = TP / (TP + FP) Recall r = TP / (TP + FN)

slide-12
SLIDE 12

Jian Pei: CMPT 741/459 Classification (2) 12

Fallout

  • Type I errors – false positive: a negative
  • bject is classified as positive

– Fallout: the type I error rate, FP / (TP + FP)

  • Type II errors – false negative: a positive
  • bject is classified as negative

– Captured by recall

slide-13
SLIDE 13

Jian Pei: CMPT 741/459 Classification (2) 13

Fβ Measure

  • How can we summarize precision and recall into
  • ne metric?

– Using the harmonic mean between the two

  • Fβ measure

– β = 0, Fβ is the precision – β = ∞, Fβ is the recall – 0 < β < ∞, Fβ is a tradeoff between the precision and the recall

FN FP TP TP p r rp + + = + = 2 2 2 (F) measure

  • F

F

β = (β 2 +1)rp

r + β 2p = (β 2 +1)TP (β 2 +1)TP + β 2FN + FP

slide-14
SLIDE 14

Jian Pei: CMPT 741/459 Classification (2) 14

Weighted Accuracy

  • A more general metric

d w c w b w a w d w a w

4 3 2 1 4 1

Accuracy Weighted + + + + =

Measure w1 w2 w3 w4 Recall 1 1 Precision 1 1

Fβ β2 + 1 β2

1 Accuracy 1 1 1 1

slide-15
SLIDE 15

Jian Pei: CMPT 741/459 Classification (2) 15

ROC Curve

  • Receiver Operating Characteristic (ROC)

1-dimensional data set containing 2

  • classes. Any points located at x > t is

classified as positive

slide-16
SLIDE 16

Jian Pei: CMPT 741/459 Classification (2) 16

ROC Curve

(TP,FP):

  • (0,0): declare everything

to be negative class

  • (1,1): declare everything

to be positive class

  • (1,0): ideal
  • Diagonal line:

– Random guessing – Below diagonal line: prediction is opposite of the true class

Figure from [Tan, Steinbach, Kumar]

slide-17
SLIDE 17

Jian Pei: CMPT 741/459 Classification (2) 17

Comparing Two Classifiers

Figure from [Tan, Steinbach, Kumar]

slide-18
SLIDE 18

Jian Pei: CMPT 741/459 Classification (2) 18

Cost-Sensitive Learning

  • In some applications, misclassifying some

classes may be disastrous

– Tumor detection, fraud detection

  • Using a cost matrix

PREDICTED CLASS ACTUAL CLASS

Class=Yes Class=No Class=Yes

  • 1

100 Class=No 1

slide-19
SLIDE 19

Jian Pei: CMPT 741/459 Classification (2) 19

Sampling for Imbalance Classes

  • Consider a data set containing 100 positive

examples and 1,000 negative examples

  • Undersampling: use a random sample of 100

negative examples and all positive examples

– Some useful negative examples may be lost – Run undersampling multiple times, use the ensemble of multiple base classifiers – Focused undersampling: remove negative samples that are not useful for classification, e.g., those far away from the decision boundary

slide-20
SLIDE 20

Jian Pei: CMPT 741/459 Classification (2) 20

Oversampling

  • Replicate the positive examples until the

training set has an equal number of positive and negative examples

  • For noisy data, may cause overfitting
slide-21
SLIDE 21

Jian Pei: CMPT 741/459 Classification (2) 21

Significance Tests

  • Are two algorithms different in effectiveness?

– The null hypothesis: there is NO difference – The alternative hypothesis: there is a difference – B is better than A (the baseline method)

  • Matched pair experiments: the rankings that are compared

are based on the same set of queries for both algorithms

  • Possible errors of significant tests

– Type I: the null hypothesis is rejected when it is true – Type II: the null hypothesis is accepted when it is false

  • The power of a hypothesis test: the probability that the test

will reject the null hypothesis correctly

– Reducing the type II errors

slide-22
SLIDE 22

Jian Pei: CMPT 741/459 Classification (2) 22

Procedure of Comparison

  • Using a set of data sets
  • Procedure

– Compute the effectiveness measure for every data set – Compute a test statistic based on a comparison of the effectiveness measures for each data set

  • E.g., the t-test, the Wilcoxon signed-rank test, and the sign test

– Compute a P-value: the probability that a test statistic value at least that extreme could be observed if the null hypothesis were true – The null hypothesis is rejected if the P-value ≤ α, where α is the significance level which is used to minimize the type I errors

  • One-sided (one-tailed) tests: whether B is better than A (the

baseline method)

– Two-sided tests: whether A and B are different – the P-value is doubled

slide-23
SLIDE 23

Jian Pei: CMPT 741/459 Classification (2) 23

Distribution of Test Statistics

slide-24
SLIDE 24

Jian Pei: CMPT 741/459 Classification (2) 24

T-test

  • Assuming data values are sampled from normal

distributions

– In a matched pair experiment, assuming the difference between the effectiveness values is a sample from a normal distribution

  • The null hypothesis: the mean of the distribution of

difference is 0

– B – A is the mean of the differences, σB – A is the standard deviation of the differences

N A B t

A B−

− = σ

=

− =

N i i

x x N

1 2 2

) ( 1 σ

slide-25
SLIDE 25

Jian Pei: CMPT 741/459 Classification (2) 25

Example

33 . 2 1 . 29 4 . 21 = = = −

t A B

A B

σ

P-value = 0.02 significant at a level of σ = 0.05 – the null hypothesis can be rejected

slide-26
SLIDE 26

Jian Pei: CMPT 741/459 Classification (2) 26

Issues in T-test

  • Data is assumed to be sampled from normal

distributions

– Generally inappropriate for effectiveness measures – However, experiments showed that t-test produces very similar results to the randomization test which does not assume any distribution (the most powerful nonparametric test)

  • T-test assumes that the evaluation data is

measured on an interval scale

– Effectiveness measures are ordinal – the magnitude of the differences are not significant – Use the Wilcoxon signed-rank test and the sign test, which make less assumption about the effectiveness measure, but are less powerful

slide-27
SLIDE 27

Jian Pei: CMPT 741/459 Classification (2) 27

Wilcoxon Signed-Rank Test

  • Assumption: the differences between the effectiveness

values can be ranked, but the magnitude is not important

– Ri is a signed-rank, N is the number of non-zero differences

  • Procedure

– The differences are sorted by their absolute values increasing order – Differences are assigned rank values (ties are assigned the average rank) – The rank values are given the sign of the original difference

  • The null hypothesis: the sum of the positive ranks will be

the same as the sum of the negative ranks

=

=

N i i

R w

1

slide-28
SLIDE 28

Jian Pei: CMPT 741/459 Classification (2) 28

Example

The non-zero differences in rank

  • rder of absolute value: 2, 9, 10,

24, 25, 25, 41, 60, 70 The signed ranks: -1, +2, +3, -4, +5.5, +5.5, +7, +8, +9 w = 35 P-value = 0.025 significant at a level of σ = 0.05 – the null hypothesis can be rejected

slide-29
SLIDE 29

Jian Pei: CMPT 741/459 Classification (2) 29

Sign Test

  • Completely ignore the magnitude of the

differences

– In practice, we may require that a 5-10% difference is needed to be considered as different

  • The null hypothesis: P(B > A) = P(A > B) = ½
  • Sum up the number of pairs B > A
slide-30
SLIDE 30

Jian Pei: CMPT 741/459 Classification (2) 30

Example

7 pairs out of 10 B > A P-value = 0.17 – the probability that we observe 7 successes out

  • f 10 trials where the

probability of success is 0.5 Cannot reject the null hypothesis

slide-31
SLIDE 31

To-Do List

  • Read Chapter 8.5
  • Understand how to generate classifier

evaluation output in Weka

Jian Pei: CMPT 741/459 Classification (2) 31