[PDF] - AUC: a Better Measure than Accuracy in Comparing Learning Algorithms PDF Document

SLIDE 1

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 1

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms

Authors: Charles X. Ling, Department of Computer Science, University

f Western Ontario, Canada

& Jin Huang, Department of Computer Science, University of Western Ontario, Canada & Harry Zhang, Faculty of Computer Science, University of New Brunswick, Canada Presented by: William Elazmeh, Ottawa-Carleton Institute for Computer Science, Canada

SLIDE 2

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 2

Introduction

The focus is visualization of classifier’s performance
Traditionally, performance = predictive accuracy
Accuracy ignores probability estimations of classifi-

cation in favor of class labels

ROC curves show the trade off between false positive

and true positive rates

AUC of ROC is a better measure than accuracy
AUC as a criteria for comparing learning algorithms
AUC replaces accuracy when comparing classifiers
Experimental results show AUC indicates a differ-

ence in performance between decision trees and Naive Bayes (significantly better)

SLIDE 3

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 3

Matrices Confusion Matrix +

Y T+ F+

N F- T- F+ Rate = F+

−

T+ Rate (Recall) = T+

+

Precision = T+

Y

Accuracy = (T+)+(T−)

(+)+(−)

F-Score = Precision × Recall Error Rate = 1 - Accuracy

SLIDE 4

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 4

ROC Space

F True Positive Rate False Positive Rate All Positive All Negative Trivial Classifiers 1 1 A B C D E

SLIDE 5

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 5

ROC Curves

0.1

True Positive Rate False Positive Rate 1 1

0.9 0.8 0.7 0.6 0.55 0.54 0.53 0.52 0.51 0.505 0.4 0.39 0.38 0.37 0.36 0.35 0.34 0.33 0.30

# Class Score # Class Score 1 + 0.9 11 + 0.4 2 + 0.8 12

0.39

3

0.7

13 + 0.38 4 + 0.6 14

0.37

5 + 0.55 15

0.36

6 + 0.54 16

0.35

7

0.53

17 + 0.34 8

0.52

18

0.33

9 + 0.51 19 + 0.30 10

0.505

20

0.1

SLIDE 6

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 6

ROC Curves

1 True Positive Rate False Positive Rate 1

SLIDE 7

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 7

Comparing Classifier Performance ROC

1 True Positive Rate False Positive Rate 1

SLIDE 8

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 8

Choosing Between Classifiers ROC

1 True Positive Rate False Positive Rate 1

SLIDE 9

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 9

Area Under the Curve AUC AUC = ΣRank(+)−|+|×(|+|+1)/2

|+|+|−|

where:

Rank(+) is the sum the ranks of all positively classified examples

| + | is the number of positive examples in the dataset | − | is the number of negative examples in the dataset Class Label Rank C1 C2 C3 + 10 +

+

+ 9 + + + + 8 + + + + 7 + +

+

6

+
5

+

+
4
+
3
2
1
+
Classifier

AUC Error Rate C1

(5+7+8+9+10)−(5×6)/2 5×5

= 24

25

20% C2

(1+6+7+8+9)−(5×6)/2 5×5

= 16

25

20% C3

(4+5+8+9+10)−(5×6)/2 5×5

= 21

25

40%

SLIDE 10

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 10

Comparing Evaluation Measures for Learning Algorithm

Let Ψ represent the domain and f and g are the two

evaluation measures used to compare the learning algorithms A and B

Consistency: f and g are strictly consistent if there

does not exist a, b ∈ Ψ|f(a) > f(b) and g(a) < g(b)

Discriminancy: f is strictly more discriminating than

g if ∃a, b ∈ Ψ|f(a) > f(b) and g(a) = g(b), and there does not exist a, b ∈ Ψ|g(a) > g(b) and f(a) = f(b)

SLIDE 11

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 11

Consistency and Discriminancy

Y Ψ Ψ f g f g X

X is Consistency counter example Y is Discriminancy counter example

SLIDE 12

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 12

Statistical Consistency and Discriminancy of Two Measures

Let Ψ represent the domain and f and g are the two

evaluation measures used to compare the learning algorithms A and B

Degree of Consistency: let R = {(a, b)|a, b ∈ Ψ, f(a) >

f(b), g(a) > g(b)}, S = {(a, b)|a, b ∈ Ψ, f(a) > f(b), g(a) < g(b)}. The degree of consistency of f and g is C(0 ≤ C ≤ 1), where C =

|R| |R|+|S|.

Degree of Discriminancy: let P = {(a, b)|a, b ∈

Ψ, f(a) > f(b), g(a) = g(b)}, Q = {(a, b)|a, b ∈ Ψ, g(a) > g(b), f(a) = f(b)}. The degree of dis- criminancy for f and g is D = |P|

|Q|.

The measure f is statistically consistent and more

discriminating than g if and only if C > 0.5 and D > 1. Intuitively, f is better than g.

SLIDE 13

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 13

For AUC and Accuracy Formally

In domain Ψ let R = {(a, b)|a, b ∈ Ψ, AUC(a) >

AUC(b), acc(a) > acc(b)}, S = {(a, b)|a, b ∈ Ψ, AUC(a) < AUC(b), acc(a) > acc(b)}. Then,

|R| |R|+|S| > 0.5 or |R| > |S|.

In domain Ψ let P = {(a, b)|a, b ∈ Ψ, AUC(a) >

AUC(b), acc(a) = acc(b)}, Q = {(a, b)|a, b ∈ Ψ, acc(a) > acc(b), AUC(a) = AUC(b)}. Then |P| > |Q|.

Experimental results to verify the above formal re-

sults for balanced or unbalanced datasets

Experimental results to show that the Naive Bayes

classifier is significantly better than decision trees

SLIDE 14

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 14

AUC and Accuracy Experimental Results (balanced)

Statistical Consistency # AUC(a) > AUC(b) AUC(a) > AUC(b) C & acc(a) > acc(b) & acc(a) < acc(b) 4 9 1.0 6 113 1 0.991 8 1459 34 0.977 10 19742 766 0.963 12 273600 13997 0.951 14 3864673 237303 0.942 16 55370122 3868959 0.935 Statistical Discriminancy # AUC(a) > AUC(b) acc(a) > acc(b) D & acc(a) = acc(b) & AUC(a) = AUC(b) 4 5 NA 6 62 4 15.5 8 762 52 14.7 10 9416 618 15.2 12 120374 7369 16.3 14 1578566 89828 17.6 16 21161143 1121120 18.9

SLIDE 15

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 15

AUC and Accuracy Experimental Results (unbalanced)

Statistical Consistency # AUC(a) > AUC(b) AUC(a) > AUC(b) C & acc(a) > acc(b) & acc(a) < acc(b) 4 3 1.0 8 187 10 0.949 12 12716 1225 0.912 16 926884 114074 0.890 Statistical Discriminancy # AUC(a) > AUC(b) acc(a) > acc(b) D & acc(a) = acc(b) & AUC(a) = AUC(b) 4 3 NA 8 159 10 15.9 12 8986 489 18.4 16 559751 25969 21.6

SLIDE 16

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms /16 16

Conclusions

AUC is a better measure than accuracy based on

formal definitions of discriminancy and consistency

The above conclusion allows to the re-evaluation of

conclusions made using accuracy in machine learn- ing such as, the Naive Bayes classifier predicts signif- icantly better than decision trees. This is contrary to the well-established conclusion of both being equiva- lent based on the accuracy measure.

The paper recommends using AUC as a “single num-

AUC: a Better Measure than Accuracy in Comparing Learning Algorithms

Authors: Charles X. Ling, Department of Computer Science, University

& Jin Huang, Department of Computer Science, University of Western Ontario, Canada & Harry Zhang, Faculty of Computer Science, University of New Brunswick, Canada Presented by: William Elazmeh, Ottawa-Carleton Institute for Computer Science, Canada

Introduction

cation in favor of class labels

and true positive rates

ence in performance between decision trees and Naive Bayes (significantly better)

Matrices Confusion Matrix +

N F- T- F+ Rate = F+

−

T+ Rate (Recall) = T+

+

Precision = T+

Y

Accuracy = (T+)+(T−)

(+)+(−)

F-Score = Precision × Recall Error Rate = 1 - Accuracy

ROC Space

F True Positive Rate False Positive Rate All Positive All Negative Trivial Classifiers 1 1 A B C D E

ROC Curves

True Positive Rate False Positive Rate 1 1

# Class Score # Class Score 1 + 0.9 11 + 0.4 2 + 0.8 12

3

13 + 0.38 4 + 0.6 14

5 + 0.55 15

6 + 0.54 16

7

17 + 0.34 8

18

9 + 0.51 19 + 0.30 10

20

ROC Curves

1 True Positive Rate False Positive Rate 1

Comparing Classifier Performance ROC

1 True Positive Rate False Positive Rate 1

Choosing Between Classifiers ROC

1 True Positive Rate False Positive Rate 1

Area Under the Curve AUC AUC = ΣRank(+)−|+|×(|+|+1)/2

|+|+|−|

Comparing Evaluation Measures for Learning Algorithm

evaluation measures used to compare the learning algorithms A and B

does not exist a, b ∈ Ψ|f(a) > f(b) and g(a) < g(b)

g if ∃a, b ∈ Ψ|f(a) > f(b) and g(a) = g(b), and there does not exist a, b ∈ Ψ|g(a) > g(b) and f(a) = f(b)

Consistency and Discriminancy

Y Ψ Ψ f g f g X

Statistical Consistency and Discriminancy of Two Measures

evaluation measures used to compare the learning algorithms A and B

f(b), g(a) > g(b)}, S = {(a, b)|a, b ∈ Ψ, f(a) > f(b), g(a) < g(b)}. The degree of consistency of f and g is C(0 ≤ C ≤ 1), where C =

Ψ, f(a) > f(b), g(a) = g(b)}, Q = {(a, b)|a, b ∈ Ψ, g(a) > g(b), f(a) = f(b)}. The degree of dis- criminancy for f and g is D = |P|

discriminating than g if and only if C > 0.5 and D > 1. Intuitively, f is better than g.

For AUC and Accuracy Formally

AUC(b), acc(a) > acc(b)}, S = {(a, b)|a, b ∈ Ψ, AUC(a) < AUC(b), acc(a) > acc(b)}. Then,

AUC(b), acc(a) = acc(b)}, Q = {(a, b)|a, b ∈ Ψ, acc(a) > acc(b), AUC(a) = AUC(b)}. Then |P| > |Q|.

sults for balanced or unbalanced datasets

classifier is significantly better than decision trees

AUC and Accuracy Experimental Results (balanced)

AUC and Accuracy Experimental Results (unbalanced)

Conclusions

formal definitions of discriminancy and consistency

conclusions made using accuracy in machine learn- ing such as, the Naive Bayes classifier predicts signif- icantly better than decision trees. This is contrary to the well-established conclusion of both being equiva- lent based on the accuracy measure.

ber” measure to over accuracy when evaluating and comparing classifiers