Optimizing Abstaining Classifiers u s i n g R O C A n a l y s i s - - PowerPoint PPT Presentation

optimizing abstaining classifiers u s i n g r o c a n a l
SMART_READER_LITE
LIVE PREVIEW

Optimizing Abstaining Classifiers u s i n g R O C A n a l y s i s - - PowerPoint PPT Presentation

IBM Zurich Research Laboratory, GSAL Optimizing Abstaining Classifiers u s i n g R O C A n a l y s i s / 't dek p e'tr ek / Tadek Pietraszek pie@zurich.ibm.com ICML 2005 August 9, 2005 To classify, or not to classify:


slide-1
SLIDE 1

IBM Zurich Research Laboratory, GSAL

Optimizing Abstaining Classifiers u s i n g R O C A n a l y s i s Tadek Pietraszek / 'tʌ·dek

pɪe·'trʌ·ʃek /

pie@zurich.ibm.com

ICML 2005 August 9, 2005

slide-2
SLIDE 2

ICML2005 2 August 9, 2005

“To classify, or not to classify: that is the question.”

slide-3
SLIDE 3

ICML2005 3 August 9, 2005

Motivation

! Abstaining classifiers are classifiers that in certain cases can refrain from classification and are similar to human experts who can say “I don’t know”. ! In many domains such experts are preferred to the ones that always make a decision and are sometimes wrong (think “doctor”). ! Machine learning has frequently used abstaining classifiers ([FH04], [GL00], [PMAS94], [Tort00]) also implicitly (e.g., active learning, delegating classifiers, triskels (ICML05)). ! Q1: How do we optimally select abstaining classifiers? ! Q2: How do we compare normal and abstaining classifiers?

slide-4
SLIDE 4

ICML2005 4 August 9, 2005

Outline

  • 1. ROC Background
  • 2. Tri-State Classifier

1. Cost-Based Model 2. Bounded-Abstention Model 3. Bounded-Improvement Model

  • 3. Experiments, Results
  • 4. Summary
slide-5
SLIDE 5

ICML2005 5 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Notation

! Binary classifier C is a function : i α {+,-}, where i ∈ I is an instance ! Ranker R (a.k.a scoring classifier) is a function attaching rank to an instance i α R, can be converted to a binary classifier Cτ using ∀ i : Cτ(i) = + ⇔ R(i) ≥ τ ! Abstaining binary classifier A is a classifier that in certain case can refrain from

  • classification. We denote it as attaching a

third class “?”.

slide-6
SLIDE 6

ICML2005 6 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

ROC Background

! Evaluate model performance under all class and cost distributions – 2D plot (X – false positive rate, Y – true positive rate) – Classifier C corresponds to a single point on the ROC curve (fp, tp). ! Classifier Cτ (or a machine learning method Lτ) has a parameter τ , varying which produces multiple points. ! Therefore we consider a ROC curve a function f : τ α (fpτ, tpτ). ! Can find an inverse function f-1 : (fpτ, tpτ) α τ

slide-7
SLIDE 7

ICML2005 7 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

ROC Background

! ROC Convex Hull – A piecewise-linear convex down curve fR, having the following properties:

  • fR(0) = 0, fR(1) = 1
  • Slope of fR is monotonically non-increasing.

– Assume that for any value m, there [PF98] exists fR(x) = m.

  • Vertices have ``slopes’’ assuming values between the slopes of adjacent

edges

  • Assume sentinel edges: 0th edge with a slope ∞ and (n+1)th edge with a

slope 0.

– We will use ROCCH instead of ROC.

slide-8
SLIDE 8

ICML2005 8 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Some Definitions

! Confusion Matrix FN TP FN fn TN FP FP fp FN TP TP tp + = + = + =

N TN FP

  • P

FN TP +

  • +

A/C c21

  • c12

+

  • +

A/C

! Cost Matrix

12 21

c c CR =

A = Actual, C = Classified as

slide-9
SLIDE 9

ICML2005 9 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Cost Minimizing Criteria for One Classifier

! Known iso-performance lines [PF98]

( )

P N CR fp fROC = ′

slide-10
SLIDE 10

ICML2005 10 August 9, 2005

Outline

  • 1. ROC Background
  • 2. Tri-State Classifier

1. Cost-Based Model 2. Bounded-Abstention Model 3. Bounded-Improvement Model

  • 3. Experiments, Results
  • 4. Summary
slide-11
SLIDE 11

ICML2005 11 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Metaclassifier Aα,β

! IDEA: Construct the classifier as follows: where Cα, Cβ is such that: ! Can we optimally select Cα, Cβ ?

  • Impossible
  • +

? +

  • +

+ + Result

Cβ Cα

( ) (

)

− = − + = ∧ − = + = + = ) ( ) ( ) ( ? ) ( ) (

,

x C x C x C x C x A

β β α α β α

) ) ( ) ( ( ) ) ( ) ( ( : − = ⇒ − = ∧ + = ⇒ + = ∀ x C x C x C x C x

α β β α

slide-12
SLIDE 12

ICML2005 12 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Requirements on the ROC Curve

Requirement: for a ROC curve and any two classifiers Cα and Cβ corresponding to points (fpα, tpα) and (fpβ, tpβ) such that fpα ≤ fpβ ! Conditions are the same used by [FlachWu03] and are met in particular if classifiers Cα and Cβ are constructed from a single ranker R.

) ) ( ) ( ( ) ) ( ) ( ( : − = ⇒ − = ∧ + = ⇒ + = ∀ x C x C x C x C x

α β β α

slide-13
SLIDE 13

ICML2005 13 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

“Optimal” Metaclassifier Aα,β

! How do we compare binary classifiers and abstaining classifiers? How to select an optimal classifier? ! No clear answer – Use cost based model (Cost-Based Model) – Use boundary conditions:

  • Maximum number of instances classified as “?” (Bounded-

Abstention Model)

  • Maximum misclassification cost (Bounded-Improvement Model)
slide-14
SLIDE 14

ICML2005 14 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Cost-Based Model

! Cost Matrix

c23 c21

  • c13

c12 + ?

  • +

A/C

A = Actual, C = Classified as

! Important properties

( )( ) ( )( )

α β α β α β β α

fn fn fn fn fp fp fp fp ≥ ⇒ ≥ ⇒

TNα FPα

  • FNα

TPα +

  • +

A/C

Cα Cβ

TNβ FPβ

  • FNβ

TPβ +

  • +

A/C

slide-15
SLIDE 15

ICML2005 15 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Selecting the Optimal Classifier

! Similar criteria – minimize the cost

( ) ( )

P N c c c fp f P N c c c fp f FP rc FP rc c FN FN c FP FP c FP c FN P N rc

ROC ROC misclass disagree misclass disagree fp fp fn fn 13 23 21 13 12 23 . 13 . 23 , 21 , 12

) ( ) ( 1 − = ′ − = ′ ⇒ = ∂ ∂ ∧ = ∂ ∂           − + − + + + =

− − α β α β α β α β α β β α α α β β

4 4 3 4 4 2 1 4 4 3 4 4 2 1 3 2 1 3 2 1

slide-16
SLIDE 16

ICML2005 16 August 9, 2005

Cost-Based Model – a Simulated Example

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

ROC curve with two optimal classifiers

FP TP Classifier A Classifier B F P ( a ) 0.0 0.2 0.4 0.6 0.8 1.0 FP(b) 0.0 0.2 0.4 0.6 0.8 1.0 Cost 0.2 0.3 0.4 0.5

Misclassification cost for different combinations of A and B

F P ( a ) 0.0 0.2 0.4 0.6 0.8 1.0 FP(b) 0.0 0.2 0.4 0.6 0.8 1.0 Cost 0.2 0.3 0.4 0.5

Misclassification cost for different combinations of A and B

P N c c c fp f P N c c c fp f

ROC ROC 13 23 21 13 12 23

) ( ) ( − = ′ − = ′

α β

slide-17
SLIDE 17

ICML2005 17 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Understanding Cost Matrices

! 2x2 cost matrix is well known. 2x3 cost matrices has some interesting properties: e.g., under which conditions the optimal classifier is an abstaining classifier. ! Our derivation is valid for we can prove that if this condition is not met the classifier is a trivial binary classifier

( ) ( ) ( )

12 23 13 21 12 21 13 12 23 21

c c c c c c c c c c + ≥ ∧ > ∧ ≥

slide-18
SLIDE 18

ICML2005 18 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Cost Matrices – Interesting Cases

! How to set c13, c23 so that the classifier is a non- trivial abstaining classifier? ! Two interesting cases – Symmetric case (c13=c23) – Proportional case (c13 / c23 = c12 / c21)

12 21 21 12 23 13

c c c c c c + ≤ =

2 2

21 23 12 13

c c c c ≤ ⇔ ≤

slide-19
SLIDE 19

ICML2005 19 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Bounded Models

! Problem: 2x3 cost matrix is not always given and would have to be estimated. However, classifier is very sensitive to c13, c23. ! Finding other optimization criteria for an abstaining classifier using a standard cost matrix. – Calculate misclassification costs per classified instance ! Follow the same reasoning to find the optimal classifier

slide-20
SLIDE 20

ICML2005 20 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Bounded Models Equation

! Obtained the following equation, determining the relationship between k and rc for as a function of classifiers Cα, Cβ. – Constrain k, minimize rc → bounded-abstention – Constrain rc, minimize k → bounded-improvement ! No algebraic solution, need to optimize numerically.

( )( )(

) ( ) ( ) ( )

β α α β β α

fn fn fp fp P N k c FN c FP P N k rc − + − + = + + − = 1 1 1

12 21

slide-21
SLIDE 21

ICML2005 21 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Bounded-Abstention Model

! Among classifiers abstaining for no more than a fraction

  • f kMAX instances find the one that minimizes rc.

! Useful application in real-time processing instances where the non-classified instances will be processed by another classifier with a limited processing speed. ! Can prove that the solution is not limited to vertices of ROCCH.

slide-22
SLIDE 22

ICML2005 22 August 9, 2005

Bounded-Abstention Model – a Simulated Example

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

ROC curve with two optimal classifiers

FP TP Classifier A Classifier B 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 F P ( a ) 0.0 0.2 0.4 0.6 0.8 1.0 FP(b) 0.0 0.2 0.4 0.6 0.8 1.0 Cost 0.0 0.2 0.4 0.6 0.8 1.0

Misclassification cost (tp−fp). Bounded case |?| <= 0.2

slide-23
SLIDE 23

ICML2005 23 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Bounded-Improvement Model

! Among classifiers having misclassification cost not higher than rcMAX, find the one that abstains for the smallest number of instances. ! Useful in, e.g. medical domain where having a test want to achieve a certain lower misclassification cost allowing for non-classified instances. ! For the evaluation – use f, such that rcMAX = (1-f)rc, where rc is the cost of the optimal binary classifier. ! Can prove that the solution is not limited to vertices of ROCCH.

slide-24
SLIDE 24

ICML2005 24 August 9, 2005

Bounded-Improvement Moded – a Simulated Example

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

ROC curve with two optimal classifiers

FP TP Classifier A Classifier B 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FP(a) 0.0 0.2 0.4 0.6 0.8 1.0 F P ( b ) 0.0 0.2 0.4 0.6 0.8 1.0 Skipped Fraction 0.0 0.2 0.4 0.6 0.8 1.0

Fraction of skipped instances for different combinations of A and B

slide-25
SLIDE 25

ICML2005 25 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Experiments

! Tested with 15 UCI KDD datasets, using averaged cross-validation. ! In each model used one independent parameter c13=c23, k or f. ! Classifier – Bayesian classifier from Weka [WF00]. ! Numerical calculations and optimization in R. ! Showing results for one representative dataset.

slide-26
SLIDE 26

ICML2005 26 August 9, 2005

Building an Abstaining Classifier

Binary classifier Abstaining classifier Construct Tri-State Classifier Build Classifier Find Thresholds* Collect Statistics Build ROC Classify Build Classifier n-fold Cross- validation training instances (1) 2x3 cost matrix (2) 2x2 cost matrix, fraction k or f training set testing set thresholds ROC (for each fold) repeat m-times and average

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

slide-27
SLIDE 27

ICML2005 27 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Results – Cost-Based Model

0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

ionosphere.arff

cost value c13=c23 cost improvement 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6

ionosphere.arff

cost value c13=c23 fraction instances skipped 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

ionosphere.arff

fraction instances skipped cost improvement

slide-28
SLIDE 28

ICML2005 28 August 9, 2005

Results – Bounded-Abstention Model

0.1 0.2 0.3 0.4 0.5 0.2 0.4 0.6 0.8

ionosphere.arff

fraction skipped (k) relative cost improvement 0.1 0.2 0.3 0.4 0.5 0.04 0.08 0.12 0.16

ionosphere.arff

fraction skipped (k) misclassification cost (rc)

slide-29
SLIDE 29

ICML2005 29 August 9, 2005

Results – Cost-Based Model

0.1 0.2 0.3 0.4 0.5 0.05 0.15 0.25 0.35

ionosphere.arff

relative cost improvement (f) fraction skipped (k) 0.06 0.08 0.10 0.12 0.14 0.05 0.15 0.25 0.35

ionosphere.arff

misclassification cost (rc) fraction skipped (k)

slide-30
SLIDE 30

ICML2005 30 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Summary

! Abstaining classifier as a metaclassifier – Cost-based model – Bounded-improvement model – Bounded-abstention model ! Methodically tested and proved it works (in all three models) – Multiple data sets (UCI KDD) – Cross-validation ! Idea fits our alert classification system (see: Pietraszek 2004, “Using Adaptive Alert Classification to Reduce False Positives in Intrusion Detection”)

slide-31
SLIDE 31

IBM Zurich Research Laboratory, GSAL

END

pie@zurich.ibm.com http://tadek.pietraszek.org/

slide-32
SLIDE 32

ICML2005 32 August 9, 2005

Bibliography (1)

! [Chow70] Chow, C. (1970). On optimum recognition error and reject tradeoff. IEEE Transactions

  • n Information Theory, 16, 41--46.

! [Dietterich98] Dietterich, T. G. (1998). Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation, 10, 1895--1923. ! [Fawcett03] Fawcett, T. (2003). ROC graphs: Note and practical considerations for researchers (HPL-2003-4) (Technical Report). HP Laboratories. ! [FFH04] Ferri, C., Flach, P., Hernandez-Orallo, J. (2004). Delegating classifiers. Proceedings of 21th International Conference on Machine Leaning (ICML'04) (pp. 106--110). Alberta, Canada: Omnipress. ! [FerriHernandez04] Ferri, C., Hernandez-Orallo, J. (2004). Cautious classifiers. Proceedings of ROC Analysis in Artificial Intelligence, 1st International Workshop (ROCAI-2004) (pp. 27--36). Valencia Spain. ! [FlachWu03] Flach, P.A., Wu, S. (2003). Repairing concavities in ROC curves. Proc. 2003 UK Workshop on Computational Intelligence (pp. 38--44). Bristol, UK. ! [GambergerLavrac00] Gamberger, D., Lavrac, N. (2000). Reducing misclassification costs. Principles of Data Mining and Knowledge Discovery, 4th European Conference (PKDD 2000) (pp. 34--43). Lyon, France: Springer Verlag. ! [HettichBay99] Hettich, S., Bay, S. D. (1999). The UCI KDD Archive. Web page at http://kdd.ics.uci.edu. ! [LewisCatlett94] Lewis, D.D., Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. Proceedings of ICML-94, 11th International Conference on Machine Learning (pp. 48--156). Morgan Kaufmann Publishers, San Francisco, US.

slide-33
SLIDE 33

ICML2005 33 August 9, 2005

Bibliography (2)

! [NedlerMead65] Nedler, J., Mead, R. (1965). A simplex method for function minimization. Computer Journal, 7, 308--313. ! [PMAS94] Pazzani, M.J., Murphy, P., Ali, K., Schulenburg, D. (1994). Trading off coverage for accuracy in forecasts: Applications to clinical data analysis. Proceedings of AAAI Symposium

  • n AI in Medicine (pp. 106--110). Stanford, CA.

! [ProvostFawcett98]Provost, F., Fawcett, T. (1998). Robust classification systems for imprecise

  • environemnts. Proceedings of the Fifteenth National Conference on Artificial Intelligence

(AAAI-98) (pp. 706--713). AAAI Press. ! [Tortorella00] Tortorella, F. (2000). An optimal reject rule for binary classifiers. Advances in Pattern Recognition, Joint IAPR International Workshops SSPR 2000 and SPR 2000 (pp.\/ 611-- 620). Alicante, Spain: Springer-Verlag. ! [WittenFrank00] Witten, I.H., Frank, E. (2000). Data Mining: Practical machine learning tools with Java implementations. San Francisco: Morgan Kaufmann.

slide-34
SLIDE 34

ICML2005 34 August 9, 2005

Further Improvements in: Bounded-Abstention and Bounded-Improvement Models ! In previous work, we used general numerical methods to find the solution ! But:

– ROCCH is not an arbitrary function, but has special properties – Thus, we can do much better and understand the tri-state classifiers better.

! Proposed an algorithm and a proof (see paper).

slide-35
SLIDE 35

ICML2005 35 August 9, 2005

FP(a) 0.0 0.2 0.4 0.6 0.8 1.0 FP(b) 0.0 0.2 0.4 0.6 0.8 1.0 C

  • s

t 0.1 0.2 0.3 0.4 0.5

Optimal classifier path − bounded−abstention

Optimal Classifier Path

slide-36
SLIDE 36

ICML2005 36 August 9, 2005

FP(a) 0.0 0.2 0.4 0.6 0.8 1.0 FP(b) 0.0 0.2 0.4 0.6 0.8 1.0 Cost 0.2 0.3 0.4 0.5

Smallest relative gradient path − bounded−abstention

Algorithm – Bounded Abstention Model

slide-37
SLIDE 37

ICML2005 37 August 9, 2005

FP(a) 0.0 0.2 0.4 0.6 0.8 1.0 F P ( b ) 0.0 0.2 0.4 0.6 0.8 1.0 k 0.0 0.2 0.4 0.6 0.8

Optimal classifier path − bounded−improvement

Algorithm – Bounded-Improvement Model

slide-38
SLIDE 38

ICML2005 38 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Selecting the Optimal Classifier

! Criteria – minimize the misclassification cost

( ) ( )

1 1 1 ) 1 ( 1 ) ( 1

12 21 12 21 12 21 12 21

=               ′ ⋅ ⋅ − + =         ⋅               − + ⋅ + =       ⋅ = ⋅ − + ⋅ + = = ⋅ + ⋅ + = N FP f c N P c P N FP d rc d c N FP f P c FP P N rc N FP f P TP c TP P c FP P N rc fp f tp c FN c FP P N rc

ROC ROC ROC ROC

slide-39
SLIDE 39

ICML2005 39 August 9, 2005

1. ROC Background 2. Abstaining Classifier Cost-Based Bounded-Abstention Bounded-Improvement 3. Experiments, Results 4. Summary

Cost Matrices

! Theorem. If (*) is not met, the classifier is a trivial binary classifier. ! Proof (sketch) – show that for an optimal classifier fR’(fp*

α) ≥ fR’(fp*) ≥ fR’(fp* β), where fp* corresponds to

an optimal binary classifier. – show that if (*) is not met, is positive for fp*

α < fp*

and that is positive for fp*

β > fp*

– therefore fp*

α = fp* = fp* β

( ) ( ) ( )

(*)

12 23 13 21 12 21 13 12 23 21

c c c c c c c c c c + ≥ ∧ > ∧ ≥

α

fp rc ∂ ∂

β

fp rc ∂ ∂