Data Warehousing and Machine Learning Probabilistic Classifiers - - PowerPoint PPT Presentation

data warehousing and machine learning
SMART_READER_LITE
LIVE PREVIEW

Data Warehousing and Machine Learning Probabilistic Classifiers - - PowerPoint PPT Presentation

Data Warehousing and Machine Learning Probabilistic Classifiers Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 34 Probabilistic Classifiers Conditional class probabilities Id. Savings


slide-1
SLIDE 1

Data Warehousing and Machine Learning

Probabilistic Classifiers Thomas D. Nielsen

Aalborg University Department of Computer Science

Spring 2008

DWML Spring 2008 1 / 34

slide-2
SLIDE 2

Probabilistic Classifiers

Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium High 75 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Medium High 75 Bad 8 Medium Medium 75 Good . . . . . . . . . . . . . . .

Probabilistic Classifiers DWML Spring 2008 2 / 34

slide-3
SLIDE 3

Probabilistic Classifiers

Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium High 75 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Medium High 75 Bad 8 Medium Medium 75 Good . . . . . . . . . . . . . . . P(Risk = Good|Savings = Medium, Assets = High, Income = 75) = 2/3 P(Risk = Bad|Savings = Medium, Assets = High, Income = 75) = 1/3

Probabilistic Classifiers DWML Spring 2008 2 / 34

slide-4
SLIDE 4

Probabilistic Classifiers

Empirical Distribution The training data defines the empirical distribution, which can be represented in a table. Empirical distribution obtained from 1000 data instances: Gender Blood Pressure Weight Smoker Stroke P m low under no no 32/1000 m low under no yes 1/1000 m low under yes no 27/1000 . . . . . . . . . . . . . . . . . . f normal normal no yes 0/1000 . . . . . . . . . . . . . . . . . . f high

  • ver

yes yes 54/1000 Such a table is not a suitable probabilistic model, because

  • Size of representation
  • It overfits the data

Probabilistic Classifiers DWML Spring 2008 3 / 34

slide-5
SLIDE 5

Probabilistic Classifiers

Model View data as being produced by a random process that is described by a joint probability distribution P on States(A1, . . . , An, C), i.e. P assigns a probability P(a1, . . . , an, c) ∈ [0, 1] to every tuple (a1, . . . , an, c) of values for the attribute and class variables, s.t. X

(a1,...,an,c)∈States(A1,...,An,C)

P(a1, . . . , an, c) = 1 (for discrete attributes; integration instead of summation for continuous attributes) Conditional Probability The joint distribution P also defines the conditional probability distribution of C, given A1, . . . , An, i.e. values P(c | a1, . . . , an) := P(a1, . . . , an, c) P(a1, . . . , an) = P(a1, . . . , an, c) P

c′ P(a1, . . . , an, c′)

that represent the probability that C = c given that it is known that A1 = a1, . . . , An = an.

Probabilistic Classifiers DWML Spring 2008 4 / 34

slide-6
SLIDE 6

Probabilistic Classifiers

Classification Rule For a loss function L(c, c′) an instance is classified according to C(a1, . . . , an) := arg min

c′∈States(C)

X

c∈States(C)

L(c, c′)P(c | a1, . . . , an) Examples Predicted Cancer Normal true Cancer 1 1000 True 1 Predicted c c’ true c 1 c’ 1 L(c, c′) 0/1 loss

Probabilistic Classifiers DWML Spring 2008 5 / 34

slide-7
SLIDE 7

Probabilistic Classifiers

Classification Rule For a loss function L(c, c′) an instance is classified according to C(a1, . . . , an) := arg min

c′∈States(C)

X

c∈States(C)

L(c, c′)P(c | a1, . . . , an) Under 0/1-loss we get C(a1, . . . , an) := arg maxc∈States(C)P(c | a1, . . . , an) In binary case, e.g. States(C) = {notinfected, infected}, also with variable threshold t: C(a1, . . . , an) = notinfected :⇔ P(notinfected | a1, . . . , an) ≥ t. (this can also be generalized for non-binary attributes).

Probabilistic Classifiers DWML Spring 2008 5 / 34

slide-8
SLIDE 8

Naive Bayes

The Naive Bayes Model Structural assumption: P(a1, . . . , an, c) = P(a1 | c) · P(a2 | c) · · · P(an | c) · P(c) Graphical representation as a Bayesian Network: A1 A2 A3 A4 A5 A6 A7 C Interpretation: Given the true class labels, the different attributes take their value independently.

Probabilistic Classifiers DWML Spring 2008 6 / 34

slide-9
SLIDE 9

Naive Bayes

The naive Bayes assumption I

1 2 3 6 5 9 8 7 4

For example: P(Cell − 2 = b | Cell − 5 = b, Symbol = 1) > P(Cell − 2 = b | Symbol = 1) Attributes not independent given Symbol=1!

Probabilistic Classifiers DWML Spring 2008 7 / 34

slide-10
SLIDE 10

Naive Bayes

The naive Bayes assumption II For spam example e.g.: P(Body’nigeria’=y | Body’confidential’=y, Spam=y) ≫ P(Body’nigeria’=y | Spam=y) Attributes not independent given Spam=yes! Naive Bayes assumption often not realistic. Nevertheless, Naive Bayes often successful.

Probabilistic Classifiers DWML Spring 2008 8 / 34

slide-11
SLIDE 11

Naive Bayes

Learning a Naive Bayes Classifier

  • Determine parameters P(ai | c) (ai ∈ States(Ai), c ∈ States(C)) from empirical counts in the

data.

  • Missing values are easily handled: instances for which Ai is missing are ignored for P(ai | c).
  • Discrete and continuous attributes can be mixed.

Probabilistic Classifiers DWML Spring 2008 9 / 34

slide-12
SLIDE 12

Naive Bayes

The paradoxical success of Naive Bayes One explanation for the surprisingly good performance of Naive Bayes in many domains: do not require exact distribution for classification, only the right decision boundaries [Domingos, Pazzani 97]

0.5 1

⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ : P(C = ⊕ | a1, . . . , an) (real) States(A1, . . . , An)

Probabilistic Classifiers DWML Spring 2008 10 / 34

slide-13
SLIDE 13

Naive Bayes

The paradoxical success of Naive Bayes One explanation for the surprisingly good performance of Naive Bayes in many domains: do not require exact distribution for classification, only the right decision boundaries [Domingos, Pazzani 97]

0.5 1

⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ : P(C = ⊕ | a1, . . . , an) (real) ⊕ : P(C = ⊕ | a1, . . . , an) (Naive Bayes) States(A1, . . . , An)

Probabilistic Classifiers DWML Spring 2008 10 / 34

slide-14
SLIDE 14

Naive Bayes

When Naive Bayes must fail No Naive Bayes Classifier can produce the following classification: A B Class yes yes ⊕ yes no ⊖ no yes ⊖ no no ⊕ because assume it did, then: 1. P(A = y | ⊕)P(B = y | ⊕)P(⊕) > P(A = y | ⊖)P(B = y | ⊖)P(⊖) 2. P(A = y | ⊖)P(B = n | ⊖)P(⊖) > P(A = y | ⊕)P(B = n | ⊕)P(⊕) 3. P(A = n | ⊖)P(B = y | ⊖)P(⊖) > P(A = n | ⊕)P(B = y | ⊕)P(⊕) 4. P(A = n | ⊕)P(B = n | ⊕)P(⊕) > P(A = n | ⊖)P(B = n | ⊖)P(⊖)

Probabilistic Classifiers DWML Spring 2008 11 / 34

slide-15
SLIDE 15

Naive Bayes

When Naive Bayes must fail (cont.) 1. P(A = y | ⊕)P(B = y | ⊕)P(⊕) > P(A = y | ⊖)P(B = y | ⊖)P(⊖) 2. P(A = y | ⊖)P(B = n | ⊖)P(⊖) > P(A = y | ⊕)P(B = n | ⊕)P(⊕) 3. P(A = n | ⊖)P(B = y | ⊖)P(⊖) > P(A = n | ⊕)P(B = y | ⊕)P(⊕) 4. P(A = n | ⊕)P(B = n | ⊕)P(⊕) > P(A = n | ⊖)P(B = n | ⊖)P(⊖) Multiplying the four left sides and the four right sides of these inequalities:

4

Y

i=1

(leftsideof i.) >

4

Y

i=1

(rightsideofi.) But this is false, because both products are actually equal.

Probabilistic Classifiers DWML Spring 2008 12 / 34

slide-16
SLIDE 16

Naive Bayes

Tree Augmented Naive Bayes Model: all Bayesian network structures where

  • The class node is parent of each

attribute node

  • The substructure on the attribute

nodes is a tree A1 A2 A3 A4 A5 A6 A7 C Learning TAN classifier: learning the tree structure and parameters. Optimal tree structure can be found efficiently (Chow, Liu 1968, Friedman et al. 1997).

Probabilistic Classifiers DWML Spring 2008 13 / 34

slide-17
SLIDE 17

Naive Bayes

TAN classifier for

A B Class yes yes ⊕ yes no ⊖ no yes ⊖ no no ⊕

:

C A B

⊕ ⊖ 0.5 0.5 C yes no ⊕ 0.5 0.5 ⊖ 0.5 0.5 C A yes no ⊕ yes 1.0 0.0 ⊕ no 0.0 1.0 ⊖ yes 0.0 1.0 ⊖ no 1.0 0.0

Probabilistic Classifiers DWML Spring 2008 14 / 34

slide-18
SLIDE 18

Tree Augmented Naive Bayes

Learning a TAN Classifier: a rough overview

  • Learn a (class conditional) maximum likelihood tree structure of the attributes.
  • Insert the class variable as a parent of all the attributes.

Probabilistic Classifiers DWML Spring 2008 15 / 34

slide-19
SLIDE 19

Tree Augmented Naive Bayes

Learning a TAN Classifier: a rough overview

  • Learn a (class conditional) maximum likelihood tree structure of the attributes.
  • Insert the class variable as a parent of all the attributes.

Learning a Chow-Liu tree A Chow-Liu tree of maximal likelihood can be constructed as follows:

1

Calculate MI(Ai, Aj) for each pair (Ai, Aj).

2

Build a maximum-weight spanning tree over the attributes.

3

Direct the resulting tree.

4

Learn the parameters. MI(Ai, Aj) = X

Ai ,Aj

P#(Ai, Aj) log2 P#(Ai, Aj) P#(Ai)P(Aj) !

Probabilistic Classifiers DWML Spring 2008 15 / 34

slide-20
SLIDE 20

Tree Augmented Naive Bayes

Example: learning a maximum likelihood tree structure (Chow-Liu tree)

Cold Sore Throat? See spots? Fever? Angina

1

Calculate MI(Ai, Aj) for each pair (Ai, Aj).

2

Build a maximum-weight spanning tree

  • ver the attributes.

3

Direct the resulting tree.

4

Learn the parameters.

Probabilistic Classifiers DWML Spring 2008 16 / 34

slide-21
SLIDE 21

Tree Augmented Naive Bayes

Example: learning a maximum likelihood tree structure (Chow-Liu tree)

Cold Sore Throat? See spots? Fever? Angina

MI(Cold, Angina) = 0 MI(Fever?, Angina) = 0.015076 MI(SoreThroat?, Angina) = 0.018016 MI(SeeSpots?, Angina) = 0.0180588 MI(Cold, Fever?) = 0.014392 MI(Cold, SoreThroat?) = 0.0210122 MI(Cold, SeeSpots?) = 0 MI(SoreThroat, Fever?) = 0.0015214 MI(Fever?, SeeSpots?) = 0.0017066 MI(SeeSpots?, SoreThroat?) = 0.0070697 1

Calculate MI(Ai, Aj) for each pair (Ai, Aj).

2

Build a maximum-weight spanning tree

  • ver the attributes.

3

Direct the resulting tree.

4

Learn the parameters. MI(Cold, Sore) = X

Cold,Sore

P(Cold, Sore) log2 „ P(Cold, Sore) P(Cold)P(Sore) « = 0.02101216.

Probabilistic Classifiers DWML Spring 2008 16 / 34

slide-22
SLIDE 22

Tree Augmented Naive Bayes

Example: learning a maximum likelihood tree structure (Chow-Liu tree)

Cold Sore Throat? See spots? Fever? Angina

1

Calculate MI(Ai, Aj) for each pair (Ai, Aj).

2

Build a maximum-weight spanning tree

  • ver the attributes.

3

Direct the resulting tree.

4

Learn the parameters.

0.014 0.007 0.018 0.021 0.002 0.015 0.018 0.002

Cold Sore Throat? See spots? Fever? Angina

Probabilistic Classifiers DWML Spring 2008 16 / 34

slide-23
SLIDE 23

Tree Augmented Naive Bayes

Example: learning a maximum likelihood tree structure (Chow-Liu tree)

Cold Sore Throat? See spots? Fever? Angina

1

Calculate MI(Ai, Aj) for each pair (Ai, Aj).

2

Build a maximum-weight spanning tree

  • ver the attributes.

3

Direct the resulting tree.

4

Learn the parameters.

0.014 0.007 0.018 0.021 0.002 0.015 0.018 0.002

Cold Sore Throat? See spots? Fever? Angina

Probabilistic Classifiers DWML Spring 2008 16 / 34

slide-24
SLIDE 24

Tree Augmented Naive Bayes

Example: learning a maximum likelihood tree structure (Chow-Liu tree)

Cold Sore Throat? See spots? Fever? Angina

1

Calculate MI(Ai, Aj) for each pair (Ai, Aj).

2

Build a maximum-weight spanning tree

  • ver the attributes.

3

Direct the resulting tree.

4

Learn the parameters.

Cold Sore Throat? See spots? Fever? Angina

Probabilistic Classifiers DWML Spring 2008 16 / 34

slide-25
SLIDE 25

Tree Augmented Naive Bayes

Example: learning a maximum likelihood tree structure (Chow-Liu tree)

Cold Sore Throat? See spots? Fever? Angina

1

Calculate MI(Ai, Aj) for each pair (Ai, Aj).

2

Build a maximum-weight spanning tree

  • ver the attributes.

3

Direct the resulting tree.

4

Learn the parameters.

Cold Sore Throat? See spots? Fever? Angina

Probabilistic Classifiers DWML Spring 2008 16 / 34

slide-26
SLIDE 26

Tree Augmented Naive Bayes

Example: learning a maximum likelihood tree structure (Chow-Liu tree)

Cold Sore Throat? See spots? Fever? Angina

1

Calculate MI(Ai, Aj) for each pair (Ai, Aj).

2

Build a maximum-weight spanning tree

  • ver the attributes.

3

Direct the resulting tree.

4

Learn the parameters.

Cold Sore Throat? See spots? Fever? Angina

Standard parameter learning

Probabilistic Classifiers DWML Spring 2008 16 / 34

slide-27
SLIDE 27

Tree Augmented Naive Bayes

Learning a TAN Classifier A TAN of maximal likelihood can be constructed as follows:

1

Calculate CMI(Ai, Aj | C) for each pair (Ai, Aj).

2

Build a maximum-weight spanning tree over the attributes.

3

Direct the resulting tree.

4

Insert C as a parent of all the attributes.

5

Learn the parameters. CMI(Ai, Aj | C) = X

C

P#(C) X

Ai ,Aj

P#(Ai, Aj | C) log2 P#(Ai, Aj | C) P#(Ai | C)P(Aj | C) !

Probabilistic Classifiers DWML Spring 2008 17 / 34

slide-28
SLIDE 28

Tree Augmented Naive Bayes

Other ways of handling attribute dependence Introduce hidden variables to model the dependence.

C A1 A2 A3 A4 A5 L1 L2 L3

  • Use e.g. CMI to decide on where to insert hidden variables.
  • Selecting the number of states is a (structural) learning problem (use e.g. BIC).
  • If |sp(Li)| = |sp(ch(Li))|, then Li can represent any configuration over its children.

Probabilistic Classifiers DWML Spring 2008 18 / 34

slide-29
SLIDE 29

Evaluating Classifiers

Probabilistic Classifiers Thomas D. Nielsen

Aalborg University Department of Computer Science

Spring 2008

Evaluating Classifiers Evaluating Classifiers Spring 2008 19 / 34

slide-30
SLIDE 30

Evaluating Classifiers

Classification Error Classifier C (e.g. decision tree) is used to classify instances a1, . . . , aN with true class labels c1, . . . , cN. Class labels assigned by C : c′

1, . . . , c′

  • N. Classification error:

|{i ∈ 1, . . . , N | ci = c′

i }|/N

Test

  • Evaluation: estimate of the performance of a classifier on future data.
  • Estimate obtained by:
  • Hold-out set or test set
  • Random sub-sampling
  • Cross-validation

Evaluating Classifiers Evaluating Classifiers Spring 2008 20 / 34

slide-31
SLIDE 31

Evaluating Classifiers

Hold-out data

1

Divided the data into training data and test data (50–50 or 2/3–1/3).

2

Learn a classifier on the training data.

3

Accuracy can be estimated as the accuracy over the test set. Pitfalls

  • Assuming that the accuracy increases with the size of the training data, then the hold-out

method is pessimistic.

Evaluating Classifiers Evaluating Classifiers Spring 2008 21 / 34

slide-32
SLIDE 32

Evaluating Classifiers

Hold-out data

1

Divided the data into training data and test data (50–50 or 2/3–1/3).

2

Learn a classifier on the training data.

3

Accuracy can be estimated as the accuracy over the test set. Pitfalls

  • Assuming that the accuracy increases with the size of the training data, then the hold-out

method is pessimistic.

  • Increasing the training set reduces the test set, which gives a higher variance (larger

confidence interval) for the accuracy estimate.

  • Reducing the training set introduces a bias in our estimate.

Evaluating Classifiers Evaluating Classifiers Spring 2008 21 / 34

slide-33
SLIDE 33

Evaluating Classifiers

Hold-out data: confidence intervals for accuracy Imagine tossing a coin 100 times resulting in 70 heads and 30 tails. In this experiment we have that

  • Each toss can have two outcomes.
  • The probability p of heads is constant.

This can be considered a binomial experiment: If X is the number of heads in N tosses, then P(X = x) = “N x ” px(1 − p)(N−x), with mean N · p and variance Np(1 − p).

20 40 60 80 100 0,01 0,02 0,03 0,04 0,05 0,06 0,07 0,08 Evaluating Classifiers Evaluating Classifiers Spring 2008 22 / 34

slide-34
SLIDE 34

Evaluating Classifiers

Hold-out data: confidence intervals for accuracy Imagine tossing a coin 100 times resulting in 70 heads and 30 tails. In this experiment we have that

  • Each toss can have two outcomes.
  • The probability p of heads is constant.

This can be considered a binomial experiment: If X is the number of heads in N tosses, then P(X = x) = “N x ” px(1 − p)(N−x), with mean N · p and variance Np(1 − p). Example With p = 0.6 we get P(X = 70) = “100 70 ” 0.670(1 − 0.6)(100−70) = 0.01. The expectation is 0.6 · 100 = 60 and the variance is 100 · 0.6 · (1 − 0.6) = 24.

Evaluating Classifiers Evaluating Classifiers Spring 2008 22 / 34

slide-35
SLIDE 35

Evaluating Classifiers

Hold-out data: confidence intervals for accuracy Imagine tossing a coin 100 times resulting in 70 heads and 30 tails. In this experiment we have that

  • Each toss can have two outcomes.
  • The probability p of heads is constant.

This can be considered a binomial experiment: If X is the number of heads in N tosses, then P(X = x) = “N x ” px(1 − p)(N−x), with mean N · p and variance Np(1 − p). Example With p = 0.6 we get P(X = 70) = “100 70 ” 0.670(1 − 0.6)(100−70) = 0.01. The expectation is 0.6 · 100 = 60 and the variance is 100 · 0.6 · (1 − 0.6) = 24. The task of predicting class labels can also be seen as a binomial experiment.

Evaluating Classifiers Evaluating Classifiers Spring 2008 22 / 34

slide-36
SLIDE 36

Evaluating Classifiers

Hold-out data: confidence intervals for accuracy Given a test set with N cases, let X be the number of cases correctly predicted by the classifier. The empirical accuracy of the classifier is then a = X/N. However, how confident can we be in the empirical accuracy estimated for a given test set?

Evaluating Classifiers Evaluating Classifiers Spring 2008 23 / 34

slide-37
SLIDE 37

Evaluating Classifiers

Hold-out data: confidence intervals for accuracy Given a test set with N cases, let X be the number of cases correctly predicted by the classifier. The empirical accuracy of the classifier is then a = X/N. However, how confident can we be in the empirical accuracy estimated for a given test set? Confidence interval First note:

  • The empirical accuracy a = X/N of the classifier follows a binomial distribution with mean p

and variance p(1 − p)/N.

  • For sufficiently large N, the binomial distribution is close to a normal distribution with mean p

and variance p(1 − p)/N.

10 20 30 0,02 0,04 0,06 0,08 0,10 0,12 0,14 0,16

N = 30 and p = 0.8

Evaluating Classifiers Evaluating Classifiers Spring 2008 23 / 34

slide-38
SLIDE 38

Evaluating Classifiers

Confidence interval First note:

  • The empirical accuracy a = X/N of the classifier follows a binomial distribution with mean p

and variance p(1 − p)/N.

  • For sufficiently large N, the binomial distribution is close to a normal distribution with mean p

and variance p(1 − p)/N. By standardizing the normal distribution, the following confidence interval for a can be found: P „ − Zα/2 ≤ a − p p p(1 − p)/N ≤ Z1−α/2 « = 1 − α. (for a given α, the value for Zα/2 can be found by table lookup; note Zα/2 = Z1−α/2). By rearranging we get 2 · N · a + Z 2

α/2 ± Zα/2

q Z 2

α/2 + 4Na − 4Na2

2(N + Z 2

α/2)

This should be read as: at confidence level 1 − α, the true accuracy will be in the interval defined by the expression above.

Evaluating Classifiers Evaluating Classifiers Spring 2008 23 / 34

slide-39
SLIDE 39

Evaluating Classifiers

Hold-out data: confidence intervals for accuracy Given a test set with N cases, let X be the number of cases correctly predicted by the classifier. The empirical accuracy of the classifier is then a = X/N. However, how confident can we be in the empirical accuracy estimated for a given test set? Example Consider a model with accuracy 70% when evaluated on 100 test cases. What is the confidence interval for its true accuracy at a 95% confidence level?

  • Find Zα/2 by table lookup ⇒ Zα/2 = 1.96

Evaluating Classifiers Evaluating Classifiers Spring 2008 23 / 34

slide-40
SLIDE 40

Evaluating Classifiers

Hold-out data: confidence intervals for accuracy Given a test set with N cases, let X be the number of cases correctly predicted by the classifier. The empirical accuracy of the classifier is then a = X/N. However, how confident can we be in the empirical accuracy estimated for a given test set? Example Consider a model with accuracy 70% when evaluated on 100 test cases. What is the confidence interval for its true accuracy at a 95% confidence level?

  • Find Zα/2 by table lookup ⇒ Zα/2 = 1.96
  • By inserting into the previous expression gives [0.6; 0.78].

Evaluating Classifiers Evaluating Classifiers Spring 2008 23 / 34

slide-41
SLIDE 41

Evaluating Classifiers

Expected Loss A more detailed picture is provided by the confusion matrix and a cost function: (e.g. for States(C) = {a, b, c} and n = 150): true predicted a b c a 45/150 4/150 3/150 b 2/150 39/150 1/150 c 3/150 7/150 46/150 Confusion matrix: Fractions of cases with true/predicted combination true predicted a b c a

  • 5

3 3 b 12

  • 1

3 c 4 3 Loss matrix Expected Loss: X

x,y∈{a,b,c}

Confusion(x, y) · Loss(x, y) When cost function given, try to minimize expected loss (minimizing classification error is special case for 0-1 loss: Loss(x, x) = 0 and Loss(x, y) = 1 for x = y)!

Evaluating Classifiers Evaluating Classifiers Spring 2008 24 / 34

slide-42
SLIDE 42

Evaluating Classifiers

Classifiers with Confidence Most classifiers (implicitly) provide a numeric measurement for the likelihood of class label c for instance a:

  • Probabilistic classifier: Probability of c given a.
  • Decision Tree: Frequency of label c (among training cases) in leaf reached by a.
  • k-Nearest-Neighbor: Frequency of label c among k nearest neighbors of a.
  • Neural Network: Output value of c output neuron given input a.

Evaluating Classifiers Evaluating Classifiers Spring 2008 25 / 34

slide-43
SLIDE 43

Evaluating Classifiers

Quantiles For a given class label c sort instances according to decreasing confidence in c: Instance: a3 a5 a1 a7 a8 a4 a2 a10 a6 a9 P(c): 0.96 0.91 0.86 0.83 0.74 0.55 0.51 0.42 0.11 0.06

Evaluating Classifiers Evaluating Classifiers Spring 2008 26 / 34

slide-44
SLIDE 44

Evaluating Classifiers

Quantiles For a given class label c sort instances according to decreasing confidence in c: Instance: a3 a5 a1 a7 a8 a4 a2 a10 a6 a9 P(c): 0.96 0.91 0.86 0.83 0.74 0.55 0.51 0.42 0.11 0.06 The 40% quantile consists of the 40% of cases with highest confidence in c.

Evaluating Classifiers Evaluating Classifiers Spring 2008 26 / 34

slide-45
SLIDE 45

Evaluating Classifiers

Quantiles For a given class label c sort instances according to decreasing confidence in c: Instance: a3 a5 a1 a7 a8 a4 a2 a10 a6 a9 P(c): 0.96 0.91 0.86 0.83 0.74 0.55 0.51 0.42 0.11 0.06 ci = c: yes yes no yes yes no yes no no no The 40% quantile consists of the 40% of cases with highest confidence in c. Given the correct class labels, can compute accuracy in 40% quantile (3/4), and ratio of this accuracy and base rate of c label: Lift(40%, C, c) = 3/4 5/10 = 1.5

Evaluating Classifiers Evaluating Classifiers Spring 2008 26 / 34

slide-46
SLIDE 46

Evaluating Classifiers

Lift Charts Lift plotted for different quantiles:

0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 90 100

Lift(C, c)

Evaluating Classifiers Evaluating Classifiers Spring 2008 27 / 34

slide-47
SLIDE 47

Evaluating Classifiers

Lift Charts Lift plotted for different quantiles:

0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 90 100

Lift(C, c) Lift(C′, c)

Lift for a classifier C′ generating a perfect ordering: Instance: a7 a5 a2 a3 a8 a9 a1 a10 a6 a4 P(c): 0.98 0.97 0.97 0.87 0.74 0.34 0.29 0.12 0.11 0.02 ci = c: yes yes yes yes yes no no no no no

Evaluating Classifiers Evaluating Classifiers Spring 2008 27 / 34

slide-48
SLIDE 48

Evaluating Classifiers

Lift and Costs What is better

  • predicting C = c for all instances in the 40% quantile (say lift=1.5), and C = c for all others, or
  • predicting C = c for all instances in the 60% quantile (say lift=1.333), and C = c for all
  • thers?

That depends on the cost function! First option will be better when wrong predictions of C = c are very expensive, second option will be better when wrong predictions of C = c are very expensive.

Evaluating Classifiers Evaluating Classifiers Spring 2008 28 / 34

slide-49
SLIDE 49

Evaluating Classifiers

ROC Space Confusion matrix for binary classification problems: true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) Each classifier (applied to some dataset) defines a point in ROC space:

fpr tpr 1 1

Evaluating Classifiers Evaluating Classifiers Spring 2008 29 / 34

slide-50
SLIDE 50

Evaluating Classifiers

ROC Space Confusion matrix for binary classification problems: true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) Each classifier (applied to some dataset) defines a point in ROC space:

fpr tpr 1 1

always classify positive

Evaluating Classifiers Evaluating Classifiers Spring 2008 29 / 34

slide-51
SLIDE 51

Evaluating Classifiers

ROC Space Confusion matrix for binary classification problems: true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) Each classifier (applied to some dataset) defines a point in ROC space:

fpr tpr 1 1

always classify negative always classify positive

Evaluating Classifiers Evaluating Classifiers Spring 2008 29 / 34

slide-52
SLIDE 52

Evaluating Classifiers

ROC Space Confusion matrix for binary classification problems: true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) Each classifier (applied to some dataset) defines a point in ROC space:

fpr tpr 1 1 q q

always classify negative classify positive with probability always classify positive q

Evaluating Classifiers Evaluating Classifiers Spring 2008 29 / 34

slide-53
SLIDE 53

Evaluating Classifiers

ROC Space Confusion matrix for binary classification problems: true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) Each classifier (applied to some dataset) defines a point in ROC space:

fpr tpr 1 1 q q

always classify negative classify positive with probability always classify positive q perfect classification

Evaluating Classifiers Evaluating Classifiers Spring 2008 29 / 34

slide-54
SLIDE 54

Evaluating Classifiers

Comparison One classifier is strictly better than another, if its tpr/fpr point is to the left and above in ROC space:

fpr tpr 1 1

C1 C2 C3 C1 better than C2. C3 incomparable with C1 and C2.

Evaluating Classifiers Evaluating Classifiers Spring 2008 30 / 34

slide-55
SLIDE 55

Evaluating Classifiers

ROC curves Probabilistic classifiers (and many others) are parameterized by an acceptance threshold. Plotting the tpr/fpr values for all parameters (and a given dataset) gives a ROC curve:

fpr tpr 1 1

Evaluating Classifiers Evaluating Classifiers Spring 2008 31 / 34

slide-56
SLIDE 56

Evaluating Classifiers

ROC curves Probabilistic classifiers (and many others) are parameterized by an acceptance threshold. Plotting the tpr/fpr values for all parameters (and a given dataset) gives a ROC curve:

fpr tpr 1 1

Performance measure for (parameterized family) classifier: area under (ROC) curve (AUC).

Evaluating Classifiers Evaluating Classifiers Spring 2008 31 / 34

slide-57
SLIDE 57

Optimizing Predictive Performance

Overfitting again Model parameter Future data Training data Performance Measure Possible performance measures:

  • Misclassification rate
  • Expected loss
  • AUC
  • . . .

Model parameter:

  • Pruning parameter for decision trees
  • k in k-nearest neighbor
  • Complexity of probabilistic model (e.g.

Naive Bayes, TAN,. . . )

  • . . .

How do we determine the model performing best on future data?

Evaluating Classifiers Evaluating Classifiers Spring 2008 32 / 34

slide-58
SLIDE 58

Optimizing Predictive Performance

Test Set

  • Set aside part (e.g. one third) of the available data as a test set
  • Learn models with different parameters using the remaining data as the training data
  • Measure the performance of each learned model on the test set
  • Choose parameter setting with best performance
  • Learn final model with chosen parameter setting using the whole available data

Problem: for small datasets cannot afford to set aside test set.

Evaluating Classifiers Evaluating Classifiers Spring 2008 33 / 34

slide-59
SLIDE 59

Optimizing Predictive Performance

Cross Validation

  • Partition the data into n subsets or folds (typically: n = 10).
  • For each model parameter setting:
  • for i = 1 to n:

learn a model using folds 1, . . . , i − 1, i + 1, . . . , n as training data measure performance on fold i

  • model performance = average performance on the n test sets
  • Choose parameter setting with best performance
  • Learn final model with chosen parameter setting using the whole available data

Evaluating Classifiers Evaluating Classifiers Spring 2008 34 / 34