Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2011/12
X.1&2-Chapter X: Classification
1
Chapter X: Classification Information Retrieval & Data Mining - - PowerPoint PPT Presentation
Chapter X: Classification Information Retrieval & Data Mining Universitt des Saarlandes, Saarbrcken Winter Semester 2011/12 X.1&2- 1 Chapter X: Classification* 1. Basic idea 2. Decision trees 3. Nave Bayes classifier 4.
Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2011/12
X.1&2-Chapter X: Classification
1Chapter X: Classification*
* Zaki & Meira: Ch. 24, 26, 28 & 29; Tan, Steinbach & Kumar: Ch. 4, 5.3–5.6
X.1 Basic idea
1.1. Data 1.2. Classification function 1.3. Predictive vs. descriptive 1.4. Supervised vs. unsupervised
3Definitions
– Vector x is the attribute (feature) set
– Value y is the class label
regression!
that maps attribute sets to class labels, f(x) = y
4Definitions
– Vector x is the attribute (feature) set
– Value y is the class label
regression!
that maps attribute sets to class labels, f(x) = y
4attribute set
Definitions
– Vector x is the attribute (feature) set
– Value y is the class label
regression!
that maps attribute sets to class labels, f(x) = y
4class
Classification function as a black box
5f Input Output Attribute set x Class label y Classification function
Descriptive vs. predictive
description of the data
– Those who have bought diapers have also bought beer – These are the clusters of documents from this corpus
future
– Those who will buy diapers will also buy beer – If new documents arrive, they will be similar to one of the cluster centroids
machine learning is hard to define
6Descriptive vs. predictive classification
– Descriptive
– Predictive
– What we will concentrate on
7General classification framework
8Classification model evaluation
9with IR methods
– Focus on accuracy and error rate – But also precision, recall, F-scores, …
Class ¡= ¡1 Class ¡= ¡0 Class ¡= ¡1 Class ¡= ¡0 f11 f10 f01 f00Predicted class Actual class
Accuracy = f11 + f00 f11 + f00 + f10 + f01 Error rate = f10 + f01 f11 + f00 + f10 + f01
Supervised vs. unsupervised learning
– Training data is accompanied by class labels – New data is classified based on the training set
– The class labels are unknown – The aim is to establish the existence of classes in the data based on measurements, observations, etc.
X.2 Decision trees
Zaki & Meira: Ch. 24; Tan, Steinbach & Kumar: Ch. 4
Basic idea
about the attributes
– Each question depends on the answer to the previous one – Ultimately, all samples with satisfying attribute values have the same label and we’re done
proper edges of the tree until we meet a leaf
– Decision tree leafs are always class labels
12Example: training data
13 age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent noExample: decision tree
14age? 31..40 ≤ 30 > 40 student? credit rating? yes no yes excellent fair yes yes no no
Hunt’s algorithm
15attributes is exponential
– The decision tree is grown by making a series of locally
Hunt’s algorithm
then t is a leaf node labeled as yt
– Select attribute test condition to partition the records into smaller subsets – Create a child node for each outcome of test condition – Apply algorithm recursively to each child
16Example decision tree construction
17Example decision tree construction
17 Has multiple labelsExample decision tree construction
17 Has multiple labels Has multiple labels Only one labelExample decision tree construction
17 Has multiple labels Has multiple labels Only one labelExample decision tree construction
17 Has multiple labels Has multiple labels Only one label Only one label Only one labelSelecting the split
18answering two questions
Splitting methods
19Binary attributes
Splitting methods
20Nominal attributes
Splitting methods
21Ordinal attributes
Splitting methods
22Continuous attributes
Selecting the best split
23class i at node t
– p(0 | t) = 0 and p(1 | t) = 1 has high purity – p(0 | t) = 1/2 and p(1 | t) = 1/2 has the smallest purity (highest impurity)
measures ⇒ better split
Example of purity
24Example of purity
24high impurity high purity
Entropy(t) = −
c−1X
i=0p(i | t) log2 p(i | t) Gini(t) = 1 −
c−1X
i=02 Classification error(t) = 1 − max
i {p(i | t)} IR&DM, WS'11/12 X.1&2- 26 January 2012Impurity measures
250 × log2(0) = 0 ≤ 0
Comparing impurity measures
26Comparing conditions
27– Called the gain of the test condition
impurity measure of child nodes
∆ = I(p) −
kX
j=1N(vj) N I(vj)
Computing the gain: example
28Computing the gain: example
28 G: 0.4898 G: 0.480Computing the gain: example
28 G: 0.4898 G: 0.480 7Computing the gain: example
28 G: 0.4898 G: 0.480 7 5Computing the gain: example
28 G: 0.4898 G: 0.480 7 5Computing the gain: example
28 G: 0.4898 G: 0.480 7 5 × 0.4898 +Computing the gain: example
28 G: 0.4898 G: 0.480 7 5 × 0.4898 + × 0.480Computing the gain: example
28 G: 0.4898 G: 0.480 7 5 × 0.4898 + × 0.480Computing the gain: example
28 G: 0.4898 G: 0.480 7 5 × 0.4898 + × 0.480 ( ) / 12 = 0.486Problems of maximizing Δ
29Δ
Higher purity
Problems of maximizing Δ
30not be desirable
– Number of records in each partition is too small to make predictions
–
– Used e.g. in C4.5
SplitInfo = − Pk
i=1 P(vi) log2(P(vi))Stopping the splitting
class
attribute values
– E.g. gain ratio drops below certain threshold – Keeps trees simple – Helps with overfitting
31Geometry of single-attribute splits
32 y < 0.33? : 0 : 3 : 4 : 0 y < 0.47? : 4 : 0 : 0 : 4 x < 0.43? Yes Yes No No Yes No 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x ysingle-attribute splits
Geometry of single-attribute splits
32 y < 0.33? : 0 : 3 : 4 : 0 y < 0.47? : 4 : 0 : 0 : 4 x < 0.43? Yes Yes No No Yes No 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x ysingle-attribute splits
Geometry of single-attribute splits
32 y < 0.33? : 0 : 3 : 4 : 0 y < 0.47? : 4 : 0 : 0 : 4 x < 0.43? Yes Yes No No Yes No 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x ysingle-attribute splits
Combatting overfitting
33building the tree before overfitting happens
– Overfitting makes decision trees overly complex – Generalization error will be big
Estimating the generalization error
– e(T) = Σe(t) / N
– e’(T) = Σe’(t) / N – Optimistic approach: e’(T) = e(T) – Pessimistic approach: e’(T) = Σt(e(t) + Ω)/N
Handling overfitting
when some early stopping criterion is satisfied
– From bottom to up try replacing a decision node with a leaf – If generalization error improves, replace the sub-tree with a leaf
– We can also use minimum description length principle
35Minimum description principle (MDL)
– The complexity of explaining a model for data – The complexity of explaining the data given the model – L = L(M) + L(D | M)
data
– This is the minimum description length principle – Computing the least number of bits to produce a data is its Kolmogorov complexity
– MDL approximates Kolmogorov complexity
36MDL and classification
error
– Per MDL principle, the better the encoder, the better the results – The art of creating good encoders is in the heart of using MDL
37Summary of decision trees
– Small ones are easy to interpret
boundaries