CISC 4631 Data Mining
Lecture 04:
- Decision Trees
Theses slides are based on the slides by
- Tan, Steinbach and Kumar (textbook authors)
- Eamonn Koegh (UC Riverside)
- Raymond Mooney (UT Austin)
1
Data Mining Lecture 04: Decision Trees Theses slides are based - - PowerPoint PPT Presentation
CISC 4631 Data Mining Lecture 04: Decision Trees Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside) Raymond Mooney (UT Austin) 1 Classification: Definition
1
2
Apply Model
Induction Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes
10Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ?
10Test Set Learning algorithm Training Set
3
4
Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes
10Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
5
Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes
10MarSt Refund TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
There could be more than one tree that fits the same data!
6
Apply Model
Induction Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes
10Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ?
10Test Set Tree Induction algorithm Training Set
Decision Tree
7
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Test Data Start from the root of tree.
8
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Test Data
9
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Test Data
10
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Test Data
11
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Test Data
12
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Test Data Assign Cheat to “No”
13
14
Apply Model
Induction Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes
10Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ?
10Test Set Tree Induction algorithm Training Set
Decision Tree
15
mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms.
16
17
Ross Quinlan Antenna Length Antenna Length
10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9
Abdomen Length Abdomen Length Abdomen Length Abdomen Length > 7.1? no yes
Katydid
Antenna Length Antenna Length > 6.0? no yes
Katydid Grasshopper
18
Grasshopper Antennae shorter than body? Cricket Foretiba has ears? Katydids Camel Cricket Yes Yes Yes No No 3 Tarsi? No
Decision trees predate computers
19
Decision tree is a classifier in the form of a tree structure – Decision node: specifies a test on a single attribute – Leaf node: indicates the value of the target attribute – Arc/edge: split of one attribute – Path: a disjunction of test to make the final decision Decision trees classify instances or examples by starting at the root of the
tree and moving through it until a leaf node.
20
– Tree construction
– Tree pruning
– Test the attribute values of the sample against the decision tree
21
sunny
rain yes humidity wind high normal strong weak yes yes no no
22
– Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous-valued, they can be discretized in advance) – Examples are partitioned recursively based on selected attributes. – Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
– All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left
23
1. A the “best” decision attribute for next node 2. Assign A as decision attribute for node 3. For each value of A, create new descendant of node 4. Sort training examples to leaf nodes 5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes
24
25
– The tree can grow huge – These trees are hard to understand. – Larger trees are typically less accurate than smaller trees.
– Selection of an attribute to test at each node - choosing the most useful attribute for classifying examples. – How? – Information gain
according to their target classification
while growing the tree
26
Own Car?
C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7
Car Type?
C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1
Student ID?
...
Yes No Family Sports Luxury c1 c10 c20
C0: 0 C1: 1
...
c11
Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? Why is student id a bad feature to use?
27
Gender
C0: 5 C1: 5 C0: 9 C1: 1
Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity
28
29
razor.
tree but does not guarantee to find the smallest.
– General lesson in Machine Learning and Data Mining: “Greed is good.”
“pure” in a single class so they are “closer” to being leaf nodes.
based on information gain that originated with the ID3 system of Quinlan (1979).
30
1,000 -- what is it? What is the first question you would ask?
identify the integer.
likely.
31
classification is: where p1 is the fraction of positive examples in S and p0 is the fraction of negatives.
the class of an example in S where data compression (e.g. Huffman coding) is used to give shorter codes to more likely cases.
) ( log ) ( log ) (
2 1 2 1
p p p p S Entropy
c i i i
p p S Entropy
1 2
) ( log ) (
32
(or any outcome is equally possible). Entropy of a 2-class problem with regard to the portion of one of the two groups
33
according to this attribute.
member of S, by knowing the value of attribute A.
34
number of child sets
Note: entropy is at its minimum if the collection of objects is completely uniform
C1 2 C2 4 C1 1 C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
j
2
C1 3 C2 3
P(C1) = 3/6=1/2 P(C2) = 3/6 = 1/2 Entropy = – (1/2) log2 (1/2) – (1/2) log2 (1/2) = -(1/2)(-1) – (1/2)(-1) = ½ + ½ = 1
NOTE: p( j | t) is computed as the relative frequency of class j at node t
35
36
Parent Node, p is split into k partitions; ni is number of records in partition i
k i i split
1
38
subsets (easy for discrete attribute).
– Partition the continuous value of attribute A into a discrete set of intervals – Create a new boolean attribute Ac , looking for a threshold c,
c c
How to choose c ?
39
40
Hair Length <= 5?
yes no Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911
n p n n p n n p p n p p S Entropy
2 2
log log ) (
Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911
Let us try splitting on Hair length Let us try splitting on Hair length
41
Weight <= 160?
yes no Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911
n p n n p n n p p n p p S Entropy
2 2
log log ) (
Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
Let us try splitting on Weight Let us try splitting on Weight
42
age <= 40?
yes no Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911
n p n n p n n p p n p p S Entropy
2 2
log log ) (
Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
Let us try splitting on Age Let us try splitting on Age
43
Weight <= 160?
yes no
Hair Length <= 2?
yes no Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… So we simply recurse! This time we find that we can split on Hair length, and we are done!
44
Weight <= 160?
yes no Hair Length <= 2? yes no We don’t need to keep the data around, just the test conditions.
How would these people be classified?
45
It is trivial to convert Decision Trees to rules…
Weight <= 160?
yes no Hair Length <= 2? yes no
Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female
46
Decision tree for a typical shared-care setting applying the system for the diagnosis of prostatic obstructions.
This decision tree is attached to a medical machine, and is designed to help nurses make decisions about what type of doctor to call.
47
Wears green? Yes No The worked examples we have seen were performed on small datasets. However with small datasets there is a great danger of
When you have few datapoints, there are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets.
For example, the rule “Wears green?” perfectly classifies the data, so does “Mothers name is Jacqueline?”, so does “Has blue shoes”…
B?
Yes No Node N3 Node N4
A?
Yes No Node N1 Node N2 Before Splitting:
C0 N10 C1 N11
C0 N20 C1 N21
C0 N30 C1 N31
C0 N40 C1 N41 C0 N00 C1 N01
M0 M1 M2 M3 M4 M12 M34 Gain = M0 – M12 vs M0 – M34
48
NOTE: p( j | t) is computed as the relative frequency of class j at node t
j
2
49
50
C1 C2 6 Gini=0.000 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=0.500 C1 1 C2 5 Gini=0.278
51
C1 2 C2 4 C1 1 C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
j
2
P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
classes, implying least interesting information
interesting information ) | ( max 1 ) ( t i P t Error
i
53
C1 2 C2 4 C1 1 C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
i
54
55
strategy, so we need to use a splitting metric that leads to globally better results
there is no proof for this.
56
x1: petal length x2: sepal width
57
x1: petal length x2: sepal width
58
– Terminate growth early – Grow to purity, then prune back
59
x1: petal length x2: sepal width Not statistically supportable leaf Remove split & merge leaves
60
61
– Too many branches, some may reflect anomalies due to noise or outliers – Result is in poor accuracy for unseen samples
– Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold
– Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees
“best pruned tree”
62
63
CarType
Family Sports Luxury
CarType
{Family, Luxury} {Sports}
CarType
{Sports, Luxury} {Family}
64
Size
Small Medium Large
Size
{Medium, Large} {Small}
Size
{Small, Medium} {Large}
Size
{Small, Large} {Medium}
65
bucketing, equal frequency bucketing (percentiles), or clustering.
66
Taxable Income > 80K?
Yes No
Taxable Income? (i) Binary split (ii) Multi-way split
< 10K [10K,25K) [25K,50K) [50K,80K) > 80K
67
68
69
– Class = 1 if there is an even number of Boolean attributes with truth value = True – Class = 0 if there is an odd number of Boolean attributes with truth value = True
70
y < 0.33? : 0 : 3 : 4 : 0 y < 0.47? : 4 : 0 : 0 : 4 x < 0.43? Yes Yes No No Yes No
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x y
decision boundary
attribute at-a-time
71
x + y < 1
Class = + Class =
72
500 circular and 500 triangular data points. Circular points: 0.5 sqrt(x1
2+x2 2) 1
Triangular points: sqrt(x1
2+x2 2) > 0.5 or
sqrt(x1
2+x2 2) < 1 73
P Q R S 1 1 Q S 1
74
75
76
10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9
100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90
10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9
Which of the “Problems” can be solved by a Decision Tree?
The Decision Tree has a hard time with correlated attributes
77