Introduction to Machine Learning CMU-10701
- 23. Decision Trees
Introduction to Machine Learning CMU-10701 23. Decision Trees - - PowerPoint PPT Presentation
Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information gain Generalizations
2
Many of these slides are taken from
3
4
Learn decision rules from a dataset: Do we want to play tennis? 4 discrete-valued attributes (Outlook, Temperature, Humidity, Wind) Play tennis?:“Yes/No” classification problem
5
We want to learn a “good” decision tree from the data. For example, this tree:
6
Formal Problem Setting:
(H= possible decision trees)
I nput:
Output:
In decision tree learning, we are doing function approximation, where the set of hypotheses H = set of decision trees
7
Each internal node is labeled with some feature xj Arc (from xj) labeled with results of test xj Leaf nodes specify class h(x) One Instance:
Outlook = Sunny Temperature = Hot Humidity = High Wind = Strong classified as “No” (Temperature, Wind: irrelevant)
Easy to use in Classification Interpretable rules
8
Features can be continuous Output can be continuous too (regression trees) Instead of single features in the nodes, we can use set of features
too in the nodes Later we will discuss them in more detail.
9
I f a feature is continuous:
internal nodes may test value against threshold
10
Tax Fraud Detection: Goal is to predict who is cheating on tax
using the ‘refund’, ‘marital status’, and ‘income’ features
Build a tree that matches the data
11
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Data
12
13
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Query Data
14
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Query Data
15
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Query Data
No
Refund Marital Status Taxable Income Cheat No Married 80K ?
1016
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Query Data
No
Refund Marital Status Taxable Income Cheat No Married 80K ?
1017
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Query Data
No
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Married
Refund Marital Status Taxable Income Cheat No Married 80K ?
1018
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Query Data
No
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Married
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Assign Cheat to “No”
19
20
Decision trees divide feature space into axis-parallel rectangles, labeling each rectangle with one class
Two features only: x1 and x2
21
Some functions cannot be represented with binary splits: If we want to learn this function too,
represented with binary splits.
22
23
How would you represent
Y = X2 and X5 ? Y = X2 or X5?
How would you represent X2 X5 ∨ X3X4(¬ X1)?
24
X1 X2 X2
25
I ntuition: Want SMALL trees
... to capture “regularities” in data ... ... easier to understand, faster to execute
Trees can represent any boolean (and discrete) functions,
e.g. (A v B) & (C v not D v E)
Just produce “path” for each example (store the training data) . . . may require exponentially many nodes. . . Any generalization capability? (Other instances that are not in
the training data?)
NP-hard to find smallest tree that fits data
26
27
1000 patients 25% have butterfly-itis (250) 75% are healthy (750) Use 10 silly features, not related to the class label
28
Standard decision tree learner: Error Rate:
Train data: 0%
New data: 37%
Optimal decision tree: Error Rate:
Train data: 25%
New data: 25%
Regularization is important…
29
30
31
Yes No 40 Genuine 0 Cheats 10 Genuine 30 Cheats Single, Divorced Married 30 Genuine 10 Cheats 20 Genuine 20 Cheats
Absolutely sure Kind of sure Kind of sure Absolutely unsure
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status 80 training people (50 Genuine, 30 Cheats)
32
H(Y) – entropy of Y H(Y|Xi) – conditional entropy of Y
33
I nformation Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly
drawn value of Y (under most efficient code)
p
Entropy, H(Y)
Uniform Max entropy Deterministic Zero entropy
Entropy of a random variable Y
Larg rger r unc uncert aint nt y, lar arger ent ropy! Y ~ Bernoulli(p)
34
Advantage of attribute = decrease in uncertainty
I nformation gain is the difference: Max I nformation gain = min conditional entropy
We want this to be small
35
Which feature splits the data the best to + and – instances?
36
Outlook feature looks great, because the Overcast branch is perfectly separated.
37
I f split on xi, produce 2 children:
(1) # (xi = t) follow TRUE branch data: [ # (xi = t, Y = + ),# (xi = t, Y= –) ] (2) # (xi = f) follow FALSE branch data: [ # (xi = f, Y = + ),# (xi = f, Y= –) ]
Calculate the mutual information between xi and Y!
38
Outlook 14: (9+ ,5-) H= -(9/14* log2(9/14)+ 5/14* log2(5/14))= 0.9403 Sunny [2+ ,3-] [3+ ,2-]
H1= -(2/5* log2(2/5)+ 3/5* log2(3/5)) = 0.9710 H3= -(3/5* log2(3/5)+ 2/5* log2(2/5))= 0.9710
I(Y ,Outlook) = 0.940 – (5/14* H1+ 4/14* H2+ 5/14* H3)= 0.2465
Overcast Rain [4+ ,0-]
H2= -(4/4* log2(4/4)+ 0/4* log2(0/4))= 0
39
Humidity 14: (9+ ,5-) H= -(9/14* log2(9/14)+ 5/14* log2(5/14))= 0.9403 High Normal [3+ ,4-] [6+ ,1-]
H= -(3/7* log2(3/7)+ 4/7* log2(4/7))= 0.9852 H= -(6/7* log2(6/7)+ 1/7* log2(1/7))= 0.5917
I(Y , Humidity) = 0.940-7/14* 0.985-7/14* 0.591 = 0.151
40
Wind 14: (9+ ,5-) H= -(9/14* log2(9/14)+ 5/14* log2(5/14))= 0.9403 Weak Strong [6+ ,2-] [3+ ,3-]
H= -(6/8* log2(6/8)+ 2/8* log2(2/8))= 0.811 H= -(3/6* log2(3/6)+ 3/6* log2(3/6))= 1
I(Y ,Wind) = 0.940-8/14* 0.811-6/14* 1 = 0.048
41
Similar calculations for the temperature feature. Outlook feature is the best root node among all features.
Humidity is the best
42
http://www.cs.ualberta.ca/%7Eaixplore/learning/ DecisionTrees/Applet/DecisionTreeApplet.html
43
44
1 1 1 1
1 1 1 1 1 1
Class labels
45
Average (fit a constant ) using training data at the leaves
Num Children? ≥ 2 < 2
X1 Xp
46
47
48
Refund MarSt NO Yes No Married Single, Divorced
49
j
(j) (j) (j) (j) (j) (j) (j) (j)
Regression Classification
50
53
Recursive solution:
Given n attributes
Write Lk = log2 Hk L0 = 1 Lk = log2 n + 2Lk-1 = log2 n + 2(log2 n + 2Lk-2) = log2 n + 2log2 n + 22log2 n + … + 2k-1(log2 n + 2L0) So Lk = (2k-1)log2 n+ 2k (sum of the first k terms of a geometric series) Hk = (# choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = n * Hk-1 * Hk-1 Hk = Number of decision trees of depth k
54
Lk = (2k-1)log2 n+ 2k
55
Hk = (# choices of root attribute) * [(# left subtrees wth 1 leaf)* (# right subtrees wth k-1 leaves) + (# left subtrees wth 2 leaves)* (# right subtrees wth k-2 leaves) + … + (# left subtrees wth k-1 leaves)* (# right subtrees wth 1 leaf)]
Hk = Number of decision trees with k leaves H1 = 2 (Yes graph or No graph) = nk-1 Ck-1 (Ck-1 : Catalan Number)
56
number of points m is linear in # leaves k
number of points m is exponential in depth k (n is the number of features)
57
m: number of training points k: number of leaves
58
59
60
OBSERVED DATA Voting Preferences Row total Republican Democrat Independent Male 200 150 50 400 Female 250 300 50 600 Column total 450 450 100 1000
H0: Gender and voting preferences are independent. Ha: Gender and voting preferences are not independent.
Expected numbers under H0 (independence:) Er,c = (nr * nc) / n
61
OBSERVED DATA Voting Preferences Row total Republican Democrat Independent Male 200 150 50 400 Female 250 300 50 600 Column total 450 450 100 1000
Expected numbers under H0 (independence:) Er,c = (nr * nc) / n
E1,1 = (400 * 450) / 1000 = 180000/1000 = 180 E1,2 = (400 * 450) / 1000 = 180000/1000 = 180 E1,3 = (400 * 100) / 1000 = 40000/1000 = 40 E2,1 = (600 * 450) / 1000 = 270000/1000 = 270 E2,2 = (600 * 450) / 1000 = 270000/1000 = 270 E2,3 = (600 * 100) / 1000 = 60000/1000 = 60
Χ2 = Σ [ (Or,c - Er,c)2 / Er,c ] Χ2 = (200 - 180)2/180 + (150 - 180)2/180 + (50 - 40)2/40
+ (250 - 270)2/270 + (300 - 270)2/270 + (50 - 60)2/40 = 16.2
62
Degrees of freedom DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
where r= # rows, c= # columns
P(Χ2 > 16.2) = 0.0003< 0.05 (p value)
) we cannot accept the null hypothesis.
Evidence shows that there is a relationship between gender and voting preference.
63
1. Build a Complete Tree 2. Consider each “leaf”, and perform a chi-square independence test X: s= p+ n, (p+ ,n-)
# of instances entering this node = s # of + instances entering this node = p # of - instances entering this node = n # of instances here = sf= pf+ nf # of + instances here = pf # of - instances here = nf # of instances here = st=pt+ nt # of + instances here = pt # of - instances here = nt
false true sf* p/s sf* n/s st* n/s st* p/s Expected numbers
If after splitting the expected numbers are the same as the measured ones, then there is no point of splitting the node! Delete the leafs!
64
X1 X2 Y Count
T T T 2 T F T 2 F T F 5 F F T 1
X1 X2 F F T T S= 6, p= 1,n= 5 Y= T Y= F Y= T Sf= 1, pf= 1,nf= 0 St= 5, pt= 0,nt= 5
Variable Assignment Real Counts of Y= T Expected Counts of Y= T
X2= F 1 1/6 (sf* p/s) X2= T 5/6 (st* p/s)
Variable Assignment Real Counts of Y= F Expected Counts of Y= F
X2= F 5/6 (sf* n/s) X2= T 5 25/6 (st* n/s)
65
Variable Assignment Real Counts of Y= T Expected Counts of Y= T
X2= F 1 1/6 (sf* p/s) X2= T 5/6 (st* p/s)
Variable Assignment Real Counts of Y= F Expected Counts of Y= F
X2= F 5/6 (sf* n/s) X2= T 5 25/6 (st* n/s)
If label Y and feature X2 are independent, then the expected counts should be close to the real counts.
Degrees of freedom
DF = (# Y labels- 1) * (# X2 labels - 1) = (2 - 1) * (2 - 1) = 1 Z = Σ [ (Or,c - Er,c)2 / Er,c ] = (1-1/6)^ 2/(1/6)+ (0-5/6)^ 2/(5/6)+ (0-5/6)^ 2/(5/6) 6+ (5-25/6)^ 2/(25/6) = 25/6+ 5/6+ 5/6+ 1/6 = 6
66
P(Z> c) is the probability that we see this large deviation by chance under the H0 independence assumption. P(Z> 3.8415) = 0.05, P(Z · 3.8415) = 0.95 The smaller the Z is, the more likely that the feature is independent from the label. (There is no evidence showing their dependence)
In our case Z = 6 )
67
Information gain to select attributes (ID3, C4.5,…) Can be used for classification, regression, and density estimation too Decision trees will overfit!!!
68