Introduction to Machine Learning CMU-10701 23. Decision Trees - - PowerPoint PPT Presentation

introduction to machine learning cmu 10701
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning CMU-10701 23. Decision Trees - - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information gain Generalizations


slide-1
SLIDE 1

Introduction to Machine Learning CMU-10701

  • 23. Decision Trees

Barnabás Póczos

slide-2
SLIDE 2

2

Contents

 Decision Trees: Definition + Motivation  Algorithm for Learning Decision Trees

  • Entropy, Mutual Information, Information gain

 Generalizations

  • Regression Trees

 Overfitting

  • Pruning
  • Regularization

Many of these slides are taken from

  • Aarti Singh,
  • Eric Xing,
  • Carlos Guestrin
  • Russ Greiner
  • Andrew Moore
slide-3
SLIDE 3

3

Decision Trees

slide-4
SLIDE 4

4

Learn decision rules from a dataset: Do we want to play tennis?  4 discrete-valued attributes (Outlook, Temperature, Humidity, Wind)  Play tennis?:“Yes/No” classification problem

Decision Tree: Motivation

slide-5
SLIDE 5

5

Decision Tree: Motivation

 We want to learn a “good” decision tree from the data.  For example, this tree:

slide-6
SLIDE 6

6

Formal Problem Setting:

  • Set of possible instances X (set of all possible feature vectors)
  • Unknown target function f : X ! Y
  • Set of function hypotheses H= { h | h : X ! Y }

(H= possible decision trees)

I nput:

  • Training examples { < x(i),y(i)> } of unknown target function f

Output:

  • Hypothesis h ∈ H that best approximates target function f

Function Approximation

In decision tree learning, we are doing function approximation, where the set of hypotheses H = set of decision trees

slide-7
SLIDE 7

7

 Each internal node is labeled with some feature xj  Arc (from xj) labeled with results of test xj  Leaf nodes specify class h(x)  One Instance:

Outlook = Sunny Temperature = Hot Humidity = High Wind = Strong classified as “No” (Temperature, Wind: irrelevant)

 Easy to use in Classification  Interpretable rules

Decision Tree: The Hypothesis Space

slide-8
SLIDE 8

8

Generalizations

 Features can be continuous  Output can be continuous too (regression trees)  Instead of single features in the nodes, we can use set of features

too in the nodes Later we will discuss them in more detail.

slide-9
SLIDE 9

9

I f a feature is continuous:

internal nodes may test value against threshold

Continuous Features

slide-10
SLIDE 10

10

Refund Marital status Taxable income Cheat yes Married 50K no no Married 90K no no Single 60K no no Divorced 100K yes yes Married 110K no

Tax Fraud Detection: Goal is to predict who is cheating on tax

using the ‘refund’, ‘marital status’, and ‘income’ features

Build a tree that matches the data

Example: Mixed Discrete and Continuous Features

slide-11
SLIDE 11

11

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

  • Each internal node: test
  • ne feature Xi
  • Continuous features test

value against threshold

  • Each branch from a node:

selects one value (or set

  • f values) for Xi.
  • Each leaf node: predict Y

Data

Decision Tree for Tax Fraud Detection

slide-12
SLIDE 12

12

Given a decision tree, how do we assign label to a test point?

slide-13
SLIDE 13

13

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

Decision Tree for Tax Fraud Detection

slide-14
SLIDE 14

14

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

Decision Tree for Tax Fraud Detection

slide-15
SLIDE 15

15

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

No

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Decision Tree for Tax Fraud Detection

slide-16
SLIDE 16

16

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

No

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Decision Tree for Tax Fraud Detection

slide-17
SLIDE 17

17

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

No

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Married

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Decision Tree for Tax Fraud Detection

slide-18
SLIDE 18

18

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

No

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Married

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Assign Cheat to “No”

Decision Tree for Tax Fraud Detection

slide-19
SLIDE 19

19

What do decision trees do in the feature space?

slide-20
SLIDE 20

20

Decision trees divide feature space into axis-parallel rectangles, labeling each rectangle with one class

Decision Tree Decision Boundaries

Two features only: x1 and x2

slide-21
SLIDE 21

21

Some functions cannot be represented with binary splits

 Some functions cannot be represented with binary splits:  If we want to learn this function too,

  • we need more complex functions in the nodes than binary splits
  • We need to “break” this function to smaller parts that can be

represented with binary splits.

+

  • 1

2

3 4 5

slide-22
SLIDE 22

22

How do we learn a decision tree from training data?

slide-23
SLIDE 23

23

How would you represent

Y = X2 and X5 ? Y = X2 or X5?

How would you represent X2 X5 ∨ X3X4(¬ X1)?

What Boolean functions can be represented with decision trees?

slide-24
SLIDE 24

24

n boolean features (x1,…,xn) )

 2n possible different instances  2n+ 1 possible different functions if class label Y is

boolean too.

Decision trees can represent any boolean/discrete functions

X1 X2 X2

  • +
  • +
slide-25
SLIDE 25

25

I ntuition: Want SMALL trees

... to capture “regularities” in data ... ... easier to understand, faster to execute

Option 1: Just store training data

 Trees can represent any boolean (and discrete) functions,

e.g. (A v B) & (C v not D v E)

 Just produce “path” for each example (store the training data)  . . . may require exponentially many nodes. . .  Any generalization capability? (Other instances that are not in

the training data?)

 NP-hard to find smallest tree that fits data

slide-26
SLIDE 26

26

Example: Learn A xor B (Boolean features and labels)

  • There is a decision tree which perfectly classifies a

training set with one path to leaf for each example.

Expressiveness of General Decision Trees

slide-27
SLIDE 27

27

 1000 patients  25% have butterfly-itis (250)  75% are healthy (750)  Use 10 silly features, not related to the class label

  • ½ of patients have F1 = 1 (“odd birthday”)
  • ½ of patients have F2 = 1 “even SSN”
  • etc

Example of Overfitting

slide-28
SLIDE 28

28

Standard decision tree learner: Error Rate: ฀

Train data: 0%

New data: 37%

Optimal decision tree: Error Rate: ฀

Train data: 25%

New data: 25%

Typical Results

Regularization is important…

slide-29
SLIDE 29

29

  • Top-down induction [ many algorithms ID3, C4.5, CART, …]

(Grow the tree from the root to the leafs)

Repeat:

  • 1. Select “best feature” (X1, X2 or X3) to split
  • 2. For each value that feature takes, sort training examples

to leaf nodes

  • 3. Stop if leaf contains all training examples with same label
  • r if all features are used up
  • 4. Assign leaf with majority vote of labels of training

examples

How to learn a decision tree

We will focus on ID3 algorithm

slide-30
SLIDE 30

30

First Split?

slide-31
SLIDE 31

31

Yes No 40 Genuine 0 Cheats 10 Genuine 30 Cheats Single, Divorced Married 30 Genuine 10 Cheats 20 Genuine 20 Cheats

Good split: we are less uncertain about classification after split

Absolutely sure Kind of sure Kind of sure Absolutely unsure

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status 80 training people (50 Genuine, 30 Cheats)

Which feature is best to split?

Refund gives more information about the labels than Marital Status

slide-32
SLIDE 32

32

Pick the attribute/feature which yields maximum information gain:

H(Y) – entropy of Y H(Y|Xi) – conditional entropy of Y

Feature which yields maximum reduction in entropy provides maximum information about Y

Which feature is best to split?

slide-33
SLIDE 33

33

I nformation Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly

drawn value of Y (under most efficient code)

p

Entropy, H(Y)

Uniform Max entropy Deterministic Zero entropy

Entropy

Entropy of a random variable Y

Larg rger r unc uncert aint nt y, lar arger ent ropy! Y ~ Bernoulli(p)

slide-34
SLIDE 34

34

Advantage of attribute = decrease in uncertainty

  • Entropy of Y before split
  • Entropy of Y after splitting based on Xi
  • Weight by probability of following each branch

Information Gain

I nformation gain is the difference: Max I nformation gain = min conditional entropy

We want this to be small

slide-35
SLIDE 35

35

First Split?

Which feature splits the data the best to + and – instances?

slide-36
SLIDE 36

36

First Split?

Outlook feature looks great, because the Overcast branch is perfectly separated.

slide-37
SLIDE 37

37

I f split on xi, produce 2 children:

(1) # (xi = t) follow TRUE branch data: [ # (xi = t, Y = + ),# (xi = t, Y= –) ] (2) # (xi = f) follow FALSE branch data: [ # (xi = f, Y = + ),# (xi = f, Y= –) ]

Statistics

Calculate the mutual information between xi and Y!

slide-38
SLIDE 38

38

Outlook 14: (9+ ,5-) H= -(9/14* log2(9/14)+ 5/14* log2(5/14))= 0.9403 Sunny [2+ ,3-] [3+ ,2-]

H1= -(2/5* log2(2/5)+ 3/5* log2(3/5)) = 0.9710 H3= -(3/5* log2(3/5)+ 2/5* log2(2/5))= 0.9710

I(Y ,Outlook) = 0.940 – (5/14* H1+ 4/14* H2+ 5/14* H3)= 0.2465

Information gain of the Outlook feature

Overcast Rain [4+ ,0-]

H2= -(4/4* log2(4/4)+ 0/4* log2(0/4))= 0

slide-39
SLIDE 39

39

Humidity 14: (9+ ,5-) H= -(9/14* log2(9/14)+ 5/14* log2(5/14))= 0.9403 High Normal [3+ ,4-] [6+ ,1-]

H= -(3/7* log2(3/7)+ 4/7* log2(4/7))= 0.9852 H= -(6/7* log2(6/7)+ 1/7* log2(1/7))= 0.5917

I(Y , Humidity) = 0.940-7/14* 0.985-7/14* 0.591 = 0.151

Information gain of the Humidity feature

slide-40
SLIDE 40

40

Wind 14: (9+ ,5-) H= -(9/14* log2(9/14)+ 5/14* log2(5/14))= 0.9403 Weak Strong [6+ ,2-] [3+ ,3-]

H= -(6/8* log2(6/8)+ 2/8* log2(2/8))= 0.811 H= -(3/6* log2(3/6)+ 3/6* log2(3/6))= 1

I(Y ,Wind) = 0.940-8/14* 0.811-6/14* 1 = 0.048

Information gain of the Wind feature

slide-41
SLIDE 41

41

 Similar calculations for the temperature feature.  Outlook feature is the best root node among all features.

Humidity is the best

Repeat and build the tree

slide-42
SLIDE 42

42

http://www.cs.ualberta.ca/%7Eaixplore/learning/ DecisionTrees/Applet/DecisionTreeApplet.html

Tree Learning App

slide-43
SLIDE 43

43

More general trees

slide-44
SLIDE 44

44

1 1 1 1

 Features can be discrete

  • r continuous

 Each internal node: test some set of features {Xi}  Each branch from a node: selects a set of value for the set {Xi}  Each leaf node: predict Y

  • Majority vote

(classification)

  • Average or

Polynomial fit (regression)

1 1 1 1 1 1

Class labels

Decision/Classification Tree more generally…

slide-45
SLIDE 45

45

Average (fit a constant ) using training data at the leaves

Num Children? ≥ 2 < 2

X1 Xp

Regression trees

slide-46
SLIDE 46

46

Regression (Constant) trees

slide-47
SLIDE 47

47

Overfitting

slide-48
SLIDE 48

48

Many strategies for picking simpler trees:

 Pre-pruning

  • Fixed depth
  • Fixed number of leaves

 Post-pruning

  • Chi-square test

 Model Selection by complexity penalization

Refund MarSt NO Yes No Married Single, Divorced

When to Stop?

slide-49
SLIDE 49

49

  • Penalize complex models by introducing cost

log likelihood cost

j

(j) (j) (j) (j) (j) (j) (j) (j)

Model Selection

Regression Classification

penalize trees with more leaves

slide-50
SLIDE 50

50

Pre-Pruning

slide-51
SLIDE 51
  • Equivalently, with probability
  • Fixed m sample size

H hypothesis space complex simple small large large small

PAC bound and Bias-Variance tradeoff

slide-52
SLIDE 52
  • Sample complexity

What about the size of the hypothesis space?

How large is the hypothesis space of decision trees?

)

slide-53
SLIDE 53

53

Recursive solution:

Given n attributes

H0 =2 (Yes, and no tree)

Number of decision trees of depth k

Write Lk = log2 Hk L0 = 1 Lk = log2 n + 2Lk-1 = log2 n + 2(log2 n + 2Lk-2) = log2 n + 2log2 n + 22log2 n + … + 2k-1(log2 n + 2L0) So Lk = (2k-1)log2 n+ 2k (sum of the first k terms of a geometric series) Hk = (# choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = n * Hk-1 * Hk-1 Hk = Number of decision trees of depth k

slide-54
SLIDE 54

54

Bad!!! Number of points is exponential in depth k!

In contrary, the number of leaves is never more than the number of data points, so let us regularize with the number of leaves instead of depth!

PAC bound for decision trees of depth k

)

Lk = (2k-1)log2 n+ 2k

slide-55
SLIDE 55

55

Hk = (# choices of root attribute) * [(# left subtrees wth 1 leaf)* (# right subtrees wth k-1 leaves) + (# left subtrees wth 2 leaves)* (# right subtrees wth k-2 leaves) + … + (# left subtrees wth k-1 leaves)* (# right subtrees wth 1 leaf)]

Loose bound (using Sterling’s approximation):

Number of decision trees with k leaves

Hk = Number of decision trees with k leaves H1 = 2 (Yes graph or No graph) = nk-1 Ck-1 (Ck-1 : Catalan Number)

slide-56
SLIDE 56

56

log2 Hk = (2k-1)log2 n) +2k

Number of decision trees

 With k leaves

linear in k

number of points m is linear in # leaves k

 With depth k

exponential in k

number of points m is exponential in depth k (n is the number of features)

slide-57
SLIDE 57

57

k = m large (~ > ½) k < m >0 small (~ <½) With prob ≥ 1-δ With , we get:

PAC bound for decision trees with k leaves – Bias-Variance revisited

With prob ≥ 1-δ,

m: number of training points k: number of leaves

slide-58
SLIDE 58

58

  • Bias-Variance tradeoff formalized
  • Complexity k » m – no bias, lots of variance
  • k < m – some bias, less variance

What did we learn from decision trees?

slide-59
SLIDE 59

59

Post-Pruning (Bottom-Up pruning)

slide-60
SLIDE 60

60

OBSERVED DATA Voting Preferences Row total Republican Democrat Independent Male 200 150 50 400 Female 250 300 50 600 Column total 450 450 100 1000

H0: Gender and voting preferences are independent. Ha: Gender and voting preferences are not independent.

Chi-Squared independence test

Expected numbers under H0 (independence:) Er,c = (nr * nc) / n

slide-61
SLIDE 61

61

OBSERVED DATA Voting Preferences Row total Republican Democrat Independent Male 200 150 50 400 Female 250 300 50 600 Column total 450 450 100 1000

Chi-Squared independence test

Expected numbers under H0 (independence:) Er,c = (nr * nc) / n

E1,1 = (400 * 450) / 1000 = 180000/1000 = 180 E1,2 = (400 * 450) / 1000 = 180000/1000 = 180 E1,3 = (400 * 100) / 1000 = 40000/1000 = 40 E2,1 = (600 * 450) / 1000 = 270000/1000 = 270 E2,2 = (600 * 450) / 1000 = 270000/1000 = 270 E2,3 = (600 * 100) / 1000 = 60000/1000 = 60

Χ2 = Σ [ (Or,c - Er,c)2 / Er,c ] Χ2 = (200 - 180)2/180 + (150 - 180)2/180 + (50 - 40)2/40

+ (250 - 270)2/270 + (300 - 270)2/270 + (50 - 60)2/40 = 16.2

slide-62
SLIDE 62

62

Degrees of freedom DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2

where r= # rows, c= # columns

Chi-Squared independence test

P(Χ2 > 16.2) = 0.0003< 0.05 (p value)

) we cannot accept the null hypothesis.

Evidence shows that there is a relationship between gender and voting preference.

slide-63
SLIDE 63

63

Chi-Square Pruning

1. Build a Complete Tree 2. Consider each “leaf”, and perform a chi-square independence test X: s= p+ n, (p+ ,n-)

# of instances entering this node = s # of + instances entering this node = p # of - instances entering this node = n # of instances here = sf= pf+ nf # of + instances here = pf # of - instances here = nf # of instances here = st=pt+ nt # of + instances here = pt # of - instances here = nt

false true sf* p/s sf* n/s st* n/s st* p/s Expected numbers

If after splitting the expected numbers are the same as the measured ones, then there is no point of splitting the node! Delete the leafs!

slide-64
SLIDE 64

64

X1 X2 Y Count

T T T 2 T F T 2 F T F 5 F F T 1

X1 X2 F F T T S= 6, p= 1,n= 5 Y= T Y= F Y= T Sf= 1, pf= 1,nf= 0 St= 5, pt= 0,nt= 5

Variable Assignment Real Counts of Y= T Expected Counts of Y= T

X2= F 1 1/6 (sf* p/s) X2= T 5/6 (st* p/s)

Variable Assignment Real Counts of Y= F Expected Counts of Y= F

X2= F 5/6 (sf* n/s) X2= T 5 25/6 (st* n/s)

slide-65
SLIDE 65

65

Variable Assignment Real Counts of Y= T Expected Counts of Y= T

X2= F 1 1/6 (sf* p/s) X2= T 5/6 (st* p/s)

Variable Assignment Real Counts of Y= F Expected Counts of Y= F

X2= F 5/6 (sf* n/s) X2= T 5 25/6 (st* n/s)

If label Y and feature X2 are independent, then the expected counts should be close to the real counts.

Degrees of freedom

DF = (# Y labels- 1) * (# X2 labels - 1) = (2 - 1) * (2 - 1) = 1 Z = Σ [ (Or,c - Er,c)2 / Er,c ] = (1-1/6)^ 2/(1/6)+ (0-5/6)^ 2/(5/6)+ (0-5/6)^ 2/(5/6) 6+ (5-25/6)^ 2/(25/6) = 25/6+ 5/6+ 5/6+ 1/6 = 6

slide-66
SLIDE 66

66

P(Z> c) is the probability that we see this large deviation by chance under the H0 independence assumption. P(Z> 3.8415) = 0.05, P(Z · 3.8415) = 0.95 The smaller the Z is, the more likely that the feature is independent from the label. (There is no evidence showing their dependence)

  • we reject the independence hypothesis
  • and keep the node X2.

In our case Z = 6 )

Chi-Squared independence test

slide-67
SLIDE 67

67

 Decision trees are one of the most popular data mining tools

  • Simplicity of design
  • Interpretability
  • Ease of implementation
  • Good performance in practice (for small dimensions)

 Information gain to select attributes (ID3, C4.5,…)  Can be used for classification, regression, and density estimation too  Decision trees will overfit!!!

  • Must use tricks to find “simple trees”, e.g.,
  • Pre-Pruning: Fixed depth/Fixed number of leaves
  • Post-Pruning: Chi-square test of independence
  • Complexity Penalized model selection

What you should know

slide-68
SLIDE 68

68

Thanks for the Attention! 