Introduction to Machine Learning CMU-10701 23. Decision Trees - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information gain Generalizations

1. Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos

2. Contents  Decision Trees: Definition + Motivation  Algorithm for Learning Decision Trees • Entropy, Mutual Information, Information gain  Generalizations • Regression Trees  Overfitting • Pruning • Regularization Many of these slides are taken from • Aarti Singh, • Eric Xing, • Carlos Guestrin • Russ Greiner 2 • Andrew Moore

3. Decision Trees 3

4. Decision Tree: Motivation Learn decision rules from a dataset : Do we want to play tennis?  4 discrete-valued attributes (Outlook, Temperature, Humidity, Wind)  Play tennis?:“Yes/No” classification problem 4

5. Decision Tree: Motivation  We want to learn a “good” decision tree from the data.  For example, this tree: 5

6. Function Approximation Formal Problem Setting : • Set of possible instances X (set of all possible feature vectors) • Unknown target function f : X ! Y • Set of function hypotheses H = { h | h : X ! Y } (H= possible decision trees) I nput : • Training examples { < x (i), y (i ) > } of unknown target function f Output : • Hypothesis h ∈ H that best approximates target function f In decision tree learning, we are doing function approximation, where the set of hypotheses H = set of decision trees 6

7. Decision Tree: The Hypothesis Space  Each internal node is labeled with some feature x j  Arc (from x j ) labeled with results of test x j  Leaf nodes specify class h(x)  One Instance: Outlook = Sunny Temperature = Hot Humidity = High Wind = Strong classified as “No” (Temperature, Wind: irrelevant)  Easy to use in Classification  Interpretable rules 7

8. Generalizations  Features can be continuous  Output can be continuous too (regression trees)  Instead of single features in the nodes, we can use set of features too in the nodes Later we will discuss them in more detail. 8

9. Continuous Features I f a feature is continuous: internal nodes may test value against threshold 9

10. Example: Mixed Discrete and Continuous Features Tax Fraud Detection: Goal is to predict who is cheating on tax using the ‘refund’, ‘marital status’, and ‘income’ features Refund Marital Taxable Cheat status income yes Married 50K no no Married 90K no no Single 60K no no Divorced 100K yes yes Married 110K no Build a tree that matches the data 10

11. Decision Tree for Tax Fraud Detection Data Refund Yes No • Each internal node: test one feature X i NO MarSt Married • Continuous features test Single, Divorced value against threshold TaxInc NO • Each branch from a node: < 80K > 80K selects one value (or set of values) for X i . YES NO • Each leaf node: predict Y 11

12. Given a decision tree, how do we assign label to a test point? 12

13. Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Cheat Status Income Refund No Married 80K ? Yes No 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 13

14. Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Cheat Status Income Refund No Married 80K ? Yes No 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 14

15. Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Taxable Cheat Cheat Status Income Status Income Refund No Married 80K ? No Married 80K ? Yes No No 10 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 15

16. Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Taxable Cheat Cheat Status Income Status Income Refund No Married 80K ? No Married 80K ? Yes No No 10 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 16

17. Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Refund Marital Taxable Taxable Cheat Cheat Cheat Status Income Status Status Income Income Refund No Married 80K ? No No Married Married 80K 80K ? ? Yes No No 10 10 10 NO MarSt Married Married Single, Divorced TaxInc NO < 80K > 80K YES NO 17

18. Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Refund Marital Taxable Taxable Cheat Cheat Cheat Status Income Status Status Income Income Refund No Married 80K ? No No Married Married 80K 80K ? ? Yes No No 10 10 10 NO MarSt Assign Cheat to “No” Married Married Single, Divorced TaxInc NO < 80K > 80K YES NO 18

19. What do decision trees do in the feature space? 19

20. Decision Tree Decision Boundaries Decision trees divide feature space into axis-parallel rectangles , labeling each rectangle with one class Two features only: x 1 and x 2 20

21. Some functions cannot be represented with binary splits  Some functions cannot be represented with binary splits:  If we want to learn this function too, • we need more complex functions in the nodes than binary splits • We need to “break” this function to smaller parts that can be represented with binary splits. 2 3 - 1 + 5 4 21

22. How do we learn a decision tree from training data? 22

23. What Boolean functions can be represented with decision trees? How would you represent Y = X 2 and X 5 ? Y = X 2 or X 5 ? How would you represent X 2 X 5 ∨ X 3 X 4 (¬ X 1 )? 23

24. Decision trees can represent any boolean/discrete functions n boolean features (x 1 ,…,x n ) )  2 n possible different instances  2 n+ 1 possible different functions if class label Y is boolean too. X 1 X 2 X 2 + - - + 24

25. Option 1: Just store training data  Trees can represent any boolean (and discrete) functions, e.g. (A v B) & (C v not D v E)  Just produce “path” for each example (store the training data)  . . . may require exponentially many nodes. . .  Any generalization capability? (Other instances that are not in the training data?)  NP-hard to find smallest tree that fits data I ntuition: Want SMALL trees ... to capture “regularities” in data ... ... easier to understand, faster to execute 25

26. Expressiveness of General Decision Trees Example: Learn A xor B (Boolean features and labels) • There is a decision tree which perfectly classifies a training set with one path to leaf for each example. 26

27. Example of Overfitting  1000 patients  25% have butterfly-itis (250)  75% are healthy (750)  Use 10 silly features, not related to the class label  ½ of patients have F1 = 1 (“odd birthday”)  ½ of patients have F2 = 1 “even SSN”  etc 27

28. Typical Results Standard decision tree learner: Error Rate: ฀ Train data: 0% ฀ New data: 37% Optimal decision tree: Error Rate: ฀ Train data: 25% ฀ New data: 25% Regularization is important… 28

29. How to learn a decision tree • Top-down induction [ many algorithms ID3, C4.5, CART, … ] (Grow the tree from the root to the leafs) We will focus on ID3 algorithm Repeat : 1. Select “best feature” (X 1 , X 2 or X 3 ) to split 2. For each value that feature takes, sort training examples to leaf nodes 3. Stop if leaf contains all training examples with same label or if all features are used up 4. Assign leaf with majority vote of labels of training examples 29

30. First Split? 30

31. Which feature is best to split? Good split: we are less uncertain about classification after split 80 training people (50 Genuine, 30 Cheats) Refund Marital Status Refund Yes No NO MarSt Single, Yes No Married Married Divorced Single, Divorced TaxInc NO < 80K > 80K 40 Genuine 10 Genuine 30 Genuine 20 Genuine 0 Cheats 30 Cheats 10 Cheats 20 Cheats YES NO Absolutely Kind of Kind of Absolutely sure sure sure unsure Refund gives more information about the labels than Marital Status 31

32. Which feature is best to split? Pick the attribute/feature which yields maximum information gain: H(Y) – entropy of Y H(Y|X i ) – conditional entropy of Y Feature which yields maximum reduction in entropy provides maximum information about Y 32

33. Entropy Entropy of a random variable Y Larg rger r unc uncert aint nt y, lar arger ent ropy! Uniform Entropy, H(Y) Y ~ Bernoulli(p) Max entropy Deterministic Zero entropy p I nformation Theory interpretation : H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code) 33

34. Information Gain Advantage of attribute = decrease in uncertainty • Entropy of Y before split • Entropy of Y after splitting based on X i We want this to be small • Weight by probability of following each branch I nformation gain is the difference: Max I nformation gain = min conditional entropy 34

35. First Split? Which feature splits the data the best to + and – instances? 35

36. First Split? Outlook feature looks great, because the Overcast branch is perfectly separated. 36

37. Statistics I f split on x i , produce 2 children: (1) # (x i = t) follow TRUE branch data: [ # (x i = t, Y = + ),# (x i = t, Y= –) ] (2) # (x i = f) follow FALSE branch data: [ # (x i = f, Y = + ),# (x i = f, Y= –) ] Calculate the mutual information between x i and Y! 37

Recommend

More recommend