decision trees
play

Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , - PowerPoint PPT Presentation

Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , 2010 Learning a good prediction rule Learn a mapping Best prediction rule Hypothesis space/Function class Parametric classes (Gaussian, binomial etc.)


  1. Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , 2010

  2. Learning a good prediction rule • Learn a mapping • Best prediction rule • Hypothesis space/Function class – Parametric classes (Gaussian, binomial etc.) – Conditionally independent class densities (Naïve Bayes) – Linear decision boundary (Logistic regression) – Nonparametric class (Histograms, nearest neighbor, kernel estimators, Decision Trees – Today ) • Given training data, find a hypothesis/function in that is close to the best prediction rule. 2

  3. First … • What does a decision tree represent • Given a decision tree, how do we assign label to a test point 3

  4. Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Cheat Status Income Refund No Married 80K ? Yes No 10 NO MarSt • Each internal node: test Married Single, Divorced one feature X i • Each branch from a node: TaxInc NO selects one value for X i < 80K > 80K • Each leaf node: predict Y NO YES 4

  5. Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Cheat Status Income Refund No Married 80K ? Yes No 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 5

  6. Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Cheat Status Income Refund No Married 80K ? Yes No 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 6

  7. Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Taxable Cheat Cheat Status Income Status Income Refund No Married 80K ? ? No Married 80K Yes No No 10 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 7

  8. Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Taxable Cheat Cheat Status Income Status Income Refund No Married 80K ? ? No Married 80K Yes No No 10 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 8

  9. Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Refund Marital Taxable Taxable Cheat Cheat Cheat Status Income Status Status Income Income Refund No Married 80K ? ? ? No No Married Married 80K 80K Yes No No 10 10 10 NO MarSt Married Married Single, Divorced TaxInc NO < 80K > 80K NO YES 9

  10. Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Refund Marital Taxable Taxable Cheat Cheat Cheat Status Income Status Status Income Income Refund No Married 80K ? ? ? No No Married Married 80K 80K Yes No No 10 10 10 NO MarSt Assign Cheat to “No” Married Married Single, Divorced TaxInc NO < 80K > 80K NO YES 10

  11. Decision Tree more generally… • Features can be discrete, continuous or categorical • Each internal node: test 1 1 some set of features {X i } • Each branch from a node: 0 1 1 selects a set of value for 1 0 {X i } • Each leaf node: predict Y 1 1 1 1 0 0 1 11

  12. So far… • What does a decision tree represent • Given a decision tree, how do we assign label to a test point Now … • How do we learn a decision tree from training data • What is the decision on each leaf 12

  13. So far… • What does a decision tree represent • Given a decision tree, how do we assign label to a test point Now … • How do we learn a decision tree from training data • What is the decision on each leaf 13

  14. How to learn a decision tree • Top- down induction *ID3, C4.5, CART, …+ Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 14

  15. Which feature is best to split? X 1 X 2 Y T T T T F T F T F T T T T Y: 4 Ts Y: 1 Ts Y: 3 Ts Y: 2 Ts T F T 0 Fs 3 Fs 1 Fs 2 Fs F T T Absolutely Kind of Kind of Absolutely sure sure sure unsure F F F F T F Good split if we are more certain F F F about classification after split – Uniform distribution of labels is bad 15

  16. Which feature is best to split? Pick the attribute/feature which yields maximum information gain: H(Y) – entropy of Y H(Y|X i ) – conditional entropy of Y 16

  17. Entropy • Entropy of a random variable Y More uncertainty, Uniform more entropy! Max entropy Entropy, H(Y) Y ~ Bernoulli(p) Deterministic Zero entropy p Information Theory interpretation : H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code) 17

  18. Andrew Moore’s Entropy in a Nutshell High Entropy Low Entropy ..the values (locations ..the values (locations of of soup) sampled soup) unpredictable... almost entirely from within uniformly sampled the soup bowl throughout our dining room 18

  19. Information Gain • Advantage of attribute = decrease in uncertainty – Entropy of Y before split – Entropy of Y after splitting based on X i • Weight by probability of following each branch • Information gain is difference Max Information gain = min conditional entropy 19

  20. Information Gain X 1 X 2 Y T T T T F T F T F T T T T Y: 4 Ts Y: 1 Ts Y: 3 Ts Y: 2 Ts T F T 0 Fs 3 Fs 1 Fs 2 Fs F T T F F F F T F F F F > 0 20

  21. Which feature is best to split? Pick the attribute/feature which yields maximum information gain: H(Y) – entropy of Y H(Y|X i ) – conditional entropy of Y Feature which yields maximum reduction in entropy provides maximum information about Y 21

  22. Expressiveness of Decision Trees • Decision trees can express any function of the input features. • E.g., for Boolean functions, truth table row → path to leaf: • There is a decision tree which perfectly classifies a training set with one path to leaf for each example • But it won't generalize well to new examples - prefer to find more compact decision trees 22

  23. Decision Trees - Overfitting One training example per leaf – overfits, need compact/pruned decision tree 23

  24. Bias-Variance Tradeoff average Classifiers based on classifier different training data coarse partition bias large variance small Ideal classifier fine partition bias small variance large 24

  25. When to Stop? • Many strategies for picking simpler trees: – Pre-pruning • Fixed depth Refund Yes No • Fixed number of leaves MarSt – Post-pruning Married Single, Divorced • Chi-square test NO – Convert decision tree to a set of rules – Eliminate variable values in rules which are independent of label (using chi-square test for independence) – Simplify rule set by eliminating unnecessary rules – Information Criteria: MDL(Minimum Description Length) 25

  26. Information Criteria • Penalize complex models by introducing cost log likelihood cost regression classification penalize trees with more leaves 26

  27. Information Criteria - MDL Penalize complex models based on their information content . # bits needed to describe f MDL (Minimum Description Length) (description length) Example: Binary Decision trees k leaves => 2k – 1 nodes 2k – 1 bits to encode tree structure + k bits to encode label of each leaf (0/1) 5 leaves => 9 bits to encode structure

  28. So far… • What does a decision tree represent • Given a decision tree, how do we assign label to a test point Now … • How do we learn a decision tree from training data • What is the decision on each leaf 28

  29. How to assign label to each leaf Classification – Majority vote Regression – ? 29

  30. How to assign label to each leaf Classification – Majority vote Regression – Constant/ Linear/Poly fit 30

  31. Regression trees Num Children? ≥ 2 < 2 Average (fit a constant ) using training data at the leaves 31

  32. Connection between nearest neighbor/histogram classifiers and decision trees 32

  33. Local prediction Histogram, kernel density estimation, k-nearest neighbor classifier, kernel regression D Histogram Classifier 33

  34. Local Adaptive prediction Let neighborhood size adapt to data – small neighborhoods near decision boundary (small bias), large neighborhoods elsewhere (small variance) D x Majority vote Decision Tree Classifier at each leaf 34

  35. Histogram Classifier vs Decision Trees Ideal classifier Decision tree histogram 256 cells in each partition 35

  36. Application to Image Coding 1024 cells in each partition 36

  37. Application to Image Coding JPEG 0.125 bpp JPEG 2000 0.125 bpp non-adaptive partitioning adaptive partitioning 37

  38. What you should know • Decision trees are one of the most popular data mining tools • Simplicity of design • Interpretability • Ease of implementation • Good performance in practice (for small dimensions) • Information gain to select attributes (ID3, C4.5,…) • Can be used for classification, regression and density estimation too • Decision trees will overfit!!! – Must use tricks to find “simple trees”, e.g., • Pre-Pruning: Fixed depth/Fixed number of leaves • Post-Pruning: Chi-square test of independence • Complexity Penalized/MDL model selection 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend