15 388 688 practical data science decision trees and
play

15-388/688 - Practical Data Science: Decision trees and - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Decision trees and interpretable models J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Decision trees Training (classification) decision trees Interpreting predictions Boosting


  1. 15-388/688 - Practical Data Science: Decision trees and interpretable models J. Zico Kolter Carnegie Mellon University Spring 2018 1

  2. Outline Decision trees Training (classification) decision trees Interpreting predictions Boosting Examples 2

  3. Outline Decision trees Training (classification) decision trees Interpreting predictions Boosting Examples 3

  4. Overview Decision trees and boosted decision trees are some of the most ubiquitous algorithms in data science Boosted decision trees typically perform very well without much tuning (the majority of Kaggle contests, for instance, are won with boosting methods) Decision trees, while not as powerful from a pure ML standpoint, are still one of the canonical examples of an “understandable” ML algorithm 4

  5. Decision trees Decision trees were one of the first machine learning algorithms Basic idea: make classification/regression predictions by tracing through rules in a tree, with a constant prediction at each leaf node x 1 ≥ 2 x 2 ≥ 2 x 2 ≥ 3 x 1 ≥ 3 h 1 = 0 . 1 h 2 = 0 . 7 h 3 = 0 . 8 h 4 = 0 . 9 h 5 = 0 . 2 5

  6. Partitioning the input space You can think of the hypothesis function of decision trees as partitioning the input space with axis-aligned boundaries In each partition, predict a constant value x 1 ≥ 2 h 3 h 2 x 2 ≥ 2 x 2 ≥ 3 x 2 h 4 h 5 x 1 ≥ 3 h 1 = 0 . 1 h 2 = 0 . 7 h 3 = 0 . 8 h 1 x 1 h 4 = 0 . 9 h 5 = 0 . 2 6

  7. Partitioning the input space You can think of the hypothesis function of decision trees as partitioning the input space with axis-aligned boundaries In each partition, predict a constant value x 1 ≥ 2 h 3 h 2 x 2 ≥ 2 x 2 ≥ 3 x 2 h 4 h 5 x 1 ≥ 3 h 1 = 0 . 1 h 2 = 0 . 7 h 3 = 0 . 8 h 1 x 1 h 4 = 0 . 9 h 5 = 0 . 2 7

  8. Partitioning the input space You can think of the hypothesis function of decision trees as partitioning the input space with axis-aligned boundaries In each partition, predict a constant value x 1 ≥ 2 h 3 h 2 x 2 ≥ 2 x 2 ≥ 3 x 2 h 4 h 5 x 1 ≥ 3 h 1 = 0 . 1 h 2 = 0 . 7 h 3 = 0 . 8 h 1 x 1 h 4 = 0 . 9 h 5 = 0 . 2 8

  9. Partitioning the input space You can think of the hypothesis function of decision trees as partitioning the input space with axis-aligned boundaries In each partition, predict a constant value x 1 ≥ 2 h 3 h 2 x 2 ≥ 2 x 2 ≥ 3 x 2 h 4 h 5 x 1 ≥ 3 h 1 = 0 . 1 h 2 = 0 . 7 h 3 = 0 . 8 h 1 x 1 h 4 = 0 . 9 h 5 = 0 . 2 9

  10. Outline Decision trees Training (classification) decision trees Interpreting predictions Boosting Examples 10

  11. Decision trees as ML algorithms To specify the decision trees from a machine learning standpoint, we need to specify 1. What is the hypothesis function ℎ 휃 𝑦 ? 2. What is the loss function ℓ ℎ 휃 𝑦 , 𝑧 ? 3. How do we minimize the loss function? 푚 1 , 𝑧 푖 ) 𝑛 ∑ ℓ ( ℎ 휃 𝑦 푖 minimize 휃 푖 = 1 11

  12. Decision trees as ML algorithms To specify the decision trees from a machine learning standpoint, we need to specify …a decision tree ( 𝜄 is shorthand for all the 1. What is the hypothesis function 𝒊 휽 𝒚 ? parameters that define the tree: tree structure, 2. What is the loss function ℓ ℎ 휃 𝑦 , 𝑧 ? values to split on, leaf predictions, etc) 3. How do we minimize the loss function? 푚 1 , 𝑧 푖 ) 𝑛 ∑ ℓ ( ℎ 휃 𝑦 푖 minimize 휃 푖 = 1 12

  13. Decision trees as ML algorithms To specify the decision trees from a machine learning standpoint, we need to specify 1. What is the hypothesis function ℎ 휃 𝑦 ? 2. What is the loss function ℓ 𝒊 휽 𝒚 , 𝒛 ? 3. How do we minimize the loss function? 푚 1 , 𝑧 푖 ) 𝑛 ∑ ℓ ( ℎ 휃 𝑦 푖 minimize 휃 푖 = 1 13

  14. Loss functions in decision trees Let’s assume the output is binary for now (classification task, we will deal with regression shortly), and assume 𝑧 ∈ 0,1 The typical decision tree algorithm using a probabilistic loss function that considers 𝑧 to be a Bernoulli random variable with probability ℎ 휃 ( 𝑦 ) = ℎ 휃 𝑦 푦 1 − ℎ 휃 𝑦 1−푦 𝑞 𝑧 ℎ 휃 𝑦 The loss function is just the negative log probability of the output (like in maximum likelihood estimation) ℓ ℎ 휃 𝑦 , 𝑧 = − log 𝑞 𝑧 ℎ 휃 𝑦 = −𝑧 log ℎ 휃 𝑦 − 1 − 𝑧 log 1 − ℎ 휃 𝑦 14

  15. Decision trees as ML algorithms To specify the decision trees from a machine learning standpoint, we need to specify 1. What is the hypothesis function ℎ 휃 𝑦 ? 2. What is the loss function ℓ ℎ 휃 𝑦 , 𝑧 ? 3. How do we minimize the loss function? 풎 𝟐 , 𝒛 풊 ) 𝒏 ∑ ℓ ( 𝒊 휽 𝒚 풊 𝐧𝐣𝐨𝐣𝐧𝐣𝐴𝐟 휽 풊 = ퟏ 15

  16. Optimizing decision trees Key challenge: unlike models we have considered previously, discrete tree structure means there are no gradients Additionally, even if we assume binary inputs i.e., 𝑦 ∈ 0,1 푛 , there are 2 2 푛 possible decision trees: 𝑜 = 7 means 3.4 × 10 38 possible trees Instead, we’re going to use greedy methods to incrementally build the tree (i.e., minimize the loss function) one node at a time 16

  17. Optimizing a single leaf Consider a single leaf in a decision tree (could be root of initial tree) Let 𝒴 denote the examples at this leaf (e.g., in this partition), where 𝒴 + denotes the positive examples and 𝒴 − denotes negative (zero) examples What should we choose as the (constant) prediction ℎ at this leaf? � = − 𝒴 + log ℎ − 𝒴 − minimize ∑ ℓ ℎ , 𝑧 log(1 − ℎ ) 𝒴 𝒴 ℎ 푥,푦∈풳 ⟹ ℎ = 𝒴 + , 𝒴 Which achieves loss: ℓ = −ℎ log ℎ − (1 − ℎ ) log(1 − ℎ ) 17

  18. Optimizing splits Now suppose we want to split this leaf into two leaves, assuming for the 0,1 푛 is binary time being that 𝑦 ∈ If we split on a given feature 𝑘 , this will separate 𝒴 into two sets: 𝒴 0 and +/ − and defined similarly to before), and we would choose 𝒴 1 (with 𝒴 0 / 1 + / 𝒴 0 , ℎ 1 = 𝒴 1 + / 𝒴 1 optimal prediction ℎ 0 = 𝒴 0 X x j = 0 x j = 1 X 0 X 1 h 0 = |X + h 1 = |X + 0 | 1 | |X 0 | |X 1 | 18

  19. Loss of split The new leafs will each now suffer loss ℓ 0 = −ℎ 0 log ℎ 0 − 1 − ℎ 0 log 1 − ℎ 0 ℓ 1 = −ℎ 1 log ℎ 1 − 1 − ℎ 1 log 1 − ℎ 1 Thus, if we split the original leaf on feature 𝑘 , we no longer suffer our original loss ℓ , but we do suffer losses ℓ 1 + ℓ 2 , i.e., we have decreased the overall loss function by ℓ − ℓ 0 − ℓ 1 (this quantity is called information gain ) Greedy decision tree learning – repeat: • For all leaf nodes, evaluate information gain (i.e., decrease in loss) when splitting on each feature 𝑘 • Split the node/feature that minimizes loss the most • (Run cross-validation to determine when to stop, or after N nodes) 19

  20. Poll: Decision tree training Which of the following are true about training decision trees? 1. Once a feature has been selected as a split, it will never be selected again as a split in the tree 2. If a feature has been selected as a split, it will never be selected as a split in the next level in the tree 3. Assuming no training points are identical, decision trees can always obtain zero error if they are trained deep enough 4. The loss will never increase after a split 20

  21. Continuous features What if 𝑦 푗 ’s are continuous? Solution: sort the examples by their 𝑦 푗 values, compute information gain at each possible split point x ( i 1 ) x ( i 2 ) x ( i 3 ) x ( i 4 ) x ( i 5 ) x ( i 6 ) x ( i 7 ) j j j j j j j x j 𝒴 0 𝒴 0 𝒴 1 𝒴 1 21

  22. Regression trees Regression trees are the same, except that the hypothesis ℎ are real- valued instead of probabilities, and we use squared loss ℓ ℎ , 𝑧 = ℎ − 𝑧 2 This means that the loss a node is given by � � 1 ⟹ ℎ = 1 ℎ − 𝑧 2 minimize ∑ ∑ 𝑧 (i. e. mean) 𝒴 𝒴 ℎ 푥,푦∈풳 푥,푦∈풳 and suffers loss � ℓ = 1 𝑧 − ℎ 2 ∑ (i. e. variance) 𝒴 푥,푦∈풳 22

  23. Outline Decision trees Training (classification) decision trees Interpreting predictions Boosting Examples 23

  24. Interpretable models Decision trees are the canonical example of an interpretable model Why did we we predict +1? x 1 ≥ 2 x 1 ≥ 2 x 2 ≥ 2 x 2 ≥ 2 x 2 ≥ 3 x 2 ≥ 3 x 1 ≥ 3 x 1 ≥ 3 h 1 = +1 h 1 = +1 h 2 = − 1 h 2 = − 1 h 3 = − 1 h 3 = − 1 h 4 = − 1 h 4 = − 1 h 5 = +1 h 5 = +1 …because 𝑦 1 ≥ 2, 𝑦 2 ≥ 3, 𝑦 1 ≥ 3 …because 𝑦 1 ≥ 3 𝑦 2 ≥ 3 24

  25. Decision tree surface for cancer prediction …because mean concave points > 0.05, mean area > 791 25

  26. Explanations in higher dimensions Explanatory power works “well” even for data with high dimension Example from full breast cancer dataset with 30 features, “example classified as positive because” max _perimeter > 117.41, max _concave_points > 0.087 Compare to linear classifier, “exampled classified positive because” 2.142 ∗ mean_radius + 0.119 ∗ mean_texture + … > 0 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend