15-388/688 - Practical Data Science: Decision trees and - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Decision trees and interpretable models J. Zico Kolter Carnegie Mellon University Spring 2018 1

Outline Decision trees Training (classification) decision trees Interpreting predictions Boosting Examples 2

Overview Decision trees and boosted decision trees are some of the most ubiquitous algorithms in data science Boosted decision trees typically perform very well without much tuning (the majority of Kaggle contests, for instance, are won with boosting methods) Decision trees, while not as powerful from a pure ML standpoint, are still one of the canonical examples of an “understandable” ML algorithm 4

Decision trees Decision trees were one of the first machine learning algorithms Basic idea: make classification/regression predictions by tracing through rules in a tree, with a constant prediction at each leaf node x 1 ≥ 2 x 2 ≥ 2 x 2 ≥ 3 x 1 ≥ 3 h 1 = 0 . 1 h 2 = 0 . 7 h 3 = 0 . 8 h 4 = 0 . 9 h 5 = 0 . 2 5

Partitioning the input space You can think of the hypothesis function of decision trees as partitioning the input space with axis-aligned boundaries In each partition, predict a constant value x 1 ≥ 2 h 3 h 2 x 2 ≥ 2 x 2 ≥ 3 x 2 h 4 h 5 x 1 ≥ 3 h 1 = 0 . 1 h 2 = 0 . 7 h 3 = 0 . 8 h 1 x 1 h 4 = 0 . 9 h 5 = 0 . 2 6

Decision trees as ML algorithms To specify the decision trees from a machine learning standpoint, we need to specify 1. What is the hypothesis function ℎ 휃 𝑦 ? 2. What is the loss function ℓ ℎ 휃 𝑦 , 𝑧 ? 3. How do we minimize the loss function? 푚 1 , 𝑧 푖 ) 𝑛 ∑ ℓ ( ℎ 휃 𝑦 푖 minimize 휃 푖 = 1 11

Decision trees as ML algorithms To specify the decision trees from a machine learning standpoint, we need to specify …a decision tree ( 𝜄 is shorthand for all the 1. What is the hypothesis function 𝒊 휽 𝒚 ? parameters that define the tree: tree structure, 2. What is the loss function ℓ ℎ 휃 𝑦 , 𝑧 ? values to split on, leaf predictions, etc) 3. How do we minimize the loss function? 푚 1 , 𝑧 푖 ) 𝑛 ∑ ℓ ( ℎ 휃 𝑦 푖 minimize 휃 푖 = 1 12

Decision trees as ML algorithms To specify the decision trees from a machine learning standpoint, we need to specify 1. What is the hypothesis function ℎ 휃 𝑦 ? 2. What is the loss function ℓ 𝒊 휽 𝒚 , 𝒛 ? 3. How do we minimize the loss function? 푚 1 , 𝑧 푖 ) 𝑛 ∑ ℓ ( ℎ 휃 𝑦 푖 minimize 휃 푖 = 1 13

Loss functions in decision trees Let’s assume the output is binary for now (classification task, we will deal with regression shortly), and assume 𝑧 ∈ 0,1 The typical decision tree algorithm using a probabilistic loss function that considers 𝑧 to be a Bernoulli random variable with probability ℎ 휃 ( 𝑦 ) = ℎ 휃 𝑦 푦 1 − ℎ 휃 𝑦 1−푦 𝑞 𝑧 ℎ 휃 𝑦 The loss function is just the negative log probability of the output (like in maximum likelihood estimation) ℓ ℎ 휃 𝑦 , 𝑧 = − log 𝑞 𝑧 ℎ 휃 𝑦 = −𝑧 log ℎ 휃 𝑦 − 1 − 𝑧 log 1 − ℎ 휃 𝑦 14

Decision trees as ML algorithms To specify the decision trees from a machine learning standpoint, we need to specify 1. What is the hypothesis function ℎ 휃 𝑦 ? 2. What is the loss function ℓ ℎ 휃 𝑦 , 𝑧 ? 3. How do we minimize the loss function? 풎 𝟐 , 𝒛 풊 ) 𝒏 ∑ ℓ ( 𝒊 휽 𝒚 풊 𝐧𝐣𝐨𝐣𝐧𝐣𝐴𝐟 휽 풊 = ퟏ 15

Optimizing decision trees Key challenge: unlike models we have considered previously, discrete tree structure means there are no gradients Additionally, even if we assume binary inputs i.e., 𝑦 ∈ 0,1 푛 , there are 2 2 푛 possible decision trees: 𝑜 = 7 means 3.4 × 10 38 possible trees Instead, we’re going to use greedy methods to incrementally build the tree (i.e., minimize the loss function) one node at a time 16

Optimizing a single leaf Consider a single leaf in a decision tree (could be root of initial tree) Let 𝒴 denote the examples at this leaf (e.g., in this partition), where 𝒴 + denotes the positive examples and 𝒴 − denotes negative (zero) examples What should we choose as the (constant) prediction ℎ at this leaf? � = − 𝒴 + log ℎ − 𝒴 − minimize ∑ ℓ ℎ , 𝑧 log(1 − ℎ ) 𝒴 𝒴 ℎ 푥,푦∈풳 ⟹ ℎ = 𝒴 + , 𝒴 Which achieves loss: ℓ = −ℎ log ℎ − (1 − ℎ ) log(1 − ℎ ) 17

Optimizing splits Now suppose we want to split this leaf into two leaves, assuming for the 0,1 푛 is binary time being that 𝑦 ∈ If we split on a given feature 𝑘 , this will separate 𝒴 into two sets: 𝒴 0 and +/ − and defined similarly to before), and we would choose 𝒴 1 (with 𝒴 0 / 1 + / 𝒴 0 , ℎ 1 = 𝒴 1 + / 𝒴 1 optimal prediction ℎ 0 = 𝒴 0 X x j = 0 x j = 1 X 0 X 1 h 0 = |X + h 1 = |X + 0 | 1 | |X 0 | |X 1 | 18

Loss of split The new leafs will each now suffer loss ℓ 0 = −ℎ 0 log ℎ 0 − 1 − ℎ 0 log 1 − ℎ 0 ℓ 1 = −ℎ 1 log ℎ 1 − 1 − ℎ 1 log 1 − ℎ 1 Thus, if we split the original leaf on feature 𝑘 , we no longer suffer our original loss ℓ , but we do suffer losses ℓ 1 + ℓ 2 , i.e., we have decreased the overall loss function by ℓ − ℓ 0 − ℓ 1 (this quantity is called information gain ) Greedy decision tree learning – repeat: • For all leaf nodes, evaluate information gain (i.e., decrease in loss) when splitting on each feature 𝑘 • Split the node/feature that minimizes loss the most • (Run cross-validation to determine when to stop, or after N nodes) 19

Poll: Decision tree training Which of the following are true about training decision trees? 1. Once a feature has been selected as a split, it will never be selected again as a split in the tree 2. If a feature has been selected as a split, it will never be selected as a split in the next level in the tree 3. Assuming no training points are identical, decision trees can always obtain zero error if they are trained deep enough 4. The loss will never increase after a split 20

Continuous features What if 𝑦 푗 ’s are continuous? Solution: sort the examples by their 𝑦 푗 values, compute information gain at each possible split point x ( i 1 ) x ( i 2 ) x ( i 3 ) x ( i 4 ) x ( i 5 ) x ( i 6 ) x ( i 7 ) j j j j j j j x j 𝒴 0 𝒴 0 𝒴 1 𝒴 1 21

Regression trees Regression trees are the same, except that the hypothesis ℎ are real- valued instead of probabilities, and we use squared loss ℓ ℎ , 𝑧 = ℎ − 𝑧 2 This means that the loss a node is given by � � 1 ⟹ ℎ = 1 ℎ − 𝑧 2 minimize ∑ ∑ 𝑧 (i. e. mean) 𝒴 𝒴 ℎ 푥,푦∈풳 푥,푦∈풳 and suffers loss � ℓ = 1 𝑧 − ℎ 2 ∑ (i. e. variance) 𝒴 푥,푦∈풳 22

Interpretable models Decision trees are the canonical example of an interpretable model Why did we we predict +1? x 1 ≥ 2 x 1 ≥ 2 x 2 ≥ 2 x 2 ≥ 2 x 2 ≥ 3 x 2 ≥ 3 x 1 ≥ 3 x 1 ≥ 3 h 1 = +1 h 1 = +1 h 2 = − 1 h 2 = − 1 h 3 = − 1 h 3 = − 1 h 4 = − 1 h 4 = − 1 h 5 = +1 h 5 = +1 …because 𝑦 1 ≥ 2, 𝑦 2 ≥ 3, 𝑦 1 ≥ 3 …because 𝑦 1 ≥ 3 𝑦 2 ≥ 3 24

Decision tree surface for cancer prediction …because mean concave points > 0.05, mean area > 791 25

Explanations in higher dimensions Explanatory power works “well” even for data with high dimension Example from full breast cancer dataset with 30 features, “example classified as positive because” max _perimeter > 117.41, max _concave_points > 0.087 Compare to linear classifier, “exampled classified positive because” 2.142 ∗ mean_radius + 0.119 ∗ mean_texture + … > 0 26

15-388/688 - Practical Data Science: Decision trees and - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Decision trees and interpretable models J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Decision trees Training (classification) decision trees Interpreting predictions Boosting

15-388/688 - Practical Data Science: Debugging data science J. Zico Kolter School of Computer

15-388/688 - Practical Data Science: Introduction J. Zico Kolter Carnegie Mellon University

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

15-388/688 - Practical Data Science: Basic probability J. Zico Kolter Carnegie Mellon University

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

15-388/688 - Practical Data Science: Data collection and scraping J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Relational Data J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Visualization and Data Exploration J. Zico Kolter Carnegie

Time Series Modeling Shouvik Mani April 5, 2018 15-388/688: Practical Data Science Carnegie

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians J. Zico Kolter

15-388/688 - Practical Data Science: Graph and network processing J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, and regularization J.

15-388/688 - Practical Data Science: Hypothesis testing and experimental design J. Zico Kolter

15-388/688 - Practical Data Science: Free text and natural language processing J. Zico Kolter

segment tree By Zohre Akbari January2014 2 Arbitrarily oriented segments Two cases of

Chapter 5. Tree-based Methods Wei Pan Division of Biostatistics, School of Public Health,

Case Studies in Asynchronous, Message-Driven Shared Memory Programming Pritish Jetley Parallel

Merge Sort 7 2 9 4 2 4 7 9 7 2 2 7 9 4 4 9 7 7 2 2 9 9 4

Two Major Rendering Methods: Rasterization and Ray Tracing Sung-Eui Yoon ( ) Course

Lecture 4: Rule-based classification and regression Felix Held, Mathematical Sciences

Lecture 9 Polar Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan

Repetition Code Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical Engineering