Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine - PowerPoint PPT Presentation

Decision Trees COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning Decision Trees 1 / 19

Outline 1 Motivation 2 Recursive Splits and Trees 3 Prediction 4 Purity 5 How to Split 6 When to Stop Splitting COMPSCI 371D — Machine Learning Decision Trees 2 / 19

Motivation Linear Predictors → Trees → Forests • Linear predictors: + Few parameters → Good generalization, efficient training + Convex risk → Unique minimum risk, easy optimization + Score-based → Measure of confidence - Few parameters → Limited expressiveness: • Regessor is an affine function • Classifier is a set of convex regions in X • Decision trees: • Score based (in a sophisticated way) • Arbitrarily expressive: Flexible, but generalizes poorly • Interpretable: We can audit a decision • Random decision forests: • Ensembles of trees that vote on an answer • Expressive (somewhat less than trees), generalize well COMPSCI 371D — Machine Learning Decision Trees 3 / 19

Recursive Splits and Trees Splitting X Recursively 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 COMPSCI 371D — Machine Learning Decision Trees 4 / 19

Recursive Splits and Trees A Decision Tree Choose splits to maximize purity a: d = 2 t = 0.265 b: d = 1 c: d = 2 t = 0.41 t = 0.34 d: d = 1 p = [0, 1, 0] p = [1, 0, 0] p = [1, 0, 0] t = 0.16 e: d = 2 p = [0, 0, 1] t = 0.55 p = [1, 0, 0] p = [0, 0, 1] COMPSCI 371D — Machine Learning Decision Trees 5 / 19

Recursive Splits and Trees What’s in a Node • Internal: • Split parameters: Dimension j ∈ { 1 , . . . , d } , threshold t ∈ R • Pointers to children, corresponding to subsets of T : def L = { ( x , y ) ∈ S | x j ≤ t } def R = { ( x , y ) ∈ S | x j > t } • Leaf: Distribution of training values y in this subset of X : p , discrete for classification, histogram for regression • At inference time, return a summary of p as the value for the leaf • Mode (majority) for a classifier • Mean or median for a regressor • (Remember k -NN?) COMPSCI 371D — Machine Learning Decision Trees 6 / 19

Recursive Splits and Trees Why Store p ? • Can’t we just store summary ( p ) at the leaves? • With p , we can compute a confidence value • (More important) We need p at every node during training to evaluate purity COMPSCI 371D — Machine Learning Decision Trees 7 / 19

Prediction Prediction function y ← predict ( x , τ, summary ) if leaf ?( τ ) then return summary ( τ. p ) else return predict ( x , split ( x , τ ) , summary ) end if end function function τ ← split ( x , τ ) if x τ. j ≤ τ. t then return τ. L else return τ. R end if end function COMPSCI 371D — Machine Learning Decision Trees 8 / 19

Purity Design Decisions for Training • How to define (im)purity • How to find optimal split parameters j and t • When to stop splitting COMPSCI 371D — Machine Learning Decision Trees 9 / 19

Purity Impurity Measure 1: The Error Rate • Simplest option: i ( S ) = err ( S ) = 1 − max y p ( y | S ) • S : subset of T that reaches the given node • Interpretation: • Put yourself at node τ • The distribution of training-set labels that are routed to τ is that of the labels in S • The best the classifier can do is to pick the label with the highest fraction, max y p ( y | S ) • If the distribution is representative, err ( S ) is the probability that the classifier is wrong at τ (empirical risk) COMPSCI 371D — Machine Learning Decision Trees 10 / 19

Purity Impurity Measure 2: The Gini Index • A classifier that always picks the most likely label does best at inference time • However, it ignores all other labels at training time p = [ 0 . 5 , 0 . 49 , 0 . 01 ] same error rate as q = [ 0 . 5 , 0 . 25 , 0 . 25 ] • In p , we have almost eliminated the third label • q closer to uniform, perhaps less desirable • For evaluating splits (only), consider a stochastic predictor : y = h Gini ( x ) = ˆ ˆ y with probability p (ˆ y | S ( x )) • The Gini index measures the empirical risk for the stochastic predictor (looks at all of p , not just p max ) • Says that p is a bit better than q : p is less impure than q • i ( S p ) ≈ 0 . 51 and i ( S q ) ≈ 0 . 62 COMPSCI 371D — Machine Learning Decision Trees 11 / 19

Purity The Gini Index • Stochastic predictor : ˆ y = h Gini ( x ) = ˆ y with probability p (ˆ y | S ( x )) • What is the empirical risk for h Gini ? • True answer is y with probability p ( y | S ( x )) • If the true answer is y , then ˆ y is wrong with probability ≈ 1 − p ( y | S ) (because h Gini picks y with probability p ( y | S ( x )) ) • Therefore, impurity defined as the empirical risk of h Gini is i ( S ) = L S ( h Gini ) = � y ∈ Y p ( y | S )( 1 − p ( y | S )) = y ∈ Y p 2 ( y | S ) 1 − � COMPSCI 371D — Machine Learning Decision Trees 12 / 19

How to Split How to Split • Split at training time: If training subset S made it to the current node, put all samples in S into either L or R by the split rule • Split at inference time: Send x either to τ. L or to τ. R • Either way: • Choose a dimension j in { 1 , . . . , d } • Choose a threshold t • Any data point for which x j ≤ t goes to τ. L • All other points go to τ. R • How to pick j and t ? COMPSCI 371D — Machine Learning Decision Trees 13 / 19

How to Split How to Pick j and t at Each Node? • Try all possibilities and pick the best • “Best:” Maximizes the decrease in impurity: ∆ i ( S , L , R ) = i ( S ) − | L | | S | i ( L ) − | R | | S | i ( R ) • “All possibilities:” Choices are finite in number x ( 0 ) ( u j ) • Sorted unique values in x j across T : , . . . , x j j • Possible thresholds: t = t ( 1 ) ( u j ) , . . . , t j j x ( ℓ − 1 ) + x ( ℓ ) where t ( ℓ ) j j = for ℓ = 1 , . . . , u j j 2 • Nested loop: for j = 1 , . . . , d ( u j ) for t = t ( 1 ) , . . . , t j j • Efficiency hacks are possible COMPSCI 371D — Machine Learning Decision Trees 14 / 19

When to Stop Splitting Stopping too Soon is Dangerous • Temptation: Stop when impurity does not decrease o + + o o o + o + + o o o + + o o o + o + o + COMPSCI 371D — Machine Learning Decision Trees 15 / 19

When to Stop Splitting When to Stop Splitting • Possible stopping criteria • Impurity is zero • Too few samples would result in either L or R • Maximum depth reached • Overgrow the tree, then prune it • There is no optimal pruning method (Finding the optimal tree is NP-hard) (Reduction from set cover problem, Hyafil and Rivest) • Better option: Random Decision Forests COMPSCI 371D — Machine Learning Decision Trees 16 / 19

When to Stop Splitting Summary: Training a Decision Tree • Use exhaustive search at the root of the tree to find the dimension j and threshold t that splits T with the biggest decrease in impurity • Store j and t at the root of the tree • Make new children with L and R • Repeat on the two subtrees until some criterion is met 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 COMPSCI 371D — Machine Learning Decision Trees 17 / 19

When to Stop Splitting Summary: Predicting with a Decision Tree • Use j and t at the root τ to see if x belongs in τ. L or τ. R • Go to the appropriate child • Repeat until a leaf is reached • Return summary ( p ) • summary is majority for a classifier, mean or median for a regressor COMPSCI 371D — Machine Learning Decision Trees 18 / 19

When to Stop Splitting From Trees to Forests • Trees are flexible → good expressiveness • Trees are flexible → poor generalization • Pruning is an option, but messy • Random Decision Forests let several trees vote • Use the bootstrap to give different trees different views of the data • Randomize split rules to make trees even more independent COMPSCI 371D — Machine Learning Decision Trees 19 / 19

Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine - PowerPoint PPT Presentation

Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Decision Trees 1 / 19 Outline 1 Motivation 2 Recursive Splits and Trees 3 Prediction 4 Purity 5 How to Split 6 When to Stop Splitting COMPSCI 371D Machine

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

/ + - * * 5 3 2 6 5 2 Examples Binary Trees BSTs Augmenting BinExpr General Trees

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Optimal Sparse Decision Trees Xiyang Hu Cynthia Rudin Margo Seltzer Carnegie Mellon Duke

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision trees Decision Trees / Discrete Variables Location Season Location Fun? Ski Slope

Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS + CDS, NYU in collaboration

Concentration of risk measures: A Wasserstein distance approach 1 Prashanth L. A. Joint work

The Failure of a Clearinghouse: Empirical Evidence Vincent Bignon Guillaume Vuillemey Banque de

Using Strengths Based Measures to Assess and Manage Risk of Future Negative outcomes Simone

Tighter risk certificates for (probabilistic) neural networks Omar Rivasplata

Crying Wolf: An Empirical Study of SSL Warning Effectiveness Joshua Sunshine Serge Egelman

Mean estimation: median-of-means tournaments G abor Lugosi ICREA, Pompeu Fabra University,