Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , - PowerPoint PPT Presentation

Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , 2010

Learning a good prediction rule • Learn a mapping • Best prediction rule • Hypothesis space/Function class – Parametric classes (Gaussian, binomial etc.) – Conditionally independent class densities (Naïve Bayes) – Linear decision boundary (Logistic regression) – Nonparametric class (Histograms, nearest neighbor, kernel estimators, Decision Trees – Today ) • Given training data, find a hypothesis/function in that is close to the best prediction rule. 2

First … • What does a decision tree represent • Given a decision tree, how do we assign label to a test point 3

Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Cheat Status Income Refund No Married 80K ? Yes No 10 NO MarSt • Each internal node: test Married Single, Divorced one feature X i • Each branch from a node: TaxInc NO selects one value for X i < 80K > 80K • Each leaf node: predict Y NO YES 4

Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Cheat Status Income Refund No Married 80K ? Yes No 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 5

Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Cheat Status Income Refund No Married 80K ? Yes No 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 6

Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Taxable Cheat Cheat Status Income Status Income Refund No Married 80K ? ? No Married 80K Yes No No 10 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 7

Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Taxable Cheat Cheat Status Income Status Income Refund No Married 80K ? ? No Married 80K Yes No No 10 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 8

Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Refund Marital Taxable Taxable Cheat Cheat Cheat Status Income Status Status Income Income Refund No Married 80K ? ? ? No No Married Married 80K 80K Yes No No 10 10 10 NO MarSt Married Married Single, Divorced TaxInc NO < 80K > 80K NO YES 9

Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Refund Marital Taxable Taxable Cheat Cheat Cheat Status Income Status Status Income Income Refund No Married 80K ? ? ? No No Married Married 80K 80K Yes No No 10 10 10 NO MarSt Assign Cheat to “No” Married Married Single, Divorced TaxInc NO < 80K > 80K NO YES 10

Decision Tree more generally… • Features can be discrete, continuous or categorical • Each internal node: test 1 1 some set of features {X i } • Each branch from a node: 0 1 1 selects a set of value for 1 0 {X i } • Each leaf node: predict Y 1 1 1 1 0 0 1 11

So far… • What does a decision tree represent • Given a decision tree, how do we assign label to a test point Now … • How do we learn a decision tree from training data • What is the decision on each leaf 12

How to learn a decision tree • Top- down induction *ID3, C4.5, CART, …+ Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 14

Which feature is best to split? X 1 X 2 Y T T T T F T F T F T T T T Y: 4 Ts Y: 1 Ts Y: 3 Ts Y: 2 Ts T F T 0 Fs 3 Fs 1 Fs 2 Fs F T T Absolutely Kind of Kind of Absolutely sure sure sure unsure F F F F T F Good split if we are more certain F F F about classification after split – Uniform distribution of labels is bad 15

Which feature is best to split? Pick the attribute/feature which yields maximum information gain: H(Y) – entropy of Y H(Y|X i ) – conditional entropy of Y 16

Entropy • Entropy of a random variable Y More uncertainty, Uniform more entropy! Max entropy Entropy, H(Y) Y ~ Bernoulli(p) Deterministic Zero entropy p Information Theory interpretation : H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code) 17

Andrew Moore’s Entropy in a Nutshell High Entropy Low Entropy ..the values (locations ..the values (locations of of soup) sampled soup) unpredictable... almost entirely from within uniformly sampled the soup bowl throughout our dining room 18

Information Gain • Advantage of attribute = decrease in uncertainty – Entropy of Y before split – Entropy of Y after splitting based on X i • Weight by probability of following each branch • Information gain is difference Max Information gain = min conditional entropy 19

Information Gain X 1 X 2 Y T T T T F T F T F T T T T Y: 4 Ts Y: 1 Ts Y: 3 Ts Y: 2 Ts T F T 0 Fs 3 Fs 1 Fs 2 Fs F T T F F F F T F F F F > 0 20

Which feature is best to split? Pick the attribute/feature which yields maximum information gain: H(Y) – entropy of Y H(Y|X i ) – conditional entropy of Y Feature which yields maximum reduction in entropy provides maximum information about Y 21

Expressiveness of Decision Trees • Decision trees can express any function of the input features. • E.g., for Boolean functions, truth table row → path to leaf: • There is a decision tree which perfectly classifies a training set with one path to leaf for each example • But it won't generalize well to new examples - prefer to find more compact decision trees 22

Decision Trees - Overfitting One training example per leaf – overfits, need compact/pruned decision tree 23

Bias-Variance Tradeoff average Classifiers based on classifier different training data coarse partition bias large variance small Ideal classifier fine partition bias small variance large 24

When to Stop? • Many strategies for picking simpler trees: – Pre-pruning • Fixed depth Refund Yes No • Fixed number of leaves MarSt – Post-pruning Married Single, Divorced • Chi-square test NO – Convert decision tree to a set of rules – Eliminate variable values in rules which are independent of label (using chi-square test for independence) – Simplify rule set by eliminating unnecessary rules – Information Criteria: MDL(Minimum Description Length) 25

Information Criteria • Penalize complex models by introducing cost log likelihood cost regression classification penalize trees with more leaves 26

Information Criteria - MDL Penalize complex models based on their information content . # bits needed to describe f MDL (Minimum Description Length) (description length) Example: Binary Decision trees k leaves => 2k – 1 nodes 2k – 1 bits to encode tree structure + k bits to encode label of each leaf (0/1) 5 leaves => 9 bits to encode structure

How to assign label to each leaf Classification – Majority vote Regression – ? 29

How to assign label to each leaf Classification – Majority vote Regression – Constant/ Linear/Poly fit 30

Regression trees Num Children? ≥ 2 < 2 Average (fit a constant ) using training data at the leaves 31

Connection between nearest neighbor/histogram classifiers and decision trees 32

Local prediction Histogram, kernel density estimation, k-nearest neighbor classifier, kernel regression D Histogram Classifier 33

Local Adaptive prediction Let neighborhood size adapt to data – small neighborhoods near decision boundary (small bias), large neighborhoods elsewhere (small variance) D x Majority vote Decision Tree Classifier at each leaf 34

Histogram Classifier vs Decision Trees Ideal classifier Decision tree histogram 256 cells in each partition 35

Application to Image Coding 1024 cells in each partition 36

Application to Image Coding JPEG 0.125 bpp JPEG 2000 0.125 bpp non-adaptive partitioning adaptive partitioning 37

What you should know • Decision trees are one of the most popular data mining tools • Simplicity of design • Interpretability • Ease of implementation • Good performance in practice (for small dimensions) • Information gain to select attributes (ID3, C4.5,…) • Can be used for classification, regression and density estimation too • Decision trees will overfit!!! – Must use tricks to find “simple trees”, e.g., • Pre-Pruning: Fixed depth/Fixed number of leaves • Post-Pruning: Chi-square test of independence • Complexity Penalized/MDL model selection 38

Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , - PowerPoint PPT Presentation

Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , 2010 Learning a good prediction rule Learn a mapping Best prediction rule Hypothesis space/Function class Parametric classes (Gaussian, binomial etc.)

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

/ + - * * 5 3 2 6 5 2 Examples Binary Trees BSTs Augmenting BinExpr General Trees

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Optimal Sparse Decision Trees Xiyang Hu Cynthia Rudin Margo Seltzer Carnegie Mellon Duke

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision trees Decision Trees / Discrete Variables Location Season Location Fun? Ski Slope

Review - Mathematical Tools & Probability Logarithm Fundamentals of Probability Discrete

Chapter 6 Inference for categorical data Huamei Dong 03/22/2016 1. Review of hypothesis test

On the Chi square and higher-order Chi distances for approximating f -divergences Frank Nielsen 1

S e n s i t i v i t y t o C P v i o l a t i o n i n n e u t r i n

Chapter 2 1 2.1: Inferences about 1 Test of interest throughout regression: Need sampling

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

The Chi-squared Distribution of the Regularized Least Squares Functional for Regularization

Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , - PowerPoint PPT Presentation

Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , 2010 Learning a good prediction rule Learn a mapping Best prediction rule Hypothesis space/Function class Parametric classes (Gaussian, binomial etc.)

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

/ + - * * 5 3 2 6 5 2 Examples Binary Trees BSTs Augmenting BinExpr General Trees

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Optimal Sparse Decision Trees Xiyang Hu Cynthia Rudin Margo Seltzer Carnegie Mellon Duke

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision trees Decision Trees / Discrete Variables Location Season Location Fun? Ski Slope

Review - Mathematical Tools &amp; Probability Logarithm Fundamentals of Probability Discrete

Chapter 6 Inference for categorical data Huamei Dong 03/22/2016 1. Review of hypothesis test

On the Chi square and higher-order Chi distances for approximating f -divergences Frank Nielsen 1

S e n s i t i v i t y t o C P v i o l a t i o n i n n e u t r i n

Chapter 2 1 2.1: Inferences about 1 Test of interest throughout regression: Need sampling

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

The Chi-squared Distribution of the Regularized Least Squares Functional for Regularization

Review - Mathematical Tools & Probability Logarithm Fundamentals of Probability Discrete