Introduction to Machine Learning CMU-10701 23. Decision Trees - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos

Contents  Decision Trees: Definition + Motivation  Algorithm for Learning Decision Trees • Entropy, Mutual Information, Information gain  Generalizations • Regression Trees  Overfitting • Pruning • Regularization Many of these slides are taken from • Aarti Singh, • Eric Xing, • Carlos Guestrin • Russ Greiner 2 • Andrew Moore

Decision Trees 3

Decision Tree: Motivation Learn decision rules from a dataset : Do we want to play tennis?  4 discrete-valued attributes (Outlook, Temperature, Humidity, Wind)  Play tennis?:“Yes/No” classification problem 4

Decision Tree: Motivation  We want to learn a “good” decision tree from the data.  For example, this tree: 5

Function Approximation Formal Problem Setting : • Set of possible instances X (set of all possible feature vectors) • Unknown target function f : X ! Y • Set of function hypotheses H = { h | h : X ! Y } (H= possible decision trees) I nput : • Training examples { < x (i), y (i ) > } of unknown target function f Output : • Hypothesis h ∈ H that best approximates target function f In decision tree learning, we are doing function approximation, where the set of hypotheses H = set of decision trees 6

Decision Tree: The Hypothesis Space  Each internal node is labeled with some feature x j  Arc (from x j ) labeled with results of test x j  Leaf nodes specify class h(x)  One Instance: Outlook = Sunny Temperature = Hot Humidity = High Wind = Strong classified as “No” (Temperature, Wind: irrelevant)  Easy to use in Classification  Interpretable rules 7

Generalizations  Features can be continuous  Output can be continuous too (regression trees)  Instead of single features in the nodes, we can use set of features too in the nodes Later we will discuss them in more detail. 8

Continuous Features I f a feature is continuous: internal nodes may test value against threshold 9

Example: Mixed Discrete and Continuous Features Tax Fraud Detection: Goal is to predict who is cheating on tax using the ‘refund’, ‘marital status’, and ‘income’ features Refund Marital Taxable Cheat status income yes Married 50K no no Married 90K no no Single 60K no no Divorced 100K yes yes Married 110K no Build a tree that matches the data 10

Decision Tree for Tax Fraud Detection Data Refund Yes No • Each internal node: test one feature X i NO MarSt Married • Continuous features test Single, Divorced value against threshold TaxInc NO • Each branch from a node: < 80K > 80K selects one value (or set of values) for X i . YES NO • Each leaf node: predict Y 11

Given a decision tree, how do we assign label to a test point? 12

Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Cheat Status Income Refund No Married 80K ? Yes No 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 13

Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Cheat Status Income Refund No Married 80K ? Yes No 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 14

Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Taxable Cheat Cheat Status Income Status Income Refund No Married 80K ? No Married 80K ? Yes No No 10 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 15

Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Taxable Cheat Cheat Status Income Status Income Refund No Married 80K ? No Married 80K ? Yes No No 10 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 16

Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Refund Marital Taxable Taxable Cheat Cheat Cheat Status Income Status Status Income Income Refund No Married 80K ? No No Married Married 80K 80K ? ? Yes No No 10 10 10 NO MarSt Married Married Single, Divorced TaxInc NO < 80K > 80K YES NO 17

Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Refund Marital Taxable Taxable Cheat Cheat Cheat Status Income Status Status Income Income Refund No Married 80K ? No No Married Married 80K 80K ? ? Yes No No 10 10 10 NO MarSt Assign Cheat to “No” Married Married Single, Divorced TaxInc NO < 80K > 80K YES NO 18

What do decision trees do in the feature space? 19

Decision Tree Decision Boundaries Decision trees divide feature space into axis-parallel rectangles , labeling each rectangle with one class Two features only: x 1 and x 2 20

Some functions cannot be represented with binary splits  Some functions cannot be represented with binary splits:  If we want to learn this function too, • we need more complex functions in the nodes than binary splits • We need to “break” this function to smaller parts that can be represented with binary splits. 2 3 - 1 + 5 4 21

How do we learn a decision tree from training data? 22

What Boolean functions can be represented with decision trees? How would you represent Y = X 2 and X 5 ? Y = X 2 or X 5 ? How would you represent X 2 X 5 ∨ X 3 X 4 (¬ X 1 )? 23

Decision trees can represent any boolean/discrete functions n boolean features (x 1 ,…,x n ) )  2 n possible different instances  2 n+ 1 possible different functions if class label Y is boolean too. X 1 X 2 X 2 + - - + 24

Option 1: Just store training data  Trees can represent any boolean (and discrete) functions, e.g. (A v B) & (C v not D v E)  Just produce “path” for each example (store the training data)  . . . may require exponentially many nodes. . .  Any generalization capability? (Other instances that are not in the training data?)  NP-hard to find smallest tree that fits data I ntuition: Want SMALL trees ... to capture “regularities” in data ... ... easier to understand, faster to execute 25

Expressiveness of General Decision Trees Example: Learn A xor B (Boolean features and labels) • There is a decision tree which perfectly classifies a training set with one path to leaf for each example. 26

Example of Overfitting  1000 patients  25% have butterfly-itis (250)  75% are healthy (750)  Use 10 silly features, not related to the class label  ½ of patients have F1 = 1 (“odd birthday”)  ½ of patients have F2 = 1 “even SSN”  etc 27

Typical Results Standard decision tree learner: Error Rate: ฀ Train data: 0% ฀ New data: 37% Optimal decision tree: Error Rate: ฀ Train data: 25% ฀ New data: 25% Regularization is important… 28

How to learn a decision tree • Top-down induction [ many algorithms ID3, C4.5, CART, … ] (Grow the tree from the root to the leafs) We will focus on ID3 algorithm Repeat : 1. Select “best feature” (X 1 , X 2 or X 3 ) to split 2. For each value that feature takes, sort training examples to leaf nodes 3. Stop if leaf contains all training examples with same label or if all features are used up 4. Assign leaf with majority vote of labels of training examples 29

First Split? 30

Which feature is best to split? Good split: we are less uncertain about classification after split 80 training people (50 Genuine, 30 Cheats) Refund Marital Status Refund Yes No NO MarSt Single, Yes No Married Married Divorced Single, Divorced TaxInc NO < 80K > 80K 40 Genuine 10 Genuine 30 Genuine 20 Genuine 0 Cheats 30 Cheats 10 Cheats 20 Cheats YES NO Absolutely Kind of Kind of Absolutely sure sure sure unsure Refund gives more information about the labels than Marital Status 31

Which feature is best to split? Pick the attribute/feature which yields maximum information gain: H(Y) – entropy of Y H(Y|X i ) – conditional entropy of Y Feature which yields maximum reduction in entropy provides maximum information about Y 32

Entropy Entropy of a random variable Y Larg rger r unc uncert aint nt y, lar arger ent ropy! Uniform Entropy, H(Y) Y ~ Bernoulli(p) Max entropy Deterministic Zero entropy p I nformation Theory interpretation : H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code) 33

Information Gain Advantage of attribute = decrease in uncertainty • Entropy of Y before split • Entropy of Y after splitting based on X i We want this to be small • Weight by probability of following each branch I nformation gain is the difference: Max I nformation gain = min conditional entropy 34

First Split? Which feature splits the data the best to + and – instances? 35

First Split? Outlook feature looks great, because the Overcast branch is perfectly separated. 36

Statistics I f split on x i , produce 2 children: (1) # (x i = t) follow TRUE branch data: [ # (x i = t, Y = + ),# (x i = t, Y= –) ] (2) # (x i = f) follow FALSE branch data: [ # (x i = f, Y = + ),# (x i = f, Y= –) ] Calculate the mutual information between x i and Y! 37

Introduction to Machine Learning CMU-10701 23. Decision Trees - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information gain Generalizations

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

Chapter 5 Transformation of Objects Dr M A BERBAR Drawing of objects before and after they

dave@hornacek.coa.edu http://hornacek.coa.edu/dave Friday, April 30, 2010

What is Magnetism ? ESM Cluj 2015 Basic Concepts: Magnetostatics J. M. D. Coey School of Physics

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Statistical Models for Road Traffic Forecasting Mediamobile & Insitut Mathmatique de

Time and the continuum Michiel van Lambalgen Riccardo Pinosio 1 / 102 Time and the Aims of

MAKING THE LEAP Successful Products as a Web Agency Caveat Emptor Leave now if this isnt what

Introduction to Machine Learning CMU-10701 23. Decision Trees - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information gain Generalizations

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti

CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos &amp; Aarti Singh

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos &amp; Alex

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &amp;

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos &amp; Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

Chapter 5 Transformation of Objects Dr M A BERBAR Drawing of objects before and after they

dave@hornacek.coa.edu http://hornacek.coa.edu/dave Friday, April 30, 2010

What is Magnetism ? ESM Cluj 2015 Basic Concepts: Magnetostatics J. M. D. Coey School of Physics

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Statistical Models for Road Traffic Forecasting Mediamobile &amp; Insitut Mathmatique de

Time and the continuum Michiel van Lambalgen Riccardo Pinosio 1 / 102 Time and the Aims of

MAKING THE LEAP Successful Products as a Web Agency Caveat Emptor Leave now if this isnt what

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos & Aarti

Statistical Models for Road Traffic Forecasting Mediamobile & Insitut Mathmatique de