Decision Trees: Discussion Machine Learning 1 Some slides from Tom - PowerPoint PPT Presentation

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

This lecture: Learning Decision Trees 1. Representation : What are decision trees? 2. Algorithm : Learning decision trees The ID3 algorithm: A greedy heuristic – 3. Some extensions 2

This lecture: Learning Decision Trees 1. Representation : What are decision trees? 2. Algorithm : Learning decision trees The ID3 algorithm: A greedy heuristic – 3. Some extensions 3

Tips and Tricks 1. Decision tree variants 2. Handling examples with missing feature values 3. Non-Boolean features 4. Avoiding overfitting 4

1. Variants of information gain Information gain is defined using entropy to measure the disorder/impurity of the labels. There are other ways to measure disorder. Eg: MajorityError, Gini Index Example: MajorityError computes: “ Suppose the tree was not grown below this node and the most frequent label were chosen, what would be the error? ” Suppose at some node, there are 15 + and 5 - examples. What is the MajorityError ? Answer: ¼ Works like entropy 5

1. Variants of information gain Information gain is defined using entropy to measure the disorder/impurity of the labels. There are other ways to measure disorder. Eg: MajorityError, Gini Index Example: MajorityError computes: “ Suppose the tree was not grown below this node and the most frequent label were chosen, what would be the error? ” Suppose at some node, there are 15 + and 5 - examples. What is the MajorityError ? Answer: ¼ Works like entropy 6

1. Variants of information gain Let 𝑞 denote the fraction of positive examples. Then 1 − 𝑞 is the fraction of negative examples. Entropy: − 𝑞 log ! 𝑞 + 1 − 𝑞 log ! 1 − 𝑞 Gini Index: 1 − 𝑞 ! + 1 − 𝑞 ! MajorityError: min(𝑞, 1 − 𝑞) p (fraction of positive examples) 7

1. Variants of information gain Let 𝑞 denote the fraction of positive examples. Then 1 − 𝑞 is the fraction of negative examples. Entropy: − 𝑞 log ! 𝑞 + 1 − 𝑞 log ! 1 − 𝑞 Each measure peaks when uncertainty is Gini Index: highest (i.e. p = 0.5) 1 − 𝑞 ! + 1 − 𝑞 ! MajorityError: min(𝑞, 1 − 𝑞) p (fraction of positive examples) 8

1. Variants of information gain Let 𝑞 denote the fraction of positive examples. Then 1 − 𝑞 is the fraction of negative examples. Entropy: − 𝑞 log ! 𝑞 + 1 − 𝑞 log ! 1 − 𝑞 Gini Index: 1 − 𝑞 ! + 1 − 𝑞 ! MajorityError: Lowest (zero) when min(𝑞, 1 − 𝑞) uncertainty is lowest (i.e. p=0 or p=1) p (fraction of positive examples) 9

1. Variants of information gain Let 𝑞 denote the fraction of positive examples. Then 1 − 𝑞 is the fraction of negative examples. Entropy: − 𝑞 log ! 𝑞 + 1 − 𝑞 log ! 1 − 𝑞 Each of these work like entropy. Gini Index: They can replace entropy in the 1 − 𝑞 ! + 1 − 𝑞 ! definition of information gain. MajorityError: min(𝑞, 1 − 𝑞) p (fraction of positive examples) 10

2. Missing feature values Suppose an example is missing the value of an attribute. What can we do at training time? Day Outlook Temperature Humidity Wind PlayTennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 8 Sunny Mild ??? Weak No 9 Sunny Cool High Weak Yes 11 Sunny Mild Normal Strong Yes 11

2. Missing feature values Suppose an example is missing the value of an attribute. What can we do at training time? Different methods to “Complete the example”: – Using the most common value of the attribute in the data – Using the most common value of the attribute among all examples with the same output – Using fractional counts of the attribute values • Eg: Outlook={5/14 Sunny, 4/14 Overcast, 5/15 Rain} • Exercise : Will this change probability computations? 12

2. Missing feature values Suppose an example is missing the value of an attribute. What can we do at training time? Different methods to “Complete the example”: – Using the most common value of the attribute in the data – Using the most common value of the attribute among all examples with the same output – Using fractional counts of the attribute values • Eg: Outlook={5/14 Sunny, 4/14 Overcast, 5/15 Rain} • Exercise : Will this change probability computations? At test time? 13

2. Missing feature values Suppose an example is missing the value of an attribute. What can we do at training time? Different methods to “Complete the example”: – Using the most common value of the attribute in the data – Using the most common value of the attribute among all examples with the same output – Using fractional counts of the attribute values • Eg: Outlook={5/14 Sunny, 4/14 Overcast, 5/15 Rain} • Exercise : Will this change probability computations? At test time? Use the same method 14

3. Non-Boolean features • If the features can take multiple values – We have seen one edge per value (i.e a multi-way split) Outlook Rain Sunny Overcast 15

3. Non-Boolean features • If the features can take multiple values – We have seen one edge per value (i.e a multi-way split) – Another option: Make the attributes Boolean by testing for each value { Outlook:Sunny=True, Outlook:Overcast=False, Convert Outlook=Sunny → Outlook:Rain=False } – Or, perhaps group values into disjoint sets 16

3. Non-Boolean features • If the features can take multiple values – We have seen one edge per value (i.e a multi-way split) – Another option: Make the attributes Boolean by testing for each value { Outlook:Sunny=True, Outlook:Overcast=False, Convert Outlook=Sunny → Outlook:Rain=False } – Or, perhaps group values into disjoint sets • For numeric features, use thresholds or ranges to get Boolean/discrete alternatives 17

4. Overfitting 18

The “First Bit” function • A Boolean function with n inputs • Simply returns the value of the first input, all others irrelevant What is the decision tree X 0 X 1 Y for this function? F F F F T F T F T T T T Y = X 0 X 1 is irrelvant 19

The “First Bit” function • A Boolean function with n inputs • Simply returns the value of the first input, all others irrelevant What is the decision tree X 0 X 1 Y for this function? F F F F T F X 0 T F T T F T T T T F 20

The “First Bit” function • A Boolean function with n inputs • Simply returns the value of the first input, all others irrelevant What is the decision tree X 0 X 1 Y for this function? F F F F T F X 0 T F T T F T T T T F Exercise : Convince yourself that ID3 will generate this tree 21

The best case scenario: Perfect data Suppose we have all 2 n examples for training. What will the error be on any future examples? Zero! Because we have seen every possible input! And the decision tree can represent the function and ID3 will build a consistent tree 22

The best case scenario: Perfect data Suppose we have all 2 n examples for training. What will the error be on any future examples? Zero! Because we have seen every possible input! And the decision tree can represent the function and ID3 will build a consistent tree 23

Noisy data What if the data is noisy? And we have all 2 n examples. X 0 X 1 X 2 Y Suppose, the outputs of both F F F F training and test sets are F F T F randomly corrupted F T F F Train and test sets are no longer F T T F identical. T F F T T F T T Both have noise, possibly different T T F T T T T T 24

Noisy data What if the data is noisy? And we have all 2 n examples. X 0 X 1 X 2 Y Suppose, the outputs of both F F F F training and test sets are F F T F T randomly corrupted F T F F Train and test sets are no longer F T T F identical. T F F T T F T T F Both have noise, possibly different T T F T T T T T 25

E.g: Output corrupted with probability 0.25 The data is noisy. And we have all 2 n examples. Test accuracy for different input sizes 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of features The error bars are generated by running the same 26 experiment multiple times for the same setting

E.g: Output corrupted with probability 0.25 The data is noisy. And we have all 2 n examples. Test accuracy for different input sizes 1 0.9 0.8 Error = approx 0.375 0.7 0.6 0.5 We can analytically compute test error in this case 0.4 Correct prediction: 0.3 P(Training example uncorrupted AND test example uncorrupted) = 0.75 £ 0.75 0.2 P(Training example corrupted AND test example corrupted) = 0.25 £ 0.25 0.1 P(Correct prediction) = 0.625 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Incorrect prediction: P(Training example uncorrupted AND test example corrupted) = 0.75 £ 0.25 Number of features P(Training example corrupted and AND example uncorrupted) = 0.25 £ 0.75 P(incorrect prediction) = 0.375 27

E.g: Output corrupted with probability 0.25 The data is noisy. And we have all 2 n examples. Test accuracy for different input sizes 1 0.9 What about the training accuracy? 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of features 28

Decision Trees: Discussion Machine Learning 1 Some slides from Tom - PowerPoint PPT Presentation

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others This lecture: Learning Decision Trees 1. Representation : What are decision trees? 2. Algorithm : Learning decision trees The ID3 algorithm: A

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

/ + - * * 5 3 2 6 5 2 Examples Binary Trees BSTs Augmenting BinExpr General Trees

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Optimal Sparse Decision Trees Xiyang Hu Cynthia Rudin Margo Seltzer Carnegie Mellon Duke

Decision trees Decision Trees / Discrete Variables Location Season Location Fun? Ski Slope

Decision Tree Decision Trees A decision tree is a decision support tool that uses a tree-like

Logic BIST Architecture 1 Motivation Complex systems with multiple chips demand elaborate

Generating Model Transformations for Mending Dynamic Constraint Violations in Cyber Physical

Spine-local Type Inference Christopher Jenkins and Aaron Stump Computer Science University of

Learning to Detect Faces A Large-Scale Application of Machine Learning ( This m aterial is not in

Deacon Charles W. Stump, M.S., M.P.M. Director of Pastoral Services Catholic Diocese of Dallas

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

QUAD cooling Fred Hartjes NIKHEF Nikhef/Bonn LepCol meeting January 30, 2017 Thermal test

Frdric Mothe, Gilles Le Mogudec, Grard Nepveu, Emmanuel Bucket LERFOB : Laboratoire