Decision Trees Gavin Brown Every Learning Method has Limitations - - PowerPoint PPT Presentation

▶

Dec 23, 2023 143 likes •412 views

Decision Trees Gavin Brown Every Learning Method has Limitations Linear model? KNN ? SVM ? Explain your decisions Sometimes we need interpretable results from our techniques. How do you explain the above decision? Different types of data Rugby

SLIDE 1

Decision Trees

Gavin Brown

SLIDE 2

Every Learning Method has Limitations

Linear model? KNN? SVM?

SLIDE 3

Explain your decisions

Sometimes we need interpretable results from our techniques. How do you explain the above decision?

SLIDE 4

Different types of data

Rugby players - height, weight can be plotted in 2-d. How do you plot hair colour? (Black, Brown, Blonde?) Predicting heart disease - how do you plot blood type? (A, B, O)? In general, how do you deal with categorical data?

SLIDE 5

The Tennis Problem

You are working for the local tennis club. They want a program that will advise inexperienced new members

n whether they are likely to enjoy a game today, given the current

weather conditions. However they need the program to pop out interpretable rules so they can be sure it’s not giving bad advice. They provide you with some historical data....

SLIDE 6

The Tennis Problem

Outlook Temperature Humidity Wind Play Tennis? 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No

Note: 9 examples say ’yes’, 5 examples say ’no’.

SLIDE 7

A Decision Tree for the Tennis Problem

This tree works for any example in the table — try it!

SLIDE 8

Learning a Decision Tree : Basic recursive algorithm

tree ← learntree( data ) if all examples in data have same label, return leaf node with that label else pick the most “important” feature, call it F for each possible value v of F data(v) ← all examples where F == v add branch ← learntree( data(v) ) endfor return tree endif

SLIDE 9

Example: partitioning data by “wind” feature

Outlook Temp Humid Wind Play? 2 Sunny Hot High Strong No 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 14 Rain Mild High Strong No

3 examples say yes, 3 say no.

Outlook Temp Humid Wind Play? 1 Sunny Hot High Weak No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 13 Overcast Hot Normal Weak Yes

6 examples say yes, 2 examples say no.

SLIDE 10

Learning a Decision Tree : Basic recursive algorithm

tree ← learntree( data ) if all examples in data have same label, return leaf node with that label else pick the most “important” feature , call it F for each possible value v of F data(v) ← all examples where F == v add branch ← learntree( data(v) ) endfor return tree endif Which is the most important feature?

SLIDE 11

Thinking in Probabilities...

Before the split : 9 ’yes’, 5 ’no’, ......... p(′yes′) = 9

14 ≈ 0.64

On the left branch : 3 ’yes’, 3 ’no’, ....... p(′yes′) = 3

6 = 0.5

On the right branch : 6 ’yes’, 2 ’no’, ...... p(′yes′) = 6

8 = 0.75

Remember... p(′no′) = 1 − p(′yes′)

SLIDE 12

The ‘Information’ contained in a variable - Entropy

More uncertainty = Less information H(X) = 1.0

SLIDE 13

The ‘Information’ contained in a variable - Entropy

Lower uncertainty = More information H(X) = 0.72193

SLIDE 14

Entropy

The amount of randomness in a variable X is called the ’entropy’. H(X) = −

p(xi) log p(xi) (1) The log is base 2, giving us units of measurement ’bits’.

SLIDE 15

Reducing Entropy = Maximise Information Gain

The variable of interest is “T” (for tennis), taking on ’yes’ or ’no’

values. Before the split : 9 ’yes’, 5 ’no’, .........

p(′yes′) = 9

14 ≈ 0.64

In the whole dataset, the entropy is: H(T) = −

p(xi) log p(xi) = − 5 14log 5 14 + 9 14log 9 14

= 0.94029

H(T) is the entropy before we split. See worked example in the supporting material.

SLIDE 16

Reducing Entropy = Maximise Information Gain

H(T) is the entropy before we split. H(T|W = strong) is the entropy of the data on the left branch. H(T|W = weak) is the entropy of the data on the right branch. H(T|W ) is the weighted average of the two. Choose the feature with maximum value of H(T) − H(T|W ). See worked example in the supporting material.

SLIDE 17

Learning a Decision Tree : the ID3 algorithm

tree ← learntree( data ) if all examples in data have same label, return leaf node with that label else pick the most “important” feature , call it F for each possible value v of F data(v) ← all examples where F == v add branch ← learntree( data(v) ) endfor return tree endif Or, in very simple terms: Step 1. Pick the feature that maximises information gain. Step 2. Recurse on each branch.

SLIDE 18

The ID3 algorithm

function id3( examples ) returns tree T if all the items in examples have the same conclusion, return a leaf node with value = majority conclusion let A be the feature with the largest information gain Create a blank tree T let s(1), s(2), s(3) etc be the data subsets produced by splitting examples on feature A For each subset s(n), tree t(n) = id3( s(n) ) add t(n) as a new branch of T Endfor return T

SLIDE 19

A Decision Tree for the Tennis Problem

Following each path down the tree, we can make up a list of rules. if ( sunny AND high ) → NO if ( sunny AND normal ) → YES if ( overcast ) → YES if ( rain AND strong ) → NO if ( rain AND weak ) → YES

SLIDE 20

’Overfitting’ a tree

◮ The number of possible paths tells you the number of rules. ◮ More rules = more complicated. ◮ We could have N rules where N is the size of the dataset.

This would mean no generalisation outside of the training data, or the tree is overfitted Overfitting = fine tuning

SLIDE 21

Overfitting

What if it’s rainy and hot?

Outlook Temperature Humidity Wind Play Tennis? 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No

SLIDE 22

Overfitting

How do you know if you’ve overfitted?

◮ “Validation” dataset - another dataset that you do not use to

train, but just to check whether you’ve overfitted or not. How can we avoid it?

◮ Stop after a certain depth (i.e. keep the tree short) ◮ Post-Prune the final tree ◮ ... both in order to control validation error

SLIDE 23

Overfitting

SLIDE 24

Missing data?

Outlook Temperature Humidity Wind Play Tennis? 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Normal No 7 Overcast Cool Normal Yes 8 Sunny High No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Normal Strong Yes 12 Overcast High Strong Yes 13 Overcast Normal Weak Yes 14 Rain Mild High Strong No Insert average (mean, median or mode) of the available values. Or other more complex strategies such as using Bayes Rule... NEXT WEEK... Ultimately best strategy is problem dependent.

SLIDE 25

Conclusion

Decision Trees provide a flexible and interpretable model. There are many variations on the simple id3 algorithm. Further reading: www.decisiontrees.net. (site written by a former student of this course) Why wasn’t the Temperature feature used in the tree? Answer in the next session.