CSC 411 Lecture 3: Decision Trees
Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla
University of Toronto
UofT CSC 411: 03-Decision Trees 1 / 33
CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud - - PowerPoint PPT Presentation
CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 03-Decision Trees 1 / 33 Today Decision Trees Simple but powerful learning algorithm One of the most
UofT CSC 411: 03-Decision Trees 1 / 33
◮ Simple but powerful learning algorithm ◮ One of the most widely used learning algorithms in Kaggle competitions
UofT CSC 411: 03-Decision Trees 2 / 33
UofT CSC 411: 03-Decision Trees 3 / 33
UofT CSC 411: 03-Decision Trees 4 / 33
UofT CSC 411: 03-Decision Trees 5 / 33
UofT CSC 411: 03-Decision Trees 6 / 33
UofT CSC 411: 03-Decision Trees 7 / 33
Yes No Yes No Yes No
UofT CSC 411: 03-Decision Trees 8 / 33
◮ discrete output ◮ leaf value y m typically set to the most common value in
◮ continuous output ◮ leaf value y m typically set to the mean value in {t(m1), . . . , t(mk)}
UofT CSC 411: 03-Decision Trees 9 / 33
◮ Decision trees can express any function of the input attributes ◮ E.g., for Boolean functions, truth table row → path to leaf:
◮ Can approximate any function arbitrarily closely
UofT CSC 411: 03-Decision Trees 10 / 33
UofT CSC 411: 03-Decision Trees 11 / 33
◮ Start from an empty decision tree ◮ Split on the “best” attribute ◮ Recurse
◮ Choose based on accuracy? UofT CSC 411: 03-Decision Trees 12 / 33
UofT CSC 411: 03-Decision Trees 13 / 33
◮ Deterministic: good (all are true or false; just one class in the leaf) ◮ Uniform distribution: bad (all classes in leaf equally probable) ◮ What about distributons in between?
UofT CSC 411: 03-Decision Trees 14 / 33
UofT CSC 411: 03-Decision Trees 15 / 33
UofT CSC 411: 03-Decision Trees 16 / 33
0.2 0.4 0.6 0.8 1.0 probability p of heads 0.2 0.4 0.6 0.8 1.0 entropy
UofT CSC 411: 03-Decision Trees 17 / 33
◮ Variable has a uniform like distribution ◮ Flat histogram ◮ Values sampled from it are less predictable
◮ Distribution of variable has many peaks and valleys ◮ Histogram has many lows and highs ◮ Values sampled from it are more predictable
UofT CSC 411: 03-Decision Trees 18 / 33
UofT CSC 411: 03-Decision Trees 19 / 33
p(x) ,
y p(x, y)
UofT CSC 411: 03-Decision Trees 20 / 33
UofT CSC 411: 03-Decision Trees 21 / 33
UofT CSC 411: 03-Decision Trees 22 / 33
◮ H is always non-negative ◮ Chain rule: H(X, Y ) = H(X|Y ) + H(Y ) = H(Y |X) + H(X) ◮ If X and Y independent, then X doesn’t tell us anything about Y :
◮ But Y tells us everything about Y : H(Y |Y ) = 0 ◮ By knowing X, we can only decrease uncertainty about Y :
UofT CSC 411: 03-Decision Trees 23 / 33
UofT CSC 411: 03-Decision Trees 24 / 33
149 log2( 49 149) − 100 149 log2( 100 149) ≈ 0.91
3 · 0 + 2 3 · 1) ≈ 0.24 > 0
UofT CSC 411: 03-Decision Trees 25 / 33
Yes No Yes No Yes No
UofT CSC 411: 03-Decision Trees 26 / 33
◮ if no examples – return majority from parent ◮ else if all examples in same class – return class ◮ else loop to step 1 UofT CSC 411: 03-Decision Trees 27 / 33
[from: Russell & Norvig] UofT CSC 411: 03-Decision Trees 28 / 33
UofT CSC 411: 03-Decision Trees 29 / 33
UofT CSC 411: 03-Decision Trees 30 / 33
◮ Computational efficiency (avoid redundant, spurious attributes) ◮ Avoid over-fitting training examples ◮ Human interpretability
◮ Useful principle, but hard to formalize (how to define simplicity?) ◮ See Domingos, 1999, “The role of Occam’s razor in knowledge
UofT CSC 411: 03-Decision Trees 31 / 33
◮ You have exponentially less data at lower levels ◮ Too big of a tree can overfit the data ◮ Greedy algorithms don’t necessarily yield the global optimum
◮ Split based on a threshold, chosen to maximize information gain
UofT CSC 411: 03-Decision Trees 32 / 33
◮ As we’ll see next lecture, ensembles of decision trees are much
UofT CSC 411: 03-Decision Trees 33 / 33