CSC 411: Lecture 06: Decision Trees
Class based on Raquel Urtasun & Rich Zemel’s lectures Sanja Fidler
University of Toronto
Jan 26, 2016
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 1 / 39
CSC 411: Lecture 06: Decision Trees Class based on Raquel Urtasun - - PowerPoint PPT Presentation
CSC 411: Lecture 06: Decision Trees Class based on Raquel Urtasun & Rich Zemels lectures Sanja Fidler University of Toronto Jan 26, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 1 / 39 Today Decision
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 1 / 39
◮ entropy ◮ information gain Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 2 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 3 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 4 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 5 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 6 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 7 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 8 / 39
Yes No Yes No Yes No
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 9 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 10 / 39
◮ discrete output ◮ leaf value y m typically set to the most common value in
◮ continuous output ◮ leaf value y m typically set to the mean value in {t(m1), . . . , t(mk)}
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 11 / 39
◮ Decision trees can express any function of the input attributes. ◮ E.g., for Boolean functions, truth table row → path to leaf:
◮ Can approximate any function arbitrarily closely
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 12 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 13 / 39
◮ Start from an empty decision tree ◮ Split on next best attribute ◮ Recurse
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 14 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 15 / 39
◮ Deterministic: good (all are true or false; just one class in the leaf) ◮ Uniform distribution: bad (all classes in leaf equally probable) ◮ What about distributons in between?
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 16 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 17 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 18 / 39
0.2 0.4 0.6 0.8 1.0 probability p of heads 0.2 0.4 0.6 0.8 1.0 entropy
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 19 / 39
◮ Variable has a uniform like distribution ◮ Flat histogram ◮ Values sampled from it are less predictable
◮ Distribution of variable has many peaks and valleys ◮ Histogram has many lows and highs ◮ Values sampled from it are more predictable
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 20 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 21 / 39
p(x) ,
y p(x, y)
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 22 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 23 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 24 / 39
◮ H is always non-negative ◮ Chain rule: H(X, Y ) = H(X|Y ) + H(Y ) = H(Y |X) + H(X) ◮ If X and Y independent, then X doesn’t tell us anything about Y :
◮ But Y tells us everything about Y : H(Y |Y ) = 0 ◮ By knowing X, we can only decrease uncertainty about Y :
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 25 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 26 / 39
Yes No Yes No Yes No
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 27 / 39
◮ if no examples – return majority from parent ◮ else if all examples in same class – return class ◮ else loop to step 1 Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 28 / 39
[from: Russell & Norvig] Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 29 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 30 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 31 / 39
◮ Computational efficiency (avoid redundant, spurious attributes) ◮ Avoid over-fitting training examples
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 32 / 39
◮ You have exponentially less data at lower levels. ◮ Too big of a tree can overfit the data. ◮ Greedy algorithms don’t necessarily yield the global optimum.
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 33 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 34 / 39
[J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake. Real-Time Human Pose Recognition in Parts from a Single Depth Image. CVPR’11] Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 35 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 36 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 37 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 38 / 39
◮ Flight simulator: 20 state variables; 90K examples based on expert
◮ Yahoo Ranking Challenge ◮ Random Forests Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 39 / 39