CSC 411: Lecture 06: Decision Trees Class based on Raquel Urtasun - - PowerPoint PPT Presentation

csc 411 lecture 06 decision trees
SMART_READER_LITE
LIVE PREVIEW

CSC 411: Lecture 06: Decision Trees Class based on Raquel Urtasun - - PowerPoint PPT Presentation

CSC 411: Lecture 06: Decision Trees Class based on Raquel Urtasun & Rich Zemels lectures Sanja Fidler University of Toronto Jan 26, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 1 / 39 Today Decision


slide-1
SLIDE 1

CSC 411: Lecture 06: Decision Trees

Class based on Raquel Urtasun & Rich Zemel’s lectures Sanja Fidler

University of Toronto

Jan 26, 2016

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 1 / 39

slide-2
SLIDE 2

Today

Decision Trees

◮ entropy ◮ information gain Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 2 / 39

slide-3
SLIDE 3

Another Classification Idea

We tried linear classification (eg, logistic regression), and nearest neighbors. Any other idea? Pick an attribute, do a simple test Conditioned on a choice, pick another attribute, do another test In the leaves, assign a class with majority vote Do other branches as well

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 3 / 39

slide-4
SLIDE 4

Another Classification Idea

Gives axes aligned decision boundaries

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 4 / 39

slide-5
SLIDE 5

Decision Tree: Example

Yes No Yes No Yes No

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 5 / 39

slide-6
SLIDE 6

Decision Tree: Classification

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 6 / 39

slide-7
SLIDE 7

Example with Discrete Inputs

What if the attributes are discrete? Attributes:

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 7 / 39

slide-8
SLIDE 8

Decision Tree: Example with Discrete Inputs

The tree to decide whether to wait (T) or not (F)

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 8 / 39

slide-9
SLIDE 9

Decision Trees

Yes No Yes No Yes No

Internal nodes test attributes Branching is determined by attribute value Leaf nodes are outputs (class assignments)

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 9 / 39

slide-10
SLIDE 10

Decision Tree: Algorithm

Choose an attribute on which to descend at each level. Condition on earlier (higher) choices. Generally, restrict only one dimension at a time. Declare an output value when you get to the bottom In the orange/lemon example, we only split each dimension once, but that is not required.

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 10 / 39

slide-11
SLIDE 11

Decision Tree: Classification and Regression

Each path from root to a leaf defines a region Rm of input space Let {(x(m1), t(m1)), . . . , (x(mk), t(mk))} be the training examples that fall into Rm Classification tree:

◮ discrete output ◮ leaf value y m typically set to the most common value in

{t(m1), . . . , t(mk)} Regression tree:

◮ continuous output ◮ leaf value y m typically set to the mean value in {t(m1), . . . , t(mk)}

Note: We will only talk about classification

[Slide credit: S. Russell]

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 11 / 39

slide-12
SLIDE 12

Expressiveness

Discrete-input, discrete-output case:

◮ Decision trees can express any function of the input attributes. ◮ E.g., for Boolean functions, truth table row → path to leaf:

Continuous-input, continuous-output case:

◮ Can approximate any function arbitrarily closely

Trivially, there is a consistent decision tree for any training set w/ one path to leaf for each example (unless f nondeterministic in x) but it probably won’t generalize to new examples Need some kind of regularization to ensure more compact decision trees

[Slide credit: S. Russell]

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 12 / 39

slide-13
SLIDE 13

How do we Learn a DecisionTree?

How do we construct a useful decision tree?

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 13 / 39

slide-14
SLIDE 14

Learning Decision Trees

Learning the simplest (smallest) decision tree is an NP complete problem [if you are interested, check: Hyafil & Rivest’76] Resort to a greedy heuristic:

◮ Start from an empty decision tree ◮ Split on next best attribute ◮ Recurse

What is best attribute? We use information theory to guide us

[Slide credit: D. Sonntag]

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 14 / 39

slide-15
SLIDE 15

Choosing a Good Attribute

Which attribute is better to split on, X1 or X2? Idea: Use counts at leaves to define probability distributions, so we can measure uncertainty

[Slide credit: D. Sonntag]

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 15 / 39

slide-16
SLIDE 16

Choosing a Good Attribute

Which attribute is better to split on, X1 or X2?

◮ Deterministic: good (all are true or false; just one class in the leaf) ◮ Uniform distribution: bad (all classes in leaf equally probable) ◮ What about distributons in between?

Note: Let’s take a slight detour and remember concepts from information theory

[Slide credit: D. Sonntag]

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 16 / 39

slide-17
SLIDE 17

We Flip Two Different Coins

Sequence 1:

0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ?

Sequence 2:

0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 1 ... ?

16 2 8 10

1

versus

1

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 17 / 39

slide-18
SLIDE 18

Quantifying Uncertainty

Entropy H: H(X) = −

  • x∈X

p(x) log2 p(x) 1

8/9 1/9

−8 9 log2 8 9 − 1 9 log2 1 9 ≈ 1 2 1

4/9 5/9

−4 9 log2 4 9 − 5 9 log2 5 9 ≈ 0.99 How surprised are we by a new value in the sequence? How much information does it convey?

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 18 / 39

slide-19
SLIDE 19

Quantifying Uncertainty

H(X) = −

  • x∈X

p(x) log2 p(x)

0.2 0.4 0.6 0.8 1.0 probability p of heads 0.2 0.4 0.6 0.8 1.0 entropy

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 19 / 39

slide-20
SLIDE 20

Entropy

“High Entropy”:

◮ Variable has a uniform like distribution ◮ Flat histogram ◮ Values sampled from it are less predictable

“Low Entropy”

◮ Distribution of variable has many peaks and valleys ◮ Histogram has many lows and highs ◮ Values sampled from it are more predictable

[Slide credit: Vibhav Gogate]

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 20 / 39

slide-21
SLIDE 21

Entropy of a Joint Distribution

Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy} Cloudy' Not'Cloudy' Raining' 24/100' 1/100' Not'Raining' 25/100' 50/100' H(X, Y ) = −

  • x∈X
  • y∈Y

p(x, y) log2 p(x, y) = − 24 100 log2 24 100 − 1 100 log2 1 100 − 25 100 log2 25 100 − 50 100 log2 50 100 ≈ 1.56bits

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 21 / 39

slide-22
SLIDE 22

Specific Conditional Entropy

Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy} Cloudy' Not'Cloudy' Raining' 24/100' 1/100' Not'Raining' 25/100' 50/100' What is the entropy of cloudiness Y , given that it is raining? H(Y |X = x) = −

  • y∈Y

p(y|x) log2 p(y|x) = −24 25 log2 24 25 − 1 25 log2 1 25 ≈ 0.24bits We used: p(y|x) = p(x,y)

p(x) ,

and p(x) =

y p(x, y)

(sum in a row)

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 22 / 39

slide-23
SLIDE 23

Conditional Entropy

Cloudy' Not'Cloudy' Raining' 24/100' 1/100' Not'Raining' 25/100' 50/100' The expected conditional entropy: H(Y |X) =

  • x∈X

p(x)H(Y |X = x) = −

  • x∈X
  • y∈Y

p(x, y) log2 p(y|x)

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 23 / 39

slide-24
SLIDE 24

Conditional Entropy

Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy} Cloudy' Not'Cloudy' Raining' 24/100' 1/100' Not'Raining' 25/100' 50/100' What is the entropy of cloudiness, given the knowledge of whether or not it is raining? H(Y |X) =

  • x∈X

p(x)H(Y |X = x) = 1 4H(cloudy|is raining) + 3 4H(cloudy|not raining) ≈ 0.75 bits

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 24 / 39

slide-25
SLIDE 25

Conditional Entropy

Some useful properties:

◮ H is always non-negative ◮ Chain rule: H(X, Y ) = H(X|Y ) + H(Y ) = H(Y |X) + H(X) ◮ If X and Y independent, then X doesn’t tell us anything about Y :

H(Y |X) = H(Y )

◮ But Y tells us everything about Y : H(Y |Y ) = 0 ◮ By knowing X, we can only decrease uncertainty about Y :

H(Y |X) ≤ H(Y )

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 25 / 39

slide-26
SLIDE 26

Information Gain

Cloudy' Not'Cloudy' Raining' 24/100' 1/100' Not'Raining' 25/100' 50/100' How much information about cloudiness do we get by discovering whether it is raining? IG(Y |X) = H(Y ) − H(Y |X) ≈ 0.25 bits Also called information gain in Y due to X If X is completely uninformative about Y : IG(Y |X) = 0 If X is completely informative about Y : IG(Y |X) = H(Y ) How can we use this to construct our decision tree?

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 26 / 39

slide-27
SLIDE 27

Constructing Decision Trees

Yes No Yes No Yes No

I made the fruit data partitioning just by eyeballing it. We can use the information gain to automate the process. At each level, one must choose:

  • 1. Which variable to split.
  • 2. Possibly where to split it.

Choose them based on how much information we would gain from the decision! (choose attribute that gives the highest gain)

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 27 / 39

slide-28
SLIDE 28

Decision Tree Construction Algorithm

Simple, greedy, recursive approach, builds up tree node-by-node

  • 1. pick an attribute to split at a non-terminal node
  • 2. split examples into groups based on attribute value
  • 3. for each group:

◮ if no examples – return majority from parent ◮ else if all examples in same class – return class ◮ else loop to step 1 Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 28 / 39

slide-29
SLIDE 29

Back to Our Example

Attributes:

[from: Russell & Norvig] Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 29 / 39

slide-30
SLIDE 30

Attribute Selection

IG(Y ) = H(Y ) − H(Y |X) IG(type) = 1 − 2 12H(Y |Fr.) + 2 12H(Y |It.) + 4 12H(Y |Thai) + 4 12H(Y |Bur.)

  • = 0

IG(Patrons) = 1 − 2 12H(0, 1) + 4 12H(1, 0) + 6 12H(2 6, 4 6)

  • ≈ 0.541

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 30 / 39

slide-31
SLIDE 31

Which Tree is Better?

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 31 / 39

slide-32
SLIDE 32

What Makes a Good Tree?

Not too small: need to handle important but possibly subtle distinctions in data Not too big:

◮ Computational efficiency (avoid redundant, spurious attributes) ◮ Avoid over-fitting training examples

Occam’s Razor: find the simplest hypothesis (smallest tree) that fits the

  • bservations

Inductive bias: small trees with informative nodes near the root

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 32 / 39

slide-33
SLIDE 33

Decision Tree Miscellany

Problems:

◮ You have exponentially less data at lower levels. ◮ Too big of a tree can overfit the data. ◮ Greedy algorithms don’t necessarily yield the global optimum.

In practice, one often regularizes the construction process to try to get small but highly-informative trees. Decision trees can also be used for regression on real-valued outputs, but it requires a different formalism.

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 33 / 39

slide-34
SLIDE 34

Comparison to k-NN

K-Nearest Neighbors Decision boundaries: piece-wise Test complexity: non-parametric, few parameters besides (all?) training examples Decision Trees Decision boundaries: axis-aligned, tree structured Test complexity: attributes and splits

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 34 / 39

slide-35
SLIDE 35

Applications of Decision Trees: XBox!

Decision trees are in XBox

[J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake. Real-Time Human Pose Recognition in Parts from a Single Depth Image. CVPR’11] Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 35 / 39

slide-36
SLIDE 36

Applications of Decision Trees: XBox!

Decision trees are in XBox: Classifying body parts

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 36 / 39

slide-37
SLIDE 37

Applications of Decision Trees: XBox!

Trained on million(s) of examples

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 37 / 39

slide-38
SLIDE 38

Applications of Decision Trees: XBox!

Trained on million(s) of examples Results:

Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 38 / 39

slide-39
SLIDE 39

Applications of Decision Trees

Can express any Boolean function, but most useful when function depends critically on few attributes Bad on: parity, majority functions; also not well-suited to continuous attributes Practical Applications:

◮ Flight simulator: 20 state variables; 90K examples based on expert

pilot’s actions; auto-pilot tree

◮ Yahoo Ranking Challenge ◮ Random Forests Urtasun, Zemel, Fidler (UofT) CSC 411: 06-Decision Trees Jan 26, 2016 39 / 39