Supervised Learning via Decision Trees Lecture 8 Supervised - - PowerPoint PPT Presentation

supervised learning via decision trees
SMART_READER_LITE
LIVE PREVIEW

Supervised Learning via Decision Trees Lecture 8 Supervised - - PowerPoint PPT Presentation

Wentworth Institute of Technology COMP3770 Artificial Intelligence | Spring 2017 | Derbinsky Supervised Learning via Decision Trees Lecture 8 Supervised Learning via Decision Trees March 27, 2017 1 Wentworth Institute of Technology


slide-1
SLIDE 1

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Supervised Learning via Decision Trees

Lecture 8

March 27, 2017 Supervised Learning via Decision Trees 1

slide-2
SLIDE 2

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Outline

  • 1. Learning via feature splits
  • 2. ID3

– Information gain

  • 3. Extensions

– Continuous features – Gain ratio – Ensemble learning

March 27, 2017 Supervised Learning via Decision Trees 2

slide-3
SLIDE 3

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Decision Trees

  • Sequence of decisions at

choice nodes from root to a leaf node

– Each choice node splits on a single feature

  • Can be used for

classification or regression

  • Explicit, easy for humans to

understand

  • Typically very fast at

testing/prediction time

March 27, 2017 Supervised Learning via Decision Trees 3

https://en.wikipedia.org/wiki/Decision_tree_learning

slide-4
SLIDE 4

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Input Data (Weather)

March 27, 2017 Supervised Learning via Decision Trees 4

slide-5
SLIDE 5

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Output Tree (Weather)

March 27, 2017 Supervised Learning via Decision Trees 5

slide-6
SLIDE 6

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Training Issues

  • Approximation

– Optimal tree-building is NP-complete – Typically greedy, top-down

  • Under/over-fitting

– Occam’s Razor vs. CC/SSN

  • Pruning, ensemble methods
  • Splitting metric

– Information gain, gain ratio, Gini impurity

March 27, 2017 Supervised Learning via Decision Trees 6

slide-7
SLIDE 7

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Iterative Dichotomiser 3

  • Invented by Ross Quinlan in 1986

– Precursor to C4.5/5

  • Categorical data only (can’t split on numbers)
  • Greedily consumes features

– Subtrees cannot reconsider previous feature(s) for further splits – Typically produces shallow trees

March 27, 2017 Supervised Learning via Decision Trees 7

slide-8
SLIDE 8

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

ID3: Algorithm Sketch

  • If all examples “same”, return f(examples)
  • If no more features, return f(examples)
  • A = “best” feature

– For each distinct value of A

  • branch = ID3( attributes - {A} )

March 27, 2017 Supervised Learning via Decision Trees 8

  • “same” = same class
  • f(examples) = majority
  • “best” = information gain

Now!

Classification

  • “same” = std. dev. < ε
  • f(examples) = average
  • “best” = std. dev. reduction

http://www.saedsayad.com/decision_tree_reg.htm

Regression

slide-9
SLIDE 9

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Shannon Entropy

  • Measure of “impurity”
  • r uncertainty
  • Intuition: the less

likely the event, the more information is transmitted

March 27, 2017 Supervised Learning via Decision Trees 9

slide-10
SLIDE 10

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Entropy Range

March 27, 2017 Supervised Learning via Decision Trees 10

Small Large

slide-11
SLIDE 11

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Quantifying Entropy

March 27, 2017 Supervised Learning via Decision Trees 11

H(X) = E[I(X)] X

i

P(xi)I(xi) Z P(x)I(x)dx

Expected value of information

slide-12
SLIDE 12

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Intuition for Information

  • Shouldn’t be negative
  • Events that always occur

communicate no information

  • Information from independent

events are additive

March 27, 2017 Supervised Learning via Decision Trees 12

I(X) = . . . I(X) ≥ 0 I(1) = 0

I(X1, X2) = I(X1) + I(X2)

slide-13
SLIDE 13

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Quantifying Information

March 27, 2017 Supervised Learning via Decision Trees 13

I(X) = logb 1 P(X) = − logb P(X)

Log Base = Units: 2=bit (binary digit), 3=trit, e=nat

H(X) = − X

i

P(xi) logb P(xi)

Log Base = Units: 2=shannon/bit

slide-14
SLIDE 14

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Example: Fair Coin Toss

March 27, 2017 Supervised Learning via Decision Trees 14

I(heads) = log2( 1 0.5) = log2 2 = 1 bit I(tails) = log2( 1 0.5) = log2 2 = 1 bit H(fair toss) = (0.5)(1) + (0.5)(1) = = 1 shannon

slide-15
SLIDE 15

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Example: Double Headed Coin

March 27, 2017 Supervised Learning via Decision Trees 15

H(double head) = (1) · I(head) = (1) · log2(1 1) = (1) · (0) = 0 shannons

slide-16
SLIDE 16

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Exercise: Weighted Coin

Compute the entropy of a coin that will land

  • n heads about 25% of the time, and tails

the remaining 75%.

March 27, 2017 Supervised Learning via Decision Trees 16

slide-17
SLIDE 17

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Answer

March 27, 2017 Supervised Learning via Decision Trees 17

H(weighted toss) = (0.25) · I(head) + (0.75) · I(tails) = (0.25) · log2 1 0.25 + (0.75) · log2 1 0.75 = 0.81 shannons

slide-18
SLIDE 18

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Entropy vs. P

March 27, 2017 Supervised Learning via Decision Trees 18

slide-19
SLIDE 19

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Exercise

Calculate the entropy of the following data

March 27, 2017 Supervised Learning via Decision Trees 19

slide-20
SLIDE 20

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Answer

March 27, 2017 Supervised Learning via Decision Trees 20

H(data) = 16 30 · I(green circle) + 14 30 · I(purple cross) = 16 30 · log2 30 16 + 14 30 · log2 30 14 = 0.99679 shannons

slide-21
SLIDE 21

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Bounds on Entropy

March 27, 2017 Supervised Learning via Decision Trees 21

H(X) ≥ 0 H(X) = 0 ⇐ ⇒ ∃x ∈ X(P(x) = 1) Hb(X) ≤ logb(|X|)

|X| denotes the number of elements in the range of X

Hb(X) = logb(|X|) ⇐ ⇒ X has a uniform distribution over |X|

slide-22
SLIDE 22

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Information Gain

To use entropy for a splitting metric, we consider the information gain of an action as the resulting change in entropy

March 27, 2017 Supervised Learning via Decision Trees 22

IG(T, a) = H(T) − H(T|a) = H(T) − X

i

|Ti| |T| H(Ti)

Average Entropy of the children

slide-23
SLIDE 23

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Example Split

March 27, 2017 Supervised Learning via Decision Trees 23

{16 30, 14 30}

{ 4 17, 13 17} {12 13, 1 13}

slide-24
SLIDE 24

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Example Information Gain

March 27, 2017 Supervised Learning via Decision Trees 24

H1 = 4 17 log2 17 4 + 13 17 log2 17 13 ∼ 0.79 H2 = 12 13 log2 13 12 + 1 13 log2 13 1 ∼ 0.39 IG = H(T) − (17 30H1 + 13 30H2) = 0.99679 − 0.62 = 0.38 shannons

slide-25
SLIDE 25

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Exercise

Consider the following dataset. Compute the information gain for each of the non-target

  • attributes. Decide which attribute is the best

to split on.

March 27, 2017 Supervised Learning via Decision Trees 25

X Y Z Class 1 1 1 A 1 1 A 1 B 1 B

slide-26
SLIDE 26

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

H(C)

March 27, 2017 Supervised Learning via Decision Trees 26

H(C) = −(0.5) log2 0.5 − (0.5) log2 0.5 = 1 shannon

X Y Z Class 1 1 1 A 1 1 A 1 B 1 B

slide-27
SLIDE 27

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

IG(C,X)

March 27, 2017 Supervised Learning via Decision Trees 27

X Y Z Class 1 1 1 A 1 1 A 1 B 1 B

H(C|X) = 3 4[2 3 log2 3 2 + 1 3 log2 3 1] + 1 4[0] = 0.689 shannons IG(C, X) = 1 − 0.689 = 0.311 shannons

slide-28
SLIDE 28

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

IG(C,Y)

March 27, 2017 Supervised Learning via Decision Trees 28

X Y Z Class 1 1 1 A 1 1 A 1 B 1 B

H(C|Y ) = 1 2[0] + 1 2[0] = 0 shannons IG(C, Y ) = 1 − 0 = 1 shannon

slide-29
SLIDE 29

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

IG(C,Z)

March 27, 2017 Supervised Learning via Decision Trees 29

X Y Z Class 1 1 1 A 1 1 A 1 B 1 B

IG(C, Z) = 1 − 1 = 0 shannons H(C|Y ) = 1 2[1] + 1 2[1] = 1 shannons

slide-30
SLIDE 30

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Feature Split Choice

March 27, 2017 Supervised Learning via Decision Trees 30

IG: 0.311 1.0 0.0 X Y Z Class 1 1 1 A 1 1 A 1 B 1 B Y A B 1

slide-31
SLIDE 31

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

ID3: Algorithm Sketch

  • If all examples “same”, return f(examples)
  • If no more features, return f(examples)
  • A = “best” feature

– For each distinct value of A

  • branch = ID3( attributes - {A} )

March 27, 2017 Supervised Learning via Decision Trees 31

  • “same” = same class
  • f(examples) = majority
  • “best” = information gain

Classification

  • “same” = std. dev. < ε
  • f(examples) = average
  • “best” = std. dev. reduction

http://www.saedsayad.com/decision_tree_reg.htm

Regression

slide-32
SLIDE 32

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Example

No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No

March 27, 2017 Supervised Learning via Decision Trees 32

slide-33
SLIDE 33

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

  • 0. Preliminaries

No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No

March 27, 2017 Supervised Learning via Decision Trees 33

  • Examples not the same class
  • Features remain
  • H(Fish?) = 0.971

! " log! " ! + ' " log! " '

slide-34
SLIDE 34

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

1a: No Surfacing

No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No

March 27, 2017 Supervised Learning via Decision Trees 34

  • H(Fish? | No Surfacing) = 0.55

' " (! ' log! ' ! + ) ' log! ' ))

  • IG(Fish?, No Surfacing) = 0.42
slide-35
SLIDE 35

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

1b: Flippers?

No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No

March 27, 2017 Supervised Learning via Decision Trees 35

  • H(Fish? | Flippers?) = 0.8

+ " (1) + ) " (0)

  • IG(Fish?, Flippers) = 0.17
slide-36
SLIDE 36

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

2: Split on No Surfacing

Flippers? Fish? Yes Yes Yes Yes No No

March 27, 2017 Supervised Learning via Decision Trees 36

  • Recurse(left)

Flippers? Fish? Yes No Yes No No Surfacing No Yes

slide-37
SLIDE 37

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

  • 2. Left
  • Examples the same class!

– Return class leaf node

March 27, 2017 Supervised Learning via Decision Trees 37

Flippers? Fish? Yes No Yes No

slide-38
SLIDE 38

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

2: Split on No Surfacing

Flippers? Fish? Yes Yes Yes Yes No No

March 27, 2017 Supervised Learning via Decision Trees 38

  • Recurse(right)

No Surfacing No Yes No

slide-39
SLIDE 39

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

  • 2. Right
  • Examples not the same class
  • One feature remaining

– Split!

March 27, 2017 Supervised Learning via Decision Trees 39

Flippers? Fish? Yes Yes Yes Yes No No

slide-40
SLIDE 40

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

  • 3. Split on Flippers
  • Recurse(left)

March 27, 2017 Supervised Learning via Decision Trees 40

Fish? No Flippers No Yes Fish? Yes Yes

slide-41
SLIDE 41

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

  • 3. Left
  • Examples the same class!

– Return class leaf node

March 27, 2017 Supervised Learning via Decision Trees 41

Fish? No

slide-42
SLIDE 42

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

  • 3. Split on Flippers
  • Recurse(right)

March 27, 2017 Supervised Learning via Decision Trees 42

Flippers No Yes Fish? Yes Yes No

slide-43
SLIDE 43

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

  • 3. Right
  • Examples the same class!

– Return class leaf node

March 27, 2017 Supervised Learning via Decision Trees 43

Fish? Yes Yes

slide-44
SLIDE 44

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

  • 3. Split on Flippers
  • Return!

March 27, 2017 Supervised Learning via Decision Trees 44

Flippers No Yes No Yes

slide-45
SLIDE 45

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

2: Split on No Surfacing

March 27, 2017 Supervised Learning via Decision Trees 45

  • Done!

No Surfacing No Yes No Flippers No Yes No Yes

No Surfacing Flippers? Fish ? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No

slide-46
SLIDE 46

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Additional Base Case

  • What to do given the

following example input to ID3?

– No additional features upon which to split

  • For classification,

majority vote

March 27, 2017 Supervised Learning via Decision Trees 46 Fish? Yes Yes No No No

No

slide-47
SLIDE 47

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Extensions

  • Generalization
  • Continuous features
  • Ensemble learning

March 27, 2017 Supervised Learning via Decision Trees 47

slide-48
SLIDE 48

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Generalization

  • Information gain biases towards features

with many distinct values

– Consider the value of CC/SSN

  • Approaches to mediate

– Gain ratio is a metric that divides each IG term by “SplitInfo”, which is large for features with many partitions (used in C4.5) – There are several pruning techniques that replace subtrees

March 27, 2017 Supervised Learning via Decision Trees 48

slide-49
SLIDE 49

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Continuous Features

  • You can always discretize/bin yourself

– Run the risk of suboptimal depending on tree location

  • Simple approach: binary splits, whereby left is ≤

threshold

  • Consider each distinct value a threshold, calculate

gain

– Computationally expensive for large numbers of values

  • C4.5 penalizes similar to large distinct sets

March 27, 2017 Supervised Learning via Decision Trees 49

slide-50
SLIDE 50

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Ensemble Learning

  • The Random Forest algorithm is an

exemplar of using multiple trees

– Each tree is trained via bootstrapped data (i.e. sampled with replacement) – Each choice node feature is selected from a random subset of overall – Decisions are bagged (i.e. aggregated over many trees) – Can use a validation set to weight via expected accuracy of each tree

March 27, 2017 Supervised Learning via Decision Trees 50

slide-51
SLIDE 51

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Checkup

  • ML task(s)?

– Classification: binary/multi-class?

  • Feature type(s)?
  • Implicit/explicit?
  • Parametric?
  • Online?

March 27, 2017 Supervised Learning via Decision Trees 51

slide-52
SLIDE 52

Wentworth Institute of Technology COMP3770 – Artificial Intelligence | Spring 2017 | Derbinsky

Summary: ID3/Decision Trees

  • Practicality

– Easy, generally applicable – Need know nothing about the underlying process – Very popular, easy to understand

  • Efficiency

– Training: relatively fast, batch – Testing: typically very fast

  • Performance

– Possible to get stuck in suboptimal trees

  • Methods to help, hard in general

March 27, 2017 Supervised Learning via Decision Trees 52