Supervised Learning via Decision Trees Lecture 4 Supervised - - PowerPoint PPT Presentation

supervised learning via decision trees
SMART_READER_LITE
LIVE PREVIEW

Supervised Learning via Decision Trees Lecture 4 Supervised - - PowerPoint PPT Presentation

Wentworth Institute of Technology COMP4050 Machine Learning | Fall 2015 | Derbinsky Supervised Learning via Decision Trees Lecture 4 Supervised Learning via Decision Trees October 13, 2015 1 Wentworth Institute of Technology COMP4050


slide-1
SLIDE 1

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Supervised Learning via Decision Trees

Lecture 4

October 13, 2015 Supervised Learning via Decision Trees 1

slide-2
SLIDE 2

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Outline

  • 1. Learning via feature splits
  • 2. ID3

– Information gain

  • 3. Extensions

– Continuous features – Gain ratio – Ensemble learning

October 13, 2015 Supervised Learning via Decision Trees 2

slide-3
SLIDE 3

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Decision Trees

  • Sequence of decisions at

choice nodes from root to a leaf node

– Each choice node splits on a single feature

  • Can be used for

classification or regression

  • Explicit, easy for humans to

understand

  • Typically very fast at testing/

prediction time

October 13, 2015 Supervised Learning via Decision Trees 3

h"ps://en.wikipedia.org/wiki/Decision_tree_learning

slide-4
SLIDE 4

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Weather Example

October 13, 2015 Supervised Learning via Decision Trees 4

slide-5
SLIDE 5

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

IRIS Example

October 13, 2015 Supervised Learning via Decision Trees 5

slide-6
SLIDE 6

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Training Issues

  • Approximation

– Optimal tree-building is NP-complete – Typically greedy, top-down

  • Bias vs. Variance

– Occam’s Razor vs. CC/SSN

  • Pruning, ensemble methods
  • Splitting metric

– Information gain, gain ratio, Gini impurity

October 13, 2015 Supervised Learning via Decision Trees 6

slide-7
SLIDE 7

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Iterative Dichotomiser 3

  • Invented by Ross Quinlan in 1986

– Precursor to C4.5/5

  • Categorical data only
  • Greedily consumes features

– Subtrees cannot consider previous feature(s) for further splits – Typically produces shallow trees

October 13, 2015 Supervised Learning via Decision Trees 7

slide-8
SLIDE 8

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

ID3: Algorithm Sketch

  • If all examples “same”, return f(examples)
  • If no more features, return f(examples)
  • A = “best” feature

– For each distinct value of A

  • branch = ID3( attributes - {A} )

October 13, 2015 Supervised Learning via Decision Trees 8

slide-9
SLIDE 9

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Details

  • “same” = same class
  • f(examples) = majority
  • “same” = std. dev. < ε
  • f(examples) = average

October 13, 2015 Supervised Learning via Decision Trees 9

Classification Regression

slide-10
SLIDE 10

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Recursion

  • A method of programming in which a

function refers to itself in order to solve a problem

– Example: ID3 calls itself for subtrees

  • Never necessary

– In some situations, results in simpler and/or easier-to-write code – Can often be more expensive in terms of memory + time

October 13, 2015 Supervised Learning via Decision Trees 10

slide-11
SLIDE 11

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Example

Consider the factorial function

October 13, 2015 Supervised Learning via Decision Trees 11

n! =

n

Y

k=1

k = 1 ∗ 2 ∗ 3 ∗ . . . ∗ n

slide-12
SLIDE 12

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Iterative Implementation

def factorial(n): result = 1 for i in range(n): result *= (i+1) return result

October 13, 2015 Supervised Learning via Decision Trees 12

slide-13
SLIDE 13

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Consider a Recursive Definition

October 13, 2015 Supervised Learning via Decision Trees 13

0! = 1 n! = n(n − 1)!

when n ≥ 1

Base Case Recursive Step

slide-14
SLIDE 14

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Recursive Implementation

def factorial_r(n): if n == 0: return 1 else: return (n * factorial_r(n-1))

October 13, 2015 Supervised Learning via Decision Trees 14

slide-15
SLIDE 15

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

How the Code Executes

October 13, 2015 Supervised Learning via Decision Trees 15

main print factorial_r( 4 ) factorial_r return 4 * factorial_r( 3 ) factorial_r return 3 * factorial_r( 2 ) factorial_r return 2 * factorial_r( 1 ) factorial_r return 1 * factorial_r( 0 ) factorial_r return 1 Func%on Stack Stack Frame

slide-16
SLIDE 16

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

How the Code Executes

October 13, 2015 Supervised Learning via Decision Trees 16

main print factorial_r( 4 ) factorial_r return 4 * factorial_r( 3 ) factorial_r return 3 * factorial_r( 2 ) factorial_r return 2 * factorial_r( 1 ) factorial_r return 1 * 1 Func%on Stack Stack Frame

slide-17
SLIDE 17

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

How the Code Executes

October 13, 2015 Supervised Learning via Decision Trees 17

main print factorial_r( 4 ) factorial_r return 4 * factorial_r( 3 ) factorial_r return 3 * factorial_r( 2 ) factorial_r return 2 * 1 Func%on Stack Stack Frame

slide-18
SLIDE 18

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

How the Code Executes

October 13, 2015 Supervised Learning via Decision Trees 18

main print factorial_r( 4 ) factorial_r return 4 * factorial_r( 3 ) factorial_r return 3 * 2 Func%on Stack Stack Frame

slide-19
SLIDE 19

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

How the Code Executes

October 13, 2015 Supervised Learning via Decision Trees 19

main print factorial_r( 4 ) factorial_r return 4 * 6 Func%on Stack Stack Frame

slide-20
SLIDE 20

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

How the Code Executes

October 13, 2015 Supervised Learning via Decision Trees 20

main print 24 Func%on Stack Stack Frame

slide-21
SLIDE 21

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

ID3: Algorithm Sketch

  • If all examples “same”, return f(examples)
  • If no more features, return f(examples)
  • A = “best” feature

– For each distinct value of A

  • branch = ID3( attributes - {A} )

October 13, 2015 Supervised Learning via Decision Trees 21

Recursive Step Base Cases

slide-22
SLIDE 22

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Splitting Metric: The “best” Feature

  • Information gain

– Goal: choose splits that proceed from much->little uncertainty

  • Standard Deviation

Reduction

October 13, 2015 Supervised Learning via Decision Trees 22

Classification Regression

h"p://www.saedsayad.com/ decision_tree_reg.htm

slide-23
SLIDE 23

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Shannon Entropy

  • Measure of “impurity”
  • r uncertainty
  • Intuition: the less

likely the event, the more information is transmitted

October 13, 2015 Supervised Learning via Decision Trees 23

slide-24
SLIDE 24

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Entropy Range

October 13, 2015 Supervised Learning via Decision Trees 24

Small Large

slide-25
SLIDE 25

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Quantifying Entropy

October 13, 2015 Supervised Learning via Decision Trees 25

H(X) = E[I(X)] X

i

P(xi)I(xi) Z P(x)I(x)dx

Expected value of informaCon

slide-26
SLIDE 26

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Intuition for Information

  • Shouldn’t be negative
  • Events that always occur

communicate no information

  • Information from independent

events are additive

October 13, 2015 Supervised Learning via Decision Trees 26

I(X) = . . . I(X) ≥ 0 I(1) = 0

I(X1, X2) = I(X1) + I(X2)

slide-27
SLIDE 27

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Quantifying Information

October 13, 2015 Supervised Learning via Decision Trees 27

I(X) = logb 1 P(X) = − logb P(X)

Log Base = Units: 2=bit (binary digit), 3=trit, e=nat

H(X) = − X

i

P(xi) logb P(xi)

Log Base = Units: 2=shannon/bit

slide-28
SLIDE 28

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Example: Fair Coin Toss

October 13, 2015 Supervised Learning via Decision Trees 28

I(heads) = log2( 1 0.5) = log2 2 = 1 bit I(tails) = log2( 1 0.5) = log2 2 = 1 bit H(fair toss) = (0.5)(1) + (0.5)(1) = = 1 shannon

slide-29
SLIDE 29

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Example: Double Headed Coin

October 13, 2015 Supervised Learning via Decision Trees 29

H(double head) = (1) · I(head) = (1) · log2(1 1) = (1) · (0) = 0 shannons

slide-30
SLIDE 30

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Exercise: Weighted Coin

Compute the entropy of a coin that will land

  • n heads about 25% of the time, and tails

the remaining 75%.

October 13, 2015 Supervised Learning via Decision Trees 30

slide-31
SLIDE 31

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Answer

October 13, 2015 Supervised Learning via Decision Trees 31

H(weighted toss) = (0.25) · I(head) + (0.75) · I(tails) = (0.25) · log2 1 0.25 + (0.75) · log2 1 0.75 = 0.81 shannons

slide-32
SLIDE 32

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Entropy vs. P

October 13, 2015 Supervised Learning via Decision Trees 32

slide-33
SLIDE 33

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Exercise

Calculate the entropy of the following data

October 13, 2015 Supervised Learning via Decision Trees 33

slide-34
SLIDE 34

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Answer

October 13, 2015 Supervised Learning via Decision Trees 34

H(data) = 16 30 · I(green circle) + 14 30 · I(purple cross) = 16 30 · log2 30 16 + 14 30 · log2 30 14 = 0.99679 shannons

slide-35
SLIDE 35

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Bounds on Entropy

October 13, 2015 Supervised Learning via Decision Trees 35

H(X) ≥ 0 H(X) = 0 ⇐ ⇒ ∃x ∈ X(P(x) = 1) Hb(X) ≤ logb(|X|)

|X| denotes the number of elements in the range of X

Hb(X) = logb(|X|) ⇐ ⇒ X has a uniform distribution over |X|

slide-36
SLIDE 36

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Information Gain

To use entropy for a splitting metric, we consider the information gain of an action as the resulting change in entropy

October 13, 2015 Supervised Learning via Decision Trees 36

IG(T, a) = H(T) − H(T|a) = H(T) − X

i

|Ti| |T| H(Ti)

Average Entropy of the children

slide-37
SLIDE 37

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Example Split

October 13, 2015 Supervised Learning via Decision Trees 37

{16 30, 14 30}

{ 4 17, 13 17} {12 13, 1 13}

slide-38
SLIDE 38

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Example Information Gain

October 13, 2015 Supervised Learning via Decision Trees 38

H1 = 4 17 log2 17 4 + 13 17 log2 17 13 ∼ 0.79 H2 = 12 13 log2 13 12 + 1 13 log2 13 1 ∼ 0.39 IG = H(T) − (17 30H1 + 13 30H2) = 0.99679 − 0.62 = 0.38 shannons

slide-39
SLIDE 39

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Exercise

Consider the following dataset. Compute the information gain for each of the non-target

  • attributes. Decide which attribute is the best

to split on.

October 13, 2015 Supervised Learning via Decision Trees 39

X Y Z Class 1 1 1 A 1 1 A 1 B 1 B

slide-40
SLIDE 40

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

H(C)

October 13, 2015 Supervised Learning via Decision Trees 40

H(C) = −(0.5) log2 0.5 − (0.5) log2 0.5 = 1 shannon

X Y Z Class 1 1 1 A 1 1 A 1 B 1 B

slide-41
SLIDE 41

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

IG(C,X)

October 13, 2015 Supervised Learning via Decision Trees 41

X Y Z Class 1 1 1 A 1 1 A 1 B 1 B

H(C|X) = 3 4[2 3 log2 3 2 + 1 3 log2 3 1] + 1 4[0] = 0.689 shannons IG(C, X) = 1 − 0.689 = 0.311 shannons

slide-42
SLIDE 42

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

IG(C,Y)

October 13, 2015 Supervised Learning via Decision Trees 42

X Y Z Class 1 1 1 A 1 1 A 1 B 1 B

H(C|Y ) = 1 2[0] + 1 2[0] = 0 shannons IG(C, Y ) = 1 − 0 = 1 shannon

slide-43
SLIDE 43

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

IG(C,Z)

October 13, 2015 Supervised Learning via Decision Trees 43

X Y Z Class 1 1 1 A 1 1 A 1 B 1 B

IG(C, Z) = 1 − 1 = 0 shannons H(C|Y ) = 1 2[1] + 1 2[1] = 1 shannons

slide-44
SLIDE 44

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Feature Split Choice

October 13, 2015 Supervised Learning via Decision Trees 44

0.311 1.0 0.0 X Y Z Class 1 1 1 A 1 1 A 1 B 1 B Y A B 1

slide-45
SLIDE 45

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

ID3: Algorithm Sketch

  • If all examples “same”, return f(examples)
  • If no more features, return f(examples)
  • A = “best” feature

– For each distinct value of A

  • branch = ID3( attributes - {A} )

October 13, 2015 Supervised Learning via Decision Trees 45

slide-46
SLIDE 46

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Example (MLiA)

No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No

October 13, 2015 Supervised Learning via Decision Trees 46

slide-47
SLIDE 47

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

  • 0. Preliminaries

No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No

October 13, 2015 Supervised Learning via Decision Trees 47

  • Examples not the same class
  • Features remain
  • H(Fish?) = 0.971
slide-48
SLIDE 48

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

1a: No Surfacing

No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No

October 13, 2015 Supervised Learning via Decision Trees 48

  • H(Fish? | No Surfacing) = 0.55
  • IG(Fish?, No Surfacing) = 0.42
slide-49
SLIDE 49

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

1b: Flippers?

No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No

October 13, 2015 Supervised Learning via Decision Trees 49

  • H(Fish? | Flippers?) = 0.8
  • IG(Fish?, Flippers) = 0.17
slide-50
SLIDE 50

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

2: Split on No Surfacing

Flippers? Fish? Yes Yes Yes Yes No No

October 13, 2015 Supervised Learning via Decision Trees 50

  • Recurse(left)

Flippers? Fish? Yes No Yes No No Surfacing No Yes

slide-51
SLIDE 51

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

  • 2. Left
  • Examples the same class!

– Return class leaf node

October 13, 2015 Supervised Learning via Decision Trees 51

Flippers? Fish? Yes No Yes No

slide-52
SLIDE 52

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

2: Split on No Surfacing

Flippers? Fish? Yes Yes Yes Yes No No

October 13, 2015 Supervised Learning via Decision Trees 52

  • Recurse(right)

No Surfacing No Yes No

slide-53
SLIDE 53

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

  • 2. Right
  • Examples not the same class
  • One feature remaining

– Split!

October 13, 2015 Supervised Learning via Decision Trees 53

Flippers? Fish? Yes Yes Yes Yes No No

slide-54
SLIDE 54

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

  • 3. Split on Flippers
  • Recurse(left)

October 13, 2015 Supervised Learning via Decision Trees 54

Fish? No Flippers No Yes Fish? Yes Yes

slide-55
SLIDE 55

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

  • 3. Left
  • Examples the same class!

– Return class leaf node

October 13, 2015 Supervised Learning via Decision Trees 55

Fish? No

slide-56
SLIDE 56

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

  • 3. Split on Flippers
  • Recurse(right)

October 13, 2015 Supervised Learning via Decision Trees 56

Flippers No Yes Fish? Yes Yes No

slide-57
SLIDE 57

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

  • 3. Right
  • Examples the same class!

– Return class leaf node

October 13, 2015 Supervised Learning via Decision Trees 57

Fish? Yes Yes

slide-58
SLIDE 58

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

  • 3. Split on Flippers
  • Return!

October 13, 2015 Supervised Learning via Decision Trees 58

Flippers No Yes No Yes

slide-59
SLIDE 59

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

2: Split on No Surfacing

October 13, 2015 Supervised Learning via Decision Trees 59

  • Done!

No Surfacing No Yes No Flippers No Yes No Yes

No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No

slide-60
SLIDE 60

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Additional Base Case

  • What to do given the

following example input to ID3?

– No additional features upon which to split

  • For categorical,

majority vote

October 13, 2015 Supervised Learning via Decision Trees 60 Fish? Yes Yes No No No

No

slide-61
SLIDE 61

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Extensions

  • Generalization
  • Continuous features
  • Ensemble learning

October 13, 2015 Supervised Learning via Decision Trees 61

slide-62
SLIDE 62

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Generalization

  • Information gain biases towards features

with many distinct values

– Consider the value of CC/SSN

  • Approaches to mediate

– Gain ratio is a metric that divides each IG term by “SplitInfo”, which is large for features with many partitions (used in C4.5) – There are several pruning techniques that replace subtrees

October 13, 2015 Supervised Learning via Decision Trees 62

slide-63
SLIDE 63

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Continuous Features

  • You can always discretize/bin yourself

– Run the risk of suboptimal depending on tree location

  • Simple approach: binary splits, whereby left is ≤

threshold

  • Consider each distinct value a threshold, calculate

gain

– Computationally expensive for large numbers of values

  • C4.5 penalizes similar to large distinct sets

October 13, 2015 Supervised Learning via Decision Trees 63

slide-64
SLIDE 64

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Ensemble Learning

  • The Random Forest algorithm is an

exemplar of using multiple trees

– Each tree is trained via bootstrapped data (i.e. sampled with replacement) – Each choice node feature is selected from a random subset of overall – Decisions are bagged (i.e. aggregated over many trees) – Can use a validation set to weight via expected accuracy of each tree

October 13, 2015 Supervised Learning via Decision Trees 64

slide-65
SLIDE 65

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Checkup

  • ML task(s)?

– Classification: binary/multi-class?

  • Feature type(s)?
  • Implicit/explicit?
  • Parametric?
  • Online?

October 13, 2015 Supervised Learning via Decision Trees 65

slide-66
SLIDE 66

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Summary: ID3/Decision Trees

  • Practicality

– Easy, generally applicable – Need know nothing about the underlying process – Very popular, easy to understand

  • Efficiency

– Training: relatively fast, batch – Testing: typically very fast

  • Performance

– Possible to get stuck in suboptimal trees

  • Methods to help, hard in general

October 13, 2015 Supervised Learning via Decision Trees 66