Applied Machine Learning Applied Machine Learning Decision Trees - - PowerPoint PPT Presentation

applied machine learning applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Applied Machine Learning Decision Trees - - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives decision trees: model cost function how it


slide-1
SLIDE 1

Applied Machine Learning Applied Machine Learning

Decision Trees

Siamak Ravanbakhsh Siamak Ravanbakhsh

COMP 551 COMP 551 (winter 2020) (winter 2020)

1

slide-2
SLIDE 2

decision trees: model cost function how it is optimized how to grow a tree and why you should prune it!

Learning objectives Learning objectives

2

slide-3
SLIDE 3

Adaptive bases Adaptive bases

so far we assume a fixed set of bases in f(x) =

w ϕ (x)

∑d

d d

several methods can be classified as learning these bases adaptively decision trees generalized additive models boosting neural networks

f(x) =

w ϕ (x; v )

∑d

d d d

each basis has its own parameters 3

slide-4
SLIDE 4

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

Decision trees: Decision trees: motivation motivation

4

slide-5
SLIDE 5

pros. decision trees are interpretable! they are not very sensitive to outliers do not need data normalization

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

Decision trees: Decision trees: motivation motivation

4

slide-6
SLIDE 6

pros. decision trees are interpretable! they are not very sensitive to outliers do not need data normalization

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

cons. they could easily overfit and they are unstable pruning random forests

Decision trees: Decision trees: motivation motivation

4

slide-7
SLIDE 7

Decision trees: Decision trees: idea idea

divide the input space into regions and learn one function per region

f(x) =

w I(x ∈

∑k

k

R

)

k

the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)

5 . 1

slide-8
SLIDE 8

split regions successively based on the value of a single variable called test

Decision trees: Decision trees: idea idea

divide the input space into regions and learn one function per region

f(x) =

w I(x ∈

∑k

k

R

)

k

the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)

5 . 1

slide-9
SLIDE 9

split regions successively based on the value of a single variable called test

Decision trees: Decision trees: idea idea

divide the input space into regions and learn one function per region

f(x) =

w I(x ∈

∑k

k

R

)

k

the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)

x

1

x

2

w

1

w

3

w

5

each region is a set of conditions R

=

2

{x

1

t

, x ≤

1 2

t

}

4

5 . 1

slide-10
SLIDE 10

what constant to use for prediction in each region?

Prediction per region Prediction per region

w

k

suppose we have identified the regions R

k

5 . 2

slide-11
SLIDE 11

what constant to use for prediction in each region?

Prediction per region Prediction per region

w

k

suppose we have identified the regions R

k

fore regression

use the mean value of training data-points in that region

w

=

k

mean(y ∣x ∈

(n) (n)

R

)

k

5 . 2

slide-12
SLIDE 12

what constant to use for prediction in each region?

Prediction per region Prediction per region

w

k

suppose we have identified the regions R

k

fore regression

use the mean value of training data-points in that region

w

=

k

mean(y ∣x ∈

(n) (n)

R

)

k

for classification

count the frequency of classes per region predict the most frequent label

  • r return probability

5 . 2

w

=

k

mode(y ∣x ∈

(n) (n)

R

)

k

slide-13
SLIDE 13

Winter 2020 | Applied Machine Learning (COMP551)

what constant to use for prediction in each region?

Prediction per region Prediction per region

w

k

suppose we have identified the regions R

k

fore regression

use the mean value of training data-points in that region

w

=

k

mean(y ∣x ∈

(n) (n)

R

)

k

for classification

count the frequency of classes per region predict the most frequent label

  • r return probability

example: predicting survival in titanic

most frequent label frequency of survival percentage of training data in this region

5 . 2

w

=

k

mode(y ∣x ∈

(n) (n)

R

)

k

slide-14
SLIDE 14

Feature types Feature types

given a feature what are the possible tests

6 . 1

slide-15
SLIDE 15

Feature types Feature types

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S

=

d

{s

=

d,n

x

}

d (n)

  • ne set of possible splits for each feature d

x

>

d

s

?

d,n

each split is asking

given a feature what are the possible tests

6 . 1

slide-16
SLIDE 16

Feature types Feature types

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S

=

d

{s

=

d,n

x

}

d (n)

  • ne set of possible splits for each feature d

x

>

d

s

?

d,n

each split is asking

given a feature what are the possible tests

  • rdinal features - e.g., grade, rating

x

d

{1, … , C}

we can split any any value so

S

=

d

{s

=

d,1

1, … , s

=

d,C

C}

x

>

d

s

?

d,c

each split is asking 6 . 1

slide-17
SLIDE 17

Feature types Feature types

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S

=

d

{s

=

d,n

x

}

d (n)

  • ne set of possible splits for each feature d

x

>

d

s

?

d,n

each split is asking

given a feature what are the possible tests

  • rdinal features - e.g., grade, rating

x

d

{1, … , C}

we can split any any value so

S

=

d

{s

=

d,1

1, … , s

=

d,C

C}

x

>

d

s

?

d,c

each split is asking

categorical features -

  • types, classes and categories

6 . 1

slide-18
SLIDE 18

Feature types Feature types

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S

=

d

{s

=

d,n

x

}

d (n)

  • ne set of possible splits for each feature d

x

>

d

s

?

d,n

each split is asking

given a feature what are the possible tests

  • rdinal features - e.g., grade, rating

x

d

{1, … , C}

we can split any any value so

S

=

d

{s

=

d,1

1, … , s

=

d,C

C}

x

>

d

s

?

d,c

each split is asking

categorical features -

  • types, classes and categories

6 . 1

multi-way split

problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints

{

x

=

d

? ? ? ?

slide-19
SLIDE 19

Feature types Feature types

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S

=

d

{s

=

d,n

x

}

d (n)

  • ne set of possible splits for each feature d

x

>

d

s

?

d,n

each split is asking

given a feature what are the possible tests

  • rdinal features - e.g., grade, rating

x

d

{1, … , C}

we can split any any value so

S

=

d

{s

=

d,1

1, … , s

=

d,C

C}

x

>

d

s

?

d,c

each split is asking

categorical features -

  • types, classes and categories

6 . 1

multi-way split

problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints

{

x

=

d

? ? ? ?

instead of we have

{

binary split

assume C binary features (one-hot coding) x

d

{1, … , C} x

d,1

{0, 1} x

d,2

{0, 1} ⋮ x

d,C

{0, 1} x

d,2 = ? 0

x

d,2 = ? 1

slide-20
SLIDE 20

Feature types Feature types

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S

=

d

{s

=

d,n

x

}

d (n)

  • ne set of possible splits for each feature d

x

>

d

s

?

d,n

each split is asking

given a feature what are the possible tests

  • rdinal features - e.g., grade, rating

x

d

{1, … , C}

we can split any any value so

S

=

d

{s

=

d,1

1, … , s

=

d,C

C}

x

>

d

s

?

d,c

each split is asking

categorical features -

  • types, classes and categories

6 . 1

multi-way split

problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints

{

x

=

d

? ? ? ?

instead of we have

{

binary split

assume C binary features (one-hot coding) x

d

{1, … , C} x

d,1

{0, 1} x

d,2

{0, 1} ⋮ x

d,C

{0, 1} x

d,2 = ? 0

x

d,2 = ? 1

alternative: binary splits that produce balanced subsets

slide-21
SLIDE 21

Cost function Cost function

  • bjective: find a decision tree minimizing the cost function

6 . 2

slide-22
SLIDE 22

Cost function Cost function

  • bjective: find a decision tree minimizing the cost function

for predicting constant

cost(R

, D) =

k

(y

N

k

1 ∑x ∈R

(n) k

(n)

w

)

k 2

w ∈

k

R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

6 . 2

slide-23
SLIDE 23

Cost function Cost function

  • bjective: find a decision tree minimizing the cost function

for predicting constant

cost(R

, D) =

k

(y

N

k

1 ∑x ∈R

(n) k

(n)

w

)

k 2

w ∈

k

R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

mean(y ∣x ∈

(n) (n)

R

)

k 6 . 2

slide-24
SLIDE 24

Cost function Cost function

  • bjective: find a decision tree minimizing the cost function

for predicting constant

cost(R

, D) =

k

(y

N

k

1 ∑x ∈R

(n) k

(n)

w

)

k 2

w ∈

k

R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈D

(n) R k

(n)  w

)

k

w

k

{1, … , C}

cost per region (misclassification rate)

mean(y ∣x ∈

(n) (n)

R

)

k 6 . 2

slide-25
SLIDE 25

Cost function Cost function

  • bjective: find a decision tree minimizing the cost function

for predicting constant

cost(R

, D) =

k

(y

N

k

1 ∑x ∈R

(n) k

(n)

w

)

k 2

w ∈

k

R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈D

(n) R k

(n)  w

)

k

w

k

{1, … , C}

cost per region (misclassification rate)

mean(y ∣x ∈

(n) (n)

R

)

k 6 . 2

mode(y ∣x ∈

(n) (n)

R

)

k

slide-26
SLIDE 26

Cost function Cost function

  • bjective: find a decision tree minimizing the cost function

for predicting constant

cost(R

, D) =

k

(y

N

k

1 ∑x ∈R

(n) k

(n)

w

)

k 2

w ∈

k

R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈D

(n) R k

(n)  w

)

k

w

k

{1, … , C}

cost per region (misclassification rate) total cost in both cases is the normalized sum

cost(R , D)

∑k N

N

k

k

mean(y ∣x ∈

(n) (n)

R

)

k 6 . 2

mode(y ∣x ∈

(n) (n)

R

)

k

slide-27
SLIDE 27

Cost function Cost function

it is sometimes possible to build a tree with zero cost: build a large tree with each instance having its own region (overfitting!)

  • bjective: find a decision tree minimizing the cost function

for predicting constant

cost(R

, D) =

k

(y

N

k

1 ∑x ∈R

(n) k

(n)

w

)

k 2

w ∈

k

R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈D

(n) R k

(n)  w

)

k

w

k

{1, … , C}

cost per region (misclassification rate) total cost in both cases is the normalized sum

cost(R , D)

∑k N

N

k

k

mean(y ∣x ∈

(n) (n)

R

)

k 6 . 2

mode(y ∣x ∈

(n) (n)

R

)

k

slide-28
SLIDE 28

Cost function Cost function

it is sometimes possible to build a tree with zero cost: build a large tree with each instance having its own region (overfitting!)

  • bjective: find a decision tree minimizing the cost function

new objective: find a decision tree with K tests minimizing the cost function for predicting constant

cost(R

, D) =

k

(y

N

k

1 ∑x ∈R

(n) k

(n)

w

)

k 2

w ∈

k

R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈D

(n) R k

(n)  w

)

k

w

k

{1, … , C}

cost per region (misclassification rate) total cost in both cases is the normalized sum

cost(R , D)

∑k N

N

k

k

mean(y ∣x ∈

(n) (n)

R

)

k 6 . 2

mode(y ∣x ∈

(n) (n)

R

)

k

slide-29
SLIDE 29

Search space Search space

  • bjective: find a decision tree with K tests minimizing the cost function

K+1 regions

6 . 3 alternatively, find the smallest tree (K) that classifies all examples correctly

slide-30
SLIDE 30

Search space Search space

not produced by a decision tree

  • bjective: find a decision tree with K tests minimizing the cost function

K+1 regions

6 . 3 alternatively, find the smallest tree (K) that classifies all examples correctly

slide-31
SLIDE 31

Search space Search space

not produced by a decision tree

assuming D features how many different partitions of size K+1?

  • bjective: find a decision tree with K tests minimizing the cost function

K+1 regions

6 . 3 alternatively, find the smallest tree (K) that classifies all examples correctly

slide-32
SLIDE 32

Search space Search space

not produced by a decision tree

assuming D features how many different partitions of size K+1?

  • bjective: find a decision tree with K tests minimizing the cost function

K+1 regions

6 . 3

the number of full binary trees with K+1 leaves (regions ) is the Catalan number

R

k

K+1 1 ( K 2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

slide-33
SLIDE 33

Search space Search space

not produced by a decision tree

assuming D features how many different partitions of size K+1?

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

  • bjective: find a decision tree with K tests minimizing the cost function

K+1 regions

6 . 3

the number of full binary trees with K+1 leaves (regions ) is the Catalan number

R

k

K+1 1 ( K 2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

slide-34
SLIDE 34

Search space Search space

not produced by a decision tree

assuming D features how many different partitions of size K+1?

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

we also have a choice of feature for each of K internal node DK

x

d

  • bjective: find a decision tree with K tests minimizing the cost function

K+1 regions

6 . 3

the number of full binary trees with K+1 leaves (regions ) is the Catalan number

R

k

K+1 1 ( K 2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

slide-35
SLIDE 35

Search space Search space

not produced by a decision tree

assuming D features how many different partitions of size K+1?

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

we also have a choice of feature for each of K internal node DK

x

d

  • bjective: find a decision tree with K tests minimizing the cost function

K+1 regions

moreover, for each feature different choices of splitting s

d,n

S

d 6 . 3

the number of full binary trees with K+1 leaves (regions ) is the Catalan number

R

k

K+1 1 ( K 2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

slide-36
SLIDE 36

Winter 2020 | Applied Machine Learning (COMP551)

Search space Search space

bottom line: finding optimal decision tree is an NP-hard combinatorial optimization problem

not produced by a decision tree

assuming D features how many different partitions of size K+1?

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

we also have a choice of feature for each of K internal node DK

x

d

  • bjective: find a decision tree with K tests minimizing the cost function

K+1 regions

moreover, for each feature different choices of splitting s

d,n

S

d 6 . 3

the number of full binary trees with K+1 leaves (regions ) is the Catalan number

R

k

K+1 1 ( K 2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

slide-37
SLIDE 37

Greedy heuristic Greedy heuristic

end the recursion if not worth-splitting recursively split the regions based on a greedy choice of the next test

7 . 1

slide-38
SLIDE 38

Greedy heuristic Greedy heuristic

end the recursion if not worth-splitting recursively split the regions based on a greedy choice of the next test

function fit­tree( , ,depth)

R

node

if not worth­splitting(depth, ) return else left­set = fit­tree( , , depth+1) right­set = fit­tree( , , depth+1) return {left­set, right­set} = greedy­test ( , )

D R

node D

R

, R

left right

R

, R

left right

R

node

R

left D

R

right

D

7 . 1

slide-39
SLIDE 39

Greedy heuristic Greedy heuristic

end the recursion if not worth-splitting recursively split the regions based on a greedy choice of the next test

function fit­tree( , ,depth)

R

node

if not worth­splitting(depth, ) return else left­set = fit­tree( , , depth+1) right­set = fit­tree( , , depth+1) return {left­set, right­set} = greedy­test ( , )

D R

node D

R

, R

left right

R

, R

left right

R

node

R

left D

R

right

D

final decision tree in the form of nested list of regions

{{R

, R }, {R , {R , R }}

1 2 3 4 5

7 . 1

slide-40
SLIDE 40

Choosing tests Choosing tests

the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost

7 . 2

slide-41
SLIDE 41

Choosing tests Choosing tests

the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost =

cost(R , D) +

N

node

N

left

left

cost(R , D)

N

node

N

right

right

function greedy­test ( , )

R

node D

R

=

left

R ∪

node

{x

<

d

s

}

d,n

for d ∈ {1, … , D}, s

d,n

S

d

R

=

right

R ∪

node

{x

d

s

}

d,n

split­cost best­cost = ­inf if split­cost < best­cost: best­cost = split­cost

R

=

left ∗

R

left

R

=

right ∗

R

right

return R

, R

left ∗ right ∗ 7 . 2

slide-42
SLIDE 42

Choosing tests Choosing tests

the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost =

cost(R , D) +

N

node

N

left

left

cost(R , D)

N

node

N

right

right

function greedy­test ( , )

R

node D

R

=

left

R ∪

node

{x

<

d

s

}

d,n

for d ∈ {1, … , D}, s

d,n

S

d

R

=

right

R ∪

node

{x

d

s

}

d,n

split­cost best­cost = ­inf if split­cost < best­cost: best­cost = split­cost

R

=

left ∗

R

left

R

=

right ∗

R

right

return R

, R

left ∗ right ∗ 7 . 2

creating new regions

slide-43
SLIDE 43

Choosing tests Choosing tests

the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost =

cost(R , D) +

N

node

N

left

left

cost(R , D)

N

node

N

right

right

function greedy­test ( , )

R

node D

R

=

left

R ∪

node

{x

<

d

s

}

d,n

for d ∈ {1, … , D}, s

d,n

S

d

R

=

right

R ∪

node

{x

d

s

}

d,n

split­cost best­cost = ­inf if split­cost < best­cost: best­cost = split­cost

R

=

left ∗

R

left

R

=

right ∗

R

right

return R

, R

left ∗ right ∗

evaluate their cost

7 . 2

creating new regions

slide-44
SLIDE 44

Choosing tests Choosing tests

the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost =

cost(R , D) +

N

node

N

left

left

cost(R , D)

N

node

N

right

right

function greedy­test ( , )

R

node D

R

=

left

R ∪

node

{x

<

d

s

}

d,n

for d ∈ {1, … , D}, s

d,n

S

d

R

=

right

R ∪

node

{x

d

s

}

d,n

split­cost best­cost = ­inf if split­cost < best­cost: best­cost = split­cost

R

=

left ∗

R

left

R

=

right ∗

R

right

return R

, R

left ∗ right ∗

evaluate their cost

7 . 2

creating new regions return the split with the lowest greedy cost

slide-45
SLIDE 45

Stopping the recursion Stopping the recursion

worth­splitting subroutine

if we stop when has zero cost, we may overfit

R

node

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

slide-46
SLIDE 46

Stopping the recursion Stopping the recursion

worth­splitting subroutine

if we stop when has zero cost, we may overfit

R

node

heuristics for stopping the splitting:

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

slide-47
SLIDE 47

Stopping the recursion Stopping the recursion

worth­splitting subroutine

if we stop when has zero cost, we may overfit

R

node

heuristics for stopping the splitting: reached a desired depth

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

slide-48
SLIDE 48

Stopping the recursion Stopping the recursion

worth­splitting subroutine

if we stop when has zero cost, we may overfit

R

node

heuristics for stopping the splitting: reached a desired depth number of examples in or is too small R

left

R

right

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

slide-49
SLIDE 49

Stopping the recursion Stopping the recursion

worth­splitting subroutine

if we stop when has zero cost, we may overfit

R

node

heuristics for stopping the splitting: reached a desired depth number of examples in or is too small is a good approximation, the cost is small enough R

left

R

right

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

slide-50
SLIDE 50

Stopping the recursion Stopping the recursion

worth­splitting subroutine

if we stop when has zero cost, we may overfit

R

node

heuristics for stopping the splitting: reached a desired depth number of examples in or is too small is a good approximation, the cost is small enough R

left

R

right

w

k

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

slide-51
SLIDE 51

Winter 2020 | Applied Machine Learning (COMP551)

Stopping the recursion Stopping the recursion

worth­splitting subroutine

if we stop when has zero cost, we may overfit

R

node

heuristics for stopping the splitting: reached a desired depth number of examples in or is too small is a good approximation, the cost is small enough reduction in cost by splitting is small R

left

R

right

w

k

cost(R

, D) −

node

(

cost(R , D) +

N

node

N

left

left

cost(R , D))

N

node

N

right

right

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

slide-52
SLIDE 52

revisiting the revisiting the classification cost

classification cost

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

)

k

this may not be the optimal cost for each step of greedy heuristic

8 . 1

slide-53
SLIDE 53

revisiting the revisiting the classification cost

classification cost

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

)

k

this may not be the optimal cost for each step of greedy heuristic example

8 . 1

(.5, 100%) (.25, 50%) (.75, 50%)

R

node

R

left

R

right

slide-54
SLIDE 54

revisiting the revisiting the classification cost

classification cost

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

)

k

this may not be the optimal cost for each step of greedy heuristic example

(.5, 100%) (.33, 75%) (1, 25%)

8 . 1

(.5, 100%) (.25, 50%) (.75, 50%)

R

node

R

left

R

right

slide-55
SLIDE 55

revisiting the revisiting the classification cost

classification cost

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

)

k

this may not be the optimal cost for each step of greedy heuristic example

(.5, 100%) (.33, 75%) (1, 25%)

8 . 1

both splits have the same misclassification rate (2/8)

(.5, 100%) (.25, 50%) (.75, 50%)

R

node

R

left

R

right

slide-56
SLIDE 56

revisiting the revisiting the classification cost

classification cost

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

)

k

this may not be the optimal cost for each step of greedy heuristic example

(.5, 100%) (.33, 75%) (1, 25%)

8 . 1

both splits have the same misclassification rate (2/8) however the second split may be preferable because one region does not need further splitting

(.5, 100%) (.25, 50%) (.75, 50%)

R

node

R

left

R

right

slide-57
SLIDE 57

revisiting the revisiting the classification cost

classification cost

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

)

k

this may not be the optimal cost for each step of greedy heuristic example

(.5, 100%) (.33, 75%) (1, 25%)

8 . 1

both splits have the same misclassification rate (2/8) however the second split may be preferable because one region does not need further splitting

(.5, 100%) (.25, 50%) (.75, 50%)

R

node

R

left

R

right

use a measure for homogeneity of labels in regions

slide-58
SLIDE 58

Entropy Entropy

entropy is the expected amount of information in observing a random variable

H(y) = −

p(y =

∑c=1

C

c) log p(y = c)

y

note that it is common to use capital letters for random variables (here for consistency we use lower-case)

8 . 2

slide-59
SLIDE 59

Entropy Entropy

entropy is the expected amount of information in observing a random variable

H(y) = −

p(y =

∑c=1

C

c) log p(y = c)

is the amount of information in observing c

− log p(y = c)

y

note that it is common to use capital letters for random variables (here for consistency we use lower-case)

8 . 2

slide-60
SLIDE 60

Entropy Entropy

entropy is the expected amount of information in observing a random variable

H(y) = −

p(y =

∑c=1

C

c) log p(y = c)

is the amount of information in observing c

− log p(y = c)

zero information of p(c)=1 less probable events are more informative information from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d) p(c) < p(c ) ⇒

− log p(c) > − log p(c )

y

note that it is common to use capital letters for random variables (here for consistency we use lower-case)

8 . 2

slide-61
SLIDE 61

Entropy Entropy

entropy is the expected amount of information in observing a random variable

H(y) = −

p(y =

∑c=1

C

c) log p(y = c)

a uniform distribution has the highest entropy

H(y) = −

log =

∑c=1

C C 1 C 1

log C

is the amount of information in observing c

− log p(y = c)

zero information of p(c)=1 less probable events are more informative information from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d) p(c) < p(c ) ⇒

− log p(c) > − log p(c )

y

note that it is common to use capital letters for random variables (here for consistency we use lower-case)

8 . 2

slide-62
SLIDE 62

Entropy Entropy

entropy is the expected amount of information in observing a random variable

H(y) = −

p(y =

∑c=1

C

c) log p(y = c)

a uniform distribution has the highest entropy

H(y) = −

log =

∑c=1

C C 1 C 1

log C

a deterministic random variable has the lowest entropy H(y) = −1 log(1) = 0

is the amount of information in observing c

− log p(y = c)

zero information of p(c)=1 less probable events are more informative information from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d) p(c) < p(c ) ⇒

− log p(c) > − log p(c )

y

note that it is common to use capital letters for random variables (here for consistency we use lower-case)

8 . 2

slide-63
SLIDE 63

Mutual information Mutual information

for two random variables t, y

8 . 3

slide-64
SLIDE 64

Mutual information Mutual information

for two random variables t, y the amount of information t conveys about y change in the entropy of y after observing the value of t mutual information is

8 . 3

slide-65
SLIDE 65

Mutual information Mutual information

I(t, y) = H(y) − H(y∣t)

for two random variables t, y the amount of information t conveys about y change in the entropy of y after observing the value of t mutual information is

8 . 3

slide-66
SLIDE 66

Mutual information Mutual information

I(t, y) = H(y) − H(y∣t)

conditional entropy

p(t =

∑l=1

L

l)H(x∣t = l)

for two random variables t, y the amount of information t conveys about y change in the entropy of y after observing the value of t mutual information is

8 . 3

slide-67
SLIDE 67

Mutual information Mutual information

I(t, y) = H(y) − H(y∣t)

conditional entropy

p(t =

∑l=1

L

l)H(x∣t = l)

for two random variables t, y

=

p(y =

∑l ∑c c, t = l) log

p(y=c)p(t=l) p(y=c,t=l) this is symmetric wrt y and t

the amount of information t conveys about y change in the entropy of y after observing the value of t mutual information is

8 . 3

slide-68
SLIDE 68

Mutual information Mutual information

I(t, y) = H(y) − H(y∣t)

conditional entropy

p(t =

∑l=1

L

l)H(x∣t = l)

for two random variables t, y

= H(t) − H(t∣y) = I(y, t) =

p(y =

∑l ∑c c, t = l) log

p(y=c)p(t=l) p(y=c,t=l) this is symmetric wrt y and t

the amount of information t conveys about y change in the entropy of y after observing the value of t mutual information is

8 . 3

slide-69
SLIDE 69

Mutual information Mutual information

I(t, y) = H(y) − H(y∣t)

conditional entropy

p(t =

∑l=1

L

l)H(x∣t = l)

it is always positive and zero only if y and t are independent for two random variables t, y

= H(t) − H(t∣y) = I(y, t)

try to prove these properties

=

p(y =

∑l ∑c c, t = l) log

p(y=c)p(t=l) p(y=c,t=l) this is symmetric wrt y and t

the amount of information t conveys about y change in the entropy of y after observing the value of t mutual information is

8 . 3

slide-70
SLIDE 70

Entropy for classification cost Entropy for classification cost

we care about the distribution of labels p

(y =

k

c) =

N

k

I(y

=c) ∑x

∈R (n) k (n)

8 . 4

slide-71
SLIDE 71

Entropy for classification cost Entropy for classification cost

we care about the distribution of labels p

(y =

k

c) =

N

k

I(y

=c) ∑x

∈R (n) k (n)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

) =

k

1 − p

(w )

k k

misclassification cost

8 . 4

slide-72
SLIDE 72

Entropy for classification cost Entropy for classification cost

we care about the distribution of labels p

(y =

k

c) =

N

k

I(y

=c) ∑x

∈R (n) k (n)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

) =

k

1 − p

(w )

k k

misclassification cost

the most probable class w

=

k

arg max

p (c)

c k

8 . 4

slide-73
SLIDE 73

Entropy for classification cost Entropy for classification cost

we care about the distribution of labels p

(y =

k

c) =

N

k

I(y

=c) ∑x

∈R (n) k (n)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

) =

k

1 − p

(w )

k k

misclassification cost

the most probable class w

=

k

arg max

p (c)

c k

entropy cost cost(R

, D) =

k

H(y)

choose the split with the lowest entropy

8 . 4

slide-74
SLIDE 74

Entropy for classification cost Entropy for classification cost

we care about the distribution of labels p

(y =

k

c) =

N

k

I(y

=c) ∑x

∈R (n) k (n)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

) =

k

1 − p

(w )

k k

misclassification cost

the most probable class w

=

k

arg max

p (c)

c k

entropy cost cost(R

, D) =

k

H(y)

choose the split with the lowest entropy

change in the cost becomes the mutual information between the test and labels

8 . 4

slide-75
SLIDE 75

Entropy for classification cost Entropy for classification cost

we care about the distribution of labels p

(y =

k

c) =

N

k

I(y

=c) ∑x

∈R (n) k (n)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

) =

k

1 − p

(w )

k k

misclassification cost

the most probable class w

=

k

arg max

p (c)

c k

entropy cost cost(R

, D) =

k

H(y)

choose the split with the lowest entropy

cost(R

, D) −

node

(

cost(R

, D) +

N

node

N

left

left

cost(R , D))

N

node

N

left

right

change in the cost becomes the mutual information between the test and labels

8 . 4

slide-76
SLIDE 76

Entropy for classification cost Entropy for classification cost

we care about the distribution of labels p

(y =

k

c) =

N

k

I(y

=c) ∑x

∈R (n) k (n)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

) =

k

1 − p

(w )

k k

misclassification cost

the most probable class w

=

k

arg max

p (c)

c k

entropy cost cost(R

, D) =

k

H(y)

choose the split with the lowest entropy

cost(R

, D) −

node

(

cost(R

, D) +

N

node

N

left

left

cost(R , D))

N

node

N

left

right

change in the cost becomes the mutual information between the test and labels = H(y) − (p(x

d

s

)H(p(y∣x ≥

d,n d

s

)) +

d,n

p(x

<

d

s

)H(p(y∣x <

d,n d

s

)))

d,n 8 . 4

slide-77
SLIDE 77

Entropy for classification cost Entropy for classification cost

we care about the distribution of labels p

(y =

k

c) =

N

k

I(y

=c) ∑x

∈R (n) k (n)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

) =

k

1 − p

(w )

k k

misclassification cost

the most probable class w

=

k

arg max

p (c)

c k

entropy cost cost(R

, D) =

k

H(y)

choose the split with the lowest entropy

cost(R

, D) −

node

(

cost(R

, D) +

N

node

N

left

left

cost(R , D))

N

node

N

left

right

change in the cost becomes the mutual information between the test and labels = H(y) − (p(x

d

s

)H(p(y∣x ≥

d,n d

s

)) +

d,n

p(x

<

d

s

)H(p(y∣x <

d,n d

s

)))

d,n

= I(y, x > s

)

d,n 8 . 4

slide-78
SLIDE 78

Entropy for classification cost Entropy for classification cost

we care about the distribution of labels p

(y =

k

c) =

N

k

I(y

=c) ∑x

∈R (n) k (n)

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

) =

k

1 − p

(w )

k k

misclassification cost

the most probable class w

=

k

arg max

p (c)

c k

entropy cost cost(R

, D) =

k

H(y)

choose the split with the lowest entropy

cost(R

, D) −

node

(

cost(R

, D) +

N

node

N

left

left

cost(R , D))

N

node

N

left

right

change in the cost becomes the mutual information between the test and labels = H(y) − (p(x

d

s

)H(p(y∣x ≥

d,n d

s

)) +

d,n

p(x

<

d

s

)H(p(y∣x <

d,n d

s

)))

d,n

= I(y, x > s

)

d,n

choosing the test which is maximally informative about labels

8 . 4

slide-79
SLIDE 79

Entropy for classification cost Entropy for classification cost

example

(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)

8 . 5

R

node

R

left

R

right

slide-80
SLIDE 80

Entropy for classification cost Entropy for classification cost

example

(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)

8 . 5

R

node

R

left

R

right

misclassification cost

8 4

+

4 1

8 4

=

4 1 4 1

slide-81
SLIDE 81

Entropy for classification cost Entropy for classification cost

example

(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)

8 . 5

R

node

R

left

R

right

misclassification cost

8 4

+

4 1

8 4

=

4 1 4 1

8 6

+

3 1

8 2

=

2 4 1

slide-82
SLIDE 82

Entropy for classification cost Entropy for classification cost

example

(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)

8 . 5

R

node

R

left

R

right

misclassification cost

8 4

+

4 1

8 4

=

4 1 4 1

8 6

+

3 1

8 2

=

2 4 1

the same costs

slide-83
SLIDE 83

Entropy for classification cost Entropy for classification cost

example

(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)

8 . 5

R

node

R

left

R

right

misclassification cost

8 4

+

4 1

8 4

=

4 1 4 1

8 6

+

3 1

8 2

=

2 4 1

entropy cost (using base 2 logarithm)

the same costs

slide-84
SLIDE 84

Entropy for classification cost Entropy for classification cost

example

(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)

8 . 5

R

node

R

left

R

right

misclassification cost

8 4

+

4 1

8 4

=

4 1 4 1

8 6

+

3 1

8 2

=

2 4 1

entropy cost (using base 2 logarithm)

( −

8 4

log( ) −

4 1 4 1

log( )) +

4 3 4 3

( −

8 4

log( ) −

4 1 4 1

log( )) ≈

4 3 4 3

.81

the same costs

slide-85
SLIDE 85

Entropy for classification cost Entropy for classification cost

example

(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)

8 . 5

R

node

R

left

R

right

misclassification cost

8 4

+

4 1

8 4

=

4 1 4 1

8 6

+

3 1

8 2

=

2 4 1

entropy cost (using base 2 logarithm)

( −

8 4

log( ) −

4 1 4 1

log( )) +

4 3 4 3

( −

8 4

log( ) −

4 1 4 1

log( )) ≈

4 3 4 3

.81

( −

8 6

log( ) −

3 1 3 1

log( )) +

3 2 3 2

8 2

0 ≈ .68

lower cost split the same costs

slide-86
SLIDE 86

Gini index Gini index

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

) =

k

1 − p(w

)

k

misclassification (error) rate entropy cost(R

, D) =

k

H(y)

another cost for selecting the test in classification

8 . 6

slide-87
SLIDE 87

Gini index Gini index

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

) =

k

1 − p(w

)

k

misclassification (error) rate entropy cost(R

, D) =

k

H(y)

Gini index

it is the expected error rate

another cost for selecting the test in classification

8 . 6

slide-88
SLIDE 88

Gini index Gini index

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

) =

k

1 − p(w

)

k

misclassification (error) rate entropy cost(R

, D) =

k

H(y)

Gini index

it is the expected error rate

cost(R

, D) =

k

p(c)(1 −

∑c=1

C

p(c))

probability of class c probability of error

another cost for selecting the test in classification

8 . 6

slide-89
SLIDE 89

Gini index Gini index

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

) =

k

1 − p(w

)

k

misclassification (error) rate entropy cost(R

, D) =

k

H(y) =

p(c) −

∑c=1

C

p(c) =

∑c=1

C 2

1 −

p(c)

∑c=1

C 2

Gini index

it is the expected error rate

cost(R

, D) =

k

p(c)(1 −

∑c=1

C

p(c))

probability of class c probability of error

another cost for selecting the test in classification

8 . 6

slide-90
SLIDE 90

Winter 2020 | Applied Machine Learning (COMP551)

Gini index Gini index

cost(R

, D) =

k

I(y =

N

k

1 ∑x ∈R

(n) k

(n)  w

) =

k

1 − p(w

)

k

misclassification (error) rate entropy cost(R

, D) =

k

H(y) =

p(c) −

∑c=1

C

p(c) =

∑c=1

C 2

1 −

p(c)

∑c=1

C 2

Gini index

it is the expected error rate

cost(R

, D) =

k

p(c)(1 −

∑c=1

C

p(c))

probability of class c probability of error

another cost for selecting the test in classification

comparison of costs of a node when we have 2 classes

p(y = 1) cost

8 . 6

slide-91
SLIDE 91

Example Example

decision tree for Iris dataset

decision boundaries

dataset (D=2)

decision tree

9 . 1

slide-92
SLIDE 92

Example Example

decision tree for Iris dataset

decision boundaries

dataset (D=2)

1

decision tree

9 . 1

slide-93
SLIDE 93

Example Example

decision tree for Iris dataset

decision boundaries

dataset (D=2)

1 2

decision tree

9 . 1

slide-94
SLIDE 94

Example Example

decision tree for Iris dataset

decision boundaries

dataset (D=2)

1 2 3

decision tree

9 . 1

slide-95
SLIDE 95

Example Example

decision tree for Iris dataset

decision boundaries

dataset (D=2)

decision boundaries suggest overfitting confirmed using a validation set

training accuracy ~ 85% (Cross) validation accuracy ~ 70%

1 2 3

decision tree

9 . 1

slide-96
SLIDE 96

Overfitting Overfitting

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

9 . 2

slide-97
SLIDE 97

Overfitting Overfitting

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

9 . 2

slide-98
SLIDE 98

Overfitting Overfitting

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3) there are such functions, why?

22D

9 . 2

slide-99
SLIDE 99

Overfitting Overfitting

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3) there are such functions, why?

22D

large decision trees have a high variance - low bias (low training error, high test error)

9 . 2

slide-100
SLIDE 100

Overfitting Overfitting

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3) there are such functions, why?

22D

large decision trees have a high variance - low bias (low training error, high test error) idea 1. grow a small tree

9 . 2

slide-101
SLIDE 101

Overfitting Overfitting

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3) there are such functions, why?

22D

large decision trees have a high variance - low bias (low training error, high test error) idea 1. grow a small tree

9 . 2

substantial reduction in cost may happen after a few steps by stopping early we cannot know this

slide-102
SLIDE 102

Overfitting Overfitting

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3) there are such functions, why?

22D

large decision trees have a high variance - low bias (low training error, high test error) idea 1. grow a small tree

example cost drops after the second node

9 . 2

substantial reduction in cost may happen after a few steps by stopping early we cannot know this

slide-103
SLIDE 103

Pruning Pruning

idea 2. grow a large tree and then prune it

9 . 3

slide-104
SLIDE 104

Pruning Pruning

idea 2. grow a large tree and then prune it greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node

9 . 3

slide-105
SLIDE 105

Pruning Pruning

idea 2. grow a large tree and then prune it greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node pick the best among the above models using using a validation set

9 . 3

slide-106
SLIDE 106

Pruning Pruning

idea 2. grow a large tree and then prune it greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node pick the best among the above models using using a validation set example before pruning

9 . 3

slide-107
SLIDE 107

Pruning Pruning

idea 2. grow a large tree and then prune it greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node pick the best among the above models using using a validation set after pruning example before pruning

9 . 3

slide-108
SLIDE 108

Pruning Pruning

idea 2. grow a large tree and then prune it greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node pick the best among the above models using using a validation set after pruning cross-validation is used to pick the best size example before pruning

9 . 3

slide-109
SLIDE 109

Winter 2020 | Applied Machine Learning (COMP551)

Pruning Pruning

idea 2. grow a large tree and then prune it greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node pick the best among the above models using using a validation set after pruning cross-validation is used to pick the best size example before pruning idea 3. random forests (later!)

9 . 3

slide-110
SLIDE 110

Summary Summary

model: divide the input into axis-aligned regions cost: for regression and classification

10

slide-111
SLIDE 111

Summary Summary

model: divide the input into axis-aligned regions cost: for regression and classification

  • ptimization:

NP-hard use greedy heuristic

10

slide-112
SLIDE 112

Summary Summary

model: divide the input into axis-aligned regions cost: for regression and classification

  • ptimization:

NP-hard use greedy heuristic adjust the cost for the heuristic using entropy (relation to mutual information maximization) using Gini index

10

slide-113
SLIDE 113

Summary Summary

model: divide the input into axis-aligned regions cost: for regression and classification

  • ptimization:

NP-hard use greedy heuristic adjust the cost for the heuristic using entropy (relation to mutual information maximization) using Gini index decision trees are unstable (have high variance) use pruning to avoid overfitting

10

slide-114
SLIDE 114

Summary Summary

model: divide the input into axis-aligned regions cost: for regression and classification

  • ptimization:

NP-hard use greedy heuristic adjust the cost for the heuristic using entropy (relation to mutual information maximization) using Gini index decision trees are unstable (have high variance) use pruning to avoid overfitting there are variations on decision tree heuristics what we discussed in called Classification and Regression Trees (CART)

10