Applied Machine Learning Applied Machine Learning
Decision Trees
Siamak Ravanbakhsh Siamak Ravanbakhsh
COMP 551 COMP 551 (winter 2020) (winter 2020)
1
Applied Machine Learning Applied Machine Learning Decision Trees - - PowerPoint PPT Presentation
Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives decision trees: model cost function how it
Siamak Ravanbakhsh Siamak Ravanbakhsh
COMP 551 COMP 551 (winter 2020) (winter 2020)
1
2
so far we assume a fixed set of bases in f(x) =
w ϕ (x)∑d
d d
several methods can be classified as learning these bases adaptively decision trees generalized additive models boosting neural networks
f(x) =
w ϕ (x; v )∑d
d d d
each basis has its own parameters 3
image credit:https://mymodernmet.com/the-30second-rule-a-decision/
4
pros. decision trees are interpretable! they are not very sensitive to outliers do not need data normalization
image credit:https://mymodernmet.com/the-30second-rule-a-decision/
4
pros. decision trees are interpretable! they are not very sensitive to outliers do not need data normalization
image credit:https://mymodernmet.com/the-30second-rule-a-decision/
cons. they could easily overfit and they are unstable pruning random forests
4
divide the input space into regions and learn one function per region
f(x) =
w I(x ∈∑k
k
R
)k
the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)
5 . 1
split regions successively based on the value of a single variable called test
divide the input space into regions and learn one function per region
f(x) =
w I(x ∈∑k
k
R
)k
the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)
5 . 1
split regions successively based on the value of a single variable called test
divide the input space into regions and learn one function per region
f(x) =
w I(x ∈∑k
k
R
)k
the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)
1
2
1
3
5
each region is a set of conditions R
=2
{x
≤1
t
, x ≤1 2
t
}4
5 . 1
what constant to use for prediction in each region?
w
k
suppose we have identified the regions R
k
5 . 2
what constant to use for prediction in each region?
w
k
suppose we have identified the regions R
k
fore regression
use the mean value of training data-points in that region
w
=k
mean(y ∣x ∈
(n) (n)
R
)k
5 . 2
what constant to use for prediction in each region?
w
k
suppose we have identified the regions R
k
fore regression
use the mean value of training data-points in that region
w
=k
mean(y ∣x ∈
(n) (n)
R
)k
for classification
count the frequency of classes per region predict the most frequent label
5 . 2
w
=k
mode(y ∣x ∈
(n) (n)
R
)k
Winter 2020 | Applied Machine Learning (COMP551)
what constant to use for prediction in each region?
w
k
suppose we have identified the regions R
k
fore regression
use the mean value of training data-points in that region
w
=k
mean(y ∣x ∈
(n) (n)
R
)k
for classification
count the frequency of classes per region predict the most frequent label
example: predicting survival in titanic
most frequent label frequency of survival percentage of training data in this region
5 . 2
w
=k
mode(y ∣x ∈
(n) (n)
R
)k
given a feature what are the possible tests
6 . 1
continuous features - e.g., age, height, GDP
all the values that appear in the dataset can be used to split S
=d
{s
=d,n
x
}d (n)
x
>d
s
?d,n
each split is asking
given a feature what are the possible tests
6 . 1
continuous features - e.g., age, height, GDP
all the values that appear in the dataset can be used to split S
=d
{s
=d,n
x
}d (n)
x
>d
s
?d,n
each split is asking
given a feature what are the possible tests
x
∈d
{1, … , C}
we can split any any value so
S
=d
{s
=d,1
1, … , s
=d,C
C}
x
>d
s
?d,c
each split is asking 6 . 1
continuous features - e.g., age, height, GDP
all the values that appear in the dataset can be used to split S
=d
{s
=d,n
x
}d (n)
x
>d
s
?d,n
each split is asking
given a feature what are the possible tests
x
∈d
{1, … , C}
we can split any any value so
S
=d
{s
=d,1
1, … , s
=d,C
C}
x
>d
s
?d,c
each split is asking
categorical features -
6 . 1
continuous features - e.g., age, height, GDP
all the values that appear in the dataset can be used to split S
=d
{s
=d,n
x
}d (n)
x
>d
s
?d,n
each split is asking
given a feature what are the possible tests
x
∈d
{1, … , C}
we can split any any value so
S
=d
{s
=d,1
1, … , s
=d,C
C}
x
>d
s
?d,c
each split is asking
categorical features -
6 . 1
multi-way split
problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints
x
=d
? ? ? ?
continuous features - e.g., age, height, GDP
all the values that appear in the dataset can be used to split S
=d
{s
=d,n
x
}d (n)
x
>d
s
?d,n
each split is asking
given a feature what are the possible tests
x
∈d
{1, … , C}
we can split any any value so
S
=d
{s
=d,1
1, … , s
=d,C
C}
x
>d
s
?d,c
each split is asking
categorical features -
6 . 1
multi-way split
problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints
x
=d
? ? ? ?
instead of we have
binary split
assume C binary features (one-hot coding) x
∈d
{1, … , C} x
∈d,1
{0, 1} x
∈d,2
{0, 1} ⋮ x
∈d,C
{0, 1} x
d,2 = ? 0
x
d,2 = ? 1
continuous features - e.g., age, height, GDP
all the values that appear in the dataset can be used to split S
=d
{s
=d,n
x
}d (n)
x
>d
s
?d,n
each split is asking
given a feature what are the possible tests
x
∈d
{1, … , C}
we can split any any value so
S
=d
{s
=d,1
1, … , s
=d,C
C}
x
>d
s
?d,c
each split is asking
categorical features -
6 . 1
multi-way split
problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints
x
=d
? ? ? ?
instead of we have
binary split
assume C binary features (one-hot coding) x
∈d
{1, … , C} x
∈d,1
{0, 1} x
∈d,2
{0, 1} ⋮ x
∈d,C
{0, 1} x
d,2 = ? 0
x
d,2 = ? 1
alternative: binary splits that produce balanced subsets
6 . 2
for predicting constant
cost(R
, D) =k
(y−
N
k
1 ∑x ∈R
(n) k
(n)
w
)k 2
w ∈
k
R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
6 . 2
for predicting constant
cost(R
, D) =k
(y−
N
k
1 ∑x ∈R
(n) k
(n)
w
)k 2
w ∈
k
R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
mean(y ∣x ∈
(n) (n)
R
)k 6 . 2
for predicting constant
cost(R
, D) =k
(y−
N
k
1 ∑x ∈R
(n) k
(n)
w
)k 2
w ∈
k
R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
classification cost
for predicting constant class
cost(R
, D) =k
I(y =N
k
1 ∑x ∈D
(n) R k
(n) w
)k
w
∈k
{1, … , C}
cost per region (misclassification rate)
mean(y ∣x ∈
(n) (n)
R
)k 6 . 2
for predicting constant
cost(R
, D) =k
(y−
N
k
1 ∑x ∈R
(n) k
(n)
w
)k 2
w ∈
k
R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
classification cost
for predicting constant class
cost(R
, D) =k
I(y =N
k
1 ∑x ∈D
(n) R k
(n) w
)k
w
∈k
{1, … , C}
cost per region (misclassification rate)
mean(y ∣x ∈
(n) (n)
R
)k 6 . 2
mode(y ∣x ∈
(n) (n)
R
)k
for predicting constant
cost(R
, D) =k
(y−
N
k
1 ∑x ∈R
(n) k
(n)
w
)k 2
w ∈
k
R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
classification cost
for predicting constant class
cost(R
, D) =k
I(y =N
k
1 ∑x ∈D
(n) R k
(n) w
)k
w
∈k
{1, … , C}
cost per region (misclassification rate) total cost in both cases is the normalized sum
cost(R , D)∑k N
N
k
k
mean(y ∣x ∈
(n) (n)
R
)k 6 . 2
mode(y ∣x ∈
(n) (n)
R
)k
it is sometimes possible to build a tree with zero cost: build a large tree with each instance having its own region (overfitting!)
for predicting constant
cost(R
, D) =k
(y−
N
k
1 ∑x ∈R
(n) k
(n)
w
)k 2
w ∈
k
R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
classification cost
for predicting constant class
cost(R
, D) =k
I(y =N
k
1 ∑x ∈D
(n) R k
(n) w
)k
w
∈k
{1, … , C}
cost per region (misclassification rate) total cost in both cases is the normalized sum
cost(R , D)∑k N
N
k
k
mean(y ∣x ∈
(n) (n)
R
)k 6 . 2
mode(y ∣x ∈
(n) (n)
R
)k
it is sometimes possible to build a tree with zero cost: build a large tree with each instance having its own region (overfitting!)
new objective: find a decision tree with K tests minimizing the cost function for predicting constant
cost(R
, D) =k
(y−
N
k
1 ∑x ∈R
(n) k
(n)
w
)k 2
w ∈
k
R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
classification cost
for predicting constant class
cost(R
, D) =k
I(y =N
k
1 ∑x ∈D
(n) R k
(n) w
)k
w
∈k
{1, … , C}
cost per region (misclassification rate) total cost in both cases is the normalized sum
cost(R , D)∑k N
N
k
k
mean(y ∣x ∈
(n) (n)
R
)k 6 . 2
mode(y ∣x ∈
(n) (n)
R
)k
K+1 regions
6 . 3 alternatively, find the smallest tree (K) that classifies all examples correctly
not produced by a decision tree
K+1 regions
6 . 3 alternatively, find the smallest tree (K) that classifies all examples correctly
not produced by a decision tree
assuming D features how many different partitions of size K+1?
K+1 regions
6 . 3 alternatively, find the smallest tree (K) that classifies all examples correctly
not produced by a decision tree
assuming D features how many different partitions of size K+1?
K+1 regions
6 . 3
the number of full binary trees with K+1 leaves (regions ) is the Catalan number
R
k
K+1 1 ( K 2K)
exponential in K
alternatively, find the smallest tree (K) that classifies all examples correctly
not produced by a decision tree
assuming D features how many different partitions of size K+1?
1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452
K+1 regions
6 . 3
the number of full binary trees with K+1 leaves (regions ) is the Catalan number
R
k
K+1 1 ( K 2K)
exponential in K
alternatively, find the smallest tree (K) that classifies all examples correctly
not produced by a decision tree
assuming D features how many different partitions of size K+1?
1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452
we also have a choice of feature for each of K internal node DK
x
d
K+1 regions
6 . 3
the number of full binary trees with K+1 leaves (regions ) is the Catalan number
R
k
K+1 1 ( K 2K)
exponential in K
alternatively, find the smallest tree (K) that classifies all examples correctly
not produced by a decision tree
assuming D features how many different partitions of size K+1?
1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452
we also have a choice of feature for each of K internal node DK
x
d
K+1 regions
moreover, for each feature different choices of splitting s
∈d,n
S
d 6 . 3
the number of full binary trees with K+1 leaves (regions ) is the Catalan number
R
k
K+1 1 ( K 2K)
exponential in K
alternatively, find the smallest tree (K) that classifies all examples correctly
Winter 2020 | Applied Machine Learning (COMP551)
bottom line: finding optimal decision tree is an NP-hard combinatorial optimization problem
not produced by a decision tree
assuming D features how many different partitions of size K+1?
1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452
we also have a choice of feature for each of K internal node DK
x
d
K+1 regions
moreover, for each feature different choices of splitting s
∈d,n
S
d 6 . 3
the number of full binary trees with K+1 leaves (regions ) is the Catalan number
R
k
K+1 1 ( K 2K)
exponential in K
alternatively, find the smallest tree (K) that classifies all examples correctly
end the recursion if not worth-splitting recursively split the regions based on a greedy choice of the next test
7 . 1
end the recursion if not worth-splitting recursively split the regions based on a greedy choice of the next test
function fittree( , ,depth)
R
node
if not worthsplitting(depth, ) return else leftset = fittree( , , depth+1) rightset = fittree( , , depth+1) return {leftset, rightset} = greedytest ( , )
D R
node D
R
, Rleft right
R
, Rleft right
R
node
R
left D
R
right
D
7 . 1
end the recursion if not worth-splitting recursively split the regions based on a greedy choice of the next test
function fittree( , ,depth)
R
node
if not worthsplitting(depth, ) return else leftset = fittree( , , depth+1) rightset = fittree( , , depth+1) return {leftset, rightset} = greedytest ( , )
D R
node D
R
, Rleft right
R
, Rleft right
R
node
R
left D
R
right
D
final decision tree in the form of nested list of regions
{{R
, R }, {R , {R , R }}1 2 3 4 5
7 . 1
the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost
7 . 2
the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost =
cost(R , D) +N
node
N
left
left
cost(R , D)N
node
N
right
right
function greedytest ( , )
R
node D
R
=left
R ∪
node
{x
<d
s
}d,n
for d ∈ {1, … , D}, s
∈d,n
S
d
R
=right
R ∪
node
{x
≥d
s
}d,n
splitcost bestcost = inf if splitcost < bestcost: bestcost = splitcost
R
=left ∗
R
left
R
=right ∗
R
right
return R
, Rleft ∗ right ∗ 7 . 2
the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost =
cost(R , D) +N
node
N
left
left
cost(R , D)N
node
N
right
right
function greedytest ( , )
R
node D
R
=left
R ∪
node
{x
<d
s
}d,n
for d ∈ {1, … , D}, s
∈d,n
S
d
R
=right
R ∪
node
{x
≥d
s
}d,n
splitcost bestcost = inf if splitcost < bestcost: bestcost = splitcost
R
=left ∗
R
left
R
=right ∗
R
right
return R
, Rleft ∗ right ∗ 7 . 2
creating new regions
the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost =
cost(R , D) +N
node
N
left
left
cost(R , D)N
node
N
right
right
function greedytest ( , )
R
node D
R
=left
R ∪
node
{x
<d
s
}d,n
for d ∈ {1, … , D}, s
∈d,n
S
d
R
=right
R ∪
node
{x
≥d
s
}d,n
splitcost bestcost = inf if splitcost < bestcost: bestcost = splitcost
R
=left ∗
R
left
R
=right ∗
R
right
return R
, Rleft ∗ right ∗
evaluate their cost
7 . 2
creating new regions
the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost =
cost(R , D) +N
node
N
left
left
cost(R , D)N
node
N
right
right
function greedytest ( , )
R
node D
R
=left
R ∪
node
{x
<d
s
}d,n
for d ∈ {1, … , D}, s
∈d,n
S
d
R
=right
R ∪
node
{x
≥d
s
}d,n
splitcost bestcost = inf if splitcost < bestcost: bestcost = splitcost
R
=left ∗
R
left
R
=right ∗
R
right
return R
, Rleft ∗ right ∗
evaluate their cost
7 . 2
creating new regions return the split with the lowest greedy cost
worthsplitting subroutine
if we stop when has zero cost, we may overfit
R
node
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
worthsplitting subroutine
if we stop when has zero cost, we may overfit
R
node
heuristics for stopping the splitting:
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
worthsplitting subroutine
if we stop when has zero cost, we may overfit
R
node
heuristics for stopping the splitting: reached a desired depth
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
worthsplitting subroutine
if we stop when has zero cost, we may overfit
R
node
heuristics for stopping the splitting: reached a desired depth number of examples in or is too small R
left
R
right
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
worthsplitting subroutine
if we stop when has zero cost, we may overfit
R
node
heuristics for stopping the splitting: reached a desired depth number of examples in or is too small is a good approximation, the cost is small enough R
left
R
right
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
worthsplitting subroutine
if we stop when has zero cost, we may overfit
R
node
heuristics for stopping the splitting: reached a desired depth number of examples in or is too small is a good approximation, the cost is small enough R
left
R
right
w
k
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
Winter 2020 | Applied Machine Learning (COMP551)
worthsplitting subroutine
if we stop when has zero cost, we may overfit
R
node
heuristics for stopping the splitting: reached a desired depth number of examples in or is too small is a good approximation, the cost is small enough reduction in cost by splitting is small R
left
R
right
w
k
cost(R
, D) −node
(
cost(R , D) +N
node
N
left
left
cost(R , D))N
node
N
right
right
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
revisiting the revisiting the classification cost
ideally we want to optimize the 0-1 loss (misclassification rate)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
)k
this may not be the optimal cost for each step of greedy heuristic
8 . 1
revisiting the revisiting the classification cost
ideally we want to optimize the 0-1 loss (misclassification rate)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
)k
this may not be the optimal cost for each step of greedy heuristic example
8 . 1
(.5, 100%) (.25, 50%) (.75, 50%)
R
node
R
left
R
right
revisiting the revisiting the classification cost
ideally we want to optimize the 0-1 loss (misclassification rate)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
)k
this may not be the optimal cost for each step of greedy heuristic example
(.5, 100%) (.33, 75%) (1, 25%)
8 . 1
(.5, 100%) (.25, 50%) (.75, 50%)
R
node
R
left
R
right
revisiting the revisiting the classification cost
ideally we want to optimize the 0-1 loss (misclassification rate)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
)k
this may not be the optimal cost for each step of greedy heuristic example
(.5, 100%) (.33, 75%) (1, 25%)
8 . 1
both splits have the same misclassification rate (2/8)
(.5, 100%) (.25, 50%) (.75, 50%)
R
node
R
left
R
right
revisiting the revisiting the classification cost
ideally we want to optimize the 0-1 loss (misclassification rate)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
)k
this may not be the optimal cost for each step of greedy heuristic example
(.5, 100%) (.33, 75%) (1, 25%)
8 . 1
both splits have the same misclassification rate (2/8) however the second split may be preferable because one region does not need further splitting
(.5, 100%) (.25, 50%) (.75, 50%)
R
node
R
left
R
right
revisiting the revisiting the classification cost
ideally we want to optimize the 0-1 loss (misclassification rate)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
)k
this may not be the optimal cost for each step of greedy heuristic example
(.5, 100%) (.33, 75%) (1, 25%)
8 . 1
both splits have the same misclassification rate (2/8) however the second split may be preferable because one region does not need further splitting
(.5, 100%) (.25, 50%) (.75, 50%)
R
node
R
left
R
right
use a measure for homogeneity of labels in regions
entropy is the expected amount of information in observing a random variable
H(y) = −
p(y =∑c=1
C
c) log p(y = c)
note that it is common to use capital letters for random variables (here for consistency we use lower-case)
8 . 2
entropy is the expected amount of information in observing a random variable
H(y) = −
p(y =∑c=1
C
c) log p(y = c)
is the amount of information in observing c
− log p(y = c)
note that it is common to use capital letters for random variables (here for consistency we use lower-case)
8 . 2
entropy is the expected amount of information in observing a random variable
H(y) = −
p(y =∑c=1
C
c) log p(y = c)
is the amount of information in observing c
− log p(y = c)
zero information of p(c)=1 less probable events are more informative information from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d) p(c) < p(c ) ⇒
′
− log p(c) > − log p(c )
′
note that it is common to use capital letters for random variables (here for consistency we use lower-case)
8 . 2
entropy is the expected amount of information in observing a random variable
H(y) = −
p(y =∑c=1
C
c) log p(y = c)
a uniform distribution has the highest entropy
H(y) = −
log =∑c=1
C C 1 C 1
log C
is the amount of information in observing c
− log p(y = c)
zero information of p(c)=1 less probable events are more informative information from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d) p(c) < p(c ) ⇒
′
− log p(c) > − log p(c )
′
note that it is common to use capital letters for random variables (here for consistency we use lower-case)
8 . 2
entropy is the expected amount of information in observing a random variable
H(y) = −
p(y =∑c=1
C
c) log p(y = c)
a uniform distribution has the highest entropy
H(y) = −
log =∑c=1
C C 1 C 1
log C
a deterministic random variable has the lowest entropy H(y) = −1 log(1) = 0
is the amount of information in observing c
− log p(y = c)
zero information of p(c)=1 less probable events are more informative information from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d) p(c) < p(c ) ⇒
′
− log p(c) > − log p(c )
′
note that it is common to use capital letters for random variables (here for consistency we use lower-case)
8 . 2
for two random variables t, y
8 . 3
for two random variables t, y the amount of information t conveys about y change in the entropy of y after observing the value of t mutual information is
8 . 3
I(t, y) = H(y) − H(y∣t)
for two random variables t, y the amount of information t conveys about y change in the entropy of y after observing the value of t mutual information is
8 . 3
I(t, y) = H(y) − H(y∣t)
conditional entropy
p(t =∑l=1
L
l)H(x∣t = l)
for two random variables t, y the amount of information t conveys about y change in the entropy of y after observing the value of t mutual information is
8 . 3
I(t, y) = H(y) − H(y∣t)
conditional entropy
p(t =∑l=1
L
l)H(x∣t = l)
for two random variables t, y
=
p(y =∑l ∑c c, t = l) log
p(y=c)p(t=l) p(y=c,t=l) this is symmetric wrt y and t
the amount of information t conveys about y change in the entropy of y after observing the value of t mutual information is
8 . 3
I(t, y) = H(y) − H(y∣t)
conditional entropy
p(t =∑l=1
L
l)H(x∣t = l)
for two random variables t, y
= H(t) − H(t∣y) = I(y, t) =
p(y =∑l ∑c c, t = l) log
p(y=c)p(t=l) p(y=c,t=l) this is symmetric wrt y and t
the amount of information t conveys about y change in the entropy of y after observing the value of t mutual information is
8 . 3
I(t, y) = H(y) − H(y∣t)
conditional entropy
p(t =∑l=1
L
l)H(x∣t = l)
it is always positive and zero only if y and t are independent for two random variables t, y
= H(t) − H(t∣y) = I(y, t)
try to prove these properties
=
p(y =∑l ∑c c, t = l) log
p(y=c)p(t=l) p(y=c,t=l) this is symmetric wrt y and t
the amount of information t conveys about y change in the entropy of y after observing the value of t mutual information is
8 . 3
we care about the distribution of labels p
(y =k
c) =
N
k
I(y=c) ∑x
∈R (n) k (n)
8 . 4
we care about the distribution of labels p
(y =k
c) =
N
k
I(y=c) ∑x
∈R (n) k (n)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
) =k
1 − p
(w )k k
misclassification cost
8 . 4
we care about the distribution of labels p
(y =k
c) =
N
k
I(y=c) ∑x
∈R (n) k (n)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
) =k
1 − p
(w )k k
misclassification cost
the most probable class w
=k
arg max
p (c)c k
8 . 4
we care about the distribution of labels p
(y =k
c) =
N
k
I(y=c) ∑x
∈R (n) k (n)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
) =k
1 − p
(w )k k
misclassification cost
the most probable class w
=k
arg max
p (c)c k
entropy cost cost(R
, D) =k
H(y)
choose the split with the lowest entropy
8 . 4
we care about the distribution of labels p
(y =k
c) =
N
k
I(y=c) ∑x
∈R (n) k (n)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
) =k
1 − p
(w )k k
misclassification cost
the most probable class w
=k
arg max
p (c)c k
entropy cost cost(R
, D) =k
H(y)
choose the split with the lowest entropy
change in the cost becomes the mutual information between the test and labels
8 . 4
we care about the distribution of labels p
(y =k
c) =
N
k
I(y=c) ∑x
∈R (n) k (n)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
) =k
1 − p
(w )k k
misclassification cost
the most probable class w
=k
arg max
p (c)c k
entropy cost cost(R
, D) =k
H(y)
choose the split with the lowest entropy
cost(R
, D) −node
(
cost(R, D) +
N
node
N
left
left
cost(R , D))N
node
N
left
right
change in the cost becomes the mutual information between the test and labels
8 . 4
we care about the distribution of labels p
(y =k
c) =
N
k
I(y=c) ∑x
∈R (n) k (n)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
) =k
1 − p
(w )k k
misclassification cost
the most probable class w
=k
arg max
p (c)c k
entropy cost cost(R
, D) =k
H(y)
choose the split with the lowest entropy
cost(R
, D) −node
(
cost(R, D) +
N
node
N
left
left
cost(R , D))N
node
N
left
right
change in the cost becomes the mutual information between the test and labels = H(y) − (p(x
≥d
s
)H(p(y∣x ≥d,n d
s
)) +d,n
p(x
<d
s
)H(p(y∣x <d,n d
s
)))d,n 8 . 4
we care about the distribution of labels p
(y =k
c) =
N
k
I(y=c) ∑x
∈R (n) k (n)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
) =k
1 − p
(w )k k
misclassification cost
the most probable class w
=k
arg max
p (c)c k
entropy cost cost(R
, D) =k
H(y)
choose the split with the lowest entropy
cost(R
, D) −node
(
cost(R, D) +
N
node
N
left
left
cost(R , D))N
node
N
left
right
change in the cost becomes the mutual information between the test and labels = H(y) − (p(x
≥d
s
)H(p(y∣x ≥d,n d
s
)) +d,n
p(x
<d
s
)H(p(y∣x <d,n d
s
)))d,n
= I(y, x > s
)d,n 8 . 4
we care about the distribution of labels p
(y =k
c) =
N
k
I(y=c) ∑x
∈R (n) k (n)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
) =k
1 − p
(w )k k
misclassification cost
the most probable class w
=k
arg max
p (c)c k
entropy cost cost(R
, D) =k
H(y)
choose the split with the lowest entropy
cost(R
, D) −node
(
cost(R, D) +
N
node
N
left
left
cost(R , D))N
node
N
left
right
change in the cost becomes the mutual information between the test and labels = H(y) − (p(x
≥d
s
)H(p(y∣x ≥d,n d
s
)) +d,n
p(x
<d
s
)H(p(y∣x <d,n d
s
)))d,n
= I(y, x > s
)d,n
choosing the test which is maximally informative about labels
8 . 4
example
(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)
8 . 5
R
node
R
left
R
right
example
(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)
8 . 5
R
node
R
left
R
right
misclassification cost
⋅8 4
+4 1
⋅8 4
=4 1 4 1
example
(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)
8 . 5
R
node
R
left
R
right
misclassification cost
⋅8 4
+4 1
⋅8 4
=4 1 4 1
⋅8 6
+3 1
⋅8 2
=2 4 1
example
(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)
8 . 5
R
node
R
left
R
right
misclassification cost
⋅8 4
+4 1
⋅8 4
=4 1 4 1
⋅8 6
+3 1
⋅8 2
=2 4 1
the same costs
example
(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)
8 . 5
R
node
R
left
R
right
misclassification cost
⋅8 4
+4 1
⋅8 4
=4 1 4 1
⋅8 6
+3 1
⋅8 2
=2 4 1
entropy cost (using base 2 logarithm)
the same costs
example
(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)
8 . 5
R
node
R
left
R
right
misclassification cost
⋅8 4
+4 1
⋅8 4
=4 1 4 1
⋅8 6
+3 1
⋅8 2
=2 4 1
entropy cost (using base 2 logarithm)
( −8 4
log( ) −4 1 4 1
log( )) +4 3 4 3
( −8 4
log( ) −4 1 4 1
log( )) ≈4 3 4 3
.81
the same costs
example
(.5, 100%) (.25, 50%) (.75, 50%) (.5, 100%) (.33, 75%) (1, 25%)
8 . 5
R
node
R
left
R
right
misclassification cost
⋅8 4
+4 1
⋅8 4
=4 1 4 1
⋅8 6
+3 1
⋅8 2
=2 4 1
entropy cost (using base 2 logarithm)
( −8 4
log( ) −4 1 4 1
log( )) +4 3 4 3
( −8 4
log( ) −4 1 4 1
log( )) ≈4 3 4 3
.81
( −8 6
log( ) −3 1 3 1
log( )) +3 2 3 2
⋅8 2
0 ≈ .68
lower cost split the same costs
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
) =k
1 − p(w
)k
misclassification (error) rate entropy cost(R
, D) =k
H(y)
another cost for selecting the test in classification
8 . 6
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
) =k
1 − p(w
)k
misclassification (error) rate entropy cost(R
, D) =k
H(y)
Gini index
it is the expected error rate
another cost for selecting the test in classification
8 . 6
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
) =k
1 − p(w
)k
misclassification (error) rate entropy cost(R
, D) =k
H(y)
Gini index
it is the expected error rate
cost(R
, D) =k
p(c)(1 −∑c=1
C
p(c))
probability of class c probability of error
another cost for selecting the test in classification
8 . 6
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
) =k
1 − p(w
)k
misclassification (error) rate entropy cost(R
, D) =k
H(y) =
p(c) −∑c=1
C
p(c) =∑c=1
C 2
1 −
p(c)∑c=1
C 2
Gini index
it is the expected error rate
cost(R
, D) =k
p(c)(1 −∑c=1
C
p(c))
probability of class c probability of error
another cost for selecting the test in classification
8 . 6
Winter 2020 | Applied Machine Learning (COMP551)
cost(R
, D) =k
I(y =N
k
1 ∑x ∈R
(n) k
(n) w
) =k
1 − p(w
)k
misclassification (error) rate entropy cost(R
, D) =k
H(y) =
p(c) −∑c=1
C
p(c) =∑c=1
C 2
1 −
p(c)∑c=1
C 2
Gini index
it is the expected error rate
cost(R
, D) =k
p(c)(1 −∑c=1
C
p(c))
probability of class c probability of error
another cost for selecting the test in classification
comparison of costs of a node when we have 2 classes
8 . 6
decision tree for Iris dataset
decision boundaries
dataset (D=2)
decision tree
9 . 1
decision tree for Iris dataset
decision boundaries
dataset (D=2)
decision tree
9 . 1
decision tree for Iris dataset
decision boundaries
dataset (D=2)
decision tree
9 . 1
decision tree for Iris dataset
decision boundaries
dataset (D=2)
decision tree
9 . 1
decision tree for Iris dataset
decision boundaries
dataset (D=2)
decision boundaries suggest overfitting confirmed using a validation set
training accuracy ~ 85% (Cross) validation accuracy ~ 70%
decision tree
9 . 1
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
9 . 2
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
example: of decision tree representation of a boolean function (D=3)
9 . 2
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
example: of decision tree representation of a boolean function (D=3) there are such functions, why?
22D
9 . 2
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
example: of decision tree representation of a boolean function (D=3) there are such functions, why?
22D
large decision trees have a high variance - low bias (low training error, high test error)
9 . 2
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
example: of decision tree representation of a boolean function (D=3) there are such functions, why?
22D
large decision trees have a high variance - low bias (low training error, high test error) idea 1. grow a small tree
9 . 2
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
example: of decision tree representation of a boolean function (D=3) there are such functions, why?
22D
large decision trees have a high variance - low bias (low training error, high test error) idea 1. grow a small tree
9 . 2
substantial reduction in cost may happen after a few steps by stopping early we cannot know this
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
example: of decision tree representation of a boolean function (D=3) there are such functions, why?
22D
large decision trees have a high variance - low bias (low training error, high test error) idea 1. grow a small tree
example cost drops after the second node
9 . 2
substantial reduction in cost may happen after a few steps by stopping early we cannot know this
idea 2. grow a large tree and then prune it
9 . 3
idea 2. grow a large tree and then prune it greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node
9 . 3
idea 2. grow a large tree and then prune it greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node pick the best among the above models using using a validation set
9 . 3
idea 2. grow a large tree and then prune it greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node pick the best among the above models using using a validation set example before pruning
9 . 3
idea 2. grow a large tree and then prune it greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node pick the best among the above models using using a validation set after pruning example before pruning
9 . 3
idea 2. grow a large tree and then prune it greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node pick the best among the above models using using a validation set after pruning cross-validation is used to pick the best size example before pruning
9 . 3
Winter 2020 | Applied Machine Learning (COMP551)
idea 2. grow a large tree and then prune it greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node pick the best among the above models using using a validation set after pruning cross-validation is used to pick the best size example before pruning idea 3. random forests (later!)
9 . 3
model: divide the input into axis-aligned regions cost: for regression and classification
10
model: divide the input into axis-aligned regions cost: for regression and classification
NP-hard use greedy heuristic
10
model: divide the input into axis-aligned regions cost: for regression and classification
NP-hard use greedy heuristic adjust the cost for the heuristic using entropy (relation to mutual information maximization) using Gini index
10
model: divide the input into axis-aligned regions cost: for regression and classification
NP-hard use greedy heuristic adjust the cost for the heuristic using entropy (relation to mutual information maximization) using Gini index decision trees are unstable (have high variance) use pruning to avoid overfitting
10
model: divide the input into axis-aligned regions cost: for regression and classification
NP-hard use greedy heuristic adjust the cost for the heuristic using entropy (relation to mutual information maximization) using Gini index decision trees are unstable (have high variance) use pruning to avoid overfitting there are variations on decision tree heuristics what we discussed in called Classification and Regression Trees (CART)
10