SLIDE 1 Machine
Learning
‐
10601
Decision
Trees
Geoff
Gordon,
Miroslav
Dudík
([[[partly
based
on
slides
of
Carlos
Guestrin
and
Andrew
Moore]
hHp://www.cs.cmu.edu/~ggordon/10601/
October
21,
2009
SLIDE 2 Non‐linear
Classifiers
Dealing
with
non‐linear
decision
boundary
- 1. add
“non‐linear”
features
to
a
linear
model
(e.g.,
logisUc
regression)
- 2. use
non‐linear
learners
(nearest
neighbors,
decision
trees,
arUficial
neural
nets,
...)
k‐Nearest
Neighbor
Classifier
- simple,
oWen
a
good
baseline
- can
approximate
arbitrary
boundary:
non‐parametric
- downside:
stores
all
the
data
SLIDE 3
A
Decision
Tree
for
PlayTennis
Each
internal
node:
test
one
feature
Xj
Each
branch
from
a
node:
select
one
value
for
Xj
Each
leaf
node
node:
predict
Y
or
P(Y
|
X
∈
leaf)
SLIDE 4
Decision
trees
How
would
you
represent
Y
=
A
∨
B
(A
or
B)
SLIDE 5
Decision
trees
How
would
you
represent
Y
=
(A∧B)
∨
(¬A∧C)
((A
and
B)
or
(not
A
and
C))
SLIDE 6 OpUmal
Learning
of
Decision
Trees
is
Hard
- learning
the
smallest
(simplest)
decision
tree
is
NP‐complete
(exisUng
algorithms
exponenUal)
– start
with
an
empty
tree
– choose
the
next
best
aHribute
(feature)
– recurse
SLIDE 7
A
small
dataset:
predict
miles
per
gallon
(mpg)
SLIDE 8
A
Decision
Stump
SLIDE 9
Recursion
Step
SLIDE 10
Recursion
Step
SLIDE 11
Second
Level
of
Tree
SLIDE 12
The
final
tree
SLIDE 13 Which
aHribute
is
the
best?
A
good
split:
increases
certainty
about
classificaUon
aKer
split
X1
X2
Y
T
T
T
T
F
T
T
T
T
T
F
T
F
T
T
F
F
F
F
T
F
F
F
F
SLIDE 14 Entropy
=
measure
of
uncertainty
Entropy
H(Y)
of
a
random
variable
Y:
H(Y)
=
–
∑
P(Y=yi)
log2
P(Y=yi)
H(Y)
is
the
expected
number
of
bits
needed
to
encode
a
randomly
drawn
value
of
Y
m i=1
SLIDE 15 Entropy
=
measure
of
uncertainty
Entropy
H(Y)
of
a
random
variable
Y:
H(Y)
=
–
∑
P(Y=yi)
log2
P(Y=yi)
H(Y)
is
the
expected
number
of
bits
needed
to
encode
a
randomly
drawn
value
of
Y
Why?
m i=1
SLIDE 16 Entropy
=
measure
of
uncertainty
Entropy
H(Y)
of
a
random
variable
Y:
H(Y)
=
–
∑
P(Y=yi)
log2
P(Y=yi)
H(Y)
is
the
expected
number
of
bits
needed
to
encode
a
randomly
drawn
value
of
Y
Why?
InformaUon
Theory:
most
efficient
code
assigns
–
log2
P(Y=yi)
bits
to
message
Y=yi
m i=1
SLIDE 17 Entropy
=
measure
of
uncertainty
Y
binary
P(Y=t)
=
θ
P(Y=f)
=
1
–
θ
H(Y)
=
θ
log2
θ
+
(1
–
θ)
log2
(1
–
θ)
θ
H(Y)
SLIDE 18 InformaUon
Gain
=
reducUon
in
uncertainty
Entropy
of
Y
before
split:
H(Y)
Entropy
of
Y
aKer
split:
(weighted
by
probability
of
each
branch)
H(Y|X)
=
–
∑
P(X=xj)
∑
P(Y=yi|X=xj)
log2
P(Y=yi|X=xj)
InformaUon
gain
=
difference:
IG(X)
=
H(Y)
–
H(Y|X)
X1
X2
Y
T
T
T
T
F
T
T
T
T
T
F
T
F
T
T
F
F
F
k j=1 m i=1
SLIDE 19 Learning
decision
trees
- start
with
an
empty
tree
- choose
the
next
best
aHribute
(feature)
– for
example,
one
that
maximizes
informaUon
gain
SLIDE 20
l
SLIDE 21
A
Decision
Stump
SLIDE 22
Base
Case
One
SLIDE 23
Base
Case
Two
SLIDE 24
Base
Case
Two:
aHributes
cannot
disUnguish
classes
SLIDE 25
Base
cases
SLIDE 26
Base
cases:
An
idea
SLIDE 27
Base
cases:
An
idea
SLIDE 28
The
problem
with
Base
Case
3
SLIDE 29
If
we
omit
Base
Case
3:
SLIDE 30
Basic
Decision‐Tree
Building
Summarized:
SLIDE 31
MPG
test
set
error
SLIDE 32
MPG
test
set
error
SLIDE 33 Decision
trees
overfit!
Standard
decision
trees:
- training
error
always
zero
(if
no
label
noise)
- lots
of
variance
SLIDE 34 Avoiding
overfigng
- fixed
depth
- fixed
number
of
leaves
- stop
when
splits
not
staUsUcally
significant
SLIDE 35 Avoiding
overfigng
- fixed
depth
- fixed
number
of
leaves
- stop
when
splits
not
staUsUcally
significant
OR:
then
prune
(collapse
some
subtrees)
SLIDE 36 Reduced
Error
Pruning
Split
available
data
into
training
and
pruning
sets
- 1. Learn
tree
that
classifies
training
set
perfectly
- 2. Do
unUl
further
pruning
is
harmful
over
pruning
set
– consider
pruning
each
node
– collapse
the
node
that
best
improves
pruning
set
accuracy
This
produces
smallest
version
of
most
accurate
tree
(over
the
pruning
set)
SLIDE 37
Impact
of
Pruning
SLIDE 38 A
Generic
Tree‐Learning
Algorithm
Need
to
specify:
- an
objecUve
to
select
splits
- a
criterion
for
pruning
(or
stopping)
- parameters
for
pruning/stopping
(usually
determined
by
cross‐validaUon)
SLIDE 39
SLIDE 40
SLIDE 41
“One
branch
for
each
numeric
value”
idea:
Hopeless:
with
such
high
branching
factor,
we
will
sha^er
the
dataset
and
overfit
SLIDE 42 A
beHer
idea:
thresholded
splits
- Binary
tree,
split
on
aHribute
X:
– one
branch:
X
<
t
– other
branch:
X
≥
t
- Search
through
all
possible
values
of
t
– seems
hard,
but
only
finite
set
relevant
– sort
values
of
X:
{x1,…,
xm}
– consider
splits
at
t
=
(xi
+
xi+1)/2
- InformaUon
gain
for
each
split
as
if
a
binary
variable:
“true”
for
X
<
t
“false”
for
X
≥
t
SLIDE 43
SLIDE 44
Example
tree
using
reals
SLIDE 45 What
you
should
know
about
decision
trees
- among
most
popular
data
mining
tools:
– easy
to
understand
– easy
to
implement
– easy
to
use
– computaUonally
fast
(but
only
a
greedy
heurisUc!)
- not
only
classificaUon,
also
regression,
density
esUmaUon
- meaning
of
informaUon
gain
- decision
trees
overfit!
– many
pruning/stopping
strategies
SLIDE 46
Acknowledgements
Some
material
in
this
presentaUon
is
courtesy
of
Andrew
Moore,
from
his
collecUon
of
ML
tutorials:
hHp://www.autonlab.org/tutorials/
SLIDE 47
LEARNING
THEORY
SLIDE 48 ComputaUonal
Learning
Theory
What
general
laws
constrain
“learning”?
- how
many
examples
needed
to
learn
a
target
concept
to
a
given
precision?
– complexity
of
the
target
concept?
– complexity
of
our
hypothesis
space?
– manner
in
which
examples
presented?
- random
samples—what
we
mostly
consider
in
this
course
- learner
can
make
queries
- examples
come
from
an
“adversary”
(worst‐case
analysis,
no
staUsUcal
assumpUons)