DecisionTrees MachineLearning10601 GeoffGordon,MiroslavDudk - - PowerPoint PPT Presentation

decision trees
SMART_READER_LITE
LIVE PREVIEW

DecisionTrees MachineLearning10601 GeoffGordon,MiroslavDudk - - PowerPoint PPT Presentation

DecisionTrees MachineLearning10601 GeoffGordon,MiroslavDudk ([[[partlybasedonslidesofCarlosGuestrinandAndrewMoore] hHp://www.cs.cmu.edu/~ggordon/10601/


slide-1
SLIDE 1

Machine
Learning
‐
10601


Decision
Trees


Geoff
Gordon,
Miroslav
Dudík


([[[partly
based
on
slides
of
Carlos
Guestrin
and
Andrew
Moore]


hHp://www.cs.cmu.edu/~ggordon/10601/
 October
21,
2009



slide-2
SLIDE 2

Non‐linear
Classifiers


Dealing
with
non‐linear
decision
boundary


  • 1. add
“non‐linear”
features


to
a
linear
model
(e.g.,
logisUc
regression)


  • 2. use
non‐linear
learners


(nearest
neighbors,
decision
trees,
arUficial
neural
nets,
...)
 k‐Nearest
Neighbor
Classifier


  • simple,
oWen
a
good
baseline

  • can
approximate
arbitrary
boundary:
non‐parametric

  • downside:
stores
all
the
data

slide-3
SLIDE 3

A
Decision
Tree
for
PlayTennis


Each
internal
node:
test
one
feature
Xj
 Each
branch
from
a
node:
select
one
value
for
Xj
 Each
leaf
node
node:
predict
Y
 






































or
P(Y
|
X
∈
leaf)


slide-4
SLIDE 4

Decision
trees


How
would
you
represent
 Y
=
A
∨
B




(A
or
B)


slide-5
SLIDE 5

Decision
trees


How
would
you
represent
 Y
=
(A∧B)
∨
(¬A∧C)




((A
and
B)
or
(not
A
and
C))


slide-6
SLIDE 6

OpUmal
Learning
of
 Decision
Trees
is
Hard


  • learning
the
smallest
(simplest)
decision
tree


is
NP‐complete
(exisUng
algorithms
exponenUal)


  • use
“greedy”
heurisUcs:


– start
with
an
empty
tree
 – choose
the
next
best
aHribute
(feature)
 – recurse


slide-7
SLIDE 7

A
small
dataset:
 predict
miles
per
gallon
(mpg)


slide-8
SLIDE 8

A
Decision
Stump


slide-9
SLIDE 9

Recursion
Step


slide-10
SLIDE 10

Recursion
Step


slide-11
SLIDE 11

Second
Level
of
Tree


slide-12
SLIDE 12

The
final
tree


slide-13
SLIDE 13

Which
aHribute
is
the
best?


A
good
split:
 increases
certainty
about
 classificaUon
aKer
split


X1
 X2
 Y
 T
 T
 T
 T
 F
 T
 T
 T
 T
 T
 F
 T
 F
 T
 T
 F
 F
 F
 F
 T
 F
 F
 F
 F


slide-14
SLIDE 14

Entropy
=
measure
of
uncertainty


Entropy
H(Y)
of
a
random
variable
Y:
 






H(Y)
=

–
∑
P(Y=yi)
log2
P(Y=yi)
 H(Y)
is
the
expected
number
of
bits
needed
to
encode
a
 randomly
drawn
value
of
Y


m i=1

slide-15
SLIDE 15

Entropy
=
measure
of
uncertainty


Entropy
H(Y)
of
a
random
variable
Y:
 






H(Y)
=

–
∑
P(Y=yi)
log2
P(Y=yi)
 H(Y)
is
the
expected
number
of
bits
needed
to
encode
a
 randomly
drawn
value
of
Y
 Why?


m i=1

slide-16
SLIDE 16

Entropy
=
measure
of
uncertainty


Entropy
H(Y)
of
a
random
variable
Y:
 






H(Y)
=

–
∑
P(Y=yi)
log2
P(Y=yi)
 H(Y)
is
the
expected
number
of
bits
needed
to
encode
a
 randomly
drawn
value
of
Y
 Why?
 InformaUon
Theory: 
most
efficient
code
assigns
 
 
 
 
 
 
 

–
log2
P(Y=yi)
bits
to
message
Y=yi


m i=1

slide-17
SLIDE 17

Entropy
=
measure
of
uncertainty


Y
binary
 



P(Y=t)
=
θ

 



P(Y=f)
=
1
–
θ

 H(Y)
=
θ
log2
θ

+
(1
–
θ)
log2
(1
–
θ)


θ
 H(Y)


slide-18
SLIDE 18

InformaUon
Gain
 =
reducUon
in
uncertainty


Entropy
of
Y
before
split:
 H(Y)
 Entropy
of
Y
aKer
split:
 (weighted
by
probability
of
each
branch)
 



H(Y|X)
=
–

∑

P(X=xj)


∑

P(Y=yi|X=xj)
log2
P(Y=yi|X=xj)
 InformaUon
gain
=
difference:

 IG(X)
=
H(Y)
–
H(Y|X)


X1
 X2
 Y
 T
 T
 T
 T
 F
 T
 T
 T
 T
 T
 F
 T
 F
 T
 T
 F
 F
 F
 k j=1 m i=1

slide-19
SLIDE 19

Learning
decision
trees


  • start
with
an
empty
tree

  • choose
the
next
best
aHribute
(feature)


– for
example,
one
that
 maximizes
informaUon
gain


  • split

  • recurse

slide-20
SLIDE 20

l


slide-21
SLIDE 21

A
Decision
Stump


slide-22
SLIDE 22

Base
Case 
 One 


slide-23
SLIDE 23

Base
Case 
 Two 


slide-24
SLIDE 24

Base
Case
Two: 
 aHributes
cannot
 disUnguish
classes 


slide-25
SLIDE 25

Base
cases


slide-26
SLIDE 26

Base
cases:
An
idea


slide-27
SLIDE 27

Base
cases:
An
idea


slide-28
SLIDE 28

The
problem
with
Base
Case
3


slide-29
SLIDE 29

If
we
omit
Base
Case
3:


slide-30
SLIDE 30

Basic
Decision‐Tree
Building
 Summarized:


slide-31
SLIDE 31

MPG
test
set
 error 


slide-32
SLIDE 32

MPG
test
set
 error 


slide-33
SLIDE 33

Decision
trees
overfit!


Standard
decision
trees:


  • training
error
always
zero
(if
no
label
noise)

  • lots
of
variance

slide-34
SLIDE 34

Avoiding
overfigng


  • fixed
depth

  • fixed
number
of
leaves

  • stop
when
splits
not
staUsUcally
significant

slide-35
SLIDE 35

Avoiding
overfigng


  • fixed
depth

  • fixed
number
of
leaves

  • stop
when
splits
not
staUsUcally
significant


OR:


  • grow
the
full
tree,


then
prune
 (collapse
some
subtrees)


slide-36
SLIDE 36

Reduced
Error
Pruning


Split
available
data
into
training
and
pruning
sets


  • 1. Learn
tree
that
classifies


training
set
perfectly



  • 2. Do
unUl
further
pruning
is


harmful
over
pruning
set
 – consider
pruning
each
node
 – collapse
the
node
that
best
 improves
pruning
set
accuracy

 This
produces
smallest
version
of
 most
accurate
tree
(over
the
pruning
set)



slide-37
SLIDE 37

Impact
of
Pruning


slide-38
SLIDE 38

A
Generic
Tree‐Learning
Algorithm


Need
to
specify:


  • an
objecUve
to
select
splits

  • a
criterion
for
pruning
(or
stopping)

  • parameters
for
pruning/stopping


(usually
determined
by
cross‐validaUon)


slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41

“One
branch
for
each
 numeric
value”
idea:


Hopeless:

with
such
high
 branching
factor,
we
will
 sha^er
the
dataset
and
overfit


slide-42
SLIDE 42

A
beHer
idea:
 thresholded
splits


  • Binary
tree,
split
on
aHribute
X:


– one
branch:
X
<
t
 – other
branch:
X
≥
t


  • Search
through
all
possible
values
of
t


– seems
hard,
but
only
finite
set
relevant
 – sort
values
of
X:
{x1,…,
xm}
 – consider
splits
at
t
=
(xi
+
xi+1)/2


  • InformaUon
gain
for
each
split


as
if
a
binary
variable:
“true”
for
X
<
t
 








 
 
 
 
 
“false”
for
X
≥
t




slide-43
SLIDE 43
slide-44
SLIDE 44

Example
tree
using
reals


slide-45
SLIDE 45

What
you
should
know
 about
decision
trees


  • among
most
popular
data
mining
tools:


– easy
to
understand
 – easy
to
implement
 – easy
to
use
 – computaUonally
fast
(but
only
a
greedy
heurisUc!)


  • not
only
classificaUon,
also
regression,
density


esUmaUon


  • meaning
of
informaUon
gain

  • decision
trees
overfit!


– many
pruning/stopping
strategies


slide-46
SLIDE 46

Acknowledgements


Some
material
in
this
presentaUon
is
courtesy
of
 Andrew
Moore,
from
his
collecUon
of
ML
tutorials:
 hHp://www.autonlab.org/tutorials/


slide-47
SLIDE 47

LEARNING
THEORY 


slide-48
SLIDE 48

ComputaUonal
Learning
Theory


What
general
laws
constrain
“learning”?


  • how
many
examples
needed
to
learn


a
target
concept
to
a
given
precision?


  • what
is
the
impact
of:


– complexity
of
the
target
concept?
 – complexity
of
our
hypothesis
space?
 – manner
in
which
examples
presented?


  • random
samples—what
we
mostly
consider
in
this
course

  • learner
can
make
queries

  • examples
come
from
an
“adversary”


(worst‐case
analysis,
no
staUsUcal
assumpUons)