Bo osting Neural Net w orks pap er No Holger Sc h w - - PDF document

bo osting neural net w orks pap er no holger sc h w enk
SMART_READER_LITE
LIVE PREVIEW

Bo osting Neural Net w orks pap er No Holger Sc h w - - PDF document

Bo osting Neural Net w orks pap er No Holger Sc h w enk LIMSICNRS bat BP Orsa y cedex FRANCE Y osh ua Bengio DIR O Univ ersit y of Mon tr


slide-1
SLIDE 1 Bo
  • sting
Neural Net w
  • rks
pap er No
  • Holger
Sc h w enk LIMSICNRS bat
  • BP
  • Orsa
y cedex FRANCE Y
  • sh
ua Bengio DIR O Univ ersit y
  • f
Mon tr
  • eal
Succ Cen treVille CP
  • Mon
tr
  • eal
Qc HC J CANAD A T
  • app
ear in Neural Computation Abstract Bo
  • sting
is a general metho d for impro ving the p erformance
  • f
learning algorithms A recen tly prop
  • sed
b
  • sting
algorithm is A daBo
  • st
It has b een applied with great success to sev eral b enc hmark mac hine learning problems using mainly decision trees as base classiers In this pap er w e in v estigate whether AdaBo
  • st
also w
  • rks
as w ell with neural net w
  • rks
and w e discuss the adv an tages and dra wbac ks
  • f
dieren t v ersions
  • f
the AdaBo
  • st
algorithm In particular w e compare training metho ds based
  • n
sampling the training set and w eigh ting the cost function The results suggest that random resampling
  • f
the training data is not the main explanation
  • f
the success
  • f
the impro v emen ts brough t b y AdaBo
  • st
This is in con trast to Bagging whic h directly aims at reducing v ariance and for whic h random resampling is essen tial to
  • btain
the reduction in generalization error Our system ac hiev es ab
  • ut
  • error
  • n
a data set
  • f
  • nline
handwritten digits from more than
  • writers
A b
  • sted
m ultila y er net w
  • rk
ac hiev ed
  • error
  • n
the UCI Letters and
  • error
  • n
the UCI satellite data set whic h is signican tly b etter than b
  • sted
decision trees Keyw
  • rds
AdaBo
  • st
b
  • sting
Bagging ensem ble learning m ultila y er neural net w
  • rks
generalization
slide-2
SLIDE 2
  • In
tro duction Bo
  • sting
is a general metho d for impro ving the p erformance
  • f
a learning algorithm It is a metho d for nding a highly accurate classier
  • n
the training set b y com bining w eak h yp
  • theses
Sc hapire
  • eac
h
  • f
whic h needs
  • nly
to b e mo derately accurate
  • n
the training set See an earlier
  • v
erview
  • f
dieren t w a ys to com bine neural net w
  • rks
in P errone
  • A
recen tly prop
  • sed
b
  • sting
algorithm is A daBo
  • st
F reund
  • whic
h stands for Adaptiv e Bo
  • sting
During the last t w
  • y
ears man y empirical studies ha v e b een published that use decision trees as base classiers for AdaBo
  • st
Breiman
  • Druc
k er and Cortes
  • F
reund and Sc hapire a Quinlan
  • Maclin
and Opitz
  • Bauer
and Koha vi
  • Dietteric
h b Gro v e and Sc h uurmans
  • All
these exp erimen ts ha v e sho wn impressiv e impro v emen ts in the generalization b eha vior and suggest that AdaBo
  • st
tends to b e robust to
  • v
ertting In fact in man y exp erimen ts it has b een
  • bserv
ed that the generalization error con tin ues to decrease to w ards an apparen t asymptote after the training error has reac hed zero Sc hapire et al
  • suggest
a p
  • ssible
explanation for this un usual b eha vior based
  • n
the denition
  • f
the mar gin
  • f
classic ation Other attemps to understand b
  • sting
theoretically can b e found in Sc hapire et al
  • Breiman
a Breiman
  • F
riedman et al
  • Sc
hapire
  • AdaBo
  • st
has also b een link ed with game theory F reund and Sc hapire b Breiman b Gro v e and Sc h uurmans
  • F
reund and Sc hapire
  • in
  • rder
to understand the b eha vior
  • f
AdaBo
  • st
and to prop
  • se
alternativ e algorithms Mason and Baxter
  • prop
  • se
a new v arian t
  • f
b
  • sting
based
  • n
the direct
  • ptimization
  • f
margins Additionally
  • there
is recen t evidence that AdaBo
  • st
ma y v ery w ell
  • v
ert if w e com bine sev eral h undred thousand classiers Gro v e and Sc h uurmans
  • It
also seems that the p erformance
  • f
AdaBo
  • st
degrades a lot in the presence
  • f
signican t amoun ts
  • f
noise Dietteric h b R atsc h et al
  • Although
m uc h useful w
  • rk
has b een done b
  • th
theoretically and exp erimen tally
  • there
is still a lot that is not w ell understo
  • d
ab
  • ut
the impressiv e generalization b eha vior
  • f
AdaBo
  • st
T
  • the
b est
  • f
  • ur
kno wledge applications
  • f
AdaBo
  • st
ha v e all b een to decision trees and no applications to m ultila y er articial neural net w
  • rks
ha v e b een rep
  • rted
in the literature This pap er extends and pro vides a deep er exp erimen tal analysis
  • f
  • ur
rst exp erimen ts with the application
  • f
AdaBo
  • st
to neural net w
  • rks
Sc h w enk and Bengio
  • Sc
h w enk and Bengio
  • In
this pap er w e consider the follo wing questions do es AdaBo
  • st
w
  • rk
as w ell for neural net w
  • rks
as for decision trees short answ er y es sometimes ev en b etter Do es it b eha v e in a similar w a y as w as
  • bserv
ed previously in the literature short answ er y es F urthermore are there particulars in the w a y neural net w
  • rks
are trained with gradien t bac kpropagation whic h should b e tak en in to accoun t when c ho
  • sing
a particular v ersion
  • f
AdaBo
  • st
short answ er y es b ecause it is p
  • ssible
to directly w eigh t the cost function
  • f
neural net w
  • rks
Is
  • v
ertting
  • f
the individual neural net w
  • rks
a concern short answ er not as m uc h as when not using b
  • sting
Is the random resampling used in previous implemen tations
  • f
AdaBo
  • st
critical
  • r
can w e get similar p erformances b y w eighing the training criterion whic h can easily b e done with neural net w
  • rks
short answ er it is not critical for generalization but helps
slide-3
SLIDE 3 to
  • btain
faster con v ergence
  • f
individual net w
  • rks
when coupled with sto c hastic gradien t descen t The pap er is
  • rganized
as follo ws In the next section w e rst describ e the AdaBo
  • st
algorithm and w e discuss sev eral implemen tation issues when using neural net w
  • rks
as base classiers In section
  • w
e presen t results that w e ha v e
  • btained
  • n
three mediumsized tasks a data set
  • f
handwritten
  • nline
digits and the letter and satimage data set
  • f
the UCI rep
  • sitory
  • The
pap er nishes with a conclusion and p ersp ectiv es for future researc h
  • AdaBo
  • st
It is w ell kno wn that it is
  • ften
p
  • ssible
to increase the accuracy
  • f
a classier b y a v eraging the decisions
  • f
an ensem ble
  • f
classiers P errone
  • Krogh
and V edelsb y
  • In
general more impro v emen t can b e exp ected when the individual classiers are div erse and y et accurate One can try to
  • btain
this result b y taking a base learning algorithm and b y in v
  • king
it sev eral times
  • n
dieren t training sets Tw
  • p
  • pular
tec hniques exist that dier in the w a y they construct these training sets Bagging Breiman
  • and
b
  • sting
F reund
  • F
reund and Sc hapire
  • In
Bagging eac h classier is trained
  • n
a b
  • tstrap
replicate
  • f
the
  • riginal
training set Giv en a training set S
  • f
N examples the new training set is created b y resampling N examples uniformly with replacemen t Note that some examples ma y
  • ccur
sev eral times while
  • thers
ma y not
  • ccur
in the sample at all One can sho w that
  • n
a v erage
  • nly
ab
  • ut
  • f
the examples
  • ccur
in eac h b
  • tstrap
replicate Note also that the individual training sets are indep enden t and the classiers could b e trained in parallel Bagging is kno wn to b e particularly eectiv e when the classiers are unstable ie when p erturbing the learning set can cause signican t c hanges in the classication b eha vior classiers F
  • rm
ulated in the con text
  • f
the biasv ariance decomp
  • sition
Geman et al
  • Bagging
impro v es generalization p erformance due to a reduction in v ariance while main taining
  • r
  • nly
sligh tly increasing bias Note ho w ev er that there is no unique biasv ariance decomp
  • sition
for classication tasks Kong and Dietteric h
  • Breiman
  • Koha
vi and W
  • lp
ert
  • Tibshirani
  • AdaBo
  • st
  • n
the
  • ther
hand constructs a comp
  • site
classier b y sequen tially training classiers while putting more and more emphasis
  • n
certain patterns F
  • r
this AdaBo
  • st
main tains a probabilit y distribution D t i
  • v
er the
  • riginal
training set In eac h round t the classier is trained with resp ect to this distribution Some learning algorithms dont allo w training with resp ect to a w eigh ted cost function In this case sampling with replacemen t using the probabilit y distribution D t
  • can
b e used to appro ximate a w eigh ted cost function Examples with high probabilit y w
  • uld
then
  • ccur
more
  • ften
than those with lo w probabilit y
  • while
some examples ma y not
  • ccur
in the sample at all although their probabilit y is not zero
slide-4
SLIDE 4 Input sequence
  • f
N examples x
  • y
  • x
N
  • y
N
  • with
lab els y i
  • Y
  • f
  • k
g Init let B
  • fi
y
  • i
  • f
  • N
g y
  • y
i g D
  • i
y
  • jB
j for all i y
  • B
Rep eat
  • T
rain neural net w
  • rk
with resp ect to distribution D t and
  • btain
h yp
  • thesis
h t
  • X
  • Y
  • calculate
the pseudoloss
  • f
h t
  • t
  • X
iy B D t i y
  • h
t x i
  • y
i
  • h
t x i
  • y
  • set
  • t
  • t
  • t
  • up
date distribution D t D t i y
  • D
t iy
  • Z
t
  • h
t x i y i h t x i y
  • t
where Z t is a normalization constan t Output nal h yp
  • thesis
f x
  • arg
max y Y X t
  • log
  • t
  • h
t x y
  • T
able
  • Pseudoloss
A daBo
  • st
A daBo
  • stM
After eac h AdaBo
  • st
round the probabilit y
  • f
incorrectly lab eled examples is increased and the probabilit y
  • f
correctly lab eled examples is decreased The result
  • f
training the t th classier is a hyp
  • thesis
h t
  • X
  • Y
where Y
  • f
  • k
g is the space
  • f
lab els and X is the space
  • f
input features After the t th round the w eigh ted error
  • t
  • P
ih t x i y i D t i
  • f
the resulting classier is calculated and the distribution D t is computed from D t
  • b
y increasing the probabilit y
  • f
incorrectly lab eled examples The probabilities are c hanged so that the error
  • f
the t th classier using these new w eigh ts D t w
  • uld
b e
  • In
this w a y
  • the
classiers are
  • ptimally
decoupled The global decision f is
  • btained
b y w eigh ted v
  • ting
This basic AdaBo
  • st
algorithm con v erges learns the training set if eac h classier yields a w eigh ted error that is less than
  • ie
b etter than c hance in the class case In general neural net w
  • rk
classiers pro vide more information than just a class lab el It can b e sho wn that the net w
  • rk
  • utputs
appro ximate the ap
  • steriori
probabilities
  • f
classes and it migh t b e useful to use this information rather than to p erform a hard decision for
  • ne
recognized class This issue is addressed b y another v ersion
  • f
AdaBo
  • st
called A da Bo
  • stM
F reund and Sc hapire
  • It
can b e used when the classier computes con dence scores
  • for
eac h class The result
  • f
training the t th classier is no w a h yp
  • thesis
h t
  • X
  • Y
  • F
urthermore w e use a distribution D t i y
  • v
er the set
  • f
all misslab els
  • The
scores do not need to sum to
  • ne
slide-5
SLIDE 5 B
  • fi
y
  • i
  • f
  • N
g y
  • y
i g where N is the n um b er
  • f
training examples Therefore jB j
  • N
k
  • AdaBo
  • st
mo dies this distribution so that the next learner fo cuses not
  • nly
  • n
the examples that are hard to classify
  • but
more sp ecically
  • n
impro ving the discrim ination b et w een the correct class and the incorrect class that comp etes with it Note that the misslab el distribution D t induces a distribution
  • v
er the examples P t i
  • W
t i
  • P
i W t i where W t i
  • P
y y i D t i y
  • P
t i ma y b e used for resampling the training set F reund and Sc hapire
  • dene
the pseudoloss
  • f
a learning mac hine as
  • t
  • X
iy B D t i y
  • h
t x i
  • y
i
  • h
t x i
  • y
  • It
is minimized if the condence scores
  • f
the correct lab els are
  • and
the condence scores
  • f
all the wrong lab els are
  • The
nal decision f is
  • btained
b y adding together the w eigh ted condence scores
  • f
all the mac hines all the h yp
  • theses
h
  • h
  • T
able
  • summarizes
the AdaBo
  • stM
algorithm This m ulticlass b
  • sting
algorithm con v erges if eac h classier yields a pseudoloss that is less than
  • ie
b etter than an y constan t h yp
  • thesis
AdaBo
  • st
has v ery in teresting theoretical prop erties in particular it can b e sho wn that the error
  • f
the comp
  • site
classier
  • n
the training data decreases exp
  • nen
tially fast to zero as the n um b er
  • f
com bined classiers is increased F reund and Sc hapire
  • Man
y empirical ev aluations
  • f
AdaBo
  • st
also pro vide an analysis
  • f
the socalled mar gin distribution The margin is dened as the dierence b et w een the ensem ble score
  • f
the correct class and the strongest ensem ble score
  • f
a wrong class In the case in whic h there are just t w
  • p
  • ssible
lab els f g this is y f x where f is the
  • utput
  • f
the comp
  • site
classier and y the correct lab el The classication is correct if the margin is p
  • sitiv
e Discussions ab
  • ut
the relev ance
  • f
the margin distribution for the generalization b eha vior
  • f
ensem ble tec hniques can b e found in F reund and Sc hapire b Sc hapire et al
  • Breiman
a Breiman b Gro v e and Sc h uurmans
  • R
atsc h et al
  • In
this pap er an imp
  • rtan
t fo cus is
  • n
whether the go
  • d
generalization p erformance
  • f
AdaBo
  • st
is partially explained b y the random resampling
  • f
the training sets generally used in its implemen tation This issue will b e addressed b y comparing three v ersions
  • f
AdaBo
  • st
as describ ed in the next section in whic h randomization is used
  • r
not used in three dieren t w a ys
  • Applying
AdaBo
  • st
to neural net w
  • rks
In this pap er w e in v estigate dieren t tec hniques
  • f
using neural net w
  • rks
as base classi ers for AdaBo
  • st
In all cases w e ha v e trained the neural net w
  • rks
b y minimizing a quadratic criterion that is a w eigh ted sum
  • f
the squared dierences z ij
  • z
ij
  • where
z i
  • z
i
  • z
i
  • z
ik
  • is
the desired
  • utput
v ector with a lo w target v alue ev erywhere ex cept at the p
  • sition
corresp
  • nding
to the target class and
  • z
i is the
  • utput
v ector
  • f
the net w
  • rk
A score for class j for pattern i can b e directly
  • btained
from the j th elemen t
slide-6
SLIDE 6
  • z
ij
  • f
the
  • utput
v ector
  • z
i
  • When
a class m ust b e c hosen the
  • ne
with the highest score is selected Let V t i j
  • D
t i j max k y i D t i k
  • for
j
  • y
i and V t i y i
  • These
w eigh ts are used to giv e more emphasis to certain incorrect lab els according to the PseudoLoss Adab
  • st
What w e call ep
  • ch
is a pass
  • f
the training algorithm through all the examples in a training set In this pap er w e compare three dieren t v ersions
  • f
AdaBo
  • st
R T raining the tth classier with a xed training set
  • btained
b y resampling with re placemen t
  • nce
from the
  • riginal
training set b efore starting training the tth net w
  • rk
w e sample N patterns from the
  • riginal
training set eac h time with a probabilit y P t i
  • f
pic king pattern i T raining is p erformed for a xed n um b er
  • f
iterations alw a ys using this same resampled training set This is basically the sc heme that has b een used in the past when applying AdaBo
  • st
to decision trees except that w e used the Pseudoloss AdaBo
  • st
T
  • appro
ximate the Pseudoloss the training cost that is minimized for a pattern that is the ith
  • ne
from the
  • riginal
training set is
  • P
j V t i j z ij
  • z
ij
  • E
T raining the tth classier using a dieren t training set at eac h ep
  • c
h b y resampling with replacemen t after eac h training ep
  • c
h after eac h ep
  • c
h a new training set is
  • b
tained b y sampling from the
  • riginal
training set with probabilities P t i Since w e used an
  • nline
sto c hastic gradien t in this case this is equiv alen t to sampling a new pat tern from the
  • riginal
training set with probabilit y P t i b efore eac h forw ardbac kw ard pass through the neural net w
  • rk
T raining con tin ues un til a xed n um b er
  • f
pattern presen tations has b een p erformed Lik e for R the training cost that is minimized for a pattern that is the ith
  • ne
from the
  • riginal
training set is
  • P
j V t i j z ij
  • z
ij
  • W
T raining the tth classier b y directly w eigh ting the cost function here the squared er ror
  • f
the tth neural net w
  • rk
ie all the
  • riginal
training patterns are in the training set but the cost is w eigh ted b y the probabilit y
  • f
eac h example
  • P
j D t i j z ij
  • z
ij
  • If
w e used directly this form ulae the gradien ts w
  • uld
b e v ery small ev en when all probabilities D t i j
  • are
iden tical T
  • a
v
  • id
ha ving to scale learning rates dieren tly dep ending
  • n
the n um b er
  • f
examples the follo wing normalized error function w as used
  • P
t i max k P t k
  • X
j V t i j z ij
  • z
ij
  • In
E and W what mak es the com bined net w
  • rks
essen tially dieren t from eac h
  • ther
is the fact that they are trained with resp ect to dieren t w eigh tings D t
  • f
the
  • riginal
training set Rather in R an additional elemen t
  • f
div ersit y is builtin b ecause the criterion used for the tth net w
  • rk
is not exactly the errors w eigh ted b y P t i Instead more emphasis is put
  • n
certain patterns while completely ignoring
  • thers
b ecause
  • f
the initial random sampling
  • f
the training set The E v ersion can b e seen as a sto c hastic v ersion
  • f
the W v ersion ie as the n um b er
  • f
iterations through the data increases and the learning rate decreases E b ecomes a v ery go
  • d
appro ximation
  • f
W W itself is closest to the recip e mandated b y
slide-7
SLIDE 7 the AdaBo
  • st
algorithm but as w e will see b elo w it suers from n umerical problems Note that E is a b etter appro ximation
  • f
the w eigh ted cost function than R in particular when man y ep
  • c
hs are p erformed If random resampling
  • f
the training data explained a go
  • d
part
  • f
the generalization p erformance
  • f
AdaBo
  • st
then the w eigh ted training v ersion W should p erform w
  • rse
than the resampling v ersions and the xed sample v ersion R should p erform b etter than the con tin uously resampled v ersion E Note that for Bagging whic h directly aims at reducing v ariance random resampling is essen tial to
  • btain
the reduction in generalization error
  • Results
Exp erimen ts ha v e b een p erformed
  • n
three data sets a data set
  • f
  • nline
handwritten digits the UCI L etters data set
  • f
  • line
mac hineprin ted alphab etical c haracters and the UCI satel lite data set that is generated from Landsat Multisp ectral Scanner image data All data sets ha v e a predened training and test set All the pv alues that are giv en in this section concern a pair
  • p
  • p
  • f
test p erformance results
  • n
n test p
  • in
ts for t w
  • classication
systems with unkno wn true error rates p
  • and
p
  • The
n ull h yp
  • thesis
is that the true exp ected p erformance for the t w
  • systems
is not dieren t ie p
  • p
  • Let
  • p
  • p
  • p
  • b
e the estimator
  • f
the common error rate under the n ull h yp
  • thesis
The alternativ e h yp
  • thesis
is that p
  • p
  • so
the pv alue is
  • btained
as the probabilit y
  • f
  • bserving
suc h a large dierence under the n ull h yp
  • thesis
ie P Z
  • z
  • for
a Normal Z
  • with
z
  • p
n
  • p
  • p
  • p
  • p
  • p
  • This
is based
  • n
the Normal appro ximation
  • f
the Binomial whic h is appropriate for large n ho w ev er see Dietteric h a for a discussion
  • f
this and
  • ther
tests to compare algorithms
  • Results
  • n
the
  • nline
data set The
  • nline
data set w as collected at P aris
  • Univ
ersit y Sc h w enk and Milgram
  • A
W A COM A tablet with a cordless p en w as used in
  • rder
to allo w natural writing Since w e w an ted to build a writerindep enden t recognition system w e tried to use man y writers and to imp
  • se
as few constrain ts as p
  • ssible
  • n
the writing st yle In total
  • studen
ts wrote do wn isolated n um b ers that ha v e b een divided in to learning set
  • examples
and test set
  • examples
Note that the writers
  • f
the training and test sets are completely distinct A particular prop ert y
  • f
this data set is the notable v ariet y
  • f
writing st yles that are not equally frequen t at all There are for instance
  • zeros
written coun terclo c kwise but
  • nly
  • written
clo c kwise Figure
  • giv
es an idea
  • f
the great v ariet y
  • f
writing st yles
  • f
this data set W e
  • nly
applied a simple prepro cessing the c haracters w ere resampled to
  • p
  • in
ts cen tered and sizenormalized to an xyco
  • rdinate
sequence in
  • T
able
  • summarizes
the results
  • n
the test set b efore using AdaBo
  • st
Note that the dif
slide-8
SLIDE 8 Figure
  • Some
examples
  • f
the
  • nline
handwritten digits data set test set T able
  • Online
digits data set err
  • r
r ates for ful ly c
  • nne
cte d MLPs not b
  • ste
d arc hitecture
  • train
  • test
  • ferences
among the test results
  • n
the last three net w
  • rks
are not statistically signican t pv alue
  • whereas
the dierence with the rst net w
  • rk
is signican t pv alue
  • fold
crossv alidation within the training set w as used to nd the
  • ptimal
n um b er
  • f
train ing ep
  • c
hs t ypically ab
  • ut
  • Note
that if training is con tin ued un til
  • ep
  • c
hs the test error increases b y up to
  • T
able
  • sho
ws the results
  • f
bagged and b
  • sted
m ultila y er p erceptrons with
  • r
  • hidden
units trained for either
  • r
  • ep
  • c
hs and using either the
  • rdinary
resampling sc heme R resampling with dieren t random selections at eac h ep
  • c
h E
  • r
training with w eigh ts D t
  • n
the squared error criterion for eac h pattern W In all cases
  • neural
net w
  • rks
w ere com bined AdaBo
  • st
impro v ed in all cases the generalization error
  • f
the MLPs for instance from
  • to
ab
  • ut
  • for
the
  • arc
hitecture Note that the impro v emen t with
  • hidden
units from
  • without
AdaBo
  • st
to
  • with
AdaBo
  • st
is signican t p v alue
  • f
  • despite
the small n um b er
  • f
examples Bo
  • sting
w as also alw a ys sup erior to Bagging although the dierences are not alw a ys v ery signican t b ecause
  • f
the small n um b er
  • The
notation h designates a fully connected neural net w
  • rk
with
  • input
no des
  • ne
hidden la y er with h neurons and a
  • dimensional
  • utput
la y er
slide-9
SLIDE 9 T able
  • Online
digits test err
  • r
r ates for b
  • ste
d MLPs arc hitecture
  • v
ersion R E W R E W R E W Bagging
  • ep
  • c
hs
  • AdaBo
  • st
  • ep
  • c
hs
  • ep
  • c
hs
  • ep
  • c
hs
  • ep
  • c
hs
  • ep
  • c
hs
  • f
examples F urthermore it seems that the n um b er
  • f
training ep
  • c
hs
  • f
eac h individual classier has no signican t impact
  • n
the results
  • f
the com bined classier at least
  • n
this data set AdaBo
  • st
with w eigh ted training
  • f
MLPs W v ersion ho w ev er do esnt w
  • rk
as w ell if the learning
  • f
eac h individual MLP is stopp ed to
  • early
  • ep
  • c
hs the net w
  • rks
didnt learn w ell enough the w eigh ted examples and
  • t
rapidly approac hed
  • When
training eac h MLP for
  • ep
  • c
hs ho w ev er the w eigh ted training W v ersion ac hiev ed the same lo w test error rate AdaBo
  • st
is less useful for v ery big net w
  • rks
  • r
more hidden units for this data since an individual classier can ac hiev e zero error
  • n
the
  • riginal
training set using the E
  • r
W metho d Suc h large net w
  • rks
probably ha v e a v ery lo w bias but high v ariance This ma y explain wh y Bagging
  • a
pure v ariance reduction metho d
  • can
do as w ell as AdaBo
  • st
whic h is b eliev ed to reduce bias and v ariance Note ho w ev er that AdaBo
  • st
can ac hiev e the same lo w error rates with the smaller
  • net
w
  • rks
Figure
  • sho
ws the error rates
  • f
some
  • f
the b
  • sted
classiers as the n um b er
  • f
net w
  • rks
is increased AdaBo
  • st
brings training error to zero after
  • nly
a few steps ev en with an MLP with
  • nly
  • hidden
units The generalization error is also considerably impro v ed and it con tin ues to decrease to an apparen t asymptote after zero training error has b een reac hed The surprising eect
  • f
con tin uously decreasing generalization error ev en after training er ror reac hes zero has already b een
  • bserv
ed b y
  • thers
Breiman
  • Druc
k er and Cortes
  • F
reund and Sc hapire a Quinlan
  • This
seems to con tradict Occams razor but a recen t theorem Sc hapire et al
  • suggests
that the margin distribution ma y b e relev an t to the generalization error Although previous empirical results Sc hapire et al
  • indicate
that pushing the margin cum ulativ e distribution to the righ t ma y impro v e generalization
  • ther
recen t results Breiman a Breiman b Gro v e and Sc h uur mans
  • sho
w that impro ving the whole margin distribution can also yield to w
  • rse
generalization Figure
  • and
  • sho
w sev eral margin cum ulativ e distributions ie the fraction
  • f
examples whose margin is at most x as a function
  • f
x
  • The
net w
  • rks
had b e trained for
  • ep
  • c
hs
  • for
the W v ersion
slide-10
SLIDE 10

2 4 6 8 10 1 10 100 error in % MLP 22-10-10 test unboosted train test Bagging AdaBoost (R) AdaBoost (E) AdaBoost (W) 2 4 6 8 10 1 10 100 error in % MLP 22-30-10 train test Bagging AdaBoost (R) AdaBoost (E) AdaBoost (W) 2 4 6 8 10 1 10 100 error in % number of networks MLP 22-50-10 train test Bagging AdaBoost (R) AdaBoost (E) AdaBoost (W)

Figure
  • Err
  • r
r ates
  • f
the b
  • ste
d classiers for incr e asing numb er
  • f
networks F
  • r
clarity the tr aining err
  • r
  • f
Bagging is not shown it
  • verlaps
with the test err
  • r
r ates
  • f
A daBo
  • st
The dotte d c
  • nstant
horizontal line c
  • rr
esp
  • nds
to the test err
  • r
  • f
the unb
  • ste
d classier Smal l
  • scil
lations ar e not signic ant sinc e they c
  • rr
esp
  • nd
to few examples
slide-11
SLIDE 11 AdaBo
  • st
R
  • f
MLP
  • AdaBo
  • st
E
  • f
MLP
  • 0.2

0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100 0.2 0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100

AdaBo
  • st
R
  • f
MLP
  • AdaBo
  • st
E
  • f
MLP
  • 0.2

0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100 0.2 0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100

AdaBo
  • st
R
  • f
MLP
  • AdaBo
  • st
E
  • f
MLP
  • 0.2

0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100 0.2 0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100

Figure
  • Mar
gin distributions using
  • and
  • networks
r esp e ctively
slide-12
SLIDE 12 AdaBo
  • st
W
  • f
MLP
  • Bagging
  • f
MLP
  • 0.2

0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100 0.2 0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100

AdaBo
  • st
W
  • f
MLP
  • Bagging
  • f
MLP
  • 0.2

0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100 0.2 0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100

AdaBo
  • st
W
  • f
MLP
  • Bagging
  • f
MLP
  • 0.2

0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100 0.2 0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100

Figure
  • Mar
gin distributions using
  • and
  • networks
r esp e ctively
slide-13
SLIDE 13 It is clear in the Figures
  • and
  • that
the n um b er
  • f
examples with high margin increases when more classiers are com bined b y b
  • sting
When b
  • sting
neural net w
  • rks
with
  • hidden
units for instance there are some examples with a margin smaller than
  • when
  • nly
t w
  • net
w
  • rks
are com bined Ho w ev er all examples ha v e a p
  • sitiv
e margin when
  • nets
are com bined and all examples ha v e a margin higher than
  • for
  • net
w
  • rks
Bagging
  • n
the
  • ther
hand has no signican t in uence
  • n
the margin distributions There is almost no dierence b et w een the margin distributions
  • f
the R E
  • r
W v ersion
  • f
AdaBo
  • st
either
  • Note
ho w ev er that there is a dierence b et w een the margin distributions and the test set errors when the complexit y
  • f
the neural net w
  • rks
is v aried hidden la y er size Finally
  • it
seems that sometimes AdaBo
  • st
m ust allo w some examples with v ery high margins in
  • rder
to impro v e the minimal margin This can b est b eseen for the
  • arc
hitecture One should k eep in mind that this data set con tains
  • nly
small amoun ts
  • f
noise In ap plication domains with high amoun ts
  • f
noise it ma y b e less adv an tageous to impro v e the minimal margin at an y price Gro v e and Sc h uurmans
  • R
atsc h et al
  • since
this w
  • uld
mean putting to
  • m
uc h w eigh t to noisy
  • r
wrongly lab eled examples
  • Results
  • n
the UCI Letters and Satimage Data Sets Similar exp erimen ts w ere p erformed with MLPs
  • n
the Letters data set from the UCI Mac hine Learning data set It has
  • training
and
  • test
patterns
  • input
features and
  • classes
AZ
  • f
distorted mac hineprin ted c haracters from
  • dieren
t fon ts A few preliminary exp erimen ts
  • n
the training set
  • nly
w ere used to c ho
  • se
a
  • arc
hitecture Eac h input feature w as normalized according to its mean and v ariance
  • n
the training set Tw
  • t
yp es
  • f
exp erimen ts w ere p erformed
  • doing
resampling after eac h ep
  • c
h E and using sto c hastic gradien t descen t and
  • without
resampling but using rew eigh ting
  • f
the squared error W and conjugate gradien t descen t In b
  • th
cases a xed n um b er
  • f
training ep
  • c
hs
  • w
as used The plain bagged and b
  • sted
net w
  • rks
are compared to decision trees in T able
  • T
able
  • T
est err
  • r
r ates
  • n
the UCI data sets CAR T y C z MLP data set alone bagged b
  • sted
alone bagged b
  • sted
alone bagged b
  • sted
letter
  • satellite
  • y
results from Breiman
  • z
results from F reund and Sc hapire a In b
  • th
cases E and W the same nal generalization error results w ere
  • btained
  • for
  • One
ma y note that the W and E v ersions ac hiev e sligh tly higher margins than R
slide-14
SLIDE 14

2 4 6 8 10 1 10 100 error in % number of networks test unboosted train test Bagging AdaBoost (SG+E) AdaBoost (CG+W)

Figure
  • Err
  • r
r ates
  • f
the b agge d and b
  • ste
d neur al networks for the UCI letter data set lo gsc ale SGE denotes sto chastic gr adient desc ent and r esampling after e ach ep
  • ch
CGW me ans c
  • njugate
gr adient desc ent and weighting
  • f
the squar e d err
  • r
F
  • r
clarity the tr aining err
  • r
  • f
Bagging is not shown it attens
  • ut
to ab
  • ut
  • The
dotte d c
  • nstant
horizontal line c
  • rr
esp
  • nds
to the test err
  • r
  • f
the unb
  • ste
d classier E and
  • for
W but the training time using the w eigh ted squared error W w as ab
  • ut
  • times
greater This sho ws that using random resampling as in E
  • r
R is not necessary to
  • btain
go
  • d
generalization whereas it is clearly necessary for Bagging Ho w ev er the exp erimen ts sho w that it is still preferable to use a random sampling metho d suc h as R
  • r
E for n umerical reasons con v ergence
  • f
eac h net w
  • rk
is faster F
  • r
this reason for the E exp erimen ts with sto c hastic gradien t descen t
  • net
w
  • rks
w ere b
  • sted
whereas w e stopp ed training
  • n
the W net w
  • rk
after
  • net
w
  • rks
when the generalization error seemed to ha v e attened
  • ut
whic h to
  • k
more than a w eek
  • n
a fast pro cessor SGI Origin W e b eliev e that the main reason for this dierence in training time is that the conjugate gradien t metho d is a batc h metho d and is therefore slo w er than sto c hastic gradien t descen t
  • n
redundan t data sets with man y thousands
  • f
examples suc h as this
  • ne
See comparisons b et w een batc h and
  • nline
metho ds Bourrely
  • and
conjugate gradien ts for classication tasks in particular Moller
  • Moller
  • F
  • r
the W v ersion with sto c hastic gradien t descen t the w eigh ted training error
  • f
individual net w
  • rks
do es not decrease as m uc h as when using conjugate gradien t descen t so that AdaBo
  • st
itself did not w
  • rk
as w ell W e b eliev e that this is due to the fact that it is di!cult for sto c hastic gradien t descen t to approac h a minim um when the
  • utput
error is w eigh ted with v ery dieren t w eigh ts for dieren t patterns the patterns with small w eigh ts mak e almost no progress On the
  • ther
hand the conjugate gradien t descen t metho d can approac h a minim um
  • f
the w eigh ted cost function more precisely
  • but
ine!cien tly
  • when
there are thousands
  • f
training examples The results
  • btained
with the b
  • sted
net w
  • rk
are extremely go
  • d
  • error
whether using the W v ersion with conjugate gradien ts
  • r
the E v ersion with sto c hastic gradien t
slide-15
SLIDE 15 Bagging AdaBo
  • st
SGE

0.2 0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100 0.2 0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1.0 2 5 10 50 100

Figure
  • Mar
gin distributions for the UCI letter data set and are the b est ev er published to date as far as the authors kno w for this data set In a comparison with the b
  • sted
trees
  • error
the pv alue
  • f
the n ull h yp
  • thesis
is less than
  • The
b est p erformance rep
  • rted
in ST A TLOG F eng et al
  • is
  • Note
also that w e need to com bine
  • nly
a few neural net w
  • rks
to get immediate imp
  • rtan
t impro v emen ts with the E v ersion
  • neural
net w
  • rks
su!ce for the error to fall under
  • whereas
b
  • sted
decision trees t ypically con v erge later The W v ersion
  • f
AdaBo
  • st
actually con v erged faster in terms
  • f
n um b er
  • f
net w
  • rks
gure
  • after
ab
  • ut
  • net
w
  • rks
the
  • mark
w as reac hed and after
  • net
w
  • rks
the
  • apparen
t asymptote w as reac hed but con v erged m uc h slo w er in terms
  • f
training time Figure
  • sho
ws the margin distributions for Bagging and AdaBo
  • st
applied to this data set Again Bagging has no eect
  • n
the margin distribution whereas AdaBo
  • st
clearly increases the n um b er
  • f
examples with large margins Similar conclusions hold for the UCI satellite data set T able
  • although
the impro v e men ts are not as dramatic as in the case
  • f
the Letter data set The impro v emen t due to AdaBo
  • st
is statistically signican t pv alue
  • but
the dierence in p erformance b et w een b
  • sted
MLPs and b
  • sted
decision trees is not pv alue
  • This
data set has
  • examples
with the rst
  • used
for training and the last
  • used
for testing generalization There are
  • inputs
and
  • classes
and a
  • net
w
  • rk
w as used Again the t w
  • b
est training metho ds are the ep
  • c
h resampling E with sto c hastic gradien t
  • r
the w eigh ted squared error W with conjugate gradien t descen t
  • Conclusion
As demonstrated here in three realw
  • rld
applications AdaBo
  • st
can signican tly impro v e neural classiers In particular the results
  • btained
  • n
the UCI Letters data set
  • test
error are signican tly b etter than the b est published results to date as far as the authors kno w The b eha vior
  • f
AdaBo
  • st
for neural net w
  • rks
conrms previous
  • bserv
ations
  • n
  • ther
learning algorithms eg Breiman
  • Druc
k er and Cortes
  • F
reund and
slide-16
SLIDE 16 Sc hapire a Quinlan
  • Sc
hapire et al
  • suc
h as the con tin ued generalization impro v emen t after zero training error has b een reac hed and the asso ciated impro v emen t in the margin distribution It seems also that AdaBo
  • st
is not v ery sensitiv e to
  • v
ertraining
  • f
the individual classiers so that the neural net w
  • rks
can b e trained for a xed preferably high n um b er
  • f
training ep
  • c
hs A similar
  • bserv
ation w as recen tly made with decision trees Breiman b This apparen t insensitivit y to
  • v
ertraining
  • f
individual classiers simplies the c hoice
  • f
neural net w
  • rk
design parameters Another in teresting nding
  • f
this pap er is that the w eigh ted training v ersion W
  • f
AdaBo
  • st
giv es go
  • d
generalization results for MLPs but requires man y more training ep
  • c
hs
  • r
the use
  • f
a secondorder and unfortunately
  • batc
h metho d suc h as conjugate gradien ts W e conjecture that this happ ens b ecause
  • f
the w eigh ts
  • n
the cost function terms esp ecially when the w eigh ts are small whic h could w
  • rsen
the conditioning
  • f
the Hessian matrix So in terms
  • f
generalization error all three metho ds R E W ga v e similar results but training time w as lo w est with the E metho d with sto c hastic gradien t descen t whic h samples eac h new training pattern from the
  • riginal
data with the AdaBo
  • st
w eigh ts Although
  • ur
exp erimen ts are insu!cien t to conclude it is p
  • ssible
that the w eigh ted training metho d W with conjugate gradien ts migh t b e faster than the
  • thers
for small training sets a few h undred examples There are v arious w a ys to dene v ariance for classiers eg Kong and Dietteric h
  • Breiman
  • Koha
vi and W
  • lp
ert
  • Tibshirani
  • It
basically represen ts ho w the resulting classier will v ary when a dieren t training set is sampled from the true generating distribution
  • f
the data Our comparativ e results
  • n
the R E and W v ersions add credence to the view that randomness induced b y resampling the training data is not the main reason for AdaBo
  • sts
reduction
  • f
the generalization error This is in con trast to Bagging whic h is a pure v ariance reduction metho d F
  • r
Bagging random resampling is essen tial to
  • btain
the
  • bserv
ed v ariance reduction Another in teresting issue is whether the b
  • sted
neural net w
  • rks
could b e trained with a criterion
  • ther
than the mean squared error criterion
  • ne
that w
  • uld
b etter appro ximate the goal
  • f
the AdaBo
  • st
criterion ie minimizing a w eigh ted classication error See Sc hapire and Singer
  • for
recen t w
  • rk
that addresses this issue Ac kno wledgmen ts Most
  • f
the w
  • rk
w as done while the rst author w as doing a p
  • stdo
ctorate at the Univ ersit y
  • f
Mon treal The authors w
  • uld
lik e to thank the National Science and Engineering Researc h Council
  • f
Canada and the Go v ernmen t
  • f
Queb ec for nancial supp
  • rt
slide-17
SLIDE 17 References Bauer E and Koha vi R
  • An
empirical comparison
  • f
v
  • ting
classication algorithms Bagging b
  • sting
and v arian ts to app e ar in Machine L e arning Bourrely
  • J
  • P
arallelization
  • f
a neural learning algorithm
  • n
a h yp ercub e In Hyp er cub e and distribute d c
  • mputers
pages " Elsiev er Science Publishing North Holland Breiman L
  • Bagging
predictors Machine L e arning " Breiman L
  • Bias
v ariance and arcing classiers T ec hnical Rep
  • rt
  • Statistics
Departmen t Univ ersit y
  • f
California at Berk eley
  • Breiman
L a Arcing the edge T ec hnical Rep
  • rt
  • Statistics
Departmen t Uni v ersit y
  • f
California at Berk eley
  • Breiman
L b Prediction games and arcing classiers T ec hnical Rep
  • rt
  • Statistics
Departmen t Univ ersit y
  • f
California at Berk eley
  • Breiman
L
  • Arcing
classiers A nnuals
  • f
Statistics " Dietteric h T a Appro ximate statistical tests for comparing sup ervised classication learning algorithms Neur al Computation " Dietteric h T G b An exp erimen tal comparison
  • f
three metho ds for constructing ensem bles
  • f
decision trees Bagging b
  • sting
and randomization submitte d to Ma chine L e arning a v ailable at ftpftpcsorstedupu bt gdp aper str ra ndom ized c ps gz Druc k er H and Cortes C
  • Bo
  • sting
decision trees In T
  • uretzky
  • D
S Mozer M C and Hasselmo M E editors A dvanc es in Neur al Information Pr
  • c
essing Systems pages " MIT Press F eng C Sutherland A King R SMuggleton and Henery
  • R
  • Comparison
  • f
mac hine learning classiers to statistics and neural net w
  • rks
In Pr
  • c
e e dings
  • f
the F
  • urth
International Workshop
  • n
A rticial Intel ligenc e and Statistics pages " F reund Y
  • Bo
  • sting
a w eak learning algorithm b y ma jorit y
  • Information
and Com putation " F reund Y and Sc hapire R E a Exp erimen ts with a new b
  • sting
algorithm In Machine L e arning Pr
  • c
e e dings
  • f
Thirte enth International Confer enc e pages " F reund Y and Sc hapire R E b Game theory
  • nline
prediction and b
  • sting
In Pr
  • c
e e dings
  • f
the Ninth A nnual Confer enc e
  • n
Computational L e arning The
  • ry
pages "
slide-18
SLIDE 18 F reund Y and Sc hapire R E
  • A
decision theoretic generalization
  • f
  • nline
learning and an application to b
  • sting
Journal
  • f
Computer and System Scienc e "
  • F
reund Y and Sc hapire R E
  • Adaptiv
e game pla ying using m ultiplicativ e w eigh ts Games and Ec
  • nomic
Behavior to app ear F riedman J Hastie T and Tibshirani R
  • Additiv
e logistic regression a statistical view
  • f
b
  • sting
T ec hnical rep
  • rt
Departmen t
  • f
Statistics Stanford Univ ersit y
  • Geman
S Bienensto c k E and Doursat R
  • Neural
net w
  • rks
and the biasv ariance dilemma Neur al Computation " Gro v e A J and Sc h uurmans D
  • Bo
  • sting
in the limit Maximizing the margin
  • f
learned ensem bles In Pr
  • c
e e dings
  • f
the Fifte enth National Confer enc e
  • n
A rticial Intel ligenc e to app ear Koha vi R and W
  • lp
ert D H
  • Bias
plus v ariance decomp
  • sition
for zeroone loss functions In Machine L e arning Pr
  • c
e e dings
  • f
Thirte enth International Confer enc e pages " Kong E B and Dietteric h T G
  • Errorcorrecting
  • utput
co ding corrects bias and v ariance In Machine L e arning Pr
  • c
e e dings
  • f
Twelfth International Confer enc e pages " Krogh A and V edelsb y
  • J
  • Neural
net w
  • rk
ensem bles cross v alidation and activ e learning In T esauro G T
  • uretzky
  • D
S and Leen T K editors A dvanc es in Neur al Information Pr
  • c
essing Systems
  • pages
" MIT Press Maclin R and Opitz D
  • An
empirical ev aluation
  • f
bagging and b
  • sting
In Pr
  • c
e e dings
  • f
the F
  • urte
enth National Confer enc e
  • n
A rticial Intel ligenc pages "
  • Mason
L and Baxter P
  • B
J
  • Direct
  • ptimization
  • f
margins impro v es generalization in com bined classiers In TODO editor A dvanc es in Neur al Information Pr
  • c
essing Systems
  • MIT
Press in press Moller M
  • Sup
ervised learning
  • n
large redundan t training sets In Neur al Networks for Signal Pr
  • c
essing
  • IEEE
press Moller M
  • Ecient
T r aining
  • f
F e e dF
  • rwar
d Neur al Networks PhD thesis Aarh us Univ ersit y
  • Aarh
us Denmark P errone M P
  • Impr
  • ving
R e gr ession Estimation A ver aging Metho ds for V arianc e R e duction with Extensions to Gener al Convex Me asur e Optimization PhD thesis Bro wn Univ ersit y
  • Institute
for Brain and Neural Systems P errone M P
  • Putting
it all together Metho ds for com bining neural net w
  • rks
In Co w an J D T esauro G and Alsp ector J editors A dvanc es in Neur al Information Pr
  • c
essing Systems v
  • lume
  • pages
" Morgan Kaufmann Publishers Inc
slide-19
SLIDE 19 Quinlan J R
  • Bagging
b
  • sting
and C In Machine L e arning Pr
  • c
e e dings
  • f
the fourte enth International Confer enc e pages " R atsc h G Ono da T and M
  • uller
KR
  • Soft
margins for adab
  • st
T ec hnical Rep
  • rt
NCTR Ro y al Hollo w a y College Sc hapire R E
  • The
strength
  • f
w eak learnabilit y
  • Machine
L e arning " Sc hapire R E
  • Theoretical
views
  • f
b
  • sting
In Computational L e arning The
  • ry
F
  • urth
Eur
  • p
e an Confer enc e Eur
  • COL
T to app ear Sc hapire R E F reund Y Bartlett P
  • and
Lee W S
  • Bo
  • sting
the margin A new explanation for the eectiv eness
  • f
v
  • ting
metho ds In Machine L e arning Pr
  • c
e e dings
  • f
F
  • urte
enth International Confer enc e pages " Sc hapire R E and Singer Y
  • Impro
v ed b
  • sting
algorithms using condence rated predictions In Pr
  • c
e e dings
  • f
the th A nnual Confer enc e
  • n
Computational L e arning The
  • ry
Sc h w enk H and Bengio Y
  • Adab
  • sting
neural net w
  • rks
Application to
  • nline
c haracter recognition In International Confer enc e
  • n
A rticial Neur al Networks pages
  • "
  • Springer
V erlag Sc h w enk H and Bengio Y
  • T
raining metho ds for adaptiv e b
  • sting
  • f
neural net w
  • rks
In Jordan M I Kearns M J and Solla S A editors A dvanc es in Neur al Information Pr
  • c
essing Systems pages
  • "
The MIT Press Sc h w enk H and Milgram M
  • Constrain
t tangen t distance for
  • nline
c haracter recognition In International Confer enc e
  • n
Pattern R e c
  • gnition
pages D" Tibshirani R
  • Bias
v ariance and prediction error for classication rules T ec hnical rep
  • rt
Departemen t
  • d
Statistics Univ ersit y
  • f
T
  • ron
to