S a n t a C r u z S u m m e r S c h
- l
2 1 2
Theory and Applications of Boosting
Yoav Freund UCSD
Many slides from Rob Schapire
Monday, July 16, 2012
Theory and Applications of Boosting Yoav Freund UCSD 2 1 0 2 - - PowerPoint PPT Presentation
Theory and Applications of Boosting Yoav Freund UCSD 2 1 0 2 l o o h c S r e m m Many slides from Rob Schapire u S z u r C a t n a S Monday, July 16, 2012 0 1 2 m e r S c h o o l 2 a C r u z S u m
S a n t a C r u z S u m m e r S c h
2 1 2
Many slides from Rob Schapire
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
boosting
matrix games
minimization.
By Majority.
with High Noise.
Pedestrian Detection
studies
tracking.
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
[Gorin et al.]
customer (Collect, CallingCard, PersonToPerson, etc.)
please (Collect)
it to my office (ThirdNumber)
please (CallingCard)
rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit)
THEN predict ‘CallingCard’ ”
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
(those most often misclassified by previous rules of thumb)
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
thumb into highly accurate prediction rule
consistently find classifiers (“rules of thumb”) at least slightly better than random, say, accuracy ≥ 55% (in two-class setting) [ “weak learning assumption” ]
construct single classifier with very high accuracy, say, 99%
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
8
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
[Valiant ’84]
with high probability given polynomially many examples (and polynomial time) can find classifier with arbitrarily small generalization error
better than random guessing (1
2 − γ)
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
predictions
better than random guessing
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
algorithms
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
and the margins theory
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
(x1, y1), . . . , (xm, ym)
ht : X → {−1, +1} with small error t on Dt: t = Pri∼Dt[ht(xi) = yi]
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
[with Freund]
Dt+1(i) = Dt(i) Zt × e−αt if yi = ht(xi) eαt if yi = ht(xi) = Dt(i) Zt exp(−αt yi ht(xi)) where Zt = normalization factor αt = 1
2 ln
1 − t t
αtht(x)
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
D1
weak classifiers = vertical or horizontal half-planes
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
h1 ! "1 1 =0.30 =0.42 2 D
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
! "2 2 =0.21 =0.65 h2 3 D
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
h3 ! "3 3=0.92 =0.14
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
H final + 0.92 + 0.65 0.42 sign = =
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
http://cseweb.ucsd.edu/~yfreund/adaboost/index.html
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
and the margins theory
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
[with Freund]
2 − γt
[ γt = “edge” ]
training error(Hfinal) ≤
t
≤ exp
γ2
t
then training error(Hfinal) ≤ e−2γ2T
[ Freund & Schapire 96]
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
αtht(x) ⇒ Hfinal(x) = sign(F(x))
Dfinal(i) = 1 m exp
αtht(xi)
Zt = 1 m exp (−yiF(xi))
Zt
Scoring function
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
Zt
training error(Hfinal) = 1 m
1 if yi = Hfinal(xi) else = 1 m
1 if yiF(xi) ≤ 0 else ≤ 1 m
exp(−yiF(xi)) =
Dfinal(i)
Zt =
Zt
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
Zt =
Dt(i) exp(−αt yi ht(xi)) =
Dt(i)eαt +
Dt(i)e−αt = t eαt + (1 − t) e−αt = 2
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
and the margins theory
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
20 40 60 80 100 0.2 0.4 0.6 0.8 1
# of rounds ( error T) train test
expect:
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
generalization error ≤ training error + ˜ O
m
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
5 10 15 20 25 30 1 10 100 1000
(boosting “stumps” on heart-disease dataset)
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
10 100 1000 5 10 15 20
error test train ) T # of rounds ( (boosting C4.5 on “letter” dataset)
# rounds 5 100 1000 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
[with Freund, Bartlett & Lee]
right or wrong
= (weighted fraction voting correctly) −(weighted fraction voting incorrectly)
correct incorrect correct incorrect high conf. high conf. low conf. !1 +1
final
H
final
H
[ Schapire, Freund & Bartlett and Lee 97]
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
= cumulative distribution of margins of training examples
10 100 1000 5 10 15 20
error test train ) T # of rounds (
0.5 1 0.5 1.0
cumulative distribution 1000 100 margin 5
# rounds 5 100 1000 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1 % margins ≤ 0.5 7.7 0.0 0.0 minimum margin 0.14 0.52 0.55
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
error (independent of number of rounds)
final classifier by a much smaller classifier (just as polls can predict not-too-close election)
examples (given weak learning assumption)
although final classifier is getting larger, margins are likely to be increasing, so final classifier actually getting close to a simpler classifier, driving down the test error
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
generalization error ≤ ˆ Pr[margin ≤ θ] + ˜ O
θ
Pr[ ] = empirical probability)
Pr[margin ≤ θ] → 0 exponentially fast (in T) if t < 1
2 − θ (∀t)
will quickly have “large” margins
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
training set
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
maximizing the minimum margin
[Breiman]
(even using same weak learner); or
but margin distributions that are lower overall
[with Reyzin]
[Reyzin & Schapire]
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
(using different norms)
AdaBoost uses weak learner to search over space
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
L
h(x) = +1 if x above L “don’t know” else
predictions
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
[with Singer]
predictions
sign(ht(x)) = prediction |ht(x)| = “confidence”
Dt+1(i) = Dt(i) Zt · exp(−αt yi ht(xi)) and identical rule for combining weak classifiers
[Schapire & Singer]
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
training error(Hfinal) ≤
Zt = 1 m
exp
αtht(xi)
Zt =
Dt(i) exp(−αt yi ht(xi))
weak classifier has simple form that can be found efficiently
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
10 20 30 40 50 60 70 1 10 100 1000 10000 % Error Number of rounds train no conf test no conf train conf test conf
round first reached % error conf. no conf. speedup 40 268 16,938 63.2 35 598 65,292 109.2 30 1,888 >80,000 –
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
[with Singer]
simple patterns, namely, (sparse) n-grams
minimize Zt
Collect, Competitor, DialForMe, Directory, HowToDial, PersonToPerson, Rate, ThirdNumber, Time, TimeCharge, Other.
[Schapire & Singer]
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
rnd term
AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT 1
collect
2
card
3
my home
4
person ? person
5
code
6
I
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
rnd term
AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT 7
time
8
wrong number
9
how
10
call
11
seven
12
trying to
13
and
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
rnd term
AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT 14
third
15
to
16
for
17
charges
18
dial
19
just
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
examples with most weight are often outliers (mislabeled and/or ambiguous)
(Collect)
(Rate)
please (CallingCard)
(Collect)
(Collect)
(CallingCard)
and have the charges billed to another number (CallingCard DialForMe)
call is so bad (BillingCredit)
(AttService Rate)
(PersonToPerson)
a non dialable point in san miguel philippines (AttService Other)
S a n t a C r u z S u m m e r S c h
2 1 2
and the margins theory
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
→ shift in mind set — goal now is merely to find classifiers barely better than random guessing
binary classification
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
→ overfitting
→ underfitting → low margins → overfitting
noise
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
[with Freund]
predict +1 predict no yes height > 5 feet ? predict
predict +1 no yes eye color = brown ?
[Freund & Schapire]
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
5 10 15 20 25 30
boosting Stumps
5 10 15 20 25 30
C4.5
5 10 15 20 25 30
boosting C4.5
5 10 15 20 25 30
C4.5
Monday, July 16, 2012
S a n t a C r u z S u m m e r S c h
2 1 2
Monday, July 16, 2012