Aykut Erdem // Hacettepe University // Fall 2019
Lecture 20:
AdaBoost
BBM406
Fundamentals of Machine Learning
Illustration adapted from Alex Rogozhnikov
BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost - - PowerPoint PPT Presentation
Illustration adapted from Alex Rogozhnikov BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe University // Fall 2019 Last time Bias/Variance Tradeo ff slide by David Sontag Graphical illustration of
Aykut Erdem // Hacettepe University // Fall 2019
Lecture 20:
AdaBoost
Illustration adapted from Alex Rogozhnikov
Last time… Bias/Variance Tradeoff
2
http://scott.fortmann-roe.com/docs/BiasVariance.html Graphical illustration of bias and variance.
slide by David Sontagtraining examples, create D’ by drawing N examples at random with replacement from D.
3
slide by David Sontag4
slide by Nando de Freitas[From the book of Hastie, Friedman and Tibshirani]
Tree t=1 t=2 t=3
5
strong learner.
classifiers returned by weak learner.
classifiers often not hard.
combined?
6
slide by Mehryar MohriExample: “How May I Help You?”
phone customer (Collect, CallingCard, PersonToPerson, etc.)
please (Collect)
my office (ThirdNumber)
(CallingCard)
the wrong number because I got the wrong party and I would like to have that taken off of my bill
(BillingCredit)
7
[Gorin et al.]
slide by Rob Schapireweak classifiers that are good at different parts of the input space
conviction
space
space?
8
slide by Aarti Singh & Barnabas Poczostraining data, then let the learned classifiers vote
weighted by their strength
9
slide by Aarti Singh & Barnabas Poczosto the ensemble
10
Greedy algorithm: for m=1,...,M
examples get “heavier”
error of hm
slide by Raquel Urtasun[Source: G. Shakhnarovich]
to the ensemble
11
Greedy algorithm: for m=1,...,M
examples get “heavier”
error of hm
slide by Raquel Urtasun[Source: G. Shakhnarovich]
to the ensemble
12
Greedy algorithm: for m=1,...,M
examples get “heavier”
error of hm
slide by Raquel Urtasun[Source: G. Shakhnarovich]
to the ensemble
13
Greedy algorithm: for m=1,...,M
examples get “heavier”
error of hm
slide by Raquel Urtasun[Source: G. Shakhnarovich]
to the ensemble
14
Greedy algorithm: for m=1,...,M
examples get “heavier”
error of hm
slide by Raquel Urtasun[Source: G. Shakhnarovich]
to the ensemble
15
Greedy algorithm: for m=1,...,M
examples get “heavier”
error of hm
slide by Raquel Urtasun[Source: G. Shakhnarovich]
to the ensemble
16
Greedy algorithm: for m=1,...,M
examples get “heavier”
error of hm
slide by Raquel Urtasun[Source: G. Shakhnarovich]
algorithms
17
slide by Rob SchapireThe AdaBoost Algorithm
18
weak hypotheses = vertical or horizontal half-planes
19
Minimize the error For binary ht , typically use
slide by Rob Schapire20
h1 ε1=0.30
slide by Rob Schapire21
h1 α ε1 1 =0.30 =0.42
slide by Rob Schapire22
h1 α ε1 1 =0.30 =0.42 2 D
slide by Rob Schapire23
ε2=0.21 h2 3
slide by Rob Schapire24
α ε2 2 =0.21 =0.65 h2 3
slide by Rob Schapire25
α ε2 2 =0.21 =0.65 h2 3 D
slide by Rob Schapire26
h3 ε3=0.14
slide by Rob Schapire27
h3 α ε3 3=0.92 =0.14
slide by Rob SchapireH final + 0.92 + 0.65 0.42 sign = =
28
slide by Rob SchapireVoted combination of classifiers
simple “weak” classifiers into a single “strong” classifier
component classifiers where the (non-negative) votes αi can be used to emphasize component classifiers that are more reliable than others
29
slide by Tommi S. Jaakkolaclassifiers generating ±1 labels: where These are called decision stumps.
component of the input vector
30
slide by Tommi S. Jaakkolaso we can determine which new component h(x; θ) to add and how many votes it should receive
consider here only a simple exponential loss
31
slide by Tommi S. Jaakkola32
slide by Tommi S. Jaakkola33
slide by Tommi S. Jaakkolashould optimize a weighted loss (weighted towards mistakes).
34
slide by Tommi S. JaakkolaEmpirical exponential loss (cont’d.)
which the loss would decrease as a function of αm
35
slide by Tommi S. JaakkolaEmpirical exponential loss (cont’d.)
so that
36
slide by Tommi S. JaakkolaEmpirical exponential loss (cont’d.)
where
37
slide by Tommi S. JaakkolaThe AdaBoost Algorithm
38
slide by Jiri Matas and Jan ŠochmanThe AdaBoost Algorithm
39
Given: (x1, y1), . . . , (xm, ym); xi ∈ X, yi ∈ {−1, +1}
slide by Jiri Matas and Jan ŠochmanThe AdaBoost Algorithm
40
Given: (x1, y1), . . . , (xm, ym); xi ∈ X, yi ∈ {−1, +1} Initialise weights D1(i) = 1/m
slide by Jiri Matas and Jan ŠochmanThe AdaBoost Algorithm
41
Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:
⌅
Find ht = arg min
hj∈H ✏j = m
P
i=1
Dt(i)Jyi 6= hj(xi)K
⌅
If ✏t 1/2 then stop t = 1
slide by Jiri Matas and Jan ŠochmanThe AdaBoost Algorithm
42
Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:
⌅
Find ht = arg min
hj∈H ✏j = m
P
i=1
Dt(i)Jyi 6= hj(xi)K
⌅
If ✏t 1/2 then stop
⌅
Set ↵t = 1
2 log(1−✏t ✏t )
t = 1
slide by Jiri Matas and Jan ŠochmanThe AdaBoost Algorithm
43
Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:
⌅
Find ht = arg min
hj∈H ✏j = m
P
i=1
Dt(i)Jyi 6= hj(xi)K
⌅
If ✏t 1/2 then stop
⌅
Set ↵t = 1
2 log(1−✏t ✏t ) ⌅
Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor t = 1
slide by Jiri Matas and Jan ŠochmanThe AdaBoost Algorithm
44
Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:
⌅
Find ht = arg min
hj∈H ✏j = m
P
i=1
Dt(i)Jyi 6= hj(xi)K
⌅
If ✏t 1/2 then stop
⌅
Set ↵t = 1
2 log(1−✏t ✏t ) ⌅
Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X
t=1
↵tht(x) !
step training error
t = 1
5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan ŠochmanThe AdaBoost Algorithm
45
Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:
⌅
Find ht = arg min
hj∈H ✏j = m
P
i=1
Dt(i)Jyi 6= hj(xi)K
⌅
If ✏t 1/2 then stop
⌅
Set ↵t = 1
2 log(1−✏t ✏t ) ⌅
Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X
t=1
↵tht(x) !
step training error
t = 2
5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan ŠochmanThe AdaBoost Algorithm
46
Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:
⌅
Find ht = arg min
hj∈H ✏j = m
P
i=1
Dt(i)Jyi 6= hj(xi)K
⌅
If ✏t 1/2 then stop
⌅
Set ↵t = 1
2 log(1−✏t ✏t ) ⌅
Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X
t=1
↵tht(x) !
step training error
t = 3
5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan ŠochmanThe AdaBoost Algorithm
47
Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:
⌅
Find ht = arg min
hj∈H ✏j = m
P
i=1
Dt(i)Jyi 6= hj(xi)K
⌅
If ✏t 1/2 then stop
⌅
Set ↵t = 1
2 log(1−✏t ✏t ) ⌅
Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X
t=1
↵tht(x) !
step training error
t = 4
5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan ŠochmanThe AdaBoost Algorithm
48
Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:
⌅
Find ht = arg min
hj∈H ✏j = m
P
i=1
Dt(i)Jyi 6= hj(xi)K
⌅
If ✏t 1/2 then stop
⌅
Set ↵t = 1
2 log(1−✏t ✏t ) ⌅
Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X
t=1
↵tht(x) !
step training error
t = 5
5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan ŠochmanThe AdaBoost Algorithm
49
Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:
⌅
Find ht = arg min
hj∈H ✏j = m
P
i=1
Dt(i)Jyi 6= hj(xi)K
⌅
If ✏t 1/2 then stop
⌅
Set ↵t = 1
2 log(1−✏t ✏t ) ⌅
Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X
t=1
↵tht(x) !
step training error
t = 6
5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan ŠochmanThe AdaBoost Algorithm
50
Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:
⌅
Find ht = arg min
hj∈H ✏j = m
P
i=1
Dt(i)Jyi 6= hj(xi)K
⌅
If ✏t 1/2 then stop
⌅
Set ↵t = 1
2 log(1−✏t ✏t ) ⌅
Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X
t=1
↵tht(x) !
step training error
t = 7
5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan ŠochmanThe AdaBoost Algorithm
51
Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:
⌅
Find ht = arg min
hj∈H ✏j = m
P
i=1
Dt(i)Jyi 6= hj(xi)K
⌅
If ✏t 1/2 then stop
⌅
Set ↵t = 1
2 log(1−✏t ✏t ) ⌅
Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X
t=1
↵tht(x) !
step training error
t = 40
5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan ŠochmanReweighting
52
slide by Jiri Matas and Jan Šochman53
Reweighting
slide by Jiri Matas and Jan Šochman54
Reweighting
slide by Jiri Matas and Jan ŠochmanBoosting results – Digit recognition
55
[Schapire, 1989]
10 100 1000 5 10 15 20
error # rounds training error test error
slide by Carlos Guestrin56
[Viola & Jones]
slide by Rob Schapire
57
[Viola & Jones]
slide by Rob SchapireBoosting vs. Logistic Regression
58
Logis+c$regression:$
$ $ where$xj$predefined$ features$(linear$classifier)$
weights$w0,w1,w2,…$
$ Boos+ng:$
$ $ where$ht(x)$defined$ dynamically$to$fit$data$$
(not$a$$linear$classifier)$
itera+on$t$incrementally$
where+x +predefined+ ++++++
slide by Aarti SinghBagging:
is the same
59
Boosting:
(modifies their distribution)
classifier’s accuracy
reduced – learning rule becomes more complex with iterations
slide by Aarti SinghK-Means Clustering
60