CSC 311: Introduction to Machine Learning
Lecture 6 - Bagging, Boosting Roger Grosse Chris Maddison Juhan Bae Silviu Pitis
University of Toronto, Fall 2020
Intro ML (UofT) CSC311-Lec6 1 / 48
CSC 311: Introduction to Machine Learning Lecture 6 - Bagging, - - PowerPoint PPT Presentation
CSC 311: Introduction to Machine Learning Lecture 6 - Bagging, Boosting Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec6 1 / 48 Today Today we will introduce ensembling
Intro ML (UofT) CSC311-Lec6 1 / 48
◮ We’ve seen many individual models (KNN, linear models, neural
◮ Train models independently on random “resamples” of the training
◮ Train models sequentially, each time focusing on training examples
Intro ML (UofT) CSC311-Lec6 2 / 48
variance
Bayes error
◮ bias: how wrong the expected prediction is (corresponds to
◮ variance: the amount of variability in the predictions (corresponds
◮ Bayes error: the inherent unpredictability of the targets
Intro ML (UofT) CSC311-Lec6 3 / 48
◮ high bias (because it cannot capture the structure in the data) ◮ low variance (because there’s enough data to get stable estimates) Intro ML (UofT) CSC311-Lec6 4 / 48
◮ low bias (since it learns all the relevant structure) ◮ high variance (it fits the quirks of the data you happened to sample) Intro ML (UofT) CSC311-Lec6 5 / 48
A: Bayes error
Intro ML (UofT) CSC311-Lec6 6 / 48
◮ Bayes error: unchanged, since we have no control over it ◮ Bias: unchanged, since the averaged prediction has the same
m
◮ Variance: reduced, since we’re averaging over independent
m
m
Intro ML (UofT) CSC311-Lec6 7 / 48
◮ Why not train a single model on the union of all sampled datasets?
◮ Take a single dataset D with n examples. ◮ Generate m new datasets (“resamples” or “bootstrap samples”),
◮ Average the predictions of models trained on each of these datasets.
◮ Intuition: As |D| → ∞, we have pD → psample. Intro ML (UofT) CSC311-Lec6 8 / 48
Intro ML (UofT) CSC311-Lec6 9 / 48
Intro ML (UofT) CSC311-Lec6 10 / 48
◮ x ∼ U(−3, 3), t ∼ N(0, 1) ◮ H =
◮ Ensembled hypotheses (mean over 1000 samples): ◮ The ensembled hypothesis is not in
Intro ML (UofT) CSC311-Lec6 11 / 48
◮ E.g., individual accuracy on “Who Wants to be a Millionaire” is
Intro ML (UofT) CSC311-Lec6 12 / 48
◮ Possible to show that if the sampled predictions have variance σ2
m
◮ Intuition: you want to invest in a diversified portfolio, not just one
◮ Can help to use average over multiple algorithms, or multiple
Intro ML (UofT) CSC311-Lec6 13 / 48
◮ When choosing each node of the decision tree, choose a random set
◮ one of the most widely used algorithms in Kaggle competitions Intro ML (UofT) CSC311-Lec6 14 / 48
◮ Even if a single model is great, a small ensemble usually helps.
◮ Does not reduce bias in case of squared error. ◮ There is still correlation between classifiers. ◮ Random forest solution: Add more randomness. ◮ Naive mixture (all members weighted equally). ◮ If members are very different (e.g., different algorithms, different
Intro ML (UofT) CSC311-Lec6 15 / 48
◮ Train classifiers sequentially, each time focusing on training
◮ The shifting focus strongly decorrelates their predictions.
Intro ML (UofT) CSC311-Lec6 16 / 48
1 N
n=1 I[h(x(n)) = t(n)] weights each training
◮ Classifier “tries harder” on examples with higher cost
N
N
n=1 w(n) = 1
Intro ML (UofT) CSC311-Lec6 17 / 48
◮ Needs to minimize weighted error. ◮ Ensemble may get very large, so base classifier must be fast. It
Intro ML (UofT) CSC311-Lec6 18 / 48
◮ Decision trees ◮ Even simpler: Decision Stump: A decision tree with a single split [Formal definition of weak learnability has quantifies such as “for any distribution over data” and the requirement that its guarantee holds only probabilistically.] Intro ML (UofT) CSC311-Lec6 19 / 48
Vertical half spaces Horizontal half spaces Intro ML (UofT) CSC311-Lec6 20 / 48
Vertical half spaces Horizontal half spaces
2 − γ for some γ > 0, using it with
Intro ML (UofT) CSC311-Lec6 21 / 48
◮ This is different from previous lectures where we had t(n) ∈ {0, +1} ◮ It is for notational convenience, otw equivalent.
Intro ML (UofT) CSC311-Lec6 22 / 48
1 N for n = 1, . . . , N
◮ Fit a classifier to weighted data (ht ← WeakLearn(DN, w)), e.g.,
h∈H
n=1 w(n)I{h(x(n)) = t(n)} ◮ Compute weighted error errt = N
n=1 w(n)I{ht(x(n))=t(n)}
N
n=1 w(n)
◮ Compute classifier coefficient αt = 1 2 log 1−errt errt
◮ Update data weights
t=1 αtht(x)
(UofT) CSC311-Lec6 23 / 48
◮ If errt ≈ 0.5, αt low so misclassified examples are not emphasized Intro ML (UofT) CSC311-Lec6 24 / 48
Intro ML (UofT) CSC311-Lec6 25 / 48
w = 1 10 , . . . , 1 10
10
i=1 wiI{h1(x(i)) = t(i)}
N
i=1 wi
= 3 10 ⇒α1 = 1 2 log 1 − err1 err1 = 1 2 log( 1 0.3 − 1) ≈ 0.42 ⇒ H(x) = sign (α1h1(x))
Intro ML (UofT) CSC311-Lec6 26 / 48
w = updated weights ⇒ Train a classifier (using w) ⇒ err2 = 10
i=1 wiI{h2(x(i)) = t(i)}
N
i=1 wi
= 0.21 ⇒α2 = 1 2 log 1 − err3 err3 = 1 2 log( 1 0.21 − 1) ≈ 0.66 ⇒ H(x) = sign (α1h1(x) + α2h2(x))
Intro ML (UofT) CSC311-Lec6 27 / 48
w = updated weights ⇒ Train a classifier (using w) ⇒ err3 = 10
i=1 wiI{h3(x(i)) = t(i)}
N
i=1 wi
= 0.14 ⇒α3 = 1 2 log 1 − err3 err3 = 1 2 log( 1 0.14 − 1) ≈ 0.91 ⇒ H(x) = sign (α1h1(x) + α2h2(x) + α3h3(x))
Intro ML (UofT) CSC311-Lec6 28 / 48
Intro ML (UofT) CSC311-Lec6 29 / 48
Intro ML (UofT) CSC311-Lec6 30 / 48
Intro ML (UofT) CSC311-Lec6 31 / 48
2 − γ for all t = 1, . . . , T with γ > 0. The training
t=1 αtht(x)
N
Intro ML (UofT) CSC311-Lec6 32 / 48
Intro ML (UofT) CSC311-Lec6 33 / 48
[Slide credit: Robert Shapire’s Slides, http://www.cs.princeton.edu/courses/archive/spring12/cos598A/schedule.html ] Intro ML (UofT) CSC311-Lec6 34 / 48
m
Intro ML (UofT) CSC311-Lec6 35 / 48
◮ Compute the m-th hypothesis Hm = Hm−1 + αmhm, i.e. hm and
h∈H,α N
◮ Add it to the additive model
Intro ML (UofT) CSC311-Lec6 36 / 48
Intro ML (UofT) CSC311-Lec6 37 / 48
h∈H,α N
N
N
i
i
Intro ML (UofT) CSC311-Lec6 38 / 48
h∈H,α
i=1 w(m) i
h∈H
i=1 w(m) i
h∈H
i=1 w(m) i
i
h∈H
i=1 w(m) i
Intro ML (UofT) CSC311-Lec6 39 / 48
i=1 w(m) i
i=1 w(m) i
Intro ML (UofT) CSC311-Lec6 40 / 48
i
i
Intro ML (UofT) CSC311-Lec6 41 / 48
i=1 αihi(x) with
h∈H N
i
i=1 w(m) i
i=1 w(m) i
i
i
Intro ML (UofT) CSC311-Lec6 42 / 48
Intro ML (UofT) CSC311-Lec6 43 / 48
◮ Change loss function for weak learners: false positives less costly
◮ Smart way to do inference in real-time (in 2001 hardware) Intro ML (UofT) CSC311-Lec6 44 / 48
◮ There is a neat trick for computing the total intensity in a rectangle
◮ So it is easy to evaluate a huge number of base classifiers and they
◮ The algorithm adds classifiers greedily based on their quality on the
◮ Each classifier uses just one feature Intro ML (UofT) CSC311-Lec6 45 / 48
Intro ML (UofT) CSC311-Lec6 46 / 48
Intro ML (UofT) CSC311-Lec6 47 / 48
◮ Reduces bias ◮ Increases variance (large ensemble can cause overfitting) ◮ Sequential ◮ High dependency between ensemble elements
◮ Reduces variance (large ensemble can’t cause overfitting) ◮ Bias is not changed (much) ◮ Parallel ◮ Want to minimize correlation between ensemble elements. Intro ML (UofT) CSC311-Lec6 48 / 48