MACHINE LEARNING - 2013
Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap - - PowerPoint PPT Presentation
Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap - - PowerPoint PPT Presentation
MACHINE LEARNING - 2013 MACHINE LEARNING Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main Idea Some Examples Why it works 2 MACHINE LEARNING - 2013 The Main Idea Aggregation
MACHINE LEARNING - 2013
Bootstrap Aggregation
- The Main Idea
- Some Examples
- Why it works
2
Bagging
MACHINE LEARNING - 2013
- Imagine we have m sets of n independent
- bservations
S(1)={(X1 ,Y1),...,(Xn ,Yn)}(1) , ... , S(m)={(X1 ,Y1),..., (Xn ,Yn)}(m)
all taken iid from the same underlying distribution P
- Traditional approach: generate some ϕ(x,S) from
all the data samples
- Aggregation: learn ϕ(x,S) by averaging ϕ(x,S(k))
- ver many k
The Main Idea
3
Aggregation
before after
ϕ(x,S)
before after
ϕ(x,S(k))
MACHINE LEARNING - 2013
The Main Idea
- Unfortunately, we usually have one single
- bservations set S
- Idea: bootstrap S to form the S (k) observation sets
- (canonical) Choose some samples, duplicate them
until you fill a new S (i) of the same size of S
- (practical) Take a sub-set of samples of S, (use a
smaller set)
- The samples not used by each set are validation
samples
4
Bootstrapping
S (1) S (2) S (3) S
MACHINE LEARNING - 2013
The Main Idea
- Generate S (1),..., S (m) from bootstrapping
- Compute the ϕ(x,S)(k) individually
- Compute ϕ(x,S)=Ek(ϕ(x,S)(k)) by aggregation
5
Bagging
MACHINE LEARNING - 2013
6
- We select some input samples
- We learn a regression model
- Very sensitive to the input selection
- m training sets = m different models
A concrete example
(x(1)
1 , y(1) 1 ), . . . , (x(1) n , y(1) n )
(x(2)
1 , y(2) 1 ), . . . , (x(2) n , y(2) n )
ˆ f (2)(x) = Y (2) ˆ f (1)(x) = Y (1)
ˆ f (1), · · · , ˆ f (m) Y (1), · · · , Y (m)
MACHINE LEARNING - 2013
7
Aggregation: combine several models
m = 60 m = 10 m = 4
- Linear combination of simple models
- More examples = Better model
- We can stop when we’re satisfied
Z = 1 m
m
X
i=1
Y (i)
MACHINE LEARNING - 2013
Proof of convergence
- Assumptions
- Y(1),...,Y(m) are iid
- E(Y) = y (E(Y) is an unbiased estimator of y)
8
Z = 1 m
m
X
i=1
Y (i) E(Z) = 1 m
m
X
i=1
E(Y (i)) = 1 m
m
X
i=1
y = y
E((Y − y)2) = E((Y − E(Y ))2 = σ2(Y )
- Expected Error
- With Aggregation
E((Z − y)2) = E((Z − E(Z))2 = σ2(Z) = σ2⇣ 1 m
m
X
i=1
Y (i)⌘
= 1 m2
m
X
i=1
σ2(Y (i)) = 1 m 1 m
m
X
i=1
σ2(Y (i)) ! = 1 mσ2(Y )
Hypothesis: The average will converge to something meaningful
infinite observations = zero error : we have our underlying estimator!
MACHINE LEARNING - 2013
In layman terms
9
y Y
The expected error (variance) of Y is larger than Z
Z
The variance of Z shrinks with m
MACHINE LEARNING - 2013
Relaxing the assumptions
- We DROP the second assumption
- Y(1),...,Y(m) are iid
- E(Y) = y (E(Y) is an unbiased estimator of y)
10
E((Y −y)2) = E((Y −E(Y )+E(Y )−y)2) = E(((Y −E(Y )+(E(Y )−y))2) = E((Y − E(Y ))2) + E((E(Y ) − y)2) + E(2(Y − E(Y ))(E(Y ) − y))
E
- Y − E(Y )
- E
- 2(E(Y ) − y)
- E
- (Y − y)2
≥ E
- (E(Y ) − y)2
E
- (Y − y)2
≥ E
- (Z − y)2
using Z gives us a smaller error
(even if we can’t prove convergence to zero)
Z
we add these we regroup them
= 0 σ 2(Y ) ≥ 0
The larger it is, the better for us
MACHINE LEARNING - 2013
- Instability is good
- The more variable (unstable) the form of ϕ(x,S) is,
the more improvement can potentially be obtained
- Low-variability methods (e.g. PCA, LDA) improve less
than high-variability ones (e.g. LWR, Decision Trees)
- Loads of redundancy
- Most predictors do roughly “the same thing”
Peculiarities
11
MACHINE LEARNING - 2013
From Bagging to Boosting
- Bagging: each model is trained independently
- Boosting: each model is built on top of the
previous ones
12
MACHINE LEARNING - 2013
- The Main Idea
- The Thousand Flavours of Boost
- Weak Learners and Cascades
13
Adaptive Boosting AdaBoost
MACHINE LEARNING - 2013
The Main Idea
- Combine several simple models (weak learners)
- Avoid redundancy
- Each learner complements the previous ones
- Keep track of the errors of the previous learners
14
Iterative Approach
MACHINE LEARNING - 2013
Weak Learners
- A “simple” classifier that can be generated easily
- As long as it is better than random, we can use it
- Better when tailored to the problem at hand
- E.g. very fast at retrieval (for images)
15
MACHINE LEARNING - 2013
AdaBoost
- We choose a weak learner model ϕ(x)
- Initialization
- Generate ϕ1(x), ... , ϕN(x) weak learners
- N can be in the hundreds of thousands
- Assign a weight wi to each training sample
16
(e.g. f (x,v) = x ∙v > θ )
v
θ
v1 v2 v3v4 v5 v6 v7
Initialization
MACHINE LEARNING - 2013
- Compute the error ej for each classifier ϕj(x)
- Select the ϕj with the smallest classification
error
- Update the weights wi depending on how they
are classified by ϕj .
v1 v2 v3v4 v5 v6 v7
AdaBoost
17
v1 v2 v3v4 v5 v6 v7
Iterations
ej : Pn
i=1
- wi · 1ϕj(xi)6=yi
- argminj
n X
i=1
- wi · 1ϕj(xi)6=yi
- !
Here comes the important part
MACHINE LEARNING - 2013
Updating the weights
- Evaluate how “well” ϕj(x) is performing
- Update the weights for each sample
18
how far are we from a perfect classification? make it bigger make it smaller
α = 1 2ln ✓1 − ej ej ◆ w(t+1)
i
= ( w(t)
i exp(α(t))
if ϕj(xi) 6= yi, w(t)
i exp(α(t))
if ϕj(xi) = yi.
MACHINE LEARNING - 2013
- Recompute the error ej for each classifier ϕj(x)
using the updated weights
- Select the new ϕj with the smallest
classification error
- Update the weights wi .
AdaBoost
Rinse and Repeat
ej : Pn
i=1
- wi · 1ϕj(xi)6=yi
- argminj
n X
i=1
- wi · 1ϕj(xi)6=yi
- !
MACHINE LEARNING - 2013
20
Boosting In Action
The Checkerboard Problem
MACHINE LEARNING - 2013
21
Initialization
Boosting In Action
- We choose a simple weak learner
- We generate a thousand random vectors
v1,...,v1000 and corresponding learners fj(x,vj)
- For each fj(x,vj) we compute a good threshold θj
f (x,v) = x ∙v > θ
MACHINE LEARNING - 2013
22
Boosting In Action
50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 9 10
- We look for the best weak learner
- We adjust the importance (weight) of the errors
- Rinse and repeat
and we keep going...
MACHINE LEARNING - 2013
23
20 weak learners 40 weak learners 80 weak learners 120 weak learners
50% 60% 70% 80% 90% 100% 1 10 20 40 80 120
Boosting In Action
MACHINE LEARNING - 2013
Drawbacks of Boosting
- Overfitting!
- Boost will always overfit with
many weak learners
- Training Time
- Training of a face detector
takes up to 2 weeks on modern computers
24
MACHINE LEARNING - 2013
A thousand different flavors
25
1989 Boosting 1996 AdaBoost 1999 Real AdaBoost 2000 Margin Boost Modest AdaBoost Gentle AdaBoost AnyBoost LogitBoost 2001 BrownBoost 2003 KLBoost Weight Boost 2004 FloatBoost ActiveBoost 2005 JensenShannonBoost Infomax Boost 2006 Emphasis Boost 2007 Entropy Boost Reweight Boost ...
- A couple of new boost variants every year
- Reduce overfitting
- Increase robustness to noise
- Tailored to specific problems
- Mainly change two things
- How the error is represented
- How the weights are updated
MACHINE LEARNING - 2013
An example
- Instead of counting the errors, we compute the
probability of correct classification
26
Real AdaBoost
α = 1 2ln ✓1 − pj pj ◆
pj =
n
Y
i=1
wiP(yi = 1|xi)
Discrete AdaBoost
α = 1 2ln ✓1 − ej ej ◆
ej : Pn
i=1
- wi · 1ϕj(xi)6=yi
- w(t+1)
i
= ( w(t)
i exp(α(t))
if ϕj(xi) 6= yi, w(t)
i exp(α(t))
if ϕj(xi) = yi.
w(t+1)
i
= w(t)
i exp(−yiα(t))
MACHINE LEARNING - 2013
A celebrated example
27
Viola-Jones Haar-Like wavelets
image pixels
A B
I(x) : pixel of image I at position x
f(x) = X
x∈A
I(x) − X
x∈B
I(x) ϕ(x) = ⇢ 1 if f(x) > 0, −1
- therwise.
- Millions of possible classifiers
...
ϕ1(x) ϕ2(x)
2 rectangles of pixels 1 positive, 1 negative
MACHINE LEARNING - 2013
Real-Time on HD video
28
MACHINE LEARNING - 2013
29
50% 60% 70% 80% 90% 100% 1 10 20 40 80 120
Random Circles Random Projections
Some simpler examples
f (x,c) = (x -c)T(x -c) > θ
Feature: the distance from a point c
MACHINE LEARNING - 2013
30
50% 60% 70% 80% 90% 100% 1 10 20 40 80 120
Random Circles Random Projections Random Rectangles
Some simpler examples
f (x,R) = 1x ∈ R
Feature: being inside a rectangle R
MACHINE LEARNING - 2013
31
f (x,μ,∑) = P(x | μ,∑)
Feature: full-covariance gaussian
Some simpler examples
50% 60% 70% 80% 90% 100% 1 10 20 40 80 120
Random Gaussians Random Circles Random Projections Random Rectangles
MACHINE LEARNING - 2013
Weak Learners don’t need to be weak!
32
20 boosted SVMs with 5 SVs and the RBF kernel
MACHINE LEARNING - 2013
Cascades of Weak Classifiers
- Split classification task in Stages
- Each stage has increasing numbers of weak
classifiers
- Each stage only needs to ‘learn’ to classify
what the previous ones let through
33
MACHINE LEARNING - 2013
Cascades of Weak Classifiers
34
Stage 1
yes no no no
sample Stage 2 Stage 3
1000 weak learners 99.9% classification 100 weak learners 95% classification 10 weak learners 90% classification
1000 samples 10’000 tests 10‘000 tests
+
50’000 tests
+ =
70’000 tests
instead of 1110
100 samples 50 samples
70 tests per sample
MACHINE LEARNING - 2013
Cascades of Weak Classifiers
- Advantages
- Stage splits can be chosen manually
- Trade-off performance and accuracy of the first
stages
- Disadvantages
- Later stages become difficult to train (few samples)
- Very large amount of samples for training
- Even slower to train than standard boosting
35
MACHINE LEARNING - 2013
The curious case of RANSAC
- Created to withstand hordes of
- utliers
- Hasn’t really been proven to
converge to optimal solution in any reasonable time
- Extremely effective in practice
- Easy to implement
- Widely used in Computer Vision
36
RANdom SAmple Consensus
MACHINE LEARNING - 2013
RANSAC in practice
- Select a random subset of samples
- Estimate the model on these samples
- Sift through all other samples
- If they are close to the model, add to the
Consensus
- If the consensus is big enough, keep it
- Repeat from top
37
Keep the best consensus (e.g. most samples, least error, etc.)
MACHINE LEARNING - 2013
Some examples of RANSAC
38
- B. Noris and A. Billard. “Aggregation of Asynchronous Eye-Tracking streams from Sets of Uncalibrated Panoramic Images”. (2011)
Input Video Reference Panoramic Image
MACHINE LEARNING - 2013
Some examples of RANSAC
39
Scaramuzza et al. “Real-Time Monocular Visual Odometry for On-Road Vehicles with 1-Point RANSAC” ICRA 2009.
Input image: 360° Camera Compute visual features (KLT) Exclude Outliers (bad features) Compute Visual Odometry
Scaramuzza et al. (ETHZ)
RANSAC
MACHINE LEARNING - 2013
RANSAC vs Bagging
40
RANSAC Bagging
Train many individual models Select a random subset of samples Keep the best Keep them all Needs to be lucky Proven to converge Heavy to compute Very light to compute
MACHINE LEARNING - 2013
Summing up
- Bagging: linear combination of multiple learners
- + Very robust to noise
- A lot of redundant effort
- Boosting: weighed combination of arbitrary
learners
- + Very strong learner from very simple ones
- Sensitive to noise (at least Discrete Adaboost)
- RANSAC: iterative evaluation of random learners
- + Very robust against outliers, simple to implement
- Not ensured to converge (although in practice it does)
41