bagging boosting and ransac
play

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap - PowerPoint PPT Presentation

MACHINE LEARNING - 2013 MACHINE LEARNING Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main Idea Some Examples Why it works 2 MACHINE LEARNING - 2013 The Main Idea Aggregation


  1. MACHINE LEARNING - 2013 MACHINE LEARNING Bagging, Boosting and RANSAC

  2. MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging • The Main Idea • Some Examples • Why it works 2

  3. MACHINE LEARNING - 2013 The Main Idea Aggregation • Imagine we have m sets of n independent before after observations S (1) ={( X 1 , Y 1 ),...,( X n , Y n )} (1) , ... , S ( m ) ={( X 1 , Y 1 ),..., ϕ ( x , S ) ( X n , Y n )} ( m ) all taken iid from the same underlying distribution P • Traditional approach: generate some ϕ ( x , S ) from before after all the data samples • Aggregation: learn ϕ ( x , S ) by averaging ϕ ( x , S ( k ) ) ϕ ( x , S ( k ) ) over many k 3

  4. MACHINE LEARNING - 2013 The Main Idea Bootstrapping • Unfortunately, we usually have one single S observations set S • Idea: bootstrap S to form the S (k) observation sets S (1) • (canonical) Choose some samples, duplicate them until you fill a new S (i) of the same size of S S (2) • (practical) Take a sub-set of samples of S , (use a smaller set) S (3) • The samples not used by each set are validation samples 4

  5. MACHINE LEARNING - 2013 The Main Idea Bagging • Generate S (1) ,..., S (m) from bootstrapping • Compute the ϕ ( x , S ) ( k ) individually • Compute ϕ ( x , S )= E k ( ϕ ( x , S ) ( k ) ) by aggregation 5

  6. MACHINE LEARNING - 2013 A concrete example • We select some input samples ( x (1) 1 , y (1) ˆ 1 ) , . . . , ( x (1) n , y (1) f (1) ( x ) = Y (1) n ) • We learn a regression model ( x (2) 1 , y (2) ˆ 1 ) , . . . , ( x (2) n , y (2) f (2) ( x ) = Y (2) n ) • Very sensitive to the input selection • m training sets = m di ff erent models f (1) , · · · , ˆ f ( m ) ˆ Y (1) , · · · , Y ( m ) 6

  7. MACHINE LEARNING - 2013 Aggregation: combine several models • Linear combination of simple models m Z = 1 • More examples = Better model X Y ( i ) m = 60 m = 10 m = 4 • We can stop when we’re satisfied m i =1 7

  8. MACHINE LEARNING - 2013 Proof of convergence Hypothesis: The average will converge to something meaningful • Assumptions • Y (1) ,..., Y ( m ) are iid • E(Y) = y ( E(Y) is an unbiased estimator of y ) • Expected Error E (( Y − y ) 2 ) = E (( Y − E ( Y )) 2 = σ 2 ( Y ) • With Aggregation m m m E ( Z ) = 1 E ( Y ( i ) ) = 1 Z = 1 X Y ( i ) X X y = y m m m i =1 i =1 i =1 E (( Z − y ) 2 ) = E (( Z − E ( Z )) 2 = σ 2 ( Z ) = σ 2 ⇣ 1 m Y ( i ) ⌘ X m i =1 m m ! 1 σ 2 ( Y ( i ) ) = 1 1 = 1 X X σ 2 ( Y ( i ) ) m σ 2 ( Y ) = m 2 m m i =1 i =1 infinite observations = zero error : we have our underlying estimator! 8

  9. MACHINE LEARNING - 2013 In layman terms y Z Y The expected error (variance) of Y is larger than Z The variance of Z shrinks with m 9

  10. MACHINE LEARNING - 2013 Relaxing the assumptions • We DROP the second assumption • Y (1) ,..., Y ( m ) are iid • E(Y) = y ( E(Y) is an unbiased estimator of y ) E (( Y − y ) 2 ) = E (( Y − E ( Y )+ E ( Y ) − y ) 2 ) = E ((( Y − E ( Y )+( E ( Y ) − y )) 2 ) we regroup them we add these = E (( Y − E ( Y )) 2 ) + E (( E ( Y ) − y ) 2 ) + E (2( Y − E ( Y ))( E ( Y ) − y )) � � � � Y − E ( Y ) 2( E ( Y ) − y ) E E σ 2 ( Y ) ≥ 0 = 0 ( Y − y ) 2 � ( E ( Y ) − y ) 2 � � � E ≥ E Z ( Y − y ) 2 � ( Z − y ) 2 � � � E ≥ E The larger it is, the better for us using Z gives us a smaller error (even if we can’t prove convergence to zero) 10

  11. MACHINE LEARNING - 2013 Peculiarities • Instability is good • The more variable (unstable) the form of ϕ ( x , S ) is, the more improvement can potentially be obtained • Low-variability methods (e.g. PCA, LDA) improve less than high-variability ones (e.g. LWR, Decision Trees) • Loads of redundancy • Most predictors do roughly “the same thing” 11

  12. MACHINE LEARNING - 2013 From Bagging to Boosting • Bagging: each model is trained independently • Boosting: each model is built on top of the previous ones 12

  13. MACHINE LEARNING - 2013 Adaptive Boosting AdaBoost • The Main Idea • The Thousand Flavours of Boost • Weak Learners and Cascades 13

  14. MACHINE LEARNING - 2013 The Main Idea Iterative Approach • Combine several simple models (weak learners) • Avoid redundancy • Each learner complements the previous ones • Keep track of the errors of the previous learners 14

  15. MACHINE LEARNING - 2013 Weak Learners • A “simple” classifier that can be generated easily • As long as it is better than random, we can use it • Better when tailored to the problem at hand • E.g. very fast at retrieval (for images) 15

  16. MACHINE LEARNING - 2013 AdaBoost Initialization • We choose a weak learner model ϕ ( x ) v (e.g. f ( x,v ) = x ∙ v > θ ) θ • Initialization • Generate ϕ 1 ( x ), ... , ϕ N ( x ) weak learners v 1 v 2 v 7 v 3 v 4 v 5 v 6 • N can be in the hundreds of thousands • Assign a weight w i to each training sample 16

  17. MACHINE LEARNING - 2013 AdaBoost Iterations • Compute the error e j for each classifier ϕ j ( x ) e j : P n � � w i · 1 ϕ j ( x i ) 6 = y i i =1 • Select the ϕ j with the smallest classification error n ! X v 1 v 1 � � argmin j w i · 1 ϕ j ( x i ) 6 = y i v 2 v 2 v 7 v 7 v 3 v 4 v 3 v 4 i =1 v 5 v 6 v 5 v 6 • Update the weights w i depending on how they are classified by ϕ j . Here comes the important part 17

  18. MACHINE LEARNING - 2013 Updating the weights • Evaluate how “well” ϕ j ( x ) is performing how far are we from a ✓ 1 − e j ◆ α = 1 perfect classification? 2ln e j • Update the weights for each sample make it bigger ( w ( t ) i exp( α ( t ) ) if ϕ j ( x i ) 6 = y i , w ( t +1) = w ( t ) i i exp( � α ( t ) ) if ϕ j ( x i ) = y i . make it smaller 18

  19. MACHINE LEARNING - 2013 AdaBoost Rinse and Repeat • Recompute the error e j for each classifier ϕ j ( x ) using the updated weights e j : P n � � w i · 1 ϕ j ( x i ) 6 = y i i =1 • Select the new ϕ j with the smallest classification error n ! X � � argmin j w i · 1 ϕ j ( x i ) 6 = y i i =1 • Update the weights w i .

  20. MACHINE LEARNING - 2013 Boosting In Action The Checkerboard Problem 20

  21. MACHINE LEARNING - 2013 Boosting In Action Initialization • We choose a simple weak learner f ( x,v ) = x ∙ v > θ • We generate a thousand random vectors v 1 ,...,v 1000 and corresponding learners f j ( x,v j ) • For each f j ( x,v j ) we compute a good threshold θ j 21

  22. MACHINE LEARNING - 2013 Boosting In Action and we keep going... 100% • We look for the best weak learner 90% 80% • We adjust the importance (weight) of the errors 70% • Rinse and repeat 60% 50% 1 2 3 4 5 6 7 8 9 10 22

  23. MACHINE LEARNING - 2013 Boosting In Action 100% 90% 80% 120 weak learners 40 weak learners 80 weak learners 20 weak learners 70% 60% 50% 1 10 20 40 80 120 23

  24. MACHINE LEARNING - 2013 Drawbacks of Boosting • Overfitting! • Boost will always overfit with many weak learners • Training Time • Training of a face detector takes up to 2 weeks on modern computers 24

  25. MACHINE LEARNING - 2013 A thousand different flavors 1989 Boosting 1996 • A couple of new boost variants every year AdaBoost 1999 Real AdaBoost • Reduce overfitting 2000 Margin Boost Modest AdaBoost Gentle AdaBoost • Increase robustness to noise AnyBoost LogitBoost 2001 BrownBoost • Tailored to specific problems 2003 KLBoost Weight Boost • Mainly change two things 2004 FloatBoost ActiveBoost 2005 • How the error is represented JensenShannonBoost Infomax Boost 2006 • How the weights are updated Emphasis Boost 2007 Entropy Boost Reweight Boost ... 25

  26. MACHINE LEARNING - 2013 An example • Instead of counting the errors, we compute the probability of correct classification Discrete AdaBoost Real AdaBoost n e j : P n Y � � p j = w i P ( y i = 1 | x i ) w i · 1 ϕ j ( x i ) 6 = y i i =1 i =1 ✓ 1 − e j ◆ ✓ 1 − p j ◆ α = 1 α = 1 2ln 2ln e j p j ( w ( t ) i exp( α ( t ) ) if ϕ j ( x i ) 6 = y i , w ( t +1) w ( t +1) = w ( t ) = i exp( − y i α ( t ) ) w ( t ) i i exp( � α ( t ) ) if ϕ j ( x i ) = y i . i 26

  27. MACHINE LEARNING - 2013 A celebrated example Viola-Jones Haar-Like wavelets I ( x ) : pixel of image I at position x B X X A f ( x ) = I ( x ) − I ( x ) x ∈ A x ∈ B ⇢ 1 if f ( x ) > 0 , ϕ ( x ) = − 1 otherwise . image pixels 2 rectangles of pixels 1 positive, 1 negative • Millions of possible classifiers ... ϕ 1 ( x ) ϕ 2 ( x ) 27

  28. MACHINE LEARNING - 2013 Real-Time on HD video 28

  29. MACHINE LEARNING - 2013 Some simpler examples Feature: the distance from a point c f ( x,c ) = ( x - c ) T ( x - c ) > θ 100% 90% 80% 70% Random Circles Random Projections 60% 50% 1 10 20 40 80 120 29

  30. MACHINE LEARNING - 2013 Some simpler examples Feature: being inside a rectangle R f ( x,R ) = 1 x ∈ R 100% 90% 80% 70% Random Circles Random Projections 60% Random Rectangles 50% 1 10 20 40 80 120 30

  31. MACHINE LEARNING - 2013 Some simpler examples Feature: full-covariance gaussian f ( x, μ , ∑ ) = P( x | μ , ∑ ) 100% 90% 80% Random Gaussians 70% Random Circles Random Projections 60% Random Rectangles 50% 1 10 20 40 80 120 31

  32. MACHINE LEARNING - 2013 Weak Learners don’t need to be weak! 20 boosted SVMs with 5 SVs and the RBF kernel 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend