coms 4721 machine learning for data science lecture 13 3
play

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University B OOSTING Robert E. Schapire and Yoav Freund, Boosting: Foundations and


  1. COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. B OOSTING Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms , MIT Press, 2012. See this textbook for many more details. (I borrow some figures from that book.)

  3. B AGGING CLASSIFIERS Algorithm: Bagging binary classifiers Given ( x 1 , y 1 ) , . . . , ( x n , y n ) , x ∈ X , y ∈ {− 1 , + 1 } ◮ For b = 1 , . . . , B ◮ Sample a bootstrap dataset B b of size n . For each entry in B b , select ( x i , y i ) with probability 1 n . Some ( x i , y i ) will repeat and some won’t appear in B b . ◮ Learn a classifier f b using data in B b . ◮ Define the classification rule to be � B � � f bag ( x 0 ) = sign f b ( x 0 ) . b = 1 ◮ With bagging, we observe that a committee of classifiers votes on a label. ◮ Each classifier is learned on a bootstrap sample from the data set. ◮ Learning a collection of classifiers is referred to as an ensemble method .

  4. B OOSTING How is it that a committee of blockheads can somehow arrive at highly reasoned decisions, despite the weak judgment of the individual members? - Schapire & Freund, “Boosting: Foundations and Algorithms” Boosting is another powerful method for ensemble learning. It is similar to bagging in that a set of classifiers are combined to make a better one. It works for any classifier, but a “weak” one that is easy to learn is usually chosen. (weak = accuracy a little better than random guessing) Short history 1984 : Leslie Valiant and Michael Kearns ask if “boosting” is possible. 1989 : Robert Schapire creates first boosting algorithm. 1990 : Yoav Freund creates an optimal boosting algorithm. 1995 : Freund and Schapire create AdaBoost (Adaptive Boosting), the major boosting algorithm.

  5. B AGGING VS B OOSTING ( OVERVIEW ) f 3 (x) f 3 (x) Bootstrap sample Weighted sample f 2 (x) Bootstrap sample f 2 (x) Weighted sample f 1 (x) f 1 (x) Bootstrap sample Weighted sample Training sample Training sample Bagging Boosting

  6. T HE A DA B OOST A LGORITHM ( SAMPLING VERSION ) Sample and α 3 , f 3 (x) Weighted sample classify B 3 weighted error ε 2 Sample and α 2 , f 2 (x) Weighted sample classify B 2 weighted error ε 1 Sample and α 1 , f 1 (x) Weighted sample classify B 1 � T � Training sample � f boost ( x 0 ) = sign α t f t ( x 0 ) t = 1 Boosting

  7. T HE A DA B OOST A LGORITHM ( SAMPLING VERSION ) Algorithm: Boosting a binary classifier Given ( x 1 , y 1 ) , . . . , ( x n , y n ) , x ∈ X , y ∈ {− 1 , + 1 } , set w 1 ( i ) = 1 n for i = 1 : n ◮ For t = 1 , . . . , T 1. Sample a bootstrap dataset B t of size n according to distribution w t . Notice we pick ( x i , y i ) with probability w t ( i ) and not 1 n . 2. Learn a classifier f t using data in B t . � � 3. Set ǫ t = � n i = 1 w t ( i ) 1 { y i � = f t ( x i ) } and α t = 1 1 − ǫ t 2 ln . ǫ t w t + 1 ( i ) ˆ w t + 1 ( i ) = w t ( i ) e − α t y i f t ( x i ) and set w t + 1 ( i ) = 4. Scale ˆ w t + 1 ( j ) . � j ˆ ◮ Set the classification rule to be �� T � f boost ( x 0 ) = sign t = 1 α t f t ( x 0 ) . Comment : Description usually simplified to “learn classifier f t using distribution w t .”

  8. B OOSTING A DECISION STUMP ( EXAMPLE 1) + - Original data + + Uniform distribution, w 1 - Learn weak classifier - Here: Use a decision stump + x 1 > 1 . 7 - + - y = 1 ˆ ˆ y = 3

  9. B OOSTING A DECISION STUMP ( EXAMPLE 1) + - Round 1 classifier + + Weighted error: ǫ 1 = 0 . 3 - Weight update: α 1 = 0 . 42 - + - + -

  10. B OOSTING A DECISION STUMP ( EXAMPLE 1) + - + Weighted data + After round 1 - - + - + -

  11. B OOSTING A DECISION STUMP ( EXAMPLE 1) + - + Round 2 classifier + Weighted error: ǫ 2 = 0 . 21 Weight update: α 2 = 0 . 65 - - + - + -

  12. B OOSTING A DECISION STUMP ( EXAMPLE 1) + - Weighted data + + After round 2 - - + - + -

  13. B OOSTING A DECISION STUMP ( EXAMPLE 1) + - Round 2 classifier + + Weighted error: ǫ 3 = 0 . 14 - Weight update: α 3 = 0 . 92 - + - + -

  14. B OOSTING A DECISION STUMP ( EXAMPLE 1) + Classifier after three rounds - + + + 0.42 x - - + + 0.65 x - + - 0.92 x

  15. B OOSTING A DECISION STUMP ( EXAMPLE 2) Example problem Random guessing 50% error Decision stump 45.8% error Full decision tree 24.7% error Boosted stump 5.8% error

  16. B OOSTING Point = one dataset. Location = error rate w/ and w/o boosting. The boosted version of the same classifier almost always produces better results.

  17. B OOSTING (left) Boosting a bad classifier is often better than not boosting a good one. (right) Boosting a good classifier is often better, but can take more time.

  18. B OOSTING AND FEATURE MAPS Q : What makes boosting work so well? A : This is a well-studied question. We will present one analysis later, but we can also give intuition by tying it in with what we’ve already learned. The classification for a new x 0 from boosting is � T � � f boost ( x 0 ) = sign α t f t ( x 0 ) . t = 1 Define φ ( x ) = [ f 1 ( x ) , . . . , f T ( x )] ⊤ , where each f t ( x ) ∈ {− 1 , + 1 } . ◮ We can think of φ ( x ) as a high dimensional feature map of x . ◮ The vector α = [ α 1 , . . . , α T ] ⊤ corresponds to a hyperplane. ◮ So the classifier can be written f boost ( x 0 ) = sign ( φ ( x 0 ) ⊤ α ) . ◮ Boosting learns the feature mapping and hyperplane simultaneously.

  19. A PPLICATION : F ACE DETECTION

  20. F ACE DETECTION (V IOLA & J ONES , 2001) Problem : Locate the faces in an image or video. Processing : Divide image into patches of different scales, e.g., 24 × 24, 48 × 48, etc. Extract features from each patch. Classify each patch as face or no face using a boosted decision stump . This can be done in real-time, for example by your digital camera (at 15 fps). ◮ One patch from a larger image. Mask it with many “feature extractors.” ◮ Each pattern gives one number, which is the sum of all pixels in black region minus sum of pixels in white region (total of 45,000+ features).

  21. F ACE DETECTION ( EXAMPLE RESULTS )

  22. A NALYSIS OF BOOSTING

  23. A NALYSIS OF BOOSTING Training error theorem We can use analysis to make a statement about the accuracy of boosting on the training data . Theorem : Under the AdaBoost framework, if ǫ t is the weighted error of classifier f t , then for the classifier f boost ( x 0 ) = sign ( � T t = 1 α t f t ( x 0 )) , n T training error = 1 � 2 − ǫ t ) 2 � � � ( 1 1 { y i � = f boost ( x i ) } ≤ exp − 2 . n i = 1 t = 1 Even if each ǫ t is only a little better than random guessing, the sum over T classifiers can lead to a large negative value in the exponent when T is large. For example, if we set: ǫ t = 0 . 45 , T = 1000 → training error ≤ 0 . 0067.

  24. P ROOF OF THEOREM Setup We break the proof into three steps. It is an application of the fact that if a < b and b < c then a < c � �� � � �� � � �� � Step 2 Step 3 conclusion ◮ Step 1 calculates the value of b . ◮ Steps 2 and 3 prove the two inequalities. Also recall the following step from AdaBoost: w t + 1 ( i ) = w t ( i ) e − α t y i f t ( x i ) . ◮ Update ˆ w t + 1 ( i ) ˆ Define Z t = � ◮ Normalize w t + 1 ( i ) = j ˆ w t + 1 ( j ) . − → � j ˆ w t + 1 ( j )

  25. P ROOF OF THEOREM ( a ≤ b ≤ c ) Step 1 We first want to expand the equation of the weights to show that � T T t = 1 α t f t ( x i ) e − y i h T ( x i ) e − y i w T + 1 ( i ) = 1 := 1 � → h T ( x ) := α t f t ( x i ) � T � T n n t = 1 Z t t = 1 Z t t = 1 Derivation of Step 1 : Notice the update rule: w t + 1 ( i ) = 1 w t ( i ) e − α t y i f t ( x i ) Z t Do the same expansion for w t ( i ) and continue until reaching w 1 ( i ) = 1 n , w T + 1 ( i ) = w 1 ( i ) e − α 1 y i f 1 ( x i ) × · · · × e − α T y i f T ( x i ) Z 1 Z T The product � T t = 1 Z t is “b” above. We use this form of w T + 1 ( i ) in Step 2.

  26. P ROOF OF THEOREM ( a ≤ b ≤ c ) Step 2 Next show the training error of f ( T ) boost (boosting after T steps) is ≤ � T t = 1 Z t . Currently we know T e − y i h T ( x i ) w T + 1 ( i ) = 1 Z t = 1 ne − y i h T ( x i ) f ( T ) � & ⇒ w T + 1 ( i ) boost ( x ) = sign ( h T ( x )) � T n t = 1 Z t t = 1 Derivation of Step 2 : Observe that 0 < e z 1 and 1 < e z 2 for any z 1 < 0 < z 2 . Therefore n n 1 1 � � 1 { y i � = f ( T ) e − y i h T ( x i ) boost ( x i ) } ≤ n n i = 1 i = 1 � �� � n T T a � � � = w T + 1 ( i ) = Z t Z t i = 1 t = 1 t = 1 � �� � b “a” is the training error – the quantity we care about.

  27. P ROOF OF THEOREM ( a ≤ b ≤ c ) Step 3 The final step is to calculate an upper bound on Z t , and by extension � T t = 1 Z t . Derivation of Step 3 : � � This step is slightly more involved. It also shows why α t := 1 1 − ǫ t 2 ln . ǫ t n � w t ( i ) e − α t y i f t ( x i ) Z t = i = 1 � � e − α t w t ( i ) + = e α t w t ( i ) i : y i = f t ( x i ) i : y i � = f t ( x i ) e − α t ( 1 − ǫ t ) + e α t ǫ t = Remember we defined ǫ t = � i : y i � = f t ( x i ) w t ( i ) , the probability of error for w t .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend