ensembles
play

Ensembles L eon Bottou COS 424 4/8/2010 Readings T. G. - PowerPoint PPT Presentation

Ensembles L eon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. L eon Bottou


  1. Ensembles L´ eon Bottou COS 424 – 4/8/2010

  2. Readings • T. G. Dietterich (2000) “Ensemble Methods in Machine Learning”. • R. E. Schapire (2003): “The Boosting Approach to Machine Learning”. Sections 1,2,3,4,6. L´ eon Bottou 2/33 COS 424 – 4/8/2010

  3. Summary 1. Why ensembles? 2. Combining outputs. 3. Constructing ensembles. 4. Boosting. L´ eon Bottou 3/33 COS 424 – 4/8/2010

  4. I. Ensembles L´ eon Bottou 4/33 COS 424 – 4/8/2010

  5. Ensemble of classifiers Ensemble of classifiers – Consider a set of classifiers h 1 , h 2 , . . . , h L . – Construct a classifier by combining their individual decisions. – For example by voting their outputs. Accuracy – The ensemble works if the classifiers have low error rates. Diversity – No gain if all classifiers make the same mistakes. – What if classifiers make different mistakes? L´ eon Bottou 5/33 COS 424 – 4/8/2010

  6. Uncorrelated classifiers Assume ∀ r � = s Cov [ 1 I { h r ( x ) = y } , 1 I { h s ( x ) = y } ] = 0 The tally of classifier votes follows a binomial distribution. Example Twenty-one uncorrelated classifiers with 30% error rate. L´ eon Bottou 6/33 COS 424 – 4/8/2010

  7. Statistical motivation blue : classifiers that work well on the training set(s) f : best classifier. L´ eon Bottou 7/33 COS 424 – 4/8/2010

  8. Computational motivation blue : classifier search may reach local optima f : best classifier. L´ eon Bottou 8/33 COS 424 – 4/8/2010

  9. Representational motivation blue : classifier space may not contain best classifier f : best classifier. L´ eon Bottou 9/33 COS 424 – 4/8/2010

  10. Practical success Recommendation system – Netflix “movies you may like”. – Customers sometimes rate movies they rent. – Input: (movie, customer) – Output: rating Netflix competition – 1M$ for the first team to do 10% better than their system. Winner: BellKor team and friends – Ensemble of more than 800 rating systems. Runner-up: everybody else – Ensemble of all the rating systems built by the other teams. L´ eon Bottou 10/33 COS 424 – 4/8/2010

  11. Bayesian ensembles Let D represent the training data. Enumerating all the classifiers � P ( y | x, D ) = P ( y, h | x, D ) h � = P ( h | x, D ) P ( y | h, x, D ) h � = P ( h | D ) P ( y | x, h ) h P ( h | D ) : how well does h match the training data. P ( y | x, h ) : what h predicts for pattern x . Note that this is a weighted average. L´ eon Bottou 11/33 COS 424 – 4/8/2010

  12. II. Combining Outputs L´ eon Bottou 12/33 COS 424 – 4/8/2010

  13. Simple averaging � � � � � � ��� � � L´ eon Bottou 13/33 COS 424 – 4/8/2010

  14. Weighted averaging a priori ��������������������������� ������������������� � � � � � � ��� � � Weights derived from the training errors, e.g. exp( − β TrainingError ( h t )) . Approximate Bayesian ensemble. L´ eon Bottou 14/33 COS 424 – 4/8/2010

  15. Weighted averaging with trained weights ������������������������ ������������������ � � � � � � ��� � � Train weights on the validation set. Training weights on the training set overfits easily. You need another validation set to estimate the performance! L´ eon Bottou 15/33 COS 424 – 4/8/2010

  16. Stacked classifiers � � ����������������� � � �������������� � �������������� ��� � � Second tier classifier trained on the validation set. You need another validation set to estimate the performance! L´ eon Bottou 16/33 COS 424 – 4/8/2010

  17. III. Constructing Ensembles L´ eon Bottou 17/33 COS 424 – 4/8/2010

  18. Diversification Cause of the mistake Diversification strategy Pattern was difficult. hopeless Overfitting ( ⋆ ) vary the training sets Some features were noisy vary the set of input features Multiclass decisions were inconsistent vary the class encoding L´ eon Bottou 18/33 COS 424 – 4/8/2010

  19. Manipulating the training examples Bootstrap replication simulates training set selection – Given a training set of size n , construct a new training set by sampling n examples with replacement. – About 30% of the examples are excluded. Bagging – Create bootstrap replicates of the training set. – Build a decision tree for each replicate. – Estimate tree performance using out-of-bootstrap data. – Average the outputs of all decision trees. Boosting – See part IV. L´ eon Bottou 19/33 COS 424 – 4/8/2010

  20. Manipulating the features Random forests – Construct decision trees on bootstrap replicas. Restrict the node decisions to a small subset of features picked randomly for each node. – Do not prune the trees. Estimate tree performance using out-of-bootstrap data. Average the outputs of all decision trees. Multiband speech recognition – Filter speech to eliminate a random subset of the frequencies. – Train speech recognizer on filtered data. – Repeat and combine with a second tier classifier. – Resulting recognizer is more robust to noise. L´ eon Bottou 20/33 COS 424 – 4/8/2010

  21. Manipulating the output codes Reducing multiclass problems to binary classification – We have seen one versus all. – We have seen all versus all. Error correcting codes for multiclass problems – Code the class numbers with an error correcting code. – Construct a binary classifier for each bit of the code. – Run the error correction algorithm on the binary classifier outputs. L´ eon Bottou 21/33 COS 424 – 4/8/2010

  22. IV. Boosting L´ eon Bottou 22/33 COS 424 – 4/8/2010

  23. Motivation • Easy to come up with rough rules of thumb for classifying data – email contains more than 50% capital letters. – email contains expression “buy now”. • Each alone isnt great, but better than random. • Boosting converts rough rules of thumb into an accurate classier. Boosting was invented by Prof. Schapire. L´ eon Bottou 23/33 COS 424 – 4/8/2010

  24. Adaboost Given examples ( x 1 , y 1 ) . . . ( x n , y n ) with y i = ± 1 . Let D 1 ( i ) = 1 /n for i = 1 . . . n . For t = 1 . . . T do • Run weak learner using examples with weights D t . • Get weak classifier h t Compute error: ε t = � • i D t ( i ) 1 I( h t ( x i ) � = y i ) � 1 − ε t � Compute magic coefficient α t = 1 • 2 log ε t Update weights D t +1 ( i ) = D t ( i ) e − α t y i h t ( x i ) • Z t   T � Output the final classifier f T ( x ) = sign α t h t ( x )   t =1 L´ eon Bottou 24/33 COS 424 – 4/8/2010

  25. Toy example Weak classifiers: vertical or horizontal half-planes. L´ eon Bottou 25/33 COS 424 – 4/8/2010

  26. Adaboost round 1 L´ eon Bottou 26/33 COS 424 – 4/8/2010

  27. Adaboost round 2 L´ eon Bottou 27/33 COS 424 – 4/8/2010

  28. Adaboost round 3 L´ eon Bottou 28/33 COS 424 – 4/8/2010

  29. Adaboost final classifier L´ eon Bottou 29/33 COS 424 – 4/8/2010

  30. From weak learner to strong classifier (1) Preliminary D T +1 ( i ) = D 1 ( i ) e − α 1 y i h 1 ( x i ) · · · e − α T y i h T ( x i ) e − y i f T ( x i ) = 1 � Z 1 Z T n t Z t Bounding the training error 1 I { f T ( x i ) � = y i } ≤ 1 e − y i f T ( x i ) = 1 � � � � � 1 D T +1 ( i ) Z t = Z t n n n t t i i i Idea: make Z t as small as possible. n D t ( i ) e − α t y i h t ( x i ) = n (1 − ε t ) e − α t + n ε t e α t � Z t = i =1 1. Pick h t to minimize ε t . 2. Pick α t to minimize Z t . L´ eon Bottou 30/33 COS 424 – 4/8/2010

  31. From weak learner to strong classifier (2) Pick α t to minimize Z t (the magic coefficient) ∂Z t α t = 1 2 log 1 − ε t = − (1 − ε t ) e − α t + ε t e α t = 0 = ⇒ ∂α t ε t Weak learner assumption: γ t = 1 2 − ε t is positive and small. � � ε 1 − ε � − 2 γ 2 1 − 4 γ 2 � � � Z t = (1 − ε ) 1 − ε + ε = 4 ε (1 − ε ) = ≤ exp t t ε   T T � � γ 2 TrainingError ( f T ) ≤ Z t ≤ exp  − 2 t  t =1 t =1 The training error decreases exponentially if inf γ t > 0 . But that does not happen beyond a certain point. . . L´ eon Bottou 31/33 COS 424 – 4/8/2010

  32. Boosting and exponential loss Proofs are instructive We obtain the bound T TrainingError ( f T ) ≤ 1 e − y i H ( x i ) = � � Z t n t =1 i ^ y y(x) – without saying how D t relates to h t – without using the value of α t Conclusion – Round T chooses the h T and α T that maximize the exponential loss reduction from f T − 1 to f T . Exercise – Tweak Adaboost to minimize the log loss instead of the exp loss. L´ eon Bottou 32/33 COS 424 – 4/8/2010

  33. Boosting and margins � y H ( x ) t α t y h t ( x ) margin H ( x, y ) = t | α t | = � � t | α t | Remember support vector machines? L´ eon Bottou 33/33 COS 424 – 4/8/2010

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend