Boosting Can we make dumb learners smart? Aarti Singh Machine - PowerPoint PPT Presentation

Boosting Can we make dumb learners smart? Aarti Singh Machine Learning 10-701/15-781 Oct 11, 2010 Slides Courtesy: Carlos Guestrin, Freund & Schapire 1

Project Proposal Due Today! 2

Why boost weak learners? Goal: Automatically categorize type of call requested (Collect, Calling card, Person-to-person, etc.) • Easy to find “rules of thumb” that are “often” correct. E.g. If ‘card’ occurs in utterance, then predict ‘calling card’ • Hard to find single highly accurate prediction rule. 3

Fighting the bias-variance tradeoff • Simple (a.k.a. weak) learners e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) Are good  - Low variance, don’t usually overfit Are bad  - High bias, can’t solve hard learning problems • Can we make weak learners always good??? – No!!! But often yes… 4

Voting (Ensemble Methods) • Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space • Output class: (Weighted) vote of each classifier – Classifiers that are most “sure” will vote with more conviction – Classifiers will be most “sure” about a particular part of the space – On average, do better than single classifier! H: X → Y ( -1,1) h 1 (X) h 2 (X) H(X) = h 1 (X)+h 2 (X) H(X) = sign( ∑α t h t (X)) t 1 -1 ? ? ? -1 ? 1 weights 5

Voting (Ensemble Methods) • Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space • Output class: (Weighted) vote of each classifier – Classifiers that are most “sure” will vote with more conviction – Classifiers will be most “sure” about a particular part of the space – On average, do better than single classifier! • But how do you ??? – force classifiers h t to learn about different parts of the input space? – weigh the votes of different classifiers?  t 6

Boosting [Schapire’89] • Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote • On each iteration t : – weight each training example by how incorrectly it was classified – Learn a weak hypothesis – h t – A strength for this hypothesis –  t • Final classifier: H(X) = sign( ∑α t h t (X)) • Practically useful • Theoretically interesting 7

Learning from weighted data • Consider a weighted dataset – D(i) – weight of i th training example ( x i ,y i ) – Interpretations: • i th training example counts as D(i) examples • If I were to “resample” data, I would get more samples of “heavier” data points • Now, in all calculations, whenever used, i th training example counts as D(i ) “examples” – e.g., in MLE redefine Count(Y=y) to be weighted count Unweighted data Weights D(i) m m Count(Y=y) = ∑ 1 ( Y i =y ) Count(Y=y) = ∑ D( i) 1 ( Y i =y ) i =1 i =1 8

AdaBoost [Freund & Schapire’95] Initially equal weights Naïve bayes, decision stump weak weak Magic (+ve) Increase weight if wrong on pt i y i h t (x i ) = -1 < 0 9

AdaBoost [Freund & Schapire’95] Initially equal weights Naïve bayes, decision stump weak weak Magic (+ve) Increase weight if wrong on pt i y i h t (x i ) = -1 < 0 Weights for all pts must sum to 1 ∑ D t+1 (i) = 1 t 10

AdaBoost [Freund & Schapire’95] Initially equal weights Naïve bayes, decision stump weak weak Magic (+ve) Increase weight if wrong on pt i y i h t (x i ) = -1 < 0 11

What  t to choose for hypothesis h t ? Weight Update Rule: [Freund & Schapire’95] Weighted training error Does h t get i th point wrong  t = ∞ ε t = 0 if h t perfectly classifies all weighted data pts  t = - ∞ ε t = 1 if h t perfectly wrong => -h t perfectly right  t = 0 ε t = 0.5 12

Boosting Example (Decision Stumps) 13

Boosting Example (Decision Stumps) 14

Analyzing training error Analysis reveals: What  t to choose for hypothesis h t ? • ε t - weighted training error • If each weak learner h t is slightly better than random guessing ( ε t < 0.5), then training error of AdaBoost decays exponentially fast in number of rounds T. Training Error 15

Analyzing training error Training error of final classifier is bounded by: Convex upper bound Where exp loss If boosting can make upper bound → 0, then training error → 0 0/1 loss 1 16 0

Analyzing training error Training error of final classifier is bounded by: Where Using Weight Update Rule Proof: Wts of all pts add to 1 … 17

Analyzing training error Training error of final classifier is bounded by: Where If Z t < 1, training error decreases exponentially (even though weak learners may not be good ε t ~0.5) Upper bound Training error 18 t

What  t to choose for hypothesis h t ? Training error of final classifier is bounded by: Where If we minimize  t Z t , we minimize our training error We can tighten this bound greedily, by choosing  t and h t on each iteration to minimize Z t . 19

What  t to choose for hypothesis h t ? We can minimize this bound by choosing  t on each iteration to minimize Z t . For boolean target function, this is accomplished by [Freund & Schapire ’97]: Proof: 20

What  t to choose for hypothesis h t ? We can minimize this bound by choosing  t on each iteration to minimize Z t . For boolean target function, this is accomplished by [Freund & Schapire ’97]: Proof: 21

Dumb classifiers made Smart Training error of final classifier is bounded by: grows as  t moves away from 1/2 If each classifier is (at least slightly) better than random  t < 0.5 AdaBoost will achieve zero training error exponentially fast (in number of rounds T) !! What about test error? 22

Boosting results – Digit recognition [Schapire, 1989] Test Error Training Error • Boosting often, but not always – Robust to overfitting – Test set error decreases even after training error is zero 23

Generalization Error Bounds [Freund & Schapire’95] bias variance small T small large tradeoff large small T large • T – number of boosting rounds • d – VC dimension of weak learner, measures complexity of classifier • m – number of training examples 24

Generalization Error Bounds [Freund & Schapire’95] With high probability Boosting can overfit if T is large Boosting often, Contradicts experimental results – Robust to overfitting – Test set error decreases even after training error is zero Need better analysis tools – margin based bounds 25

Margin Based Bounds [Schapire , Freund, Bartlett, Lee’98] With high probability Boosting increases the margin very aggressively since it concentrates on the hardest examples. If margin is large, more weak learners agree and hence more rounds does not necessarily imply that final classifier is getting more complex. Bound is independent of number of rounds T! Boosting can still overfit if margin is too small, weak learners are too complex or perform arbitrarily close to random guessing 26

Boosting: Experimental Results [Freund & Schapire, 1996] Comparison of C4.5 (decision trees) vs Boosting decision stumps (depth 1 trees) C4.5 vs Boosting C4.5 27 benchmark datasets error error error 27

Train Test Train Test Overfits Overfits Overfits Overfits 28

Boosting and Logistic Regression Logistic regression assumes: And tries to maximize data likelihood: iid Equivalent to minimizing log loss 29

Boosting and Logistic Regression Logistic regression equivalent to minimizing log loss Boosting minimizes similar loss function!! Weighted average of weak learners log loss exp loss Both smooth approximations of 0/1 loss! 0/1 loss 1 30 0

Boosting and Logistic Regression Logistic regression: Boosting: • Minimize log loss • Minimize exp loss • Define • Define where x j predefined where h t (x) defined dynamically features to fit data (linear classifier) (not a linear classifier) • Weights  t learned per iteration • Jointly optimize over all weights w 0 , w 1 , w 2 … t incrementally 31

Hard & Soft Decision Weighted average of weak learners Hard Decision/Predicted label: Soft Decision: (based on analogy with logistic regression) 32

Effect of Outliers Good  : Can identify outliers since focuses on examples that are hard to categorize Bad  : Too many outliers can degrade classification performance dramatically increase time to convergence 33

Bagging [Breiman, 1996] Related approach to combining classifiers: 1. Run independent weak learners on bootstrap replicates (sample with replacement) of the training set 2. Average/vote over weak hypotheses Bagging vs. Boosting Resamples data points Reweights data points (modifies their distribution) Weight of each classifier Weight is dependent on is the same classifier’s accuracy Only variance reduction Both bias and variance reduced – learning rule becomes more complex with iterations 34

Boosting Can we make dumb learners smart? Aarti Singh Machine - PowerPoint PPT Presentation

Boosting Can we make dumb learners smart? Aarti Singh Machine Learning 10-701/15-781 Oct 11, 2010 Slides Courtesy: Carlos Guestrin, Freund & Schapire 1 Project Proposal Due Today! 2 Why boost weak learners? Goal: Automatically

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

Lecture #16: Boosting Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas

An overview of Boosting Yoav Freund UCSD Plan of talk Generative vs. non-generative

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015 Boosting General

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

mboost - Componentwise Boosting for Generalised Regression Models Thomas Kneib & Torsten

Boosting Methods: Implicit Combinatorial Optimization via First-Order Convex Optimization Robert

STK-IN4300 Statistical Learning Methods in Data Science Likelihood-based Boosting introduction

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

Lecture 17: Boosting CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader

XGBOOST: A SCALABLE TREE BOOSTING SYSTEM ADVISOR: JIA-LING KOH SPEAKER: YIN-HSIANG LIAO

How to lift heavy things: a nerds guide to powerlifting Outline Intro to strength

Beating the bookie A look at statistical models for prediction of football matches Helge Langseth

D3 Tutorial Force-directed Layout Edit by Jiayi Xu and Han-Wei Shen, The Ohio State University

Drainage Installation and Inspection Hank Gottschalk Concrete Pipe & Precast, LLC Buried

Better Habits for Healthier Backs Better Habits for Healthier Backs Protect Your Back with

Welcome! We cannot solve our problems with the same thinking we used when we created them. ~

Deacon Training Palestine Baptist Association Highland Avenue Baptist Church November 1, 2014

The Rower Development Guide Explained Peter Sheppard Chief Coach U23s & Juniors James

Boosting Can we make dumb learners smart? Aarti Singh Machine - PowerPoint PPT Presentation

Boosting Can we make dumb learners smart? Aarti Singh Machine Learning 10-701/15-781 Oct 11, 2010 Slides Courtesy: Carlos Guestrin, Freund & Schapire 1 Project Proposal Due Today! 2 Why boost weak learners? Goal: Automatically

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

Lecture #16: Boosting Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas

An overview of Boosting Yoav Freund UCSD Plan of talk Generative vs. non-generative

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015 Boosting General

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

mboost - Componentwise Boosting for Generalised Regression Models Thomas Kneib &amp; Torsten

Boosting Methods: Implicit Combinatorial Optimization via First-Order Convex Optimization Robert

STK-IN4300 Statistical Learning Methods in Data Science Likelihood-based Boosting introduction

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

Lecture 17: Boosting CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader

XGBOOST: A SCALABLE TREE BOOSTING SYSTEM ADVISOR: JIA-LING KOH SPEAKER: YIN-HSIANG LIAO

How to lift heavy things: a nerds guide to powerlifting Outline Intro to strength

Beating the bookie A look at statistical models for prediction of football matches Helge Langseth

D3 Tutorial Force-directed Layout Edit by Jiayi Xu and Han-Wei Shen, The Ohio State University

Drainage Installation and Inspection Hank Gottschalk Concrete Pipe &amp; Precast, LLC Buried

Better Habits for Healthier Backs Better Habits for Healthier Backs Protect Your Back with

Welcome! We cannot solve our problems with the same thinking we used when we created them. ~

Deacon Training Palestine Baptist Association Highland Avenue Baptist Church November 1, 2014

The Rower Development Guide Explained Peter Sheppard Chief Coach U23s &amp; Juniors James

mboost - Componentwise Boosting for Generalised Regression Models Thomas Kneib & Torsten

Drainage Installation and Inspection Hank Gottschalk Concrete Pipe & Precast, LLC Buried

The Rower Development Guide Explained Peter Sheppard Chief Coach U23s & Juniors James