Theory and Applications of Boosting Theory and Applications of - PowerPoint PPT Presentation

Round 1 Round 1 Round 1 Round 1 Round 1 h 1 D 2 �� ε 1 =0.30 α =0.42 1

Round 2 Round 2 Round 2 Round 2 Round 2 h 2 D 3 �� ε 2 =0.21 α =0.65 2

Round 3 Round 3 Round 3 Round 3 Round 3 �� h 3 �� ε 3 =0.14 α 3=0.92

Final Classifier Final Classifier Final Classifier Final Classifier Final Classifier �� H = �� sign 0.42 + 0.65 + 0.92 �� final �� = ��

Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory • introduction to AdaBoost • analysis of training error • analysis of test error and the margins theory • experiments and applications

Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error [with Freund] • Theorem: • write ǫ t as 1 2 − γ t [ γ t = “edge” ] • then � � � � ≤ ǫ t (1 − ǫ t ) training error ( H final ) 2 t � � 1 − 4 γ 2 = t t � � � γ 2 ≤ exp − 2 t t • so: if ∀ t : γ t ≥ γ > 0 then training error ( H final ) ≤ e − 2 γ 2 T • AdaBoost is adaptive: • does not need to know γ or T a priori • can exploit γ t ≫ γ

Proof Proof Proof Proof Proof � • let F ( x ) = α t h t ( x ) ⇒ H final ( x ) = sign ( F ( x )) t • Step 1 : unwrapping recurrence: � � � − y i exp α t h t ( x i ) 1 t D final ( i ) = � m Z t t 1 exp ( − y i F ( x i )) = � m Z t t

Proof (cont.) Proof (cont.) Proof (cont.) Proof (cont.) Proof (cont.) � • Step 2 : training error ( H final ) ≤ Z t t • Proof: � 1 1 if y i � = H final ( x i ) � training error ( H final ) = 0 else m i � 1 1 if y i F ( x i ) ≤ 0 � = 0 else m i 1 � ≤ exp( − y i F ( x i )) m i � � = D final ( i ) Z t i t � = Z t t

Proof (cont.) Proof (cont.) Proof (cont.) Proof (cont.) Proof (cont.) � • Step 3 : Z t = 2 ǫ t (1 − ǫ t ) • Proof: � Z t = D t ( i ) exp( − α t y i h t ( x i )) i � � D t ( i ) e α t + D t ( i ) e − α t = i : y i � = h t ( x i ) i : y i = h t ( x i ) ǫ t e α t + (1 − ǫ t ) e − α t = � = 2 ǫ t (1 − ǫ t )

How Will Test Error Behave? (A First Guess) How Will Test Error Behave? (A First Guess) How Will Test Error Behave? (A First Guess) How Will Test Error Behave? (A First Guess) How Will Test Error Behave? (A First Guess) 1 0.8 0.6 error 0.4 test 0.2 train 20 40 60 80 100 T ) # of rounds ( expect: • training error to continue to drop (or reach zero) • test error to increase when H final becomes “too complex” • “Occam’s razor” • overfitting • hard to know when to stop training

Technically... Technically... Technically... Technically... Technically... • with high probability: �� dT generalization error ≤ training error + ˜ O m • bound depends on • m = # training examples • d = “complexity” of weak classifiers • T = # rounds • generalization error = E [test error] • predicts overfitting

Overfitting Can Happen Overfitting Can Happen Overfitting Can Happen Overfitting Can Happen Overfitting Can Happen 30 25 test 20 error 15 (boosting “stumps” on heart-disease dataset) 10 train 5 0 1 10 100 1000 # rounds • but often doesn’t...

Actual Typical Run Actual Typical Run Actual Typical Run Actual Typical Run Actual Typical Run 20 15 error 10 (boosting C4.5 on “letter” dataset) test 5 train 0 10 100 1000 T # of rounds ( ) • test error does not increase, even after 1000 rounds • (total size > 2,000,000 nodes) • test error continues to drop even after training error is zero! # rounds 5 100 1000 train error 0 . 0 0 . 0 0 . 0 test error 8 . 4 3 . 3 3 . 1 • Occam’s razor wrongly predicts “simpler” rule is better

A Better Story: The Margins Explanation A Better Story: The Margins Explanation A Better Story: The Margins Explanation A Better Story: The Margins Explanation A Better Story: The Margins Explanation [with Freund, Bartlett & Lee] • key idea: • training error only measures whether classifications are right or wrong • should also consider confidence of classifications • recall: H final is weighted majority vote of weak classifiers • measure confidence by margin = strength of the vote = (weighted fraction voting correctly) − (weighted fraction voting incorrectly) high conf. high conf. incorrect low conf. correct H H final final −1 0 +1 incorrect correct

Empirical Evidence: The Margin Distribution Empirical Evidence: The Margin Distribution Empirical Evidence: The Margin Distribution Empirical Evidence: The Margin Distribution Empirical Evidence: The Margin Distribution • margin distribution = cumulative distribution of margins of training examples 1.0 cumulative distribution 20 1000 100 15 error 0.5 10 test 5 5 train 0 10 100 1000 -1 -0.5 0.5 1 T margin # of rounds ( ) # rounds 5 100 1000 train error 0 . 0 0 . 0 0 . 0 test error 8 . 4 3 . 3 3 . 1 % margins ≤ 0 . 5 7 . 7 0 . 0 0 . 0 minimum margin 0 . 14 0 . 52 0 . 55

Theoretical Evidence: Analyzing Boosting Using Margins Theoretical Evidence: Analyzing Boosting Using Margins Theoretical Evidence: Analyzing Boosting Using Margins Theoretical Evidence: Analyzing Boosting Using Margins Theoretical Evidence: Analyzing Boosting Using Margins • Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds) • proof idea: if all margins are large, then can approximate final classifier by a much smaller classifier (just as polls can predict not-too-close election) • Theorem: boosting tends to increase margins of training examples (given weak learning assumption) • moreover, larger edges ⇒ larger margins • proof idea: similar to training error proof • so: although final classifier is getting larger, margins are likely to be increasing, so final classifier actually getting close to a simpler classifier, driving down the test error

More Technically... More Technically... More Technically... More Technically... More Technically... • with high probability, ∀ θ > 0 : �� d / m generalization error ≤ ˆ Pr [margin ≤ θ ] + ˜ O θ (ˆ Pr [ ] = empirical probability) • bound depends on • m = # training examples • d = “complexity” of weak classifiers • entire distribution of margins of training examples • ˆ Pr [margin ≤ θ ] → 0 exponentially fast (in T ) if ǫ t < 1 2 − θ ( ∀ t ) • so: if weak learning assumption holds, then all examples will quickly have “large” margins

Consequences of Margins Theory Consequences of Margins Theory Consequences of Margins Theory Consequences of Margins Theory Consequences of Margins Theory • predicts good generalization with no overfitting if: • weak classifiers have large edges (implying large margins) • weak classifiers not too complex relative to size of training set • e.g., boosting decision trees resistant to overfitting since trees often have large edges and limited complexity • overfitting may occur if: • small edges (underfitting), or • overly complex weak classifiers • e.g., heart-disease dataset: • stumps yield small edges • also, small dataset

Improved Boosting with Better Margin-Maximization? Improved Boosting with Better Margin-Maximization? Improved Boosting with Better Margin-Maximization? Improved Boosting with Better Margin-Maximization? Improved Boosting with Better Margin-Maximization? • can design algorithms more effective than AdaBoost at maximizing the minimum margin • in practice, often perform worse [Breiman] • why?? • more aggressive margin maximization seems to lead to: • more complex weak classifiers (even using same weak learner); or • higher minimum margins, but margin distributions that are lower overall [with Reyzin]

Comparison to SVM’s Comparison to SVM’s Comparison to SVM’s Comparison to SVM’s Comparison to SVM’s • both AdaBoost and SVM’s: • work by maximizing “margins” • find linear threshold function in high-dimensional space • differences: • margin measured slightly differently (using different norms) • SVM’s handle high-dimensional space using kernel trick; AdaBoost uses weak learner to search over space • SVM’s maximize minimum margin; AdaBoost maximizes margin distribution in a more diffuse sense

Practical Advantages of AdaBoost Practical Advantages of AdaBoost Practical Advantages of AdaBoost Practical Advantages of AdaBoost Practical Advantages of AdaBoost • fast • simple and easy to program • no parameters to tune (except T ) • flexible — can combine with any learning algorithm • no prior knowledge needed about weak learner • provably effective, provided can consistently find rough rules of thumb → shift in mind set — goal now is merely to find classifiers barely better than random guessing • versatile • can use with data that is textual, numeric, discrete, etc. • has been extended to learning problems well beyond binary classification

Caveats Caveats Caveats Caveats Caveats • performance of AdaBoost depends on data and weak learner • consistent with theory, AdaBoost can fail if • weak classifiers too complex → overfitting • weak classifiers too weak ( γ t → 0 too quickly) → underfitting → low margins → overfitting • empirically, AdaBoost seems especially susceptible to uniform noise

UCI Experiments UCI Experiments UCI Experiments UCI Experiments UCI Experiments [with Freund] • tested AdaBoost on UCI benchmarks • used: • C4.5 (Quinlan’s decision tree algorithm) • “decision stumps”: very simple rules of thumb that test on single attributes eye color = brown ? height > 5 feet ? yes no yes no predict predict predict predict +1 -1 -1 +1

UCI Results UCI Results UCI Results UCI Results UCI Results 30 30 25 25 20 20 C4.5 C4.5 15 15 10 10 5 5 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 boosting Stumps boosting C4.5

Application: Detecting Faces Application: Detecting Faces Application: Detecting Faces Application: Detecting Faces Application: Detecting Faces [Viola & Jones] • problem: find faces in photograph or movie • weak classifiers: detect light/dark rectangles in image • many clever tricks to make extremely fast and accurate

Application: Human-Computer Spoken Dialogue Application: Human-Computer Spoken Dialogue Application: Human-Computer Spoken Dialogue Application: Human-Computer Spoken Dialogue Application: Human-Computer Spoken Dialogue [with Rahim, Di Fabbrizio, Dutton, Gupta, Hollister & Riccardi] • application: automatic “store front” or “help desk” for AT&T Labs’ Natural Voices business • caller can request demo, pricing information, technical support, sales agent, etc. • interactive dialogue

How It Works How It Works How It Works How It Works How It Works raw Human computer utterance speech automatic text−to−speech speech recognizer text response text dialogue natural language manager understanding predicted category • NLU’s job: classify caller utterances into 24 categories (demo, sales rep, pricing info, yes, no, etc.) • weak classifiers: test for presence of word or phrase

Problem: Labels are Expensive Problem: Labels are Expensive Problem: Labels are Expensive Problem: Labels are Expensive Problem: Labels are Expensive • for spoken-dialogue task • getting examples is cheap • getting labels is expensive • must be annotated by humans • how to reduce number of labels needed?

Active Learning Active Learning Active Learning Active Learning Active Learning [with Tur & Hakkani-T¨ ur] • idea: • use selective sampling to choose which examples to label • focus on least confident examples [Lewis & Gale] • for boosting, use (absolute) margin as natural confidence measure [Abe & Mamitsuka]

Labeling Scheme Labeling Scheme Labeling Scheme Labeling Scheme Labeling Scheme • start with pool of unlabeled examples • choose (say) 500 examples at random for labeling • run boosting on all labeled examples • get combined classifier F • pick (say) 250 additional examples from pool for labeling • choose examples with minimum | F ( x ) | (proportional to absolute margin) • repeat

Results: How-May-I-Help-You? Results: How-May-I-Help-You? Results: How-May-I-Help-You? Results: How-May-I-Help-You? Results: How-May-I-Help-You? 34 random active 32 % error rate 30 28 26 24 0 5000 10000 15000 20000 25000 30000 35000 40000 # labeled examples first reached % label % error random active savings 28 11,000 5,500 50 26 22,000 9,500 57 25 40,000 13,000 68

Results: Letter Results: Letter Results: Letter Results: Letter Results: Letter 25 random active 20 % error rate 15 10 5 0 0 2000 4000 6000 8000 10000 12000 14000 16000 # labeled examples first reached % label % error random active savings 10 3,500 1,500 57 5 9,000 2,750 69 4 13,000 3,500 73

Fundamental Perspectives Fundamental Perspectives Fundamental Perspectives Fundamental Perspectives Fundamental Perspectives • game theory • loss minimization • an information-geometric view

Just a Game Just a Game Just a Game Just a Game Just a Game [with Freund] • can view boosting as a game, a formal interaction between booster and weak learner • on each round t : • booster chooses distribution D t • weak learner responds with weak classifier h t • game theory: studies interactions between all sorts of “players”

Games Games Games Games Games • game defined by matrix M : Rock Paper Scissors Rock 1 / 2 1 0 Paper 0 1 / 2 1 Scissors 1 0 1 / 2 • row player (“Mindy”) chooses row i • column player (“Max”) chooses column j (simultaneously) • Mindy’s goal: minimize her loss M ( i , j ) • assume (wlog) all entries in [0 , 1]

Randomized Play Randomized Play Randomized Play Randomized Play Randomized Play • usually allow randomized play: • Mindy chooses distribution P over rows • Max chooses distribution Q over columns (simultaneously) • Mindy’s (expected) loss � = P ( i ) M ( i , j ) Q ( j ) i , j P ⊤ MQ ≡ M ( P , Q ) = • i , j = “pure” strategies • P , Q = “mixed” strategies • m = # rows of M • also write M ( i , Q ) and M ( P , j ) when one side plays pure and other plays mixed

Sequential Play Sequential Play Sequential Play Sequential Play Sequential Play • say Mindy plays before Max • if Mindy chooses P then Max will pick Q to maximize M ( P , Q ) ⇒ loss will be L ( P ) ≡ max Q M ( P , Q ) • so Mindy should pick P to minimize L ( P ) ⇒ loss will be min P L ( P ) = min P max Q M ( P , Q ) • similarly, if Max plays first, loss will be max Q min P M ( P , Q )

Minmax Theorem Minmax Theorem Minmax Theorem Minmax Theorem Minmax Theorem • playing second (with knowledge of other player’s move) cannot be worse than playing first, so: min P max Q M ( P , Q ) ≥ max Q min P M ( P , Q ) � �� Mindy plays first Mindy plays second • von Neumann’s minmax theorem: min P max Q M ( P , Q ) = max Q min P M ( P , Q ) • in words: no advantage to playing second

Optimal Play Optimal Play Optimal Play Optimal Play Optimal Play • minmax theorem: min P max Q M ( P , Q ) = max Q min P M ( P , Q ) = value v of game • optimal strategies: • P ∗ = arg min P max Q M ( P , Q ) = minmax strategy • Q ∗ = arg max Q min P M ( P , Q ) = maxmin strategy • in words: • Mindy’s minmax strategy P ∗ guarantees loss ≤ v (regardless of Max’s play) • optimal because Max has maxmin strategy Q ∗ that can force loss ≥ v (regardless of Mindy’s play) • e.g.: in RPS, P ∗ = Q ∗ = uniform • solving game = finding minmax/maxmin strategies

Weaknesses of Classical Theory Weaknesses of Classical Theory Weaknesses of Classical Theory Weaknesses of Classical Theory Weaknesses of Classical Theory • seems to fully answer how to play games — just compute minmax strategy (e.g., using linear programming) • weaknesses: • game M may be unknown • game M may be extremely large • opponent may not be fully adversarial • may be possible to do better than value v • e.g.: Lisa (thinks): Poor predictable Bart, always takes Rock. Bart (thinks): Good old Rock, nothing beats that.

Repeated Play Repeated Play Repeated Play Repeated Play Repeated Play • if only playing once, hopeless to overcome ignorance of game M or opponent • but if game played repeatedly, may be possible to learn to play well • goal: play (almost) as well as if knew game and how opponent would play ahead of time

Repeated Play (cont.) Repeated Play (cont.) Repeated Play (cont.) Repeated Play (cont.) Repeated Play (cont.) • M unknown • for t = 1 , . . . , T : • Mindy chooses P t • Max chooses Q t (possibly depending on P t ) • Mindy’s loss = M ( P t , Q t ) • Mindy observes loss M ( i , Q t ) of each pure strategy i • want: T T 1 1 � � M ( P t , Q t ) ≤ min M ( P , Q t ) + [“small amount”] T T P t =1 t =1 � �� actual average loss best loss (in hindsight)

Multiplicative-Weights Algorithm (MW) Multiplicative-Weights Algorithm (MW) Multiplicative-Weights Algorithm (MW) Multiplicative-Weights Algorithm (MW) Multiplicative-Weights Algorithm (MW) [with Freund] • choose η > 0 • initialize: P 1 = uniform • on round t : P t +1 ( i ) = P t ( i ) exp ( − η M ( i , Q t )) normalization • idea: decrease weight of strategies suffering the most loss • directly generalizes [Littlestone & Warmuth] • other algorithms: • [Hannan’57] • [Blackwell’56] • [Foster & Vohra] • [Fudenberg & Levine] . . .

Analysis Analysis Analysis Analysis Analysis • Theorem: can choose η so that, for any game M with m rows, and any opponent, T T 1 1 � � M ( P t , Q t ) ≤ min M ( P , Q t ) + ∆ T T T P t =1 t =1 � �� actual average loss best average loss ( ≤ v ) �� ln m → 0 where ∆ T = O T • regret ∆ T is: • logarithmic in # rows m • independent of # columns • therefore, can use when working with very large games

Solving a Game Solving a Game Solving a Game Solving a Game Solving a Game [with Freund] • suppose game M played repeatedly • Mindy plays using MW • on round t , Max chooses best response: Q t = arg max Q M ( P t , Q ) • let T T P = 1 Q = 1 � � P t , Q t T T t =1 t =1 • can prove that P and Q are ∆ T -approximate minmax and maxmin strategies: max Q M ( P , Q ) ≤ v + ∆ T and min P M ( P , Q ) ≥ v − ∆ T

Boosting as a Game Boosting as a Game Boosting as a Game Boosting as a Game Boosting as a Game • Mindy (row player) ↔ booster • Max (column player) ↔ weak learner • matrix M : • row ↔ training example • column ↔ weak classifier • M ( i , j ) = � 1 if j -th weak classifier correct on i -th training example 0 else • encodes which weak classifiers correct on which examples • huge # of columns — one for every possible weak classifier

Boosting and the Minmax Theorem Boosting and the Minmax Theorem Boosting and the Minmax Theorem Boosting and the Minmax Theorem Boosting and the Minmax Theorem • γ -weak learning assumption: • for every distribution on examples • can find weak classifier with weighted error ≤ 1 2 − γ • equivalent to: (value of game M ) ≥ 1 2 + γ • by minmax theorem, implies that: • ∃ some weighted majority classifier that correctly classifies all training examples with margin ≥ 2 γ • further, weights are given by maxmin strategy of game M

Idea for Boosting Idea for Boosting Idea for Boosting Idea for Boosting Idea for Boosting • maxmin strategy of M has perfect (training) accuracy and large margins • find approximately using earlier algorithm for solving a game • i.e., apply MW to M • yields (variant of) AdaBoost

AdaBoost and Game Theory AdaBoost and Game Theory AdaBoost and Game Theory AdaBoost and Game Theory AdaBoost and Game Theory • summarizing: • weak learning assumption implies maxmin strategy for M defines large-margin classifier • AdaBoost finds maxmin strategy by applying general algorithm for solving games through repeated play • consequences: • weights on weak classifiers converge to (approximately) maxmin strategy for game M • (average) of distributions D t converges to (approximately) minmax strategy • margins and edges connected via minmax theorem • explains why AdaBoost maximizes margins • different instantiation of game-playing algorithm gives online learning algorithms (such as weighted majority algorithm)

AdaBoost and Loss Minimization AdaBoost and Loss Minimization AdaBoost and Loss Minimization AdaBoost and Loss Minimization AdaBoost and Loss Minimization • many (most?) learning and statistical methods can be viewed as minimizing loss (a.k.a. cost or objective) function measuring fit to data: • e.g. least squares regression � i ( F ( x i ) − y i ) 2 • AdaBoost also minimizes a loss function • helpful to understand because: • clarifies goal of algorithm and useful in proving convergence properties • decoupling of algorithm from its objective means: • faster algorithms possible for same objective • same algorithm may generalize for new learning challenges

What AdaBoost Minimizes What AdaBoost Minimizes What AdaBoost Minimizes What AdaBoost Minimizes What AdaBoost Minimizes • recall proof of training error bound: � • training error ( H final ) ≤ Z t t � • Z t = ǫ t e α t + (1 − ǫ t ) e − α t = 2 ǫ t (1 − ǫ t ) • closer look: • α t chosen to minimize Z t • h t chosen to minimize ǫ t • same as minimizing Z t (since increasing in ǫ t on [0 , 1 / 2]) • so: both AdaBoost and weak learner minimize Z t on round t • equivalent to greedily minimizing � t Z t

AdaBoost and Exponential Loss AdaBoost and Exponential Loss AdaBoost and Exponential Loss AdaBoost and Exponential Loss AdaBoost and Exponential Loss • so AdaBoost is greedy procedure for minimizing exponential loss Z t = 1 � � exp( − y i F ( x i )) m t i where � F ( x ) = α t h t ( x ) t • why exponential loss? • intuitively, strongly favors F ( x i ) to have same sign as y i • upper bound on training error • smooth and convex (but very loose) • how does AdaBoost minimize it?

Coordinate Descent Coordinate Descent Coordinate Descent Coordinate Descent Coordinate Descent [Breiman] • { g 1 , . . . , g N } = space of all weak classifiers N � � • then can write F ( x ) = α t h t ( x ) = λ j g j ( x ) t j =1 • want to find λ 1 , . . . , λ N to minimize   � � L ( λ 1 , . . . , λ N ) = exp  − y i λ j g j ( x i )  i j • AdaBoost is actually doing coordinate descent on this optimization problem: • initially, all λ j = 0 • each round: choose one coordinate λ j (corresponding to h t ) and update (increment by α t ) • choose update causing biggest decrease in loss • powerful technique for minimizing over huge space of functions

Functional Gradient Descent Functional Gradient Descent Functional Gradient Descent Functional Gradient Descent Functional Gradient Descent [Mason et al.][Friedman] • want to minimize � L ( F ) = L ( F ( x 1 ) , . . . , F ( x m )) = exp( − y i F ( x i )) i • say have current estimate F and want to improve • to do gradient descent, would like update F ← F − α ∇ F L ( F ) • but update restricted in class of weak classifiers F ← F + α h t • so choose h t “closest” to −∇ F L ( F ) • equivalent to AdaBoost

Estimating Conditional Probabilities Estimating Conditional Probabilities Estimating Conditional Probabilities Estimating Conditional Probabilities Estimating Conditional Probabilities [Friedman, Hastie & Tibshirani] • often want to estimate probability that y = +1 given x • AdaBoost minimizes (empirical version of): � e − yF ( x ) � � Pr [ y = +1 | x ] e − F ( x ) + Pr [ y = − 1 | x ] e F ( x ) � = E x E x , y where x , y random from true distribution • over all F , minimized when � Pr [ y = +1 | x ] � F ( x ) = 1 2 · ln Pr [ y = − 1 | x ] or 1 Pr [ y = +1 | x ] = 1 + e − 2 F ( x ) • so, to convert F output by AdaBoost to probability estimate, use same formula

Calibration Curve Calibration Curve Calibration Curve Calibration Curve Calibration Curve 100 80 observed probability 60 40 20 0 0 20 40 60 80 100 predicted probability • order examples by F value output by AdaBoost • break into bins of fixed size • for each bin, plot a point: • x -value: average estimated probability of examples in bin • y -value: actual fraction of positive examples in bin

A Synthetic Example A Synthetic Example A Synthetic Example A Synthetic Example A Synthetic Example • x ∈ [ − 2 , +2] uniform • Pr [ y = +1 | x ] = 2 − x 2 • m = 500 training examples 1 1 0.5 0.5 0 0 -2 -1 0 1 2 -2 -1 0 1 2 • if run AdaBoost with stumps and convert to probabilities, result is poor • extreme overfitting

Regularization Regularization Regularization Regularization Regularization • AdaBoost minimizes   � �  − y i L ( λ ) = exp λ j g j ( x i )  i j • to avoid overfitting, want to constrain λ to make solution “smoother” • ( ℓ 1 ) regularization: minimize: L ( λ ) subject to: � λ � 1 ≤ B • or: minimize: L ( λ ) + β � λ � 1 • other norms possible • ℓ 1 (“lasso”) currently popular since encourages sparsity [Tibshirani]

Regularization Example Regularization Example Regularization Example Regularization Example Regularization Example 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 β = 10 − 3 β = 10 − 2 . 5 β = 10 − 2 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 β = 10 − 1 . 5 β = 10 − 1 β = 10 − 0 . 5

Regularization and AdaBoost Regularization and AdaBoost Regularization and AdaBoost Regularization and AdaBoost Regularization and AdaBoost [Hastie, Tibshirani & Friedman; Rosset, Zhu & Hastie] 0.6 0.6 0.4 0.4 individual classifier weights individual classifier weights 0.2 0.2 0 0 -0.2 -0.2 -0.4 -0.4 -0.6 -0.6 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 B α T • Experiment 1: regularized • Experiment 2: AdaBoost run solution vectors λ plotted as with α t fixed to (small) α • solution vectors λ function of B plotted as function of α T • plots are identical! • can prove under certain (but not all) conditions that results will be the same (as α → 0) [Zhao & Yu]

Regularization and AdaBoost Regularization and AdaBoost Regularization and AdaBoost Regularization and AdaBoost Regularization and AdaBoost • suggests stopping AdaBoost early is akin to applying ℓ 1 -regularization • caveats: • does not strictly apply to AdaBoost (only variant) • not helpful when boosting run “to convergence” (would correspond to very weak regularization) • in fact, in limit of vanishingly weak regularization ( B → ∞ ), solution converges to maximum margin solution [Rosset, Zhu & Hastie]

Benefits of Loss-Minimization View Benefits of Loss-Minimization View Benefits of Loss-Minimization View Benefits of Loss-Minimization View Benefits of Loss-Minimization View • immediate generalization to other loss functions and learning problems • e.g. squared error for regression • e.g. logistic regression (by only changing one line of AdaBoost) • sensible approach for converting output of boosting into conditional probability estimates • helpful connection to regularization • basis for proving AdaBoost is statistically “consistent” • i.e., under right assumptions, converges to best possible classifier [Bartlett & Traskin]

A Note of Caution A Note of Caution A Note of Caution A Note of Caution A Note of Caution • tempting (but incorrect!) to conclude: • AdaBoost is just an algorithm for minimizing exponential loss • AdaBoost works only because of its loss function ∴ more powerful optimization techniques for same loss should work even better • incorrect because: • other algorithms that minimize exponential loss can give very poor generalization performance compared to AdaBoost • for example...

An Experiment An Experiment An Experiment An Experiment An Experiment • data: • instances x uniform from {− 1 , +1 } 10 , 000 • label y = majority vote of three coordinates • weak classifier = single coordinate (or its negation) • training set size m = 1000 • algorithms (all provably minimize exponential loss): • standard AdaBoost • gradient descent on exponential loss • AdaBoost, but in which weak classifiers chosen at random • results: exp. % test error [# rounds] loss stand. AdaB. grad. desc. random AdaB. 10 − 10 0 . 0 [94] 40 . 7 [5] 44 . 0 [24,464] 10 − 20 0 . 0 [190] 40 . 8 [9] 41 . 6 [47,534] 10 − 40 0 . 0 [382] 40 . 8 [21] 40 . 9 [94,479] 10 − 100 0 . 0 [956] 40 . 8 [70] 40 . 3 [234,654]

An Experiment (cont.) An Experiment (cont.) An Experiment (cont.) An Experiment (cont.) An Experiment (cont.) • conclusions: • not just what is being minimized that matters, but how it is being minimized • loss-minimization view has benefits and is fundamental to understanding AdaBoost • but is limited in what it says about generalization • results are consistent with margins theory 1 stan. AdaBoost grad. descent rand. AdaBoost 0.5 0 -1 -0.5 0 0.5 1

A Dual Information-Geometric Perspective A Dual Information-Geometric Perspective A Dual Information-Geometric Perspective A Dual Information-Geometric Perspective A Dual Information-Geometric Perspective • loss minimization focuses on function computed by AdaBoost (i.e., weights on weak classifiers) • dual view: instead focus on distributions D t (i.e., weights on examples) • dual perspective combines geometry and information theory • exposes underlying mathematical structure • basis for proving convergence

An Iterative-Projection Algorithm An Iterative-Projection Algorithm An Iterative-Projection Algorithm An Iterative-Projection Algorithm An Iterative-Projection Algorithm • say want to find point closest to x 0 in set P = { intersection of N hyperplanes } • algorithm: [Bregman; Censor & Zenios] • start at x 0 • repeat: pick a hyperplane and project onto it P x 0 • if P � = ∅ , under general conditions, will converge correctly

AdaBoost is an Iterative-Projection Algorithm AdaBoost is an Iterative-Projection Algorithm AdaBoost is an Iterative-Projection Algorithm AdaBoost is an Iterative-Projection Algorithm AdaBoost is an Iterative-Projection Algorithm [Kivinen & Warmuth] • points = distributions D t over training examples • distance = relative entropy: � P ( i ) � � RE ( P � Q ) = P ( i ) ln Q ( i ) i • reference point x 0 = uniform distribution • hyperplanes defined by all possible weak classifiers g j : � i ∼ D [ g j ( x i ) � = y i ] = 1 D ( i ) y i g j ( x i ) = 0 ⇔ Pr 2 i • intuition: looking for “hardest” distribution

AdaBoost as Iterative Projection (cont.) AdaBoost as Iterative Projection (cont.) AdaBoost as Iterative Projection (cont.) AdaBoost as Iterative Projection (cont.) AdaBoost as Iterative Projection (cont.) • algorithm: • start at D 1 = uniform • for t = 1 , 2 , . . . : • pick hyperplane/weak classifier h t ↔ g j • D t +1 = (entropy) projection of D t onto hyperplane = arg i D ( i ) y i g j ( x i )=0 RE ( D � D t ) min D : � • claim: equivalent to AdaBoost • further: choosing h t with minimum error ≡ choosing farthest hyperplane

Boosting as Maximum Entropy Boosting as Maximum Entropy Boosting as Maximum Entropy Boosting as Maximum Entropy Boosting as Maximum Entropy • corresponding optimization problem: D ∈P RE ( D � uniform) ↔ max min D ∈P entropy( D ) • where P = feasible set � � � = D : D ( i ) y i g j ( x i ) = 0 ∀ j i • P � = ∅ ⇔ weak learning assumption does not hold • in this case, D t → (unique) solution • if weak learning assumption does hold then • P = ∅ • D t can never converge • dynamics are fascinating but unclear in this case

Visualizing Dynamics Visualizing Dynamics Visualizing Dynamics Visualizing Dynamics Visualizing Dynamics [with Rudin & Daubechies] • plot one circle for each round t : • center at ( D t (1) , D t (2)) • radius ∝ t (color also varies with t ) t = 3 0.5 t = 6 0.4 �� = 2 t (2) �� t D �� t = 1 �� d t,2 �� = 4 t 0.2 t = 5 d t,1 0.2 0.3 0.4 0.5 �� t (1) D �� • in all cases examined, appears to converge eventually to cycle • open if always true

More Examples More Examples More Examples More Examples More Examples 0.35 0.25 �� d t,2 �� D (2) �� t �� 0.1 0 d t,1 �� 0 0.1 0.2 0.3 �� D t(1) ��

More Examples More Examples More Examples More Examples More Examples 0.2 0.15 �� d t,12 �� D (2) �� t �� 0.05 0 d t,11 �� 0 0.05 0.15 0.2 �� D t(1) ��

More Examples More Examples More Examples More Examples More Examples 0.35 0.25 �� D (2) �� t d t,2 �� 0.05 0 d t,1 0 0.05 0.15 0.25 �� D t(1) ��

More Examples More Examples More Examples More Examples More Examples 0.4 0.35 0.3 �� D(2) �� d t,2 t �� 0.15 0.1 0.05 0 0 0.05 0.1 0.15 d t,1 0.3 0.35 0.4 �� D(1) �� t ��

Unifying the Two Cases Unifying the Two Cases Unifying the Two Cases Unifying the Two Cases Unifying the Two Cases [with Collins & Singer] • two distinct cases: • weak learning assumption holds • P = ∅ • dynamics unclear • weak learning assumption does not hold • P � = ∅ • can prove convergence of D t ’s • to unify: work instead with unnormalized versions of D t ’s • standard AdaBoost: D t +1 ( i ) = D t ( i ) exp( − α t y i h t ( x i )) normalization • instead: d t +1 ( i ) = d t ( i ) exp( − α t y i h t ( x i )) d t +1 ( i ) D t +1 ( i ) = normalization • algorithm is unchanged

Reformulating AdaBoost as Iterative Projection Reformulating AdaBoost as Iterative Projection Reformulating AdaBoost as Iterative Projection Reformulating AdaBoost as Iterative Projection Reformulating AdaBoost as Iterative Projection • points = nonnegative vectors d t • distance = unnormalized relative entropy: � � p ( i ) � � � RE ( p � q ) = p ( i ) ln + q ( i ) − p ( i ) q ( i ) i • reference point x 0 = 1 (all 1’s vector) • hyperplanes defined by weak classifiers g j : � d ( i ) y i g j ( x i ) = 0 i • resulting iterative-projection algorithm is again equivalent to AdaBoost

Reformulated Optimization Problem Reformulated Optimization Problem Reformulated Optimization Problem Reformulated Optimization Problem Reformulated Optimization Problem • optimization problem: min d ∈P RE ( d � 1 ) • where � � � P = d ( i ) y i g j ( x i ) = 0 ∀ j d : i • note: feasible set P never empty (since 0 ∈ P )

Exponential Loss as Entropy Optimization Exponential Loss as Entropy Optimization Exponential Loss as Entropy Optimization Exponential Loss as Entropy Optimization Exponential Loss as Entropy Optimization • all vectors d t created by AdaBoost have form:   � d ( i ) = exp  − y i λ j g j ( x i )  j • let Q = { all vectors d of this form } • can rewrite exponential loss:   � � �  − y i inf exp λ j g j ( x i ) = inf d ( i )  d ∈Q λ i j i � = min d ( i ) d ∈Q i = min RE ( 0 � d ) d ∈Q • Q = closure of Q

Duality Duality Duality Duality Duality [Della Pietra, Della Pietra & Lafferty] • presented two optimization problems: • min d ∈P RE ( d � 1 ) • min RE ( 0 � d ) d ∈Q • which is AdaBoost solving? Both! • problems have same solution • moreover: solution given by unique point in P ∩ Q • problems are convex duals of each other

Theory and Applications of Boosting Theory and Applications of - PowerPoint PPT Presentation

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications of Boosting Rob Schapire Example: How May I Help You? Example:

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

An overview of Boosting Yoav Freund UCSD Plan of talk Generative vs. non-generative

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

Boosting Methods: Implicit Combinatorial Optimization via First-Order Convex Optimization Robert

Lecture #16: Boosting Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015 Boosting General

mboost - Componentwise Boosting for Generalised Regression Models Thomas Kneib & Torsten

STK-IN4300 Statistical Learning Methods in Data Science Likelihood-based Boosting introduction

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

Boosting, Min-Norm Interpolated Classifiers, and Overparametrization: a precise asymptotic theory

Lecture 17: Boosting CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader

Optimal and Adaptive Algorithms for Online Boosting Alina Beygelzimer 1 Satyen Kale 1 Haipeng Luo 2

A TOUR IN GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. Van Steen (February 2010)

A Machine-Checked Model of MGU Axioms: Applications of Finite Maps and Functional Induction

Case Discussion m/w), word finding difficulties, cognitively slower, progressed over days -

Some algorithms to fit some reliability mixture models under censoring Laurent Bordes Didier

1 Web Application Development 2 Introduction Arrays Where to put JS code

The truck scheduling problem at crossdocking terminals: exclusive versus mixed mode L. Berghman,

J AVA S CRIPT JavaScript is the programming language of HTML and the Web. JavaScript

Module 15 RDF, SPARQL and Semantic Repositories Module 15 Outline 9.45-11.00 RDF/S and OWL

Theory and Applications of Boosting Theory and Applications of - PowerPoint PPT Presentation

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications of Boosting Rob Schapire Example: How May I Help You? Example:

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

An overview of Boosting Yoav Freund UCSD Plan of talk Generative vs. non-generative

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

Boosting Methods: Implicit Combinatorial Optimization via First-Order Convex Optimization Robert

Lecture #16: Boosting Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015 Boosting General

mboost - Componentwise Boosting for Generalised Regression Models Thomas Kneib &amp; Torsten

STK-IN4300 Statistical Learning Methods in Data Science Likelihood-based Boosting introduction

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

Boosting, Min-Norm Interpolated Classifiers, and Overparametrization: a precise asymptotic theory

Lecture 17: Boosting CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader

Optimal and Adaptive Algorithms for Online Boosting Alina Beygelzimer 1 Satyen Kale 1 Haipeng Luo 2

A TOUR IN GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. Van Steen (February 2010)

A Machine-Checked Model of MGU Axioms: Applications of Finite Maps and Functional Induction

Case Discussion m/w), word finding difficulties, cognitively slower, progressed over days -

Some algorithms to fit some reliability mixture models under censoring Laurent Bordes Didier

1 Web Application Development 2 Introduction Arrays Where to put JS code

The truck scheduling problem at crossdocking terminals: exclusive versus mixed mode L. Berghman,

J AVA S CRIPT JavaScript is the programming language of HTML and the Web. JavaScript

Module 15 RDF, SPARQL and Semantic Repositories Module 15 Outline 9.45-11.00 RDF/S and OWL

mboost - Componentwise Boosting for Generalised Regression Models Thomas Kneib & Torsten