An overview of Boosting Yoav Freund UCSD Plan of talk Generative - PowerPoint PPT Presentation

An overview of Boosting Yoav Freund UCSD

Plan of talk • Generative vs. non-generative modeling • Boosting • Alternating decision trees • Boosting and over-fitting • Applications 2

Toy Example • Computer receives telephone call • Measures Pitch of voice • Decides gender of caller Male Human Voice Female 3

Generative modeling mean1 mean2 Probability var1 var2 Voice Pitch

Discriminative approach [Vapnik 85] No. of mistakes Voice Pitch

Ill-behaved data mean1 mean2 No. of mistakes Probability Voice Pitch

Traditional Statistics vs.   Machine Learning Machine Learning Predictions Decision Estimated Data Actions Theory world state Statistics

Comparison of methodologies Model Generative Discriminative Goal Probability Classification rule estimates Performance Likelihood Misclassification rate measure Mismatch Outliers Misclassifications problems 8

Boosting

A weighted training set Feature vectors Binary labels {-1,+1} Positive weights ( ) , x 2 , y 2 , w 2 ( ) , … , x m , y m , w m ( ) x 1 , y 1 , w 1 10

A weak learner Weighted training set A weak rule ( ) , x 2 , y 2 , w 2 ( ) , … , x m , y m , w m ( ) x 1 , y 1 , w 1 Weak Learner h instances predictions h x 1 , x 2 , … , x m y 2 , … , ˆ y i ∈ { − 1, + 1} y 1 , ˆ ˆ y m ; ˆ ∑ m y i ˆ y i w i The weak requirement: > γ > 0 i = 1 ∑ m w i i = 1

The boosting process ( ) , x 2 , y 2 ,1 ( ) , … , x n , y n ,1 ( ) x 1 , y 1 ,1 weak learner h 1 ( ) , x 2 , y 2 , w 2 ( ) , … , x n , y n , w n ( ) 1 1 1 x 1 , y 1 , w 1 weak learner h 2 ( ) , x 2 , y 2 , w 2 ( ) , … , x n , y n , w n ( ) 2 2 2 x 1 , y 1 , w 1 h 3 ( ) , x 2 , y 2 , w 2 ( ) , … , x n , y n , w n ( ) T − 1 T − 1 T − 1 x 1 , y 1 , w 1 h T ... Prediction( x ) = sign ( F T ( x ))

A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting [Slides from Rob Schapire] • given training set ( x 1 , y 1 ) , . . . , ( x m , y m ) • y i ∈ { − 1 , +1 } correct label of instance x i ∈ X • for t = 1 , . . . , T : • construct distribution D t on { 1 , . . . , m } • find weak classifier (“rule of thumb”) h t : X → { − 1 , +1 } with small error � t on D t : � t = Pr i ∼ D t [ h t ( x i ) ̸ = y i ] • output final classifier H final

AdaBoost AdaBoost AdaBoost AdaBoost AdaBoost [Slides from Rob Schapire] [with Freund] • constructing D t : • D 1 ( i ) = 1 / m • given D t and h t : � e − α t D t ( i ) if y i = h t ( x i ) D t +1 ( i ) = × if y i ̸ = h t ( x i ) e α t Z t D t ( i ) = exp( − α t y i h t ( x i )) Z t where Z t = normalization factor � 1 − � t � α t = 1 2 ln > 0 � t • final classifier: �� • H final ( x ) = sign α t h t ( x ) t

Toy Example Toy Example Toy Example Toy Example Toy Example [Slides from Rob Schapire] D 1 weak classifiers = vertical or horizontal half-planes

Round 1 Round 1 Round 1 Round 1 Round 1 [Slides from Rob Schapire] h 1 D 2 ε 1 =0.30 α =0.42 1

Round 2 Round 2 Round 2 Round 2 Round 2 [Slides from Rob Schapire] h 2 D 3 ε 2 =0.21 α =0.65 2

Round 3 Round 3 Round 3 Round 3 Round 3 [Slides from Rob Schapire] h 3 ε 3 =0.14 α 3=0.92

Final Classifier Final Classifier Final Classifier Final Classifier Final Classifier [Slides from Rob Schapire] H = sign 0.42 + 0.65 + 0.92 final =

Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error [Slides from Rob Schapire] [with Freund] • Theorem: • write � t as 1 [ γ t = “edge” ] 2 − γ t • then � � � � training error ( H final ) 2 � t (1 − � t ) ≤ t � � 1 − 4 γ 2 = t t � � � γ 2 exp − 2 ≤ t t • so: if ∀ t : γ t ≥ γ > 0 then training error ( H final ) ≤ e − 2 γ 2 T • AdaBoost is adaptive: • does not need to know γ or T a priori • can exploit γ t � γ

Boosting block diagram Strong Learner Accurate Rule Weak Learner Example Weak weights rule Booster

Boosting with specialists • Specialists predict only when they are confident. • In addition to {-1,+1} specialists use 0 to indicate no-prediction. • As boosting allows both positive and negative weights: we restrict attention to specialists that output {0,1}. 22

Weak Rule: h t : X → {0,1} Adaboost as variational Label: y ∈ { − 1, + 1} optimization Training set: {(x 1 , y 1 ,1),( x 2 , y 2 ,1), … ,( x n , y n , + - ⎛ ⎞ ⎛ ⎞ α t = 1 ∑ ∑ 2 ln ε + ε + t t w i w i ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ i : h t ( x i ) = 1, y i = 1 i : h t ( x i ) = 1, y i = − 1 t = F t − 1 + α t h t F

AdaBoost as variational optimization. • x = input, a scalar, • y = output, -1 or +1 • (x 1 ,y 1 ),(x 2 ,y 2 ), … ,(x n ,y n ) = training set • F t-1 = Strong rule after t-1 boosting iterations. • h t = Weak rule produced at iteration t h t + + + + + + + + + + + + + F t-1 x - - - - - - - - - - - - - - - - - - 24

AdaBoost as variational optimization. • F t-1 = Strong rule after t-1 boosting iterations. • h t = Weak rule produced at iteration t t − 1 ( x ) + 0 h t ( x ) F h t + + + + + + + + + + + + + F t-1 x - - - - - - - - - - - - - - - - - - 25

AdaBoost as variational optimization. • F t-1 = Strong rule after t-1 boosting iterations. • h t = Weak rule produced at iteration t t − 1 ( x ) + 0.4 h t ( x ) F h t + + + + + + + + + + + + + F t-1 x - - - - - - - - - - - - - - - - - - 26

AdaBoost as variational optimization. • F t-1 = Strong rule after t-1 boosting iterations. • h t = Weak rule produced at iteration t t ( x ) = F t − 1 ( x ) + 0.8 h t ( x ) F h t + + + + + + + + + + + + + F t-1 x - - - - - - - - - - - - - - - - - - 27

Margins ! h ( x ) = ( h 1 ( x ), h 2 ( x ), … , h T ( x )) Fix a set of weak rules: ! Represent the i'th example x i with outputs of the weak rules: h i ! ! ! Labels: y ∈ { − 1, + 1} h 2 , y 2 ), … ,( Training set: ( h 1 , y 1 ),( h n , y n ) Goal : Find a weight vector ! α = ( α 1 , … , α T ) that minimizes number of training mistakes ! - α - + + + - - + + + ! - - " ( h i , y i ) h i ⋅ " - - α margin ! y i + t c e reflect r r ! o C α Cumulative # examples + + + P r o j e c + e - - Mistakes Correct t k + a t + s i - M - ! - + - ( y i h i , y i ) - - Margin

Boosting as gradient descent Adaboost : e − margin ✦ Our goal is to minimize   ✦ Adaboost minimizes the exponential loss   the number of mistakes (0-1 loss) which is an upper bound. ✦ Unfortunately, the step function   ✦ Finds the vector \alpha through   has deriv. zero at all points other than   coordinate-wise gradient descent. at zero where the derivative is undefined. Loss ✦ Weak rules are added one at a time.   ✦ Ada boost uses an upper bound on   the 0-1 loss. 0-1 loss Margin Mistakes Correct

Adaboost as gradient descent ✦ Weak rules are added one at a t − 1 ( x i ) + α y i h t ( x i ) time.   new margin of ( x i , y i ) is y i F n ( ) ( ) ∑ exp − y i F t − 1 ( x i ) + α h t ( x i ) ✦ Fixing the alphas for 1..t-1, find new total loss = i = 1 alpha for the new rule (h t )   derivative of total loss wrt α : ✦ Weak rule defines how each − ∂ n ( ) ∑ exp − y i F ( ) t − 1 ( x i ) + α h t ( x i ) example would move = increase or ∂ α α = 0 i = 1 decrease the margin. ! # # " ## $ weight of example ( x i , y i ) n ( ) ∑ y i h t ( x i ) exp -y i F t-1 (x i ) = 0-1 loss i = 1 Margin Mistakes Correct

Logitboost as gradient descent Also called Gentle-Boost and Logit Boost, Hastie, Freedman & Tibshirani Logitboost loss Adaboost (loss and weight) t − 1 ( x i ) + α y i h t ( x i ) new margin of ( x i , y i ) is y i F n ( margin= m ( ∑ log 1 + exp − y i F ⎡ t − 1 ( x i ) + α h t ( x i new total loss = ⎣ Instead of the loss e − m i = 1 derivative of total loss wrt α : ( ) define loss to be log 1 + e − m − ∂ n ( ) ∑ log 1 + exp − y i F ( ) ⎡ ⎤ t − 1 ( x i ) + α h t ( x i ) ( ) ≈ e − m ⎣ ⎦ When margin ≫ 0 : log 1 + e − m ∂ α α = 0 i = 1 ( ) ≈ − m ! ## # " ### $ weight of example ( x i , y i ) When margin ≪ 0 :log 1 + e − m ( ) n exp - y i F t -1 ( x i ) ∑ y i h t ( x i ) = ( ) 1 + exp - y i F t -1 ( x i ) i = 1 Logitboost weight Margin Mistakes Correct

Noise resistance • Adaboost: – perform well when a achievable error rate is close to zero (almost consistent case). – Errors = examples with negative margins, get very large weights, can overfit. • Logitboost: – Inferior to adaboost when achievable error rate is close to zero. – Often better than Adaboost when achievable error rate is high. – Weight on any example never larger than 1. 32

An overview of Boosting Yoav Freund UCSD Plan of talk Generative - PowerPoint PPT Presentation

An overview of Boosting Yoav Freund UCSD Plan of talk Generative vs. non-generative modeling Boosting Alternating decision trees Boosting and over-fitting Applications 2 Toy Example Computer receives telephone call

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

Lecture #16: Boosting Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015 Boosting General

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

mboost - Componentwise Boosting for Generalised Regression Models Thomas Kneib & Torsten

Boosting Methods: Implicit Combinatorial Optimization via First-Order Convex Optimization Robert

STK-IN4300 Statistical Learning Methods in Data Science Likelihood-based Boosting introduction

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

Lecture 17: Boosting CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader

XGBOOST: A SCALABLE TREE BOOSTING SYSTEM ADVISOR: JIA-LING KOH SPEAKER: YIN-HSIANG LIAO

CS 730/730W/830: Intro AI Naive Bayes Boosting 1 handout: slides asst 5 milestone was due

Errors and uncertainty in variables When to worry and when to Bayes? Stefanie Muff

Holdout and Cross- -Validation Validation Holdout and Cross Methods Overfitting Avoidance

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS Cadiz, 2016 Arthur Gretton

Cervix cancer committee SENTICOL III: International prospective validation trial of sentinel node

Alpha-bits, Teleportation and Black Holes ArXiv:1706.09434, ArXiv:1807.06041 Geoffrey Penington,

Machine Learning and Data Mining Ensembles of Learners Kalev Kask HW4 Download data from

Memory Errors Bits in memory can be flipped Hard error The chip is broken E.g.,

How willing are you to be wrong? Type I and Type II Errors Type 1, Type II Errors and Power

An overview of Boosting Yoav Freund UCSD Plan of talk Generative - PowerPoint PPT Presentation

An overview of Boosting Yoav Freund UCSD Plan of talk Generative vs. non-generative modeling Boosting Alternating decision trees Boosting and over-fitting Applications 2 Toy Example Computer receives telephone call

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

Lecture #16: Boosting Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015 Boosting General

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

mboost - Componentwise Boosting for Generalised Regression Models Thomas Kneib &amp; Torsten

Boosting Methods: Implicit Combinatorial Optimization via First-Order Convex Optimization Robert

STK-IN4300 Statistical Learning Methods in Data Science Likelihood-based Boosting introduction

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

Lecture 17: Boosting CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader

XGBOOST: A SCALABLE TREE BOOSTING SYSTEM ADVISOR: JIA-LING KOH SPEAKER: YIN-HSIANG LIAO

CS 730/730W/830: Intro AI Naive Bayes Boosting 1 handout: slides asst 5 milestone was due

Errors and uncertainty in variables When to worry and when to Bayes? Stefanie Muff

Holdout and Cross- -Validation Validation Holdout and Cross Methods Overfitting Avoidance

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS Cadiz, 2016 Arthur Gretton

Cervix cancer committee SENTICOL III: International prospective validation trial of sentinel node

Alpha-bits, Teleportation and Black Holes ArXiv:1706.09434, ArXiv:1807.06041 Geoffrey Penington,

Machine Learning and Data Mining Ensembles of Learners Kalev Kask HW4 Download data from

Memory Errors Bits in memory can be flipped Hard error The chip is broken E.g.,

How willing are you to be wrong? Type I and Type II Errors Type 1, Type II Errors and Power

mboost - Componentwise Boosting for Generalised Regression Models Thomas Kneib & Torsten