Boosting: more than an ensemble method for prediction Peter B - PowerPoint PPT Presentation

✬ ✩ Boosting: more than an ensemble method for prediction Peter B¨ uhlmann ETH Z¨ urich ✫ ✪ 1

✬ ✩ 1. Historically: Boosting is about multiple predictions Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) (i.i.d. or stationary), predictor variables X i ∈ R p response variables Y i ∈ R or Y i ∈ { 0 , 1 , . . . , J − 1 } Aim: estimation of function f ( · ) : R p → R , e.g. f ( x ) = I E[ Y | X = x ] or f ( x ) = I P[ Y = 1 | X = x ] with Y ∈ { 0 , 1 } or distribution of survival time Y given X depends on some function f ( X ) only “historical” view (for classification): Boosting is a multiple predictions (estimation) & combination method ✫ ✪ 2

✬ ✩ Base procedure: algorithm A ˆ θ ( · ) (a function estimate) data − → e.g.: simple linear regression, tree, MARS, “classical” smoothing, neural nets, ... Generating multiple predictions: algorithm A ˆ − → θ 1 ( · ) weighted data 1 algorithm A ˆ − → θ 2 ( · ) weighted data 2 · · · · · · algorithm A ˆ − → θ M ( · ) weighted data M f A ( · ) = � M Aggregation: ˆ m =1 a m ˆ θ m ( · ) ✫ data weights? averaging weights a m ? ✪ 3

✬ ✩ classification of 2 lymph nodal status in breast cancer using gene expressions from microarray data: n = 33 , p = 7129 (for CART: gene-preselection, reducing to p = 50 ) method test set error gain over CART CART 22.5% – LogitBoost with trees 16.3% 28% LogitBoost with bagged trees 12.2% 46% this kind of boosting: mainly prediction, not much interpretation ✫ ✪ 4

✬ ✩ 2. Boosting algorithms around 1990: Schapire constructed some early versions of boosting AdaBoost proposed for classification by Freund & Schapire (1996) data weights (rough original idea): large weights to previously heavily misclassified instances (sequential algorithm) averaging weights a m : large if in-sample performance in m th round was good Why should this be good? ✫ ✪ 5

✬ ✩ Why should this be good? some common answers 5 years ago ... because • it works so well for prediction (which is quite true) • it concentrates on the “hard cases” (so what?) • AdaBoost almost never overfits the data no matter how many iterations it is run (not true) ✫ ✪ 6

✬ ✩ A better explanation Breiman (1998/99): AdaBoost is functional gradient descent (FGD) procedure aim: find f ∗ ( · ) = argmin f ( · ) I E[ ρ ( Y, f ( X ))] e.g. for ρ ( y, f ) = | y − f | 2 � f ∗ ( x ) = I E[ Y | X = x ] FGD solution: consider empirical risk n − 1 � n i =1 ρ ( Y i , f ( X i )) and do iterative steepest descent in function space ✫ ✪ 7

✬ ✩ 2.1. Generic FGD algorithm Step 1. ˆ f 0 ≡ 0 ; set m = 0 . Step 2. Increase m by 1. Compute negative gradient − ∂ ∂f ρ ( Y, f ) and evaluate at f = ˆ f m − 1 ( X i ) = U i ( i = 1 , . . . , n ) Step 3. Fit negative gradient vector U 1 , . . . , U n by base procedure algorithm A ˆ ( X i , U i ) n − → θ m ( · ) i =1 e.g. ˆ θ m fitted by (weighted) least squares i.e. ˆ θ m ( · ) is an approximation of the negative gradient vector Step 4. Up-date ˆ f m = ˆ f m − 1 ( · ) + νs m · ˆ θ m ( · ) s m = argmin s n − 1 � n i =1 ρ ( Y i , ˆ f m − 1 ( X i ) + s · ˆ θ m ( X i )) and 0 < ν ≤ 1 i.e. proceed along an estimate of the negative gradient vector Step 5. Iterate Steps 2-4 until m = m stop for some stopping iteration m stop ✫ ✪ 8

✬ ✩ Why “functional gradient”? Alternative formulation in function space: empirical risk functional: C ( f ) = n − 1 � n i =1 ρ ( Y i , f ( X i )) � f, g � = n − 1 � n i =1 f ( X i ) g ( X i ) inner product: negative Gateaux derivative: − dC ( f )( x ) = ∂ ∂αC ( f + α 1 x ) | α =0 , � − dC ( ˆ f m − 1 )( X i ) = U i if U 1 , ..., U n are fitted by least squares: equivalent to maximize �− dC ( f m ) , θ � w.r.t. θ ( · ) ( if � θ � = 1) (over all possible θ ( · ) ’s from the base procedure) i.e: ˆ θ m ( · ) is the best approximation (most parallel) to the negative gradient − dC ( f m ) ✫ ✪ 9

✬ ✩ By definition: FGD yields additive combination of base procedure fits ν � m stop m =1 s m ˆ θ m ( · ) Breiman (1998): FGD with ρ ( y, f ) = exp((2 y − 1) · f ) for binary classification yields the AdaBoost algorithm (great result!) Remark: FGD can not be represented as some explicit estimation function(al): n � ˆ f m ( · ) � = argmin f ∈F n − 1 ρ ( Y i , f ( X i )) for some function class F i =1 � FGD is mathematically more difficult to analyze but generically applicable (as an algorithm!) in very complex models ✫ ✪ 10

✬ ✩ 2.2. L 2 Boosting (see also Friedman, 2001) loss function ρ ( y, f ) = | y − f | 2 population minimizer: f ∗ ( x ) = I E[ Y | X = x ] FGD with base procedure ˆ θ ( · ) : repeated fitting of residuals i =1 � ˆ θ 1 ( · ) , ˆ f 1 = ν ˆ � resid. U i = Y i − ˆ m = 1 : ( X i , Y i ) n θ 1 f 1 ( X i ) i =1 � ˆ θ 2 ( · ) , ˆ f 2 = ˆ f 1 + ν ˆ � resid. U i = Y i − ˆ m = 2 : ( X i , U i ) n θ 2 f 2 ( X i ) ... ... f m stop ( · ) = ν � m stop ˆ m =1 ˆ θ m ( · ) (stagewise greedy fitting of residuals) Tukey (1977): twicing for m stop = 2 and ν = 1 ✫ ✪ 11

✬ ✩ Any gain over classical methods? (for additive modeling) Ozone data: n=300, p=8 n = 300 , p = 8 22 - magenta: L 2 Boosting with stumps (horiz. line = cross-validated stopping) 21 - black: L 2 Boosting with componentwise smoothing spline 20 MSE (horiz. line = cross-validated stopping) i.e: smoothing spline fi tting against the 19 selected predictor which reduces RSS most 18 - green: MARS restricted to additive modeling - red: additive model using backfi tting 0 20 40 60 80 100 boosting iterations L 2 Boosting with stumps or comp. smoothing splines also yields additive model: � m s top θ m ( x ( ˆ ✫ m =0 ˆ ✪ S m ) ) = ˆ g 1 ( x (1) ) + . . . + ˆ g p ( x ( p ) ) 12

✬ ✩ Simulated data: non-additive regression function, n = 200 , p = 100 Regression: n=200, p=100 16 15 - magenta: L 2 Boosting with stumps 14 - black: L 2 Boosting with componentwise MSE - green: MARS restricted to additive modeling 13 - red: additive model using backfi tting and 12 fwd. var. selection 11 0 50 100 150 200 250 300 boosting iterations ✫ ✪ 13

✬ ✩ similar for classifi cation ✫ ✪ 14

✬ ✩ 3. Structured models and choosing the base procedure have just seen the Componentwise smoothing spline base procedure smoothes the reponse against the one predictor variable which reduces RSS most we keep the degrees of freedom fixed for all candidate predictors, e.g. d.f. = 2.5 � L 2 Boosting yields an additive model fit, including variable selection ✫ ✪ 15

✬ ✩ Componentwise linear least squares simple linear OLS against the one predictor variable which reduces RSS most n n n S x ( ˆ S ) , ˆ YiX ( j ) ( X ( j ) βj X ( j ) X X )2 , X )2 θ ( x ) = ˆ ˆ ˆ ( Yi − ˆ S = arg min β ˆ βj = / i i i j i =1 i =1 i =1 first round of estimation: selected predictor variable X ( ˆ S 1 ) (e.g. = X (3) ) corresponding ˆ S 1 � fitted function ˆ β ˆ f 1 ( x ) second round of estimation: selected predictor variable X ( ˆ S 2 ) (e.g. = X (21) ) corresponding ˆ S 2 � fitted function ˆ β ˆ f 2 ( x ) etc. L 2 Boosting: ˆ f m ( x ) = ˆ f m − 1 ( x ) + ν · ˆ θ ( x ) � L 2 Boosting yields linear model fit, including variable selection, i.e. structured model fit ✫ ✪ 16

✬ ✩ for ν = 1 , this is known as Matching Pursuit (Mallat and Zhang, 1993) Weak greedy algorithm (deVore & Temlyakov, 1997) a version of Boosting (Schapire, 1992; Freund & Schapire, 1996) Gauss-Southwell algorithm C.F . Gauss in 1803 “Princeps Mathematicorum” R.V. Southwell in 1933 ✫ ✪ Professor in engineering, Oxford 17

✬ ✩ binary lymph node classification in breast cancer using gene expressions: a high noise problem n = 49 samples, p = 7129 gene expressions L 2 Boosting FPLR Pelora 1-NN DLDA SVM CV-misclassif.err. 17.7% 35.25% 27.8% 43.25% 36.12% 36.88% multivariate gene selection best 200 genes from Wilcox. L 2 Boosting selected 42 out of p = 7129 genes for this data-set: not good prediction, with any of the methods but L 2 Boosting may be a reasonable(?) multivariate gene selection method ✫ ✪ 18

✬ ✩ 42 (out of 7129) selected genes ( n = 49 ) sorted regression coefficients 0.05 0.00 −0.05 −0.10 −0.15 0 10 20 30 40 selected genes identifiability problem: strong correlations among some genes � consider groups of highly correlated genes, biological categories (e.g. GO), .... linear model: multivariate association between genes and tumor-type ✫ ✪ very different from 2-sample tests for individual genes 19

✬ ✩ Pairwise smoothing splines smoothes response against the pair of predictor variables which reduces RSS most we keep the degrees of freedom fixed for all candidate pairs, e.g. d.f. = 2.5 � L 2 Boosting yields a nonparametric interaction model, including variable selection ✫ ✪ 20

Boosting: more than an ensemble method for prediction Peter B - PowerPoint PPT Presentation

Boosting: more than an ensemble method for prediction Peter B uhlmann ETH Z urich 1 1. Historically: Boosting is about multiple predictions Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) (i.i.d. or stationary),

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

XGBOOST: A SCALABLE TREE BOOSTING SYSTEM ADVISOR: JIA-LING KOH SPEAKER: YIN-HSIANG LIAO

Ensemble and Boosting Algorithms Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net

Steganalysis by Ensemble Classifiers with Boosting by Regression, and Post-Selection of Features

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015 Boosting General

Mini Bookfairs in Schools/Universities More than 50 publishers More than 50 publishers More than

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Lecture #16: Boosting Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas

An overview of Boosting Yoav Freund UCSD Plan of talk Generative vs. non-generative

Forecasting intraday-load curve using sparse learning methods Dominique Picard LPMA- Universit

Graphlet Screening (GS) Achieves Optimal Rate in Variable Selection Jiashun Jin Carnegie Mellon

Sparse Recovery via Differential Inclusions Yuan Yao School of Mathematical Sciences Peking

Global solution to non-convex optimization problems involving an approximate 0 penalization

German EoI for Power Converters of SIS100 - SIS100 Dipole Power Converter 1678 k -

TENET: Tail-Event-driven NETwork Risk Wolfgang Karl Hrdle Weining Wang Lining Yu Ladislaus

CISC323: Introduction to Software developing software that organizes the effort into a number of

Ultra-high dimensional statistics and statistical learning on some applications Dominique Picard