Ensemble and Boosting Algorithms Weinan Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

2019 CS420, Machine Learning, Lecture 6 Ensemble and Boosting Algorithms Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html

Content of this lecture • Ensemble Methods • Bagging • Random Forest • AdaBoost • Gradient Boosting Decision Trees

Ensemble Learning • Consider a set of predictors f 1 , …, f L • Different predictors have different performance across data • Idea: construct a predictor F ( x ) that combines the individual decisions of f 1 , …, f L • E.g., could have the member predictor vote • E.g., could use different members for different region of the data space • Works well if the member each has low error rate • Successful ensembles require diversity • Predictors should make different mistakes • Encourage to involve different types of predictors

Ensemble Learning Single model f 1 ( x ) Ensemble model Data Output f 2 ( x ) F ( x ) Ensemble x … f L ( x ) • Although complex, ensemble learning probably offers the most sophisticated output and the best empirical performance!

Practical Application in Competitions • Netflix Prize Competition • Task: predict the user’s rating on a movie, given some users’ ratings on some movies • Called ‘collaborative filtering’ (we will have a lecture about it later) • Winner solution • BellKor’s Pragmatic Chaos – an ensemble of more than 800 predictors Yehuda Koren [Yehuda Koren. The BellKor Solution to the Netflix Grand Prize. 2009.]

Practical Application in Competitions • KDD-Cup 2011 Yahoo! Music Recommendation • Task: predict the user’s rating on a music, given some users’ ratings on some music • With music information like album, artist, genre IDs • Winner solution • From A graduate course of National Taiwan University - an ensemble of 221 predictors

Practical Application in Competitions • KDD-Cup 2011 Yahoo! Music Recommendation • Task: predict the user’s rating on a music, given some users’ ratings on some music • With music information like album, artist, genre IDs • 3 rd place solution • SJTU-HKUST joint team, an ensemble of 16 predictors

Combining Predictor: Averaging Single model f 1 ( x ) 1/ L Ensemble model Data Output 1/ L f 2 ( x ) F ( x ) + x 1/ L … f L ( x ) L L X X F ( x ) = 1 F ( x ) = 1 f i ( x ) f i ( x ) L L i =1 i =1 • Averaging for regression; voting for classification

Combining Predictor: Weighted Avg Single model f 1 ( x ) w 1 Ensemble model Data Output w 2 f 2 ( x ) F ( x ) + x w L … f L ( x ) L L X X F ( x ) = F ( x ) = w i f i ( x ) w i f i ( x ) i =1 i =1 • Just like linear regression or classification • Note: single model will not be updated when training ensemble model

Combining Predictor: Gating Single model f 1 ( x ) g 1 Ensemble model Data Output g 2 f 2 ( x ) F ( x ) + x g L … L L X X f L ( x ) F ( x ) = F ( x ) = g i f i ( x ) g i f i ( x ) i =1 i =1 Gating Fn. g ( x ) g i = μ > g i = μ > i x i x E.g., Design different learnable gating functions • Just like linear regression or classification • Note: single model will not be updated when training ensemble model

Combining Predictor: Gating Single model f 1 ( x ) g 1 Ensemble model Data Output g 2 f 2 ( x ) F ( x ) + x g L … L L X X f L ( x ) F ( x ) = F ( x ) = g i f i ( x ) g i f i ( x ) i =1 i =1 exp( w > exp( w > i x ) i x ) Gating Fn. g ( x ) g i = g i = P L P L E.g., j =1 exp( w > j =1 exp( w > i x ) i x ) Design different learnable gating functions • Just like linear regression or classification • Note: single model will not be updated when training ensemble model

Combining Predictor: Stacking Single model f 1 ( x ) Ensemble model Data Output f 2 ( x ) F ( x ) x g( f 1 , f 2 ,… f L ) … f L ( x ) F ( x ) = g ( f 1 ( x ) ; f 2 ( x ) ; : : : ; f L ( x )) F ( x ) = g ( f 1 ( x ) ; f 2 ( x ) ; : : : ; f L ( x )) • This is the general formulation of an ensemble

Combining Predictor: Multi-Layer Single model f 1 ( x ) Ensemble model Data Output f 2 ( x ) Layer Layer F ( x ) x 1 2 … f L ( x ) h = tanh( W 1 f + b 1 ) h = tanh( W 1 f + b 1 ) F ( x ) = ¾ ( W 2 h + b 2 ) F ( x ) = ¾ ( W 2 h + b 2 ) • Use neural networks as the ensemble model

Combining Predictor: Multi-Layer Single model f 1 ( x ) Ensemble model Data Output f 2 ( x ) Layer Layer F ( x ) x 1 2 … f L ( x ) h = tanh( W 1 [ f; x ] + b 1 ) h = tanh( W 1 [ f; x ] + b 1 ) F ( x ) = ¾ ( W 2 h + b 2 ) F ( x ) = ¾ ( W 2 h + b 2 ) • Use neural networks as the ensemble model • Incorporate x into the first hidden layer (as gating)

Combining Predictor: Tree Models Single model Ensemble model f 1 ( x ) f 1 ( x ) < a 1 Root Node Data f 2 ( x ) Yes No x Intermediate f 2 ( x ) < a 2 x 2 < a 3 … Node Yes No Yes No f L ( x ) Leaf y = -1 y = 1 y = 1 y = -1 Node F ( x ) Output • Use decision trees as the ensemble model • Splitting according to the value of f ’s and x

Diversity for Ensemble Input • Successful ensembles require diversity • Predictors may make different mistakes • Encourage to • involve different types of predictors • vary the training sets • vary the feature sets Cause of the Mistake Diversification Strategy Pattern was difficult Try different models Overfitting Vary the training sets Some features are noisy Vary the set of input features [Based on slide by Leon Bottou]

Content of this lecture • Ensemble Methods • Bagging • Random Forest • AdaBoost • Gradient Boosting Decision Trees

Manipulating the Training Data • Bootstrap replication • Given n training samples Z , construct a new training set Z * by sampling n instances with replacement • Excludes about 37% of the training instances ³ ³ ´ N ´ N 1 ¡ 1 1 ¡ 1 P f observation i 2 bootstrap samples g = 1 ¡ P f observation i 2 bootstrap samples g = 1 ¡ N N ' 1 ¡ e ¡ 1 = 0 : 632 ' 1 ¡ e ¡ 1 = 0 : 632 • Bagging (Bootstrap Aggregating) • Create bootstrap replicates of training set • Train a predictor for each replicate • Validate the predictor using out-of-bootstrap data • Average output of all predictors

Bootstrap • Basic idea • Randomly draw datasets with replacement from the training data • Each replicate with the same size as the training set • Evaluate any statistics S () over the replicates • For example, variance B B X X 1 1 ^ ^ ( S ( Z ¤ b ) ¡ ¹ ( S ( Z ¤ b ) ¡ ¹ S ¤ ) 2 S ¤ ) 2 Var[ S ( Z )] = Var[ S ( Z )] = B ¡ 1 B ¡ 1 b =1 b =1

Bootstrap • Basic idea • Randomly draw datasets with replacement from the training data • Each replicate with the same size as the training set • Evaluate any statistics S () over the replicates • For example, model error B B N N X X X X Err boot = 1 Err boot = 1 1 1 ^ ^ L ( y i ; ^ L ( y i ; ^ f ¤ b ( x i )) f ¤ b ( x i )) B B N N i =1 i =1 b =1 b =1

Bootstrap for Model Evaluation • If we directly evaluate the model using the whole training data B B N N X X X X Err boot = 1 Err boot = 1 1 1 L ( y i ; ^ L ( y i ; ^ ^ ^ f ¤ b ( x i )) f ¤ b ( x i )) B B N N i =1 i =1 b =1 b =1 • As the probability of a data instance in the bootstrap samples is ³ ³ ´ N ´ N 1 ¡ 1 1 ¡ 1 P f observation i 2 bootstrap samples g = 1 ¡ P f observation i 2 bootstrap samples g = 1 ¡ N N ' 1 ¡ e ¡ 1 = 0 : 632 ' 1 ¡ e ¡ 1 = 0 : 632 • If validate on training data, it is much likely to overfit • For example in a binary classification problem where y is indeed independent with x • Correct error rate: 0.5 • Above bootstrap error rate: 0.632*0 + (1-0.632)*0.5=0.184

Leave-One-Out Bootstrap • Build a bootstrap replicate with one instance i out, then evaluate the model using instance i N N X X X X (1) = 1 (1) = 1 1 1 ^ ^ L ( y i ; ^ L ( y i ; ^ f ¤ b ( x i )) f ¤ b ( x i )) Err Err j C ¡ i j j C ¡ i j N N i =1 i =1 b 2 C ¡ i b 2 C ¡ i • C -i is the set of indices of the bootstrap samples b that do not contain the instance i • For some instance i , the set C -i could be null set, just ignore such cases • We shall come back to the model evaluation and select in later lectures.

Bootstrap for Model Parameters • Sec 8.4 of Hastie et al. The elements of statistical learning. 2008. • Bootstrap mean is approximately a posterior average.

Bagging: Bootstrap Aggregating • Bootstrap replication • Given n training samples Z = {( x 1 , y 1 ), ( x 2 , y 2 ),…,( x n , y n )}, construct a new training set Z * by sampling n instances with replacement • Construct B bootstrap samples Z * b , b = 1 , 2 ,…, B • Train a set of predictors f ¤ 1 ( x ) ; ^ f ¤ 1 ( x ) ; ^ ^ ^ f ¤ 2 ( x ) ; : : : ; ^ f ¤ 2 ( x ) ; : : : ; ^ f ¤ B ( x ) f ¤ B ( x ) • Bagging average the predictions B B X X f bag ( x ) = 1 f bag ( x ) = 1 ^ ^ ^ ^ f ¤ b ( x ) f ¤ b ( x ) B B b =1 b =1

B-spline smooth plus and minus 1.96 × B-spline smooth of data standard error bands Ten bootstrap replicates of B-spline smooth with 95% standard error bands the B-spline smooth. computed from the bootstrap distribution Fig 8.2 of Hastie et al. The elements of statistical learning.

Ensemble and Boosting Algorithms Weinan Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

2019 CS420, Machine Learning, Lecture 6 Ensemble and Boosting Algorithms Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Content of this lecture Ensemble Methods Bagging

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

Boosting: more than an ensemble method for prediction Peter B uhlmann ETH Z urich

Steganalysis by Ensemble Classifiers with Boosting by Regression, and Post-Selection of Features

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Lecture #16: Boosting Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

Boosting Methods: Implicit Combinatorial Optimization via First-Order Convex Optimization Robert

An overview of Boosting Yoav Freund UCSD Plan of talk Generative vs. non-generative

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015 Boosting General

mboost - Componentwise Boosting for Generalised Regression Models Thomas Kneib & Torsten

STK-IN4300 Statistical Learning Methods in Data Science Likelihood-based Boosting introduction

Conclusions Larry Holder CptS 570 Machine Learning School of Electrical Engineering and

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

1 2 3 4 5 Second Project Implement collaborative filtering algorithm Apply to

Low-rank Matrix Completion via Convex Optimization Ben Recht Center for the Mathematics of

Information market based recommender systems fusion Efthimios Bothos Konstantinos Christidis

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Fran

CS573 Data Privacy and Security Li Xiong Department of Mathematics and Computer Science Emory

CS480/680 Lecture 22: July 22, 2019 Ensemble Learning [RN] Sec. 18.10, [M] Sec. 16.2.5, [B]

Sambuz

Useful Links

Newsletter

Mail Us

Ensemble and Boosting Algorithms Weinan Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

2019 CS420, Machine Learning, Lecture 6 Ensemble and Boosting Algorithms Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Content of this lecture Ensemble Methods Bagging

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

Boosting: more than an ensemble method for prediction Peter B uhlmann ETH Z urich

Steganalysis by Ensemble Classifiers with Boosting by Regression, and Post-Selection of Features

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Lecture #16: Boosting Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

Boosting Methods: Implicit Combinatorial Optimization via First-Order Convex Optimization Robert

An overview of Boosting Yoav Freund UCSD Plan of talk Generative vs. non-generative

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015 Boosting General

mboost - Componentwise Boosting for Generalised Regression Models Thomas Kneib &amp; Torsten

STK-IN4300 Statistical Learning Methods in Data Science Likelihood-based Boosting introduction

Conclusions Larry Holder CptS 570 Machine Learning School of Electrical Engineering and

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

1 2 3 4 5 Second Project Implement collaborative filtering algorithm Apply to

Low-rank Matrix Completion via Convex Optimization Ben Recht Center for the Mathematics of

Information market based recommender systems fusion Efthimios Bothos Konstantinos Christidis

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Fran

CS573 Data Privacy and Security Li Xiong Department of Mathematics and Computer Science Emory

CS480/680 Lecture 22: July 22, 2019 Ensemble Learning [RN] Sec. 18.10, [M] Sec. 16.2.5, [B]

Sambuz

Useful Links

Newsletter

Mail Us

mboost - Componentwise Boosting for Generalised Regression Models Thomas Kneib & Torsten