Ensemble method for supervised learning Using an explicit loss - PowerPoint PPT Presentation

Ensemble method for supervised learning Using an explicit loss function Ricco RAKOTOMALALA Université Lumière Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Outline 1. Preamble 2. Gradient boosting for regression 3. Gradient boosting for classification 4. Regularization (shrinkage, stochastic gradient boosting) 5. T ools and software 6. Conclusion – Pros and cons 7. References Ricco Rakotomalala 2 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Boosting and Gradient Descent Ricco Rakotomalala 3 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

BOOSTING is an ensemble method which aggregates classifiers learned sequentially on a sample for which the weights of individuals are adjusted at each step. The classifiers are weighted according to their performance. [RAK, page 28]. Input: B number of models, ALGO learning algorithm, Ω training set, with size = n, y target attribute, X matrix with p predictive attributes. MODELES = { } All the instances have the same weight ω 1 i = 1/n For b = 1 to B Do Fit the model M b from Ω(ω b ) using ALGO ( ω b weighting system at the step b) Add M b into MODELES n         ˆ b Calculate the weighted error rate for M b : I y y b i i i If  b > 0.5 or  b = 0, STOP the process  i 1 Else   1   Calculate b ln      b        ˆ b 1 b exp . I y y The weights are updated b i i b i i And normalized so that the sum is equal to 1 End For A weighted (  b ) vote is used for prediction B     f ( x ) sign M ( x ) (this is an additive model) b b  b 1 Ricco Rakotomalala 4 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Gradient descent is an iterative technique that allows to approach the solution of an optimization problem. In supervised learning, the construction of the model is often to determine the parameters that enable to optimize (max or min) an objective function (ex. Perceptron – Least squares criterion, pages 11 et 12). f() is a classifier with some parameters j() is a cost function comparing the observed value of the target and the prediction of the model for an observation n     J ( y , f ) j y , f ( x ) J() is an overall loss function, additively calculated from all i i  observations i 1  The aim is to minimize J() with regard to f() i.e. the parameters of f(). f b () is the version of classifier at step ‘’b’’  is the learning rate which enables to lead the process      f ( x ) f ( x ) j ( y , f ( x ))  is the gradient i.e. the first order partial derivative of the  b i b 1 i i i cost function with regard to the classifier  j ( y , f ( x ))   i i j ( y , f ( x ))  i i f ( x ) i Ricco Rakotomalala 5 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

We can show that ADABOOST consists in to optimize an exponential loss function i.e. each classifier M t learned from the weighted sample resulting from M t-1 allows to minimize an overall loss function [BIS, page 659 ; HAS, page 343] y  {-1, +1} n J() is the overall loss function       J ( f ) exp y f ( x ) f() is the aggregate classifier composed of a linear i i  combination of the base classifiers M b i 1 The aggregate classifier at step ‘’b’’ is corrected  with the individual classifier M b learned from the    b f f M reweighted sample. M b is the gradient here i.e.  b b 1 b 2 each intermediate model allows to reduce the loss of the global model. The "gradient" classifier comes from a sample where            b b 1 the weights of individuals depend on the performance exp . I y M ( i )   i i b 1 i b 1 of the previous model (idea of iterative corrections) GRADIENT BOOSTING : generalize the approach with other loss functions Ricco Rakotomalala 6 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Gradient Boosting = Gradient Descent + Boosting Ricco Rakotomalala 7 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

The regression is a supervised learning process which estimates the relationship between a quantitative dependent variable and a set of independent variables.  is the error term. It represents the inadequacy of the model.    y M ( x ) M is any kind of model, we use regression tree. i 1 i 1 i e is the residual. Estimated value of the error. High value (in   e y M ( x ) absolute value) reflects a bad prediction. i 1 i 1 i The aim is to model this residual with a second classifier M2 and associate it with the previous one for a better prediction.    e M ( x ) We can proceed in the same way for the residual e 2 , etc. i 1 2 i 2 i   ˆ The role of M 2 is (additively) compensate the y M ( x ) M ( x ) i 1 i 2 i inadequacy of M 1 , thereafter we can learn M 3 , etc. Ricco Rakotomalala 8 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  1   The sum of the squares of 2 j ( y , f ( x )) y f ( x ) i i i i 2 errors is a well-known overall n    indicator of quality in regression  J ( y , f ) j y , f ( x ) i i  i 1     1 Calculation of the gradient. It is actually   2 y f ( x )     i i  j ( y , f ( x )) 2 equal to the residual, but with an opposite    i i f ( x ) y   i i sign i.e. residual = negative gradient f ( x ) f ( x ) i i   Thus, we have an iterative process for the f ( x ) f ( x ) M ( x )  b i b 1 i b i   construction of the additive model.    f ( x ) y f ( x )   b 1 i i b 1 i Modeling the residuals in step "b"  j ( y , f ( x )    (regression tree M b ) corresponds to a i i f ( x ) 1   b 1 i f ( x ) gradient. Ultimately, we minimize the i overall cost function J()      f ( x ) j ( y , f ( x ))  b 1 i i i The learning rate  is equal to 1 here. Ricco Rakotomalala 9 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

We have an iterative process where, at each step, we use the negative value of the gradient: -  j(y,f) [WIK] Or, more simply, FOR m = 1, …, B (B : parameter of the algorithm) The trivial tree corresponds to a tree with only the root. The prediction is equal to the mean of the target attribute Y . Must be calculated for all the individuals of the training sample (i = 1, …, n) Fit the trivial tree f 0 () j() = square of the error  REPEAT UNTIL CONVERGENCE negative gradient = residual Calculate negative gradient -  j(y,f) Fit a regression tree M b for -  j(y,f) The depth of the trees is a f b = f b-1 +  b .M b possible parameter  b is chosen at each step n  The models are combined     arg min j ( y , f ( x ) . M ( x )) in order to minimize  b i b 1 i b i  in additive fashion  i 1 (using a numerical optimization approach) The advantage of this generic formulation is that one can use other loss functions and the associated gradients. Ricco Rakotomalala 10 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Other loss functions  Other gradient formulation  Other behavior and performance of the aggregate model -  j(y i ,f(x i )) Loss function Pros / Cons Sensitivity to small differences, ½(y i -f(x i ))² y i -f(x i ) but not robust against the outliers Less sensitive to small |y i -f(x i )| signe[y i -f(x i )] differences but robust against outliers Combine the benefits of the y i -f(x i ) si |y i -f(x i )|   b square error (more sensitive to  b .sign[yi-f(xi)] si |y i -f(x i )| >  b Huber small values of the gradient) and Where  b is a quantile of {|y i -f(x i )|} absolute error (more robust against the outliers) Ricco Rakotomalala 11 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Working with the indicator variables (dummy variables) Ricco Rakotomalala 12 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Ensemble method for supervised learning Using an explicit loss - PowerPoint PPT Presentation

Ensemble method for supervised learning Using an explicit loss function Ricco RAKOTOMALALA Universit Lumire Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/ Outline 1. Preamble 2. Gradient

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

Ensemble Learning 4/10/17 Ensemble Learning Hypothesis Space: Supervised learning (data has

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

The explicit teaching of a The explicit teaching of a The explicit teaching of a laboratory

MOBILE COMPUTING CSE 40814/60814 Fall 2015 System Structure explicit explicit input output 1

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

Explicit-State Abstraction: A New Method Abstractions for Generating Heuristic Functions

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Green Days @ Toulouse Segment Routing based Traffjc Engineering for Energy Effjcient Backbone

The IF toolset VERIMAG M.Bozga, S. Graf, L. Mounier, Y. Lakhnech, Il. Ober, Iu. Ober , J. Sifakis

Urban Sound Symposium April 4, 2019 Ghent University, Belgium Urban Low Barriers Jrme

Robust adaptive discourse parsing for e-learning fora Nadine Lucas & Emmanuel Giguet Cnrs

Input-and-state observability of structured network systems Federica GARIN (INRIA Grenoble,

Image processing & analysis with MATLAB: an overview Nicolas ROUGON ARTEMIS Department

A Roadmap for High Assurance Cryptography Harry Halpin harry.halpin@inria.fr @harryhalpin

Bringing ECOFEN to OMNeT++ Radu Crpa radu.carpa@ens-lyon.fr INRIA AVALON / LIP Ecole Normale

Sambuz

Useful Links

Newsletter

Mail Us

Ensemble method for supervised learning Using an explicit loss - PowerPoint PPT Presentation

Ensemble method for supervised learning Using an explicit loss function Ricco RAKOTOMALALA Universit Lumire Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/ Outline 1. Preamble 2. Gradient

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

Ensemble Learning 4/10/17 Ensemble Learning Hypothesis Space: Supervised learning (data has

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

The explicit teaching of a The explicit teaching of a The explicit teaching of a laboratory

MOBILE COMPUTING CSE 40814/60814 Fall 2015 System Structure explicit explicit input output 1

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

Explicit-State Abstraction: A New Method Abstractions for Generating Heuristic Functions

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Green Days @ Toulouse Segment Routing based Traffjc Engineering for Energy Effjcient Backbone

The IF toolset VERIMAG M.Bozga, S. Graf, L. Mounier, Y. Lakhnech, Il. Ober, Iu. Ober , J. Sifakis

Urban Sound Symposium April 4, 2019 Ghent University, Belgium Urban Low Barriers Jrme

Robust adaptive discourse parsing for e-learning fora Nadine Lucas &amp; Emmanuel Giguet Cnrs

Input-and-state observability of structured network systems Federica GARIN (INRIA Grenoble,

Image processing &amp; analysis with MATLAB: an overview Nicolas ROUGON ARTEMIS Department

A Roadmap for High Assurance Cryptography Harry Halpin harry.halpin@inria.fr @harryhalpin

Bringing ECOFEN to OMNeT++ Radu Crpa radu.carpa@ens-lyon.fr INRIA AVALON / LIP Ecole Normale

Sambuz

Useful Links

Newsletter

Mail Us

Robust adaptive discourse parsing for e-learning fora Nadine Lucas & Emmanuel Giguet Cnrs

Image processing & analysis with MATLAB: an overview Nicolas ROUGON ARTEMIS Department