MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco - PowerPoint PPT Presentation

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT

Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net MLCC 2019 2

Prediction and Interpretability ◮ In many practical situations, beyond prediction, it is important to obtain interpretable results MLCC 2019 3

Prediction and Interpretability ◮ In many practical situations, beyond prediction, it is important to obtain interpretable results ◮ Interpretability is often determined by detecting which factors allow good prediction MLCC 2019 4

Prediction and Interpretability ◮ In many practical situations, beyond prediction, it is important to obtain interpretable results ◮ Interpretability is often determined by detecting which factors allow good prediction We look at this question from the perspective of variable selection MLCC 2019 5

Linear Models Consider a linear model v f w ( x ) = w T x = � w j x j i =1 Here ◮ the components x j of an input can be seen as measurements (pixel values, dictionary words count, gene expressions, . . . ) MLCC 2019 6

Linear Models Consider a linear model v f w ( x ) = w T x = � w j x j i =1 Here ◮ the components x j of an input can be seen as measurements (pixel values, dictionary words count, gene expressions, . . . ) ◮ Given data, the goal of variable selection is to detect which are variables important for prediction MLCC 2019 7

Linear Models Consider a linear model v f w ( x ) = w T x = � w j x j i =1 Here ◮ the components x j of an input can be seen as measurements (pixel values, dictionary words count, gene expressions, . . . ) ◮ Given data, the goal of variable selection is to detect which are variables important for prediction Key assumption: the best possible prediction rule is sparse , that is only few of the coefficients are non zero MLCC 2019 8

Linear Models Consider a linear model v f w ( x ) = w T x = � w j x j (1) i =1 Here ◮ the components x j of an input are specific measurements (pixel values, dictionary words count, gene expressions, . . . ) MLCC 2019 10

Linear Models Consider a linear model v f w ( x ) = w T x = � w j x j (1) i =1 Here ◮ the components x j of an input are specific measurements (pixel values, dictionary words count, gene expressions, . . . ) ◮ Given data the goal of variable selection is to detect which variables important for prediction MLCC 2019 11

Linear Models Consider a linear model v f w ( x ) = w T x = � w j x j (1) i =1 Here ◮ the components x j of an input are specific measurements (pixel values, dictionary words count, gene expressions, . . . ) ◮ Given data the goal of variable selection is to detect which variables important for prediction Key assumption: the best possible prediction rule is sparse , that is only few of the coefficients are non zero MLCC 2019 12

Notation We need some notation: ◮ X n be the n by D data matrix MLCC 2019 13

Notation We need some notation: ◮ X n be the n by D data matrix ◮ i X j ∈ R n , j = 1 , . . . , D its columns MLCC 2019 14

Notation We need some notation: ◮ X n be the n by D data matrix ◮ i X j ∈ R n , j = 1 , . . . , D its columns ◮ Y n ∈ R n the output vector MLCC 2019 15

High-dimensional Statistics Estimating a linear model corresponds to solving a linear system X n w = Y n . ◮ Classically n ≫ D low dimension/overdetermined system MLCC 2019 16

High-dimensional Statistics Estimating a linear model corresponds to solving a linear system X n w = Y n . ◮ Classically n ≫ D low dimension/overdetermined system ◮ Lately n ≪ D high dimensional/underdetermined system Buzzwords: compressed sensing, high-dimensional statistics . . . MLCC 2019 17

High-dimensional Statistics Estimating a linear model corresponds to solving a linear system X n w = Y n . w X n Y n { = n { D MLCC 2019 18

High-dimensional Statistics Estimating a linear model corresponds to solving a linear system X n w = Y n . w X n Y n { = n { D Sparsity! MLCC 2019 19

Brute Force Approach Sparsity can be measured by the ℓ 0 norm � w � 0 = |{ j | w j � = 0 }| that counts non zero components in w MLCC 2019 20

Brute Force Approach Sparsity can be measured by the ℓ 0 norm � w � 0 = |{ j | w j � = 0 }| that counts non zero components in w If we consider the square loss, it can be shown that a regularization approach is given by n 1 ( y i − f w ( x i )) 2 + λ � w � 0 � min n w ∈ R D i =1 MLCC 2019 21

The Brute Force Approach is Hard n 1 ( y i − f w ( x i )) 2 + λ � w � 0 � min n w ∈ R D i =1 The above approach is as hard as a brute force approach : considering all training sets obtained with all possible subsets of variables (single, couples, triplets... of variables) MLCC 2019 22

The Brute Force Approach is Hard n 1 ( y i − f w ( x i )) 2 + λ � w � 0 � min n w ∈ R D i =1 The above approach is as hard as a brute force approach : considering all training sets obtained with all possible subsets of variables (single, couples, triplets... of variables) The computational complexity is combinatorial. In the following we consider two possible approximate approaches: ◮ greedy methods ◮ convex relaxation MLCC 2019 23

Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set MLCC 2019 25

Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set 2. find the variable most correlated with the residual MLCC 2019 26

Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set 2. find the variable most correlated with the residual 3. update the index set to include the index of such variable MLCC 2019 27

Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set 2. find the variable most correlated with the residual 3. update the index set to include the index of such variable 4. update/compute coefficient vector MLCC 2019 28

Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set 2. find the variable most correlated with the residual 3. update the index set to include the index of such variable 4. update/compute coefficient vector 5. update residual. The simplest such procedure is called forward stage-wise regression in statistics and matching pursuit (MP) in signal processing MLCC 2019 29

Initialization Let r, w, I denote the residual, the coefficient vector, an index set, respectively. MLCC 2019 30

Initialization Let r, w, I denote the residual, the coefficient vector, an index set, respectively. The MP algorithm starts by initializing the residual r ∈ R n , the coefficient vector w ∈ R D , and the index set I ⊆ { 1 , . . . , D } r 0 = Y n , , w 0 = 0 , I 0 = ∅ MLCC 2019 31

Selection The variable most correlated with the residual is given by a j = ( r T i − 1 X j ) 2 k = arg j =1 ,...,D a j , max , � X j � 2 where we note that v j = r T i − 1 X j � r i − 1 − X j v j � 2 = � r i − 1 � 2 − a j v ∈ R � r i − 1 − X j v � 2 , � X j � 2 = arg min MLCC 2019 32

Selection (cont.) Such a selection rule has two interpretations: ◮ We select the variable with larger projection on the output, or equivalently ◮ we select the variable such that the corresponding column best explains the the output vector in a least squares sense MLCC 2019 33

Active Set, Solution and residual Update Then, index set is updated as I i = I i − 1 ∪ { k } , and the coefficients vector is given by w i = w i − 1 + w k , w k k = v k e k where e k is the element of the canonical basis in R D with k -th component different from zero MLCC 2019 34

Active Set, Solution and residual Update Then, index set is updated as I i = I i − 1 ∪ { k } , and the coefficients vector is given by w i = w i − 1 + w k , w k k = v k e k where e k is the element of the canonical basis in R D with k -th component different from zero Finally, the residual is updated r i = r i − 1 − Xw k MLCC 2019 35

Orthogonal Matching Pursuit A variant of the above procedure, called Orthogonal Matching Pursuit, is also often considered, where the coefficient computation is replaced by w ∈ R D � Y n − X n M I i w � 2 , w i = arg min where the D by D matrix M I is such that ( M I w ) j = w j if j ∈ I and ( M I w ) j = 0 otherwise. Moreover, the residual update is replaced by r i = Y n − X n w i MLCC 2019 36

Theoretical Guarantees If ◮ the solution is sparse, and ◮ the data matrix has columns ”not too correlated” OMP can be shown to recover with high probability the right vector of coefficients MLCC 2019 37

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco - PowerPoint PPT Presentation

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net MLCC 2019 2 Prediction and

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

MLCC 2018 Statistical Learning: Basic Concepts Lorenzo Rosasco UNIGE-MIT-IIT Outline Learning

MLCC 2015 machine learning applications Francesca Odone ML applications Machine Learning

Sparsity, Randomness and Compressed Sensing Petros Boufounos Mitsubishi Electric Research Labs

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Introduction to Sparsity in Modeling and Learning Introduction to Sparsity in Modeling and

Sparsity and image processing Aurlie Boisbunon INRIA-SAM, AYIN March 26, 2014 Why sparsity?

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

MLCC 2019 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT About this

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2017

MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, 2017 What? Classification

MLCC 2017 Machine Learning Crash Course Universita' di Genova, Summer, 2017 Instructor : Lorenzo

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017

W HY S ELECTING V ARIABLES ? Nowadays many research areas produce data with tenth or hundred

Revisiting Frank-Wolfe Projection-Free Sparse Convex Optimization Martin Jaggi Ecole

ss rst st

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University

Sparse Reconstruction for Compressed Sensing using Stagewise Polytope Faces Pursuit Mark D.

Learning-Augmented Online Selection Algorithms Themis Gouleakis Joint work with: Antonios

And Justice For All Week 4: Jesus and Race: The Modern Civil Rights Movement and Beyond

150 Proportion of Users 100 50 0 0 1000 2000 3000 4000 Duration of User Session 150

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco - PowerPoint PPT Presentation

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net MLCC 2019 2 Prediction and

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

MLCC 2018 Statistical Learning: Basic Concepts Lorenzo Rosasco UNIGE-MIT-IIT Outline Learning

MLCC 2015 machine learning applications Francesca Odone ML applications Machine Learning

Sparsity, Randomness and Compressed Sensing Petros Boufounos Mitsubishi Electric Research Labs

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Introduction to Sparsity in Modeling and Learning Introduction to Sparsity in Modeling and

Sparsity and image processing Aurlie Boisbunon INRIA-SAM, AYIN March 26, 2014 Why sparsity?

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics &amp; Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

MLCC 2019 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT About this

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2017

MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, 2017 What? Classification

MLCC 2017 Machine Learning Crash Course Universita' di Genova, Summer, 2017 Instructor : Lorenzo

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017

W HY S ELECTING V ARIABLES ? Nowadays many research areas produce data with tenth or hundred

Revisiting Frank-Wolfe Projection-Free Sparse Convex Optimization Martin Jaggi Ecole

ss rst st

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University

Sparse Reconstruction for Compressed Sensing using Stagewise Polytope Faces Pursuit Mark D.

Learning-Augmented Online Selection Algorithms Themis Gouleakis Joint work with: Antonios

And Justice For All Week 4: Jesus and Race: The Modern Civil Rights Movement and Beyond

150 Proportion of Users 100 50 0 0 1000 2000 3000 4000 Duration of User Session 150

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION