 
              S PARSITY - BASED L EARNING R EGULARIZATION M ETHODS FOR H IGH D IMENSIONAL L EARNING Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu Regularization Methods for High Dimensional Learning Sparsity-based Learning
A BOUT THIS CLASS The meaning of selecting variables Approaches to variable selection Sparsity-based regularization Regularization Methods for High Dimensional Learning Sparsity-based Learning
W HY S ELECTING V ARIABLES ? Nowadays many research areas produce data with tenth or hundred thousands variables (computational biology, signal processing, combinatorial chemistry, ...) interpretabilty of the model : in many learning problems a main goal, besides good prediction , is to gain a better understanding of the problems, for example detecting the most discriminative information. data driven representation : in place of tailoring an ad hoc representation (for example via a kernel) we can take a large, redundant set of measurements and then try to devise a data driven selection scheme. More appropriate descriptions lead to better performance (and avoid overfitting) compression it is often desirable to have parsimonious models, that is models requiring the computation of a small number of parameters More generally if the target function is sparse enforcing sparsity of the solution can be a way to avoid overfitting Regularization Methods for High Dimensional Learning Sparsity-based Learning
A U SEFUL E XAMPLE B IOMARKER I DENTIFICATION Set up : • n patients belonging to 2 groups (say two different diseases) • p measurements for each patient quantifying the expression of p genes Goal : • learn a classification rule to predict occurrence of the disease for future patients • detect which are the genes responsible for the disease p ≫ n PARADIGM typically n is in the order of tens and p of thousands.... Regularization Methods for High Dimensional Learning Sparsity-based Learning
S OME N OTATION M EASUREMENT MATRIX Let X be the n × p measurements matrix. x p x 1   . . . . . . . . . 1 1 . . . . . . . . . . X =   . . . . .   x p x 1 . . . . . . . . . n n • n is the number of training examples • p is the number of variables • we denote with X i , i = 1 , . . . , p the columns of X For each patient we have a response (output) y ∈ R or y = ± 1. In particular we are given the responses for the training set Y = ( y 1 , y 2 , . . . , y n ) Regularization Methods for High Dimensional Learning Sparsity-based Learning
A PPROACHES TO V ARIABLE S ELECTION The choice of a subset of variables may be lead by Selection of a set of the most relevant variables — variable ranking Selection of a set of features useful to build a good predictor The two approaches are related but different. The set of most relevant variables may contain many redundant variables. Regularization Methods for High Dimensional Learning Sparsity-based Learning
A PPROACHES TO V ARIABLE S ELECTION We still have to define what are ” relevant ” variables. Different approaches are based on different way to specify what is relevant (see ”Introduction to variable and features selection” Guyon and Elisseeff ’03) Filter methods. Wrappers. Embedded methods. We will focus mainly on embedded methods. Regularization Methods for High Dimensional Learning Sparsity-based Learning
F ILTER METHODS Filter methods rely on variable ranking Filter methods are a pre-processing step — variable ranking is not necessarily used to build predictors After variable ranking a thresholding criterium is applied to perform subset selection. There is a very rich literature, here we just mention methods based on individual variables and focus on supervised learning. In this context popular methods are the so-called correlation methods . Regularization Methods for High Dimensional Learning Sparsity-based Learning
R ANKING CRITERIA (1) Correlation criteria: rank variables according to a correlation score Single variable classifiers: select variables according to their individual predictive power Such methods detect linear dependencies between variable and target . In the case of non-linear dependencies other approaches are available, e.g., non linear preprocessing followed by simple correlation Regularization Methods for High Dimensional Learning Sparsity-based Learning
F ILTER METHODS : PROS AND CONS C ONS They are prone to provide redundant sets of features No explicit selection is embedded in the method: some thresholding is needed. P ROS Straightforward and simple Scalable and fast Good empirical performance Regularization Methods for High Dimensional Learning Sparsity-based Learning
W RAPPER METHODS An algorithm of choice is used (as a black box ) to evaluate the importance of a set of variables for the given task. O NE NEEDS TO DEFINE how to search the space of all possible variable subsets how to assess prediction performance of the learning machine ⇒ usually a validation set or cross-validation is adopted which predictor to use ⇒ many machines have been adopted in the literature (decision trees, n¨ aive Bayes, LS, SVM, ...) Regularization Methods for High Dimensional Learning Sparsity-based Learning
W RAPPER METHODS ⇒ a strategy is needed to explore the variable space. A brute force approach is unfeasible even for relatively few features (NP-hard problem). In any case coarse search strategies have also the benefit of alleviating overfitting. TWO MAIN GREEDY APPROACHES forward selection : start with no variables and incrementally add them... backward elimination : start with all the variables and proceed iteratively to discard them... ...until some stopping criteria is satisfied. For example a fixed number of variables, or a certain validation error, is reached. Both methods yield nested subsets Regularization Methods for High Dimensional Learning Sparsity-based Learning
W RAPPER METHODS : PROS AND CONS P ROS They are universal and simple since they use the learning machine as a black box C ONS They require various re-training and parameter tuning They require the implementation of efficient search mechanisms Selection is not embedded and thresholding is needed Regularization Methods for High Dimensional Learning Sparsity-based Learning
E MBEDDED M ETHODS The selection procedure is embedded in the training phase. A N INTUITION what happens to the generalization properties of empirical risk minimization as we subtract variables? if we keep all the variables we probably overfit if we take just a few variables we are likely to oversmooth We are going to discuss this class of methods in detail. Regularization Methods for High Dimensional Learning Sparsity-based Learning
S ELECTING VARIABLES AND APPROXIMATING FUNCTIONS We suppose the output is a linear combination of the input variables p � f ( x ) = β i x i = � β, x � i = 1 each coefficient β i can be seen as a weight on the i -th variable. The intuition is that as we discard variables we make the model simple and avoid overfitting. Brute force approach try ALL possible subsets ⇒ unfeasible! Can we use regularization? Regularization Methods for High Dimensional Learning Sparsity-based Learning
S OLVING A BIG LINEAR SYSTEM T HE SYSTEM For a given training set ( x i , y i ) we may write the following system p � β i x i = y i i = 1 , . . . , n . i = 1 or, in vector notation, X β = Y Since ususally p >> n the system is underdetermined (the solution exists but it is not unique), and thus subject to ill-conditioning Moreover we can expect unstable behavior w.r.t. noisy data, thus we may expect ill-posedness This situation is often referred to as ”large p, small n” or ”curse of dimensionality” problem. Regularization Methods for High Dimensional Learning Sparsity-based Learning
T IKHONOV R EGULARIZED S OLUTION To restore well posedeness: Since we are dealing with noisy data we may adopt a Tikhonov regularization approach: n p β ∈ R p { 1 � � � � β 2 V ( y j , β, x j ) + λ i } min n j = 1 i = 1 We have seen the Tikhonov regularization leads to solutions with good generalization properties. ⇒ How about variable selection? In general all the β i will be different from zero. Selection property is not built in and can be made only adding a thresholding step. Regularization Methods for High Dimensional Learning Sparsity-based Learning
S PARSITY Define the ”zero”-norm (not a real norm) as � β � 0 = # { i = 1 , . . . , p | β i � = 0 } It is a measure of how ”complex” is f and of how many variables are important. Is it a good way to define sparsity? Regularization Methods for High Dimensional Learning Sparsity-based Learning
ℓ 0 R EGULARIZATION If want to select variables we can look for � β � 0 min β ∈ R p s . t . Y = X β or, since the data are noisy, we would like to consider n β ∈ R P { 1 � � � min V ( y j , β, x j ) + λ � β � 0 } n j = 1 ⇒ This is as difficult as trying all possible subsets of variables. Can we find meaningful approximations? Regularization Methods for High Dimensional Learning Sparsity-based Learning
A PPROXIMATE SOLUTIONS T WO MAIN APPROACHES There exist approximations for various choices of loss function. Most of them fall in either one of the following two approaches Convex relaxation ( ℓ 1 regularization,) 1 Greedy schemes (boosting algorithms, projection pursuit...) 2 Here we discuss the first class of methods. Regularization Methods for High Dimensional Learning Sparsity-based Learning
Recommend
More recommend