W HY S ELECTING V ARIABLES ? Nowadays many research areas produce - - PowerPoint PPT Presentation

w hy s electing v ariables
SMART_READER_LITE
LIVE PREVIEW

W HY S ELECTING V ARIABLES ? Nowadays many research areas produce - - PowerPoint PPT Presentation

S PARSITY - BASED L EARNING R EGULARIZATION M ETHODS FOR H IGH D IMENSIONAL L EARNING Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu Regularization Methods for High Dimensional Learning Sparsity-based Learning A BOUT


slide-1
SLIDE 1

SPARSITY-BASED LEARNING

REGULARIZATION METHODS FOR HIGH DIMENSIONAL LEARNING Francesca Odone and Lorenzo Rosasco

  • done@disi.unige.it - lrosasco@mit.edu

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-2
SLIDE 2

ABOUT THIS CLASS

The meaning of selecting variables Approaches to variable selection Sparsity-based regularization

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-3
SLIDE 3

WHY SELECTING VARIABLES?

Nowadays many research areas produce data with tenth or hundred thousands variables (computational biology, signal processing, combinatorial chemistry, ...) interpretabilty of the model: in many learning problems a main goal, besides good prediction , is to gain a better understanding

  • f the problems, for example detecting the most discriminative

information. data driven representation: in place of tailoring an ad hoc representation (for example via a kernel) we can take a large, redundant set of measurements and then try to devise a data driven selection scheme. More appropriate descriptions lead to better performance (and avoid overfitting) compression it is often desirable to have parsimonious models, that is models requiring the computation of a small number of parameters More generally if the target function is sparse enforcing sparsity of the solution can be a way to avoid overfitting

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-4
SLIDE 4

A USEFUL EXAMPLE

BIOMARKER IDENTIFICATION Set up:

  • n patients belonging to 2 groups (say two different diseases)
  • p measurements for each patient quantifying the expression of p

genes Goal:

  • learn a classification rule to predict occurrence of the disease for

future patients

  • detect which are the genes responsible for the disease

p ≫ n PARADIGM typically n is in the order of tens and p of thousands....

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-5
SLIDE 5

SOME NOTATION

MEASUREMENT MATRIX Let X be the n × p measurements matrix. X =    x1

1

. . . . . . . . . xp

1

. . . . . . . . . . . . . . . x1

n

. . . . . . . . . xp

n

  

  • n is the number of training examples
  • p is the number of variables
  • we denote with X i, i = 1, . . . , p the columns of X

For each patient we have a response (output) y ∈ R or y = ±1. In particular we are given the responses for the training set Y = (y1, y2, . . . , yn)

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-6
SLIDE 6

APPROACHES TO VARIABLE SELECTION

The choice of a subset of variables may be lead by Selection of a set of the most relevant variables — variable ranking Selection of a set of features useful to build a good predictor The two approaches are related but different. The set of most relevant variables may contain many redundant variables.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-7
SLIDE 7

APPROACHES TO VARIABLE SELECTION

We still have to define what are ”relevant” variables. Different approaches are based on different way to specify what is relevant

(see ”Introduction to variable and features selection” Guyon and Elisseeff ’03)

Filter methods. Wrappers. Embedded methods. We will focus mainly on embedded methods.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-8
SLIDE 8

FILTER METHODS

Filter methods rely on variable ranking Filter methods are a pre-processing step — variable ranking is not necessarily used to build predictors After variable ranking a thresholding criterium is applied to perform subset selection. There is a very rich literature, here we just mention methods based on individual variables and focus on supervised learning. In this context popular methods are the so-called correlation methods.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-9
SLIDE 9

RANKING CRITERIA (1)

Correlation criteria: rank variables according to a correlation score Single variable classifiers: select variables according to their individual predictive power Such methods detect linear dependencies between variable and target. In the case of non-linear dependencies other approaches are available, e.g., non linear preprocessing followed by simple correlation

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-10
SLIDE 10

FILTER METHODS: PROS AND CONS

CONS They are prone to provide redundant sets of features No explicit selection is embedded in the method: some thresholding is needed. PROS Straightforward and simple Scalable and fast Good empirical performance

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-11
SLIDE 11

WRAPPER METHODS

An algorithm of choice is used (as a black box) to evaluate the importance of a set of variables for the given task. ONE NEEDS TO DEFINE how to search the space of all possible variable subsets how to assess prediction performance of the learning machine ⇒ usually a validation set or cross-validation is adopted which predictor to use ⇒ many machines have been adopted in the literature (decision trees, n¨ aive Bayes, LS, SVM, ...)

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-12
SLIDE 12

WRAPPER METHODS

⇒ a strategy is needed to explore the variable space. A brute force approach is unfeasible even for relatively few features (NP-hard problem). In any case coarse search strategies have also the benefit of alleviating overfitting.

TWO MAIN GREEDY APPROACHES

forward selection: start with no variables and incrementally add them... backward elimination: start with all the variables and proceed iteratively to discard them... ...until some stopping criteria is satisfied. For example a fixed number

  • f variables, or a certain validation error, is reached.

Both methods yield nested subsets

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-13
SLIDE 13

WRAPPER METHODS: PROS AND CONS

PROS They are universal and simple since they use the learning machine as a black box CONS They require various re-training and parameter tuning They require the implementation of efficient search mechanisms Selection is not embedded and thresholding is needed

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-14
SLIDE 14

EMBEDDED METHODS

The selection procedure is embedded in the training phase. AN INTUITION what happens to the generalization properties of empirical risk minimization as we subtract variables? if we keep all the variables we probably overfit if we take just a few variables we are likely to oversmooth We are going to discuss this class of methods in detail.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-15
SLIDE 15

SELECTING VARIABLES AND APPROXIMATING

FUNCTIONS

We suppose the output is a linear combination of the input variables f(x) =

p

  • i=1

βixi = β, x each coefficient βi can be seen as a weight on the i-th variable. The intuition is that as we discard variables we make the model simple and avoid overfitting. Brute force approach try ALL possible subsets ⇒ unfeasible! Can we use regularization?

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-16
SLIDE 16

SOLVING A BIG LINEAR SYSTEM

THE SYSTEM For a given training set (xi, yi) we may write the following system

p

  • i=1

βixi = yi i = 1, . . . , n.

  • r, in vector notation,

Xβ = Y Since ususally p >> n the system is underdetermined (the solution exists but it is not unique), and thus subject to ill-conditioning Moreover we can expect unstable behavior w.r.t. noisy data, thus we may expect ill-posedness This situation is often referred to as ”large p, small n” or ”curse of dimensionality” problem.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-17
SLIDE 17

TIKHONOV REGULARIZED SOLUTION

To restore well posedeness: Since we are dealing with noisy data we may adopt a Tikhonov regularization approach: min

β∈Rp{1

n

n

  • j=1

V(yj,

  • β, xj
  • ) + λ

p

  • i=1

β2

i }

We have seen the Tikhonov regularization leads to solutions with good generalization properties. ⇒ How about variable selection? In general all the βi will be different from zero. Selection property is not built in and can be made only adding a thresholding step.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-18
SLIDE 18

SPARSITY

Define the ”zero”-norm (not a real norm) as β0 = #{i = 1, . . . , p | βi = 0} It is a measure of how ”complex” is f and of how many variables are important. Is it a good way to define sparsity?

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-19
SLIDE 19

ℓ0 REGULARIZATION

If want to select variables we can look for minβ∈Rp β0 s.t. Y = Xβ

  • r, since the data are noisy, we would like to consider

min

β∈RP{1

n

n

  • j=1

V(yj,

  • β, xj
  • ) + λ β0}

⇒ This is as difficult as trying all possible subsets of variables. Can we find meaningful approximations?

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-20
SLIDE 20

APPROXIMATE SOLUTIONS

TWO MAIN APPROACHES There exist approximations for various choices of loss function. Most

  • f them fall in either one of the following two approaches

1

Convex relaxation (ℓ1 regularization,)

2

Greedy schemes (boosting algorithms, projection pursuit...) Here we discuss the first class of methods.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-21
SLIDE 21

GREEDY APPROACHES

Very similar techniques have been proposed by different communities with different names: statistics - forward stagewise regression, approximation theory - greedy algorithms, learning - boosting methods, signal processing - projection pursuit methods. The various algorithms are often based on the iteration of the following steps. After some initialization:

1

selection a feature

2

find/update the solution. These schemes proceed incrementally and are not based on a global

  • ptimization procedure.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-22
SLIDE 22

SOME COMMENTS ON GREEDY ALGORITHMS

Under suitable assumptions they can be shown to approximate the ℓ0 norm solution The number of iterations corresponds the number of features selected The computations cost increases as we have to solve larger least squares problem We do not have to work with the whole data matrix

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-23
SLIDE 23

CONVEX RELAXATION

A natural approximation to ℓ0 regularization is given by the ℓ1 penalty. ℓ1 PENALTY The coefficient vector ˆ β is estimated as the minimizer of the empirical risk penalized with the ℓ1 norm ˆ β = arg min

β

1 n

n

  • j=1

V(yj, fβ(xj)) + λ β1 where fβ(xj) =

  • β, xj
  • , β1 = p

i=1 |βi|, and λ is a suitable

regularization parameter.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-24
SLIDE 24

LASSO

In the case of a square loss we obtain a scheme known as Lasso (Least Abstolute Shrinkage and Selection Operator) or Basis Pursuit. LASSO REGRESSION ˆ β = arg min

β

1 n

n

  • j=1

(yj − fβ(xj))2 + λ

p

  • i=1

|βi|

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-25
SLIDE 25

WHAT IS THE DIFFERENCE BETWEEN ℓ1 AND ℓ2

REGULARIZATION?

We have seen that Tikhonov regularization is a good way to avoid overfitting. Lasso provides sparse solution Tikhonov regularization doesn’t. Why?

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-26
SLIDE 26

CONSTRAINED MINIMIZATION

Consider min

β { p

  • i=1

|βi|.} subject to Y − Xβ2

n ≤ R.

β1 β2 Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-27
SLIDE 27

HOW TO FIND LASSO SOLUTIONS

ˆ β = arg min

β

1 n

n

  • j=1

(yj − fβ(xj))2 + λ

p

  • i=1

|βi| ℓ1 penalty leads to a non linear problem. The functional is convex but not strictly convex, so that the solution is not unique. One possible approach relies on linear (or quadratic) programming techniques (e.g. LARS algorithm). Using convex analysis tools we will get a simple iterative algorithm.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-28
SLIDE 28

AN ITERATIVE ALGORITHM FOR ℓ1 PENALTIES

It can be proved that the following iterative algorithm converges to the solution βλ of ℓ1 regularization as the number of iteration increases. Set βλ

0 = 0 and let

βλ

t = Sλ[βλ t−1 + τX T(Y − Xβλ t−1)]

where τ is a normalization constant ensuring τ X ≤ 1 and the map Sλ is defined component-wise as Sλ(βi) =    βi + λ/2 if βi < −λ/2 if |βi| ≤ λ/2 βi − λ/2 if βi > λ/2

(see Daubechies et al.’05 and De Mol, De Vito, Rosasco ’08) Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-29
SLIDE 29

SOFT THRESHOLDING FUNCTION

Sλ(β)γ λwγ/2 −λwγ/2

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-30
SLIDE 30

ALGORITHMIC ASPECTS

Set βλ

0 = 0

for t=1:tmax βλ

t = Sλ[βλ t−1 + τX T(Y − Xβλ t−1)]

The algorithm we just described is very easy to implement but can be quite heavy from a computational point of view. The number of iteration t can be stopped when a certain precision is reached. The complexity of the algorithm is O(tp2) for each value of the regularization parameter. The regularization parameter controls the degree of sparsity of the solution.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-31
SLIDE 31

SOME REMARKS

About Uniqueness: the solution of ℓ1 regularization is not

  • unique. Note that the various solution have the same prediction

properties but different selection properties. Correlated Variables: If we have a group of correlated variables the algorithm is going to select just one of them. This can be bad for interpretability but may be good for compression. In the p >> n case lasso tends to choose at most n variables and then saturate.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-32
SLIDE 32

ELASTIC NET REGULARIZATION

One possible way to cope with the previous problems is to consider the following penalty term pǫ(β) =

p

  • i=1

(|βi| + ǫβ2

i ).

ELASTIC NET The coefficient vector ˆ β is estimated as the minimizer of the empirical risk penalized by the penalty pǫ(β) ˆ β = arg min

β

1 n

n

  • j=1

(yj − fβ(xj))2 + λpǫ(β) λ is the regularization parameter. ǫ controls the amount of smoothness.

(Zhu. Hastie ’05; De Mol, De Vito, Rosasco ’07) Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-33
SLIDE 33

ELASTIC NET REGULARIZATION (CONT.)

The ℓ1 term promotes sparsity and the ℓ2 term smoothness. The functional is strictly convex: the solution is unique. A whole group of correlated variables is selected rather than just

  • ne variable in the group.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-34
SLIDE 34

GEOMETRY OF THE PROBLEM

β1 β2 1 2 θ R −R 1 + 2

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-35
SLIDE 35

ℓq REGULARIZATION?

Consider a more general penalty of the form βq = (

p

  • i=1

|β|q)1/q (called bridge regression in statistics). It can be proved that: limq→0 βq → β0, for 0 < q < 1 the norm is not a convex map, for q = 1 the norm is a convex map and is strictly convex for q > 1.

Elements of Statistical Learning c Hastie, Tibshirani Friedman 2009 Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-36
SLIDE 36

COEFFICIENTS SHRINKAGE

Learning algorithms based on sparsity usually suffer from an excessive shrinkage effect of the coefficients. For this reason in practice a two-step procedure is usually used: use Lasso to select the relevant components use ordinary least squares (or RLS with a small λ) on the selected variables.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-37
SLIDE 37

REGULARIZATION PATH

Continuation strategy Consider a decreasing set of regularization parameters λ1 > λ2 > ... > λT Start from the larger value λ1 which should be faster (corresponds to a sparser solution) Use the previous solution as a warm start of the next procedure

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-38
SLIDE 38

SUMMARY

Sparsity based regularization gives a way to deal with high dimensional problems It also gives a way to perform principled variable selection Very active research field, connection with signal processing - compressed sensing, statistics and approximation theory Lasso is sparse but not stable. Elastic net is a way to solve the problem

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-39
SLIDE 39

APPENDIX: CONNECTION TO SPARSE

APPROXIMATION

In the last decades there has been an explosion of alternative representations to signals than traditional Fourier representations the setting is naturally infinite dimensional OVERCOMPLETE DICTIONARY A dictionary is a collection of parametrized waveforms or atoms or features: D = {φγ : X → R; γ ∈ Γ} ⊂ H so that a funtion f ∈ H may be decomposed as f =

  • γ∈Γ

βγφγ. Most of the dictionaries are overcomplete — then the decomposition is not unique.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-40
SLIDE 40

EXAMPLES OF OVERCOMPLETE DICTIONARY

In the last years a large number of overcomplete dictionary have been proposed: stationary wavelets, wavelet packets, cosine packets, chirplets, and warplets ... The simplest example we can think of, is a union of orthonormal basis (ei), (gi), ... in some space H. It is clear that every function in H can be written in many ways using the elements of D.

Regularization Methods for High Dimensional Learning Sparsity-based Learning

slide-41
SLIDE 41

SPARSE RECOVERY

Non uniqueness of the decomposition allows us to adapt our representation to our purposes, by choosing the most appropriate set of features for the problem In the case of infinite dimensional dictionaries one can consider an approximated decomposition f =

p

  • i=1

βγiφγi + R(p), where R(p) is the residual Otherwise, in De Mol, De Vito, Rosasco (2008) it is discussed an elastic net representation that considers an infinite-dimensional dictionary, under appropriate conditions on the weights

Regularization Methods for High Dimensional Learning Sparsity-based Learning