MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco - - PowerPoint PPT Presentation

mlcc 2019 variable selection and sparsity
SMART_READER_LITE
LIVE PREVIEW

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco - - PowerPoint PPT Presentation

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net MLCC 2019 2 Prediction and


slide-1
SLIDE 1

MLCC 2019 Variable Selection and Sparsity

Lorenzo Rosasco UNIGE-MIT-IIT

slide-2
SLIDE 2

Outline

Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

MLCC 2019 2

slide-3
SLIDE 3

Prediction and Interpretability

◮ In many practical situations, beyond prediction, it is important to

  • btain interpretable results

MLCC 2019 3

slide-4
SLIDE 4

Prediction and Interpretability

◮ In many practical situations, beyond prediction, it is important to

  • btain interpretable results

◮ Interpretability is often determined by detecting which factors allow good prediction

MLCC 2019 4

slide-5
SLIDE 5

Prediction and Interpretability

◮ In many practical situations, beyond prediction, it is important to

  • btain interpretable results

◮ Interpretability is often determined by detecting which factors allow good prediction We look at this question from the perspective of variable selection

MLCC 2019 5

slide-6
SLIDE 6

Linear Models

Consider a linear model fw(x) = wT x =

v

  • i=1

wjxj Here ◮ the components xjof an input can be seen as measurements (pixel values, dictionary words count, gene expressions, . . . )

MLCC 2019 6

slide-7
SLIDE 7

Linear Models

Consider a linear model fw(x) = wT x =

v

  • i=1

wjxj Here ◮ the components xjof an input can be seen as measurements (pixel values, dictionary words count, gene expressions, . . . ) ◮ Given data, the goal of variable selection is to detect which are variables important for prediction

MLCC 2019 7

slide-8
SLIDE 8

Linear Models

Consider a linear model fw(x) = wT x =

v

  • i=1

wjxj Here ◮ the components xjof an input can be seen as measurements (pixel values, dictionary words count, gene expressions, . . . ) ◮ Given data, the goal of variable selection is to detect which are variables important for prediction Key assumption: the best possible prediction rule is sparse, that is only few of the coefficients are non zero

MLCC 2019 8

slide-9
SLIDE 9

Outline

Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

MLCC 2019 9

slide-10
SLIDE 10

Linear Models

Consider a linear model fw(x) = wT x =

v

  • i=1

wjxj (1) Here ◮ the components xjof an input are specific measurements (pixel values, dictionary words count, gene expressions, . . . )

MLCC 2019 10

slide-11
SLIDE 11

Linear Models

Consider a linear model fw(x) = wT x =

v

  • i=1

wjxj (1) Here ◮ the components xjof an input are specific measurements (pixel values, dictionary words count, gene expressions, . . . ) ◮ Given data the goal of variable selection is to detect which variables important for prediction

MLCC 2019 11

slide-12
SLIDE 12

Linear Models

Consider a linear model fw(x) = wT x =

v

  • i=1

wjxj (1) Here ◮ the components xjof an input are specific measurements (pixel values, dictionary words count, gene expressions, . . . ) ◮ Given data the goal of variable selection is to detect which variables important for prediction Key assumption: the best possible prediction rule is sparse, that is only few of the coefficients are non zero

MLCC 2019 12

slide-13
SLIDE 13

Notation

We need some notation: ◮ Xn be the n by D data matrix

MLCC 2019 13

slide-14
SLIDE 14

Notation

We need some notation: ◮ Xn be the n by D data matrix ◮ i Xj ∈ Rn, j = 1, . . . , D its columns

MLCC 2019 14

slide-15
SLIDE 15

Notation

We need some notation: ◮ Xn be the n by D data matrix ◮ i Xj ∈ Rn, j = 1, . . . , D its columns ◮ Yn ∈ Rn the output vector

MLCC 2019 15

slide-16
SLIDE 16

High-dimensional Statistics

Estimating a linear model corresponds to solving a linear system Xnw = Yn. ◮ Classically n ≫ D low dimension/overdetermined system

MLCC 2019 16

slide-17
SLIDE 17

High-dimensional Statistics

Estimating a linear model corresponds to solving a linear system Xnw = Yn. ◮ Classically n ≫ D low dimension/overdetermined system ◮ Lately n ≪ D high dimensional/underdetermined system Buzzwords: compressed sensing, high-dimensional statistics . . .

MLCC 2019 17

slide-18
SLIDE 18

High-dimensional Statistics

Estimating a linear model corresponds to solving a linear system Xnw = Yn. = Xn Yn w n D

{

{

MLCC 2019 18

slide-19
SLIDE 19

High-dimensional Statistics

Estimating a linear model corresponds to solving a linear system Xnw = Yn. = Xn Yn w n D

{

{

Sparsity!

MLCC 2019 19

slide-20
SLIDE 20

Brute Force Approach

Sparsity can be measured by the ℓ0 norm w0 = |{j | wj = 0}| that counts non zero components in w

MLCC 2019 20

slide-21
SLIDE 21

Brute Force Approach

Sparsity can be measured by the ℓ0 norm w0 = |{j | wj = 0}| that counts non zero components in w If we consider the square loss, it can be shown that a regularization approach is given by min

w∈RD

1 n

n

  • i=1

(yi − fw(xi))2 + λw0

MLCC 2019 21

slide-22
SLIDE 22

The Brute Force Approach is Hard

min

w∈RD

1 n

n

  • i=1

(yi − fw(xi))2 + λw0 The above approach is as hard as a brute force approach: considering all training sets obtained with all possible subsets of variables (single, couples, triplets... of variables)

MLCC 2019 22

slide-23
SLIDE 23

The Brute Force Approach is Hard

min

w∈RD

1 n

n

  • i=1

(yi − fw(xi))2 + λw0 The above approach is as hard as a brute force approach: considering all training sets obtained with all possible subsets of variables (single, couples, triplets... of variables) The computational complexity is combinatorial. In the following we consider two possible approximate approaches: ◮ greedy methods ◮ convex relaxation

MLCC 2019 23

slide-24
SLIDE 24

Outline

Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

MLCC 2019 24

slide-25
SLIDE 25

Greedy Methods Approach

Greedy approaches encompasses the following steps:

  • 1. initialize the residual, the coefficient vector, and the index set

MLCC 2019 25

slide-26
SLIDE 26

Greedy Methods Approach

Greedy approaches encompasses the following steps:

  • 1. initialize the residual, the coefficient vector, and the index set
  • 2. find the variable most correlated with the residual

MLCC 2019 26

slide-27
SLIDE 27

Greedy Methods Approach

Greedy approaches encompasses the following steps:

  • 1. initialize the residual, the coefficient vector, and the index set
  • 2. find the variable most correlated with the residual
  • 3. update the index set to include the index of such variable

MLCC 2019 27

slide-28
SLIDE 28

Greedy Methods Approach

Greedy approaches encompasses the following steps:

  • 1. initialize the residual, the coefficient vector, and the index set
  • 2. find the variable most correlated with the residual
  • 3. update the index set to include the index of such variable
  • 4. update/compute coefficient vector

MLCC 2019 28

slide-29
SLIDE 29

Greedy Methods Approach

Greedy approaches encompasses the following steps:

  • 1. initialize the residual, the coefficient vector, and the index set
  • 2. find the variable most correlated with the residual
  • 3. update the index set to include the index of such variable
  • 4. update/compute coefficient vector
  • 5. update residual.

The simplest such procedure is called forward stage-wise regression in statistics and matching pursuit (MP) in signal processing

MLCC 2019 29

slide-30
SLIDE 30

Initialization

Let r, w, I denote the residual, the coefficient vector, an index set, respectively.

MLCC 2019 30

slide-31
SLIDE 31

Initialization

Let r, w, I denote the residual, the coefficient vector, an index set, respectively. The MP algorithm starts by initializing the residual r ∈ Rn, the coefficient vector w ∈ RD, and the index set I ⊆ {1, . . . , D} r0 = Yn, , w0 = 0, I0 = ∅

MLCC 2019 31

slide-32
SLIDE 32

Selection

The variable most correlated with the residual is given by k = arg max

j=1,...,D aj,

aj = (rT

i−1Xj)2

Xj2 , where we note that vj = rT

i−1Xj

Xj2 = arg min

v∈R ri−1−Xjv2,

ri−1−Xjvj2 = ri−12−aj

MLCC 2019 32

slide-33
SLIDE 33

Selection (cont.)

Such a selection rule has two interpretations: ◮ We select the variable with larger projection on the output, or equivalently ◮ we select the variable such that the corresponding column best explains the the output vector in a least squares sense

MLCC 2019 33

slide-34
SLIDE 34

Active Set, Solution and residual Update

Then, index set is updated as Ii = Ii−1 ∪ {k}, and the coefficients vector is given by wi = wi−1 + wk, wkk = vkek where ek is the element of the canonical basis in RDwith k-th component different from zero

MLCC 2019 34

slide-35
SLIDE 35

Active Set, Solution and residual Update

Then, index set is updated as Ii = Ii−1 ∪ {k}, and the coefficients vector is given by wi = wi−1 + wk, wkk = vkek where ek is the element of the canonical basis in RDwith k-th component different from zero Finally, the residual is updated ri = ri−1 − Xwk

MLCC 2019 35

slide-36
SLIDE 36

Orthogonal Matching Pursuit

A variant of the above procedure, called Orthogonal Matching Pursuit, is also often considered, where the coefficient computation is replaced by wi = arg min

w∈RD Yn − XnMIiw2,

where the D by D matrix MI is such that (MIw)j = wj if j ∈ I and (MIw)j = 0 otherwise. Moreover, the residual update is replaced by ri = Yn − Xnwi

MLCC 2019 36

slide-37
SLIDE 37

Theoretical Guarantees

If ◮ the solution is sparse, and ◮ the data matrix has columns ”not too correlated” OMP can be shown to recover with high probability the right vector of coefficients

MLCC 2019 37

slide-38
SLIDE 38

Outline

Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

MLCC 2019 38

slide-39
SLIDE 39

ℓ1 Norm and Regularization

Another popular approach to find sparse solutions is based on a convex relaxation Namely, the ℓ0 norm is replaced by the ℓ1 norm, w1 =

D

  • j=1

|wj|

MLCC 2019 39

slide-40
SLIDE 40

ℓ1 Norm and Regularization

Another popular approach to find sparse solutions is based on a convex relaxation Namely, the ℓ0 norm is replaced by the ℓ1 norm, w1 =

D

  • j=1

|wj| In the case of least squares, one can consider min

w∈RD

1 n

n

  • i=1

(yi − fw(xi))2 + λw1

MLCC 2019 40

slide-41
SLIDE 41

Convex Relxation

min

w∈RD

1 n

n

  • i=1

(yi − fw(xi))2 + λw1. ◮ The above problem is called LASSO in statistics and Basis Pursuit in signal processing ◮ The objective function defining the corresponding minimization problem is convex but not differentiable ◮ Tools from non-smooth convex optimization are needed to find a solution

MLCC 2019 41

slide-42
SLIDE 42

Iterative Soft Thresholding

A simple yet powerful procedure to compute a solution is based on the so called iterative soft thresholding algorithm (ISTA):

MLCC 2019 42

slide-43
SLIDE 43

Iterative Soft Thresholding

A simple yet powerful procedure to compute a solution is based on the so called iterative soft thresholding algorithm (ISTA): w0 = 0, wi = Sλγ(wi−1 − 2γ n XT

n (Yn − Xnwi−1)),

i = 1, . . . , Tmax

MLCC 2019 43

slide-44
SLIDE 44

Iterative Soft Thresholding

A simple yet powerful procedure to compute a solution is based on the so called iterative soft thresholding algorithm (ISTA): w0 = 0, wi = Sλγ(wi−1 − 2γ n XT

n (Yn − Xnwi−1)),

i = 1, . . . , Tmax At each iteration a non linear soft thresholding operator is applied to a gradient step

MLCC 2019 44

slide-45
SLIDE 45

Iterative Soft Thresholding (cont.)

w0 = 0, wi = Sλγ(wi−1 − 2γ n XT

n (Yn − Xnwi−1)),

i = 1, . . . , Tmax ◮ the iteration should be run until a convergence criterion is met, e.g. wi − wi−1 ≤ ǫ, for some precision ǫ, or a maximum number of iteration Tmax is reached ◮ To ensure convergence we should choose the step-size γ = n 2XT

n Xn

MLCC 2019 45

slide-46
SLIDE 46

Splitting Methods

In ISTA the contribution of error and regularization are split: ◮ the argument of the soft thresholding operator corresponds to a step

  • f gradient descent

2 nXT

n (Yn − Xnwi−1)

◮ The soft thresholding operator depends only on the regularization and acts component wise on a vector w, so that Sα(u) = ||u| − α|+ u |u|.

MLCC 2019 46

slide-47
SLIDE 47

Soft Thresholding and Sparsity

Sα(u) = ||u| − α|+ u |u|. The above expression shows that the coefficients of the solution computed by ISTA can be exactly zero

MLCC 2019 47

slide-48
SLIDE 48

Soft Thresholding and Sparsity

Sα(u) = ||u| − α|+ u |u|. The above expression shows that the coefficients of the solution computed by ISTA can be exactly zero This can be contrasted to Tikhonov regularization where this is hardly the case

MLCC 2019 48

slide-49
SLIDE 49

Lasso meets Tikhonov: Elastic Net

Indeed, it is possible to see that: ◮ while Tikhonov allows to compute a stable solution, in general its solution is not sparse ◮ On the other hand the solution of LASSO, might not be stable

MLCC 2019 49

slide-50
SLIDE 50

Lasso meets Tikhonov: Elastic Net

Indeed, it is possible to see that: ◮ while Tikhonov allows to compute a stable solution, in general its solution is not sparse ◮ On the other hand the solution of LASSO, might not be stable The elastic net algorithm, defined as min

w∈RD

1 n

n

  • i=1

(yi − fw(xi))2 + λ(αw1 + (1 − α)w2

2),

α ∈ [0, 1] (2) can be seen as hybrid algorithm which interpolates between Tikhonov and LASSO

MLCC 2019 50

slide-51
SLIDE 51

ISTA for Elastic Net

The ISTA procedure can be adapted to solve the elastic net problem, where the gradient descent step incorporates also the derivative of the ℓ2 penalty term. The resulting algorithm is w0 = 0, for i = 1, . . . , Tmax wi = Sλαγ((1 − λγ(1 − α))wi−1 − 2γ n XT

n (Yn − Xnwi−1)),

To ensure convergence we should choose the step-size γ = n 2(XT

n Xn + λ(1 − α))

MLCC 2019 51

slide-52
SLIDE 52

Wrapping Up

Sparsity and interpretable models ◮ greedy methods ◮ convex relaxation

MLCC 2019 52

slide-53
SLIDE 53

Next Class

unsupervised learning: dimensionality reduction!

MLCC 2019 53