Big Data - Lecture 2 High dimensional regression with the Lasso S. - - PowerPoint PPT Presentation

big data lecture 2 high dimensional regression with the
SMART_READER_LITE
LIVE PREVIEW

Big Data - Lecture 2 High dimensional regression with the Lasso S. - - PowerPoint PPT Presentation

Introduction Sparse High Dimensional Regression Lasso estimation Application Big Data - Lecture 2 High dimensional regression with the Lasso S. Gadat Toulouse, Octobre 2014 S. Gadat Big Data - Lecture 2 Introduction Sparse High


slide-1
SLIDE 1

Introduction Sparse High Dimensional Regression Lasso estimation Application

Big Data - Lecture 2 High dimensional regression with the Lasso

  • S. Gadat

Toulouse, Octobre 2014

  • S. Gadat

Big Data - Lecture 2

slide-2
SLIDE 2

Introduction Sparse High Dimensional Regression Lasso estimation Application

Big Data - Lecture 2 High dimensional regression with the Lasso

  • S. Gadat

Toulouse, Octobre 2014

  • S. Gadat

Big Data - Lecture 2

slide-3
SLIDE 3

Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff

Schedule

1 Introduction Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff 2 Sparse High Dimensional Regression Sparsity Inducing sparsity 3 Lasso estimation Lasso Estimator Solving the lasso - MM method Statistical results 4 Application

  • S. Gadat

Big Data - Lecture 2

slide-4
SLIDE 4

Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff

I Introduction - Linear Model

In a standard linear model, we have at our disposal (Xi, Yi) supposed to be linked with Yi = Xt

i θ0 + ǫi, 1 ≤ i ≤ n.

We aim to recover the unknown θ0. Generically, (ǫi)1≤i≤n is assumed to be i.i.d. replications of a centered and squared integrale noise E[ǫ] = 0 E[ǫ2] < ∞ From a statistical point of view, we expect to find among the p variables that describe X important ones. Typical example: Yi expression level of one gene on sample i Xi = (Xi,1, . . . , Xi,p) biological signal (DNA micro-arrays) observed on sample i Discover a cognitive link between DNA and the gene expression level.

  • S. Gadat

Big Data - Lecture 2

slide-5
SLIDE 5

Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff

I Introduction - Micro-array analysis - Biological datasets

One measures micro-array datasets built from a huge amount of profile genes expression. Number

  • f genes p (of order thousands). Number of samples n (of order hundred).

Diagnostic help: healthy or ill? Select among the genes meaningful elements? Find an algorithm with good prediction of the response?

  • S. Gadat

Big Data - Lecture 2

slide-6
SLIDE 6

Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff

I Introduction - Linear Model

From a matricial point of view, the linear model can we written as follows: Y = Xθ0 + ǫ, Y ∈ Rn, X ∈ Mn,p(R), θ0 ∈ Rp In this lecture, we will consider situations where p varies (typically increases) with n.

  • S. Gadat

Big Data - Lecture 2

slide-7
SLIDE 7

Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff

I Introduction - Linear Model

Standard approach: n >> p The M.L.E. in the Gaussian case is the Least Squares Estimator: ˆ θn := arg min

β∈Rp Y − Xβ2 2,

given by ˆ θn = (XtX)−1XtY Proposition ˆ θn is an unbiased estimator of θ0 such that If ǫ ∼ N (0, σ2):

X(θn−θ0)2 2 σ2

∼ χ2

p

E

  • X(θn − θ0)2

2

n

  • = σ2p

n Most of the time,

X(θn−θ0)2 2 n

is generally neglictible comparing to σ2p

n

Main requirement: XtX must be full rank (invertible)!

  • S. Gadat

Big Data - Lecture 2

slide-8
SLIDE 8

Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff

I Introduction - Trouble with large dimension p >> n

XtX is an p × p matrix, but its rank is lower than n. If n << p, then rk(XtX) ≤ n << p. Consequence: the Gram matrix XtX is not invertible and even very ill-conditionned (the most of its eigenvalues are equal to 0!) The linear model ˆ θn completely fails. One standard “improvement”: use the ridge regression with an additional penalty: ˆ θRidge

n

= arg min

β∈Rp Y − Xβ2 2 + λβ2 2

The ridge regression is a particular case of penalized regression. The penalization is still convex w.r.t. β and can be easily solved. We will attempt to describe a better suited penalized regression for high dimensional regression. Our goal: find a method that permits to find ˆ θn: Select features among the p variables. Can be easily computed with numerical softs. Possess some statistical guarantees.

  • S. Gadat

Big Data - Lecture 2

slide-9
SLIDE 9

Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff

I Introduction - Objective of high dimensional regression

Remark: Inconsistency of the standard linear model (and even ridge regression) when p >> n. E

  • X(ˆ

θn − θ)

  • when

(n, p) − → +∞ with p >> n. Important and nowadays questions: What is a good framework for high dimensional regression ? A good model is required. How can we estimate? An efficient algorithm is necessary. How can we measure the performances: prediction of Y ? Feature selection in θ? What are we looking for? Statistical guarantees? Some mathematical theorems?

  • S. Gadat

Big Data - Lecture 2

slide-10
SLIDE 10

Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff

I Introduction - bias-variance tradeoff

In high dimension: Optimize the fit to the observed data? Reduce the variability? Standard question: find the best curve... In what sense?

  • S. Gadat

Big Data - Lecture 2

slide-11
SLIDE 11

Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff

I Introduction - bias-variance tradeoff

Several regressions: Left: fit the best line (1-D regression) Middle: fit the best quadratic polynomial Right: fit the best 10-degree polynomial Now I am interested in the prediction at point x = 0.5. What is the best?

  • S. Gadat

Big Data - Lecture 2

slide-12
SLIDE 12

Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff

I Introduction - bias-variance tradeoff

If we are looking for the best possible fit, a high dimensional regressor will be convenient. Nevertheless, our goal is to generally to predict y for new points x and the matching criterion is C( ˆ f) := E(X,Y )[Y − ˆ f(X)]2. It is a quadratic loss here, and should be replaced by other criteria (in classification for example).

  • S. Gadat

Big Data - Lecture 2

slide-13
SLIDE 13

Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff

I Introduction - bias-variance tradeoff

When the degree increases, the fit to the observed data (red curve) is always decreasing. Over the rest of the population, the generalization error starts decreasing, and after increases. Too simple sets of functions cannot contain the good function, and optimization over simple sets introduces abias. Too complex sets of functions contain the good function but are too rich and generates high variance.

  • S. Gadat

Big Data - Lecture 2

slide-14
SLIDE 14

Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff

I Introduction - bias-variance tradeoff

The former balance is illustrated by a very simple theorem. Y = f(X) + ǫ with E[ǫ] = 0. Theorem For any estimator ˆ f, one has C( ˆ f) = E[Y − ˆ f(X)]2 = E

  • Y − E[ ˆ

f(X)] 2 + E

  • E[ ˆ

f(X)] − ˆ f(X) 2 + E [Y − f(X)]2 The blue term is a bias term. The red term is a variance term. The green term is the Bayes risk and is independent on the estimator ˆ f. Statistical principle: The empirical squared loss Y − ˆ f(X)2

2,n mimics the bias.

Important need to introduce something a variance control of estimation Statistical penalty to mimic the variance. there is an important need to control the variance of estimation.

  • S. Gadat

Big Data - Lecture 2

slide-15
SLIDE 15

Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity

Schedule

1 Introduction Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff 2 Sparse High Dimensional Regression Sparsity Inducing sparsity 3 Lasso estimation Lasso Estimator Solving the lasso - MM method Statistical results 4 Application

  • S. Gadat

Big Data - Lecture 2

slide-16
SLIDE 16

Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity

Sparsity

An introductory example: In many applications, p >> n but . . . Important prior: many extracted feature in X are irrelevant for the response Y In an equivalent way: many coefficients in θ0 are not ”almost zero” but ”exactly zero”. For example, if Y is the size of a tumor, it might be reasonable to suppose that it can be expressed as a linear combination of genetic information in the genome described in X. BUT most components of X will be zero and most genes will be unimportant to predict Y : We are looking for meaningful few genes We are looking for the prediction of Y as well.

  • S. Gadat

Big Data - Lecture 2

slide-17
SLIDE 17

Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity

Sparsity

Dogmatic approach: Sparsity: assumption that the unknown θ0 we are looking for possesses its major coordinates

  • null. Only s of them are important:

s := Card {1 ≤ i ≤ p|θ0(i) = 0} . Sparsity assumption: s << n It permits to reduce the effective dimension of the problem. Assume that the effective support of θ0 were known, then If S is the support of θ0, maybe Xt

SXS is full rank, and linear model can be applied.

Major issue: How could we find S?

  • S. Gadat

Big Data - Lecture 2

slide-18
SLIDE 18

Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity

Sparsity

Signal processing: in the 1990’s, how could we find for high resolution 1,2,3 dimensional signals sparse representations? Before going further with data: understand what they represent and try to obtain a naturally sparse representation? How: wavelets decomposition in signal processing. Sparse representation: Y. Meyer (among others) Efficient algorithm: S. Mallat Noise robustness and hard thresholding method: D. Donoho

  • S. Gadat

Big Data - Lecture 2

slide-19
SLIDE 19

Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity

Sparsity

In statistics: in the 2000’s, from a redundant representation, how could we find a sparse representation? Statistics don’t manage to improve the representation of the primary features on the data! Statistical estimator of the LASSO: R. Tibshirani , 1996. Efficient algorithm to solve the LASSO with the LARS: Efron, Johnstone, Hastie,and Tibshirani, 2002. Another estimators: Dantzig Selector: Candes & Tao (2007). Boosting: Buhlmann & Yu (2003). Noise robustness and hard thresholding method: A. Tsybakov et al. (among others) What is the LASSO method? How can we solve it? What about the statistical performances?

  • S. Gadat

Big Data - Lecture 2

slide-20
SLIDE 20

Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity

ℓ0 norm and convex relaxation

Ideally, we would like to find θ such that ˆ θn = arg min

θ:θ0≤s Y − Xθ2 2,

meaning that the minimization is embbeded in a ℓ0 ball. In the previous lecture, we have seen that it is a constrained minimization problem of a convex function . . . A dual formulation is arg min

θ:Y −Xθ2≤ǫ{θ0}

But: The ℓ0 balls are not convex! The ℓ0 balls are not smooth! First (illusive) idea: explore all ℓ0 subsets and minimize! Bullshit since: Cs

p

subsets and p is large! Second idea (existing methods): run some heuristic and greedy methods to explore ℓ0 balls and compute an approximation of ˆ θn. (See next lecture) Good idea: use a convexification of the 0 norm (also referred to as a convex relaxation method). How?

  • S. Gadat

Big Data - Lecture 2

slide-21
SLIDE 21

Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity

ℓ0 norm and convex relaxation

Idea of the convex relaxation: instead of considering a variable z ∈ {0, 1}, imagine that z ∈ [0, 1]. Definition (Convex Envelope) The convex envelope f ∗ of a function f is the largest convex function below f. Theorem (Envelope of θ − → θ0) On [−1, 1]d, the convex envelope of θ − → θ0 is θ − → θ1. On [−R, R]d, the convex envelope of θ − → θ0 is θ − → θ1

R

. Idea: Instead of solving the minimization problem: ∀s ∈ N min

θ0≤s Y − Xθ2 2,

(1) we are looking for ∀C > 0 min

.∗ 0 (θ)≤C Y − Xθ2 2,

(2) What’s new? The function .∗

0 is convex and thus the above problem is a convex minimization problem

with convex constraints. Since .∗

0(θ) ≤ θ0, it is rather reasonnable to obtain sparse solutions. In fact, solutions

  • f (2) with a given C provide a lower bound of solutions of (1) with s ≤ C.

If we are looking for good solutions of (1), then there must exists even better solution to (2).

  • S. Gadat

Big Data - Lecture 2

slide-22
SLIDE 22

Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity

ℓ0 norm and convex relaxation

Geometrical interpretation (in 2 D): Left: Level sets of Y − Xβ2

2 and intersection with ℓ1 ball. Right: Same with ℓ2 ball.

The left constraint problem is likely to obtain a sparse solution. Oppositely, the right constraint no! In larger dimensions the balls are even more different:

  • S. Gadat

Big Data - Lecture 2

slide-23
SLIDE 23

Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity

ℓ1 penalty

Analytic point of view: why does the ℓ1 norm induce sparsity? From the KKT conditions (see Lecture 1), it leads to a penalized criterion min

θ∈Rp:θ1≤C Y − Xθ2 2 ⇐

⇒ min

θ∈Rp Y − Xθ2 2

  • Mimics the bias

+

Controls the variance

λθ1 In the 1d case: arg minα∈R 1

2 |x − α|2 + λ|x|

  • :=ϕλ(x)

: The minimal value of ϕλ is reached at point x∗ when 0 ∈ ∂ϕλ(x∗). x∗ is minimal iff x∗ = 0 and (x∗ − α) + λsgn(x∗) = 0. x∗ = 0 and dϕ+

λ (0) > 0 and dϕ− λ (0) < 0.

Proposition (Analytical minimization of ϕλ) x∗ = sgn(α)[|α| − λ]+ = arg min

x∈R

1 2 |x − α|2 + λ|x|

  • For large values of λ, the minimum of ϕλ is reached at point 0.
  • S. Gadat

Big Data - Lecture 2

slide-24
SLIDE 24

Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results

Schedule

1 Introduction Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff 2 Sparse High Dimensional Regression Sparsity Inducing sparsity 3 Lasso estimation Lasso Estimator Solving the lasso - MM method Statistical results 4 Application

  • S. Gadat

Big Data - Lecture 2

slide-25
SLIDE 25

Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results

Lasso estimator

Taking all together, we introduce the Least Absolute Shrinkage and Selection Operator - LASSO: ∀λ > 0 ˆ θLasso

n

= arg min

θ∈Rp Y − Xθ2 2 + λθ1

The above criterion is convex w.r.t. θ. Efficient algorithms to solve the LASSO, even for very large p. The minimizer may not be unique since the above criterion is not strongly convex. Predictions X ˆ θLasso

n

are always unique. λ is a penalty constant that must be carefully chosen. A large value of λ leads to a very sparse solution, with an important bias. A low value of λ yields overfitting with no penalization (too much important variance). We will see that a careful balance between s, n and p exists. These parameters as well as the variance of the noise σ2 influence a “good ” choice of λ. Alternative formulation: ˆ θLasso

n

= arg min

θ∈Rp:θ1≤C Y − Xθ2 2

  • S. Gadat

Big Data - Lecture 2

slide-26
SLIDE 26

Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results

Solving the lasso

Algorithm to solve the minimization problem arg minθ∈Rp Y − Xθ2

2 + λθ1

  • :=ϕλ(θ)

is needed. An efficient method follows the method of ”Minimize Majorization” and is referred to as MM method. MM are useful for the minimization of a convex function/maximization of a concave one. Geometric illustration Idea: Build a sequence (θk)k≥0 that converges to the minimum of ϕλ. A particular case of such a method is encountered with the E.M. algorithm useful for clustering and mixture models. MM algorithms are powerful, especially they can convert non-differentiable problems to smooth ones.

  • S. Gadat

Big Data - Lecture 2

slide-27
SLIDE 27

Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results

MM algorithm

1

A function g(θ, θk) is said to majorize f at point θk if g(θk|θk) = f(θk) and g(θ|θk) ≥ f(θ), ∀θ ∈ Rp.

2

Then, we define θk+1 = arg min

θ∈Rp g(θ|θk) 3

We wish to find each time a function g(., θk) whose minimization is easy.

4

An example with a quadratic majorizer of a non-smooth function:

5

Important remark: The MM is a descent algorithm: f(θk+1) = g(θk+1|θk) + f(θk+1) − g(θk+1|θk) ≤ g(θk|θk) = f(θk) (3)

  • S. Gadat

Big Data - Lecture 2

slide-28
SLIDE 28

Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results

MM algorithm for the Lasso: Coordinate descent algorithm

1

Define a sequence (θk)k≥0 ⇐ ⇒ find a suitable majorization.

2

g : θ − → Y − Xθ2 is convex, whose Hessian matrix is XtX. Taylor expansion leads to ∀y ∈ Rp g(y) ≤ g(x) + ∇g(x), y − x + ρ(X)y − x2, where ρ(X) is the spectral radius of X.

3

We are naturally driven to upper bound ϕλ as ϕλ(θ) ≤ ϕλ(θk) + ∇g(θk), θ − θk + ρ(X)θ − θk2

2 + λθ1

= ψ(θk) + ρ(X)

  • θ −
  • θk − ∇g(θk)

ρ(X)

  • 2

2

+ λθ1

4

To minimize the majorization of ϕλ, we then use the above proposition of soft-thresholding: Define ˜ θj

k := θj k − ∇g(θk)j/ρ(X).

Compute θj

k+1 = sgn(˜

θj

k) max

  • |θj

k| −

2λ ρ(X)

  • +
  • S. Gadat

Big Data - Lecture 2

slide-29
SLIDE 29

Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results

Statistical results for the Lasso

Importance of the results: understand difficulties from a statistical point of view. What could we expect? In expectation or with high probability: Estimation/consistency: ˆ θn ≃ θ0. Selection/Support: Supp(ˆ θ0) ≃ Supp(θ0). Prediction: n−1X(ˆ θn − θ02

2 ≃ s0/n

Statistical framework: we assume that ǫi ∼ N (0, σ2) (for the sake of simplicity). High dimensional framework: s is the sparsity of θ0. n − → +∞ with p = 0(en1−δ ). It means that p may be much larger than n. We are looking for a rate of convergence involving s, p and n. Important thing: choice of λ (in terms of s, p, n and σ2).

  • S. Gadat

Big Data - Lecture 2

slide-30
SLIDE 30

Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results

Basic considerations (I)

We won’t provide a sharp presentation of the best known results to keep the level understandable. Important to have in mind the extreme situation of almost orthogonal design: XtX n ≃ Ip . Solving the lasso is equivalent to solving min

w

1 2n Xty − w2

2 + λw1

Solutions are given by ST (Soft-Thresholding): wj = STλ 1 n Xt

jy

  • = STλ
  • θ0

j + 1

n Xt

  • =
  • S. Gadat

Big Data - Lecture 2

slide-31
SLIDE 31

Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results

Basic considerations (II)

We would like to keep the useless coefficients to 0. It requires that λ ≥ 1 n Xt

jǫ, ∀j ∈ Jc 0.

The r.v.

1 n Xt jǫ are i.i.d. with variance σ2/n.

The expectation of the maximum of p − s Gaussian standard variables ≃

  • 2 log(p − s).

It leads to λ = Aσ

  • log p

n , with A > √ 2. Precisely: P

  • ∀j ∈ Jc

0 : |Xt jǫ| ≤ nλ

  • ≥ 1 − p1−A2/2.

We expect that STλ − → Id to obtain a consistency result. It means that λ − → 0, so that log p n − → 0

  • S. Gadat

Big Data - Lecture 2

slide-32
SLIDE 32

Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results

Lasso consistency - One result

Theorem Assume that log p << n, that all matrix X has norm 1 and ǫi ∼ N (0, σ2), then under a coherence assumption on the design matrix XtX, one has i) With high probability, J(ˆ θn) ⊂ J0. ii) There exists C such that, with high probability, X(θn − θ0)2

2

n ≤ C κ2 σ2s0 log p n , where κ2 is a positive constant that depends on the correlations in XtX. One can also find results on the exact support recovery, as well as some weaker results without any coherence assumption. N.B.: Such a coherence is measured through the almost orthogonality of the colums of X. It can be traduced in terms of | sup

i=j

Xi, Xj| ≤ ǫ.

  • S. Gadat

Big Data - Lecture 2

slide-33
SLIDE 33

Introduction Sparse High Dimensional Regression Lasso estimation Application

Short example with the R software

CRAN software: http://cran.r-project.org/web/packages/lars/ R Code: library(lars) data(diabetes) attach(diabetes) fit = lars(x,y) plot(fit) Lars algorithm: solves the Lasso less efficiently than the coordinate descent algorithm. Typical output of the Lars software: The greater ℓ1 norm, the lower λ Sparse solution with small values of the .1 norm.

  • S. Gadat

Big Data - Lecture 2

slide-34
SLIDE 34

Introduction Sparse High Dimensional Regression Lasso estimation Application

Removing the bias of the Lasso (I)

Signal processing example: We have n = 60 noisy observations Y (i) = f(i/n) + ǫi. f is an unknown periodic function defined on [0, 1], sampled at points (i/n). ǫi are independent realizations of Gaussian r.v. We use the 50 first Fourier coefficients: ϕ0(x) = 1, ϕ2j(x) = sin(2jπx) ϕ2j+1(x) = cos(2jπx), to approximate f. The OLS estimator is ˆ f OLS(x) =

p

  • j=1

ˆ βOLS

j

ϕj(x) with ˆ βOLS = arg min

β n

  • i=1

(Yi −

p

  • j=0

βjϕj(i/n))2. The OLS does not perform well on this example.

  • S. Gadat

Big Data - Lecture 2

slide-35
SLIDE 35

Introduction Sparse High Dimensional Regression Lasso estimation Application

Removing the bias of the Lasso (II)

We experiment here the Lasso estimator with λ = 3σ

  • 2 log p

n

and obtain Lasso estimator reproduces the oscillations of f but these oscillations are shrunk toward 0. When considering the initial minimization problem, the ℓ1 penalty select nicely the good features, but introduces also a bias (introduces a shrinkage of the parameters). Strategy: select features with the Lasso and run an OLS estimator using the good variables.

  • S. Gadat

Big Data - Lecture 2

slide-36
SLIDE 36

Introduction Sparse High Dimensional Regression Lasso estimation Application

Removing the bias of the Lasso (III)

We define ˆ f Gauss = π ˆ

J0(Y )

with ˆ J0 = Supp(ˆ θLasso), where π ˆ

J0 is the L2 projection of the observations on the features selected by the Lasso.

The Adaptive Lasso is almost equivalent: βAdaptive Lasso = arg min

β∈Rp

  Y − Xβ2

2 + µ p

  • j=1

|βj| | ˆ βGauss

j

|    This minimization remains convex and the penalty term aims to mimic the ℓ0 penalty. The Adaptive Lasso is very popular and tends to select more accurately the variables than the Gauss-Lasso estimator.

  • S. Gadat

Big Data - Lecture 2