Introduction Sparse High Dimensional Regression Lasso estimation Application
Big Data - Lecture 2 High dimensional regression with the Lasso
- S. Gadat
Toulouse, Octobre 2014
- S. Gadat
Big Data - Lecture 2
Big Data - Lecture 2 High dimensional regression with the Lasso S. - - PowerPoint PPT Presentation
Introduction Sparse High Dimensional Regression Lasso estimation Application Big Data - Lecture 2 High dimensional regression with the Lasso S. Gadat Toulouse, Octobre 2014 S. Gadat Big Data - Lecture 2 Introduction Sparse High
Introduction Sparse High Dimensional Regression Lasso estimation Application
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff
1 Introduction Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff 2 Sparse High Dimensional Regression Sparsity Inducing sparsity 3 Lasso estimation Lasso Estimator Solving the lasso - MM method Statistical results 4 Application
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff
In a standard linear model, we have at our disposal (Xi, Yi) supposed to be linked with Yi = Xt
i θ0 + ǫi, 1 ≤ i ≤ n.
We aim to recover the unknown θ0. Generically, (ǫi)1≤i≤n is assumed to be i.i.d. replications of a centered and squared integrale noise E[ǫ] = 0 E[ǫ2] < ∞ From a statistical point of view, we expect to find among the p variables that describe X important ones. Typical example: Yi expression level of one gene on sample i Xi = (Xi,1, . . . , Xi,p) biological signal (DNA micro-arrays) observed on sample i Discover a cognitive link between DNA and the gene expression level.
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff
One measures micro-array datasets built from a huge amount of profile genes expression. Number
Diagnostic help: healthy or ill? Select among the genes meaningful elements? Find an algorithm with good prediction of the response?
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff
From a matricial point of view, the linear model can we written as follows: Y = Xθ0 + ǫ, Y ∈ Rn, X ∈ Mn,p(R), θ0 ∈ Rp In this lecture, we will consider situations where p varies (typically increases) with n.
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff
Standard approach: n >> p The M.L.E. in the Gaussian case is the Least Squares Estimator: ˆ θn := arg min
β∈Rp Y − Xβ2 2,
given by ˆ θn = (XtX)−1XtY Proposition ˆ θn is an unbiased estimator of θ0 such that If ǫ ∼ N (0, σ2):
X(θn−θ0)2 2 σ2
∼ χ2
p
E
2
n
n Most of the time,
X(θn−θ0)2 2 n
is generally neglictible comparing to σ2p
n
Main requirement: XtX must be full rank (invertible)!
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff
XtX is an p × p matrix, but its rank is lower than n. If n << p, then rk(XtX) ≤ n << p. Consequence: the Gram matrix XtX is not invertible and even very ill-conditionned (the most of its eigenvalues are equal to 0!) The linear model ˆ θn completely fails. One standard “improvement”: use the ridge regression with an additional penalty: ˆ θRidge
n
= arg min
β∈Rp Y − Xβ2 2 + λβ2 2
The ridge regression is a particular case of penalized regression. The penalization is still convex w.r.t. β and can be easily solved. We will attempt to describe a better suited penalized regression for high dimensional regression. Our goal: find a method that permits to find ˆ θn: Select features among the p variables. Can be easily computed with numerical softs. Possess some statistical guarantees.
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff
Remark: Inconsistency of the standard linear model (and even ridge regression) when p >> n. E
θn − θ)
(n, p) − → +∞ with p >> n. Important and nowadays questions: What is a good framework for high dimensional regression ? A good model is required. How can we estimate? An efficient algorithm is necessary. How can we measure the performances: prediction of Y ? Feature selection in θ? What are we looking for? Statistical guarantees? Some mathematical theorems?
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff
In high dimension: Optimize the fit to the observed data? Reduce the variability? Standard question: find the best curve... In what sense?
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff
Several regressions: Left: fit the best line (1-D regression) Middle: fit the best quadratic polynomial Right: fit the best 10-degree polynomial Now I am interested in the prediction at point x = 0.5. What is the best?
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff
If we are looking for the best possible fit, a high dimensional regressor will be convenient. Nevertheless, our goal is to generally to predict y for new points x and the matching criterion is C( ˆ f) := E(X,Y )[Y − ˆ f(X)]2. It is a quadratic loss here, and should be replaced by other criteria (in classification for example).
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff
When the degree increases, the fit to the observed data (red curve) is always decreasing. Over the rest of the population, the generalization error starts decreasing, and after increases. Too simple sets of functions cannot contain the good function, and optimization over simple sets introduces abias. Too complex sets of functions contain the good function but are too rich and generates high variance.
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff
The former balance is illustrated by a very simple theorem. Y = f(X) + ǫ with E[ǫ] = 0. Theorem For any estimator ˆ f, one has C( ˆ f) = E[Y − ˆ f(X)]2 = E
f(X)] 2 + E
f(X)] − ˆ f(X) 2 + E [Y − f(X)]2 The blue term is a bias term. The red term is a variance term. The green term is the Bayes risk and is independent on the estimator ˆ f. Statistical principle: The empirical squared loss Y − ˆ f(X)2
2,n mimics the bias.
Important need to introduce something a variance control of estimation Statistical penalty to mimic the variance. there is an important need to control the variance of estimation.
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity
1 Introduction Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff 2 Sparse High Dimensional Regression Sparsity Inducing sparsity 3 Lasso estimation Lasso Estimator Solving the lasso - MM method Statistical results 4 Application
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity
An introductory example: In many applications, p >> n but . . . Important prior: many extracted feature in X are irrelevant for the response Y In an equivalent way: many coefficients in θ0 are not ”almost zero” but ”exactly zero”. For example, if Y is the size of a tumor, it might be reasonable to suppose that it can be expressed as a linear combination of genetic information in the genome described in X. BUT most components of X will be zero and most genes will be unimportant to predict Y : We are looking for meaningful few genes We are looking for the prediction of Y as well.
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity
Dogmatic approach: Sparsity: assumption that the unknown θ0 we are looking for possesses its major coordinates
s := Card {1 ≤ i ≤ p|θ0(i) = 0} . Sparsity assumption: s << n It permits to reduce the effective dimension of the problem. Assume that the effective support of θ0 were known, then If S is the support of θ0, maybe Xt
SXS is full rank, and linear model can be applied.
Major issue: How could we find S?
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity
Signal processing: in the 1990’s, how could we find for high resolution 1,2,3 dimensional signals sparse representations? Before going further with data: understand what they represent and try to obtain a naturally sparse representation? How: wavelets decomposition in signal processing. Sparse representation: Y. Meyer (among others) Efficient algorithm: S. Mallat Noise robustness and hard thresholding method: D. Donoho
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity
In statistics: in the 2000’s, from a redundant representation, how could we find a sparse representation? Statistics don’t manage to improve the representation of the primary features on the data! Statistical estimator of the LASSO: R. Tibshirani , 1996. Efficient algorithm to solve the LASSO with the LARS: Efron, Johnstone, Hastie,and Tibshirani, 2002. Another estimators: Dantzig Selector: Candes & Tao (2007). Boosting: Buhlmann & Yu (2003). Noise robustness and hard thresholding method: A. Tsybakov et al. (among others) What is the LASSO method? How can we solve it? What about the statistical performances?
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity
Ideally, we would like to find θ such that ˆ θn = arg min
θ:θ0≤s Y − Xθ2 2,
meaning that the minimization is embbeded in a ℓ0 ball. In the previous lecture, we have seen that it is a constrained minimization problem of a convex function . . . A dual formulation is arg min
θ:Y −Xθ2≤ǫ{θ0}
But: The ℓ0 balls are not convex! The ℓ0 balls are not smooth! First (illusive) idea: explore all ℓ0 subsets and minimize! Bullshit since: Cs
p
subsets and p is large! Second idea (existing methods): run some heuristic and greedy methods to explore ℓ0 balls and compute an approximation of ˆ θn. (See next lecture) Good idea: use a convexification of the 0 norm (also referred to as a convex relaxation method). How?
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity
Idea of the convex relaxation: instead of considering a variable z ∈ {0, 1}, imagine that z ∈ [0, 1]. Definition (Convex Envelope) The convex envelope f ∗ of a function f is the largest convex function below f. Theorem (Envelope of θ − → θ0) On [−1, 1]d, the convex envelope of θ − → θ0 is θ − → θ1. On [−R, R]d, the convex envelope of θ − → θ0 is θ − → θ1
R
. Idea: Instead of solving the minimization problem: ∀s ∈ N min
θ0≤s Y − Xθ2 2,
(1) we are looking for ∀C > 0 min
.∗ 0 (θ)≤C Y − Xθ2 2,
(2) What’s new? The function .∗
0 is convex and thus the above problem is a convex minimization problem
with convex constraints. Since .∗
0(θ) ≤ θ0, it is rather reasonnable to obtain sparse solutions. In fact, solutions
If we are looking for good solutions of (1), then there must exists even better solution to (2).
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity
Geometrical interpretation (in 2 D): Left: Level sets of Y − Xβ2
2 and intersection with ℓ1 ball. Right: Same with ℓ2 ball.
The left constraint problem is likely to obtain a sparse solution. Oppositely, the right constraint no! In larger dimensions the balls are even more different:
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Sparsity Inducing sparsity
Analytic point of view: why does the ℓ1 norm induce sparsity? From the KKT conditions (see Lecture 1), it leads to a penalized criterion min
θ∈Rp:θ1≤C Y − Xθ2 2 ⇐
⇒ min
θ∈Rp Y − Xθ2 2
+
Controls the variance
λθ1 In the 1d case: arg minα∈R 1
2 |x − α|2 + λ|x|
: The minimal value of ϕλ is reached at point x∗ when 0 ∈ ∂ϕλ(x∗). x∗ is minimal iff x∗ = 0 and (x∗ − α) + λsgn(x∗) = 0. x∗ = 0 and dϕ+
λ (0) > 0 and dϕ− λ (0) < 0.
Proposition (Analytical minimization of ϕλ) x∗ = sgn(α)[|α| − λ]+ = arg min
x∈R
1 2 |x − α|2 + λ|x|
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results
1 Introduction Motivation Trouble with large dimension Goals Important balance: bias-variance tradeoff 2 Sparse High Dimensional Regression Sparsity Inducing sparsity 3 Lasso estimation Lasso Estimator Solving the lasso - MM method Statistical results 4 Application
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results
Taking all together, we introduce the Least Absolute Shrinkage and Selection Operator - LASSO: ∀λ > 0 ˆ θLasso
n
= arg min
θ∈Rp Y − Xθ2 2 + λθ1
The above criterion is convex w.r.t. θ. Efficient algorithms to solve the LASSO, even for very large p. The minimizer may not be unique since the above criterion is not strongly convex. Predictions X ˆ θLasso
n
are always unique. λ is a penalty constant that must be carefully chosen. A large value of λ leads to a very sparse solution, with an important bias. A low value of λ yields overfitting with no penalization (too much important variance). We will see that a careful balance between s, n and p exists. These parameters as well as the variance of the noise σ2 influence a “good ” choice of λ. Alternative formulation: ˆ θLasso
n
= arg min
θ∈Rp:θ1≤C Y − Xθ2 2
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results
Algorithm to solve the minimization problem arg minθ∈Rp Y − Xθ2
2 + λθ1
is needed. An efficient method follows the method of ”Minimize Majorization” and is referred to as MM method. MM are useful for the minimization of a convex function/maximization of a concave one. Geometric illustration Idea: Build a sequence (θk)k≥0 that converges to the minimum of ϕλ. A particular case of such a method is encountered with the E.M. algorithm useful for clustering and mixture models. MM algorithms are powerful, especially they can convert non-differentiable problems to smooth ones.
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results
1
A function g(θ, θk) is said to majorize f at point θk if g(θk|θk) = f(θk) and g(θ|θk) ≥ f(θ), ∀θ ∈ Rp.
2
Then, we define θk+1 = arg min
θ∈Rp g(θ|θk) 3
We wish to find each time a function g(., θk) whose minimization is easy.
4
An example with a quadratic majorizer of a non-smooth function:
5
Important remark: The MM is a descent algorithm: f(θk+1) = g(θk+1|θk) + f(θk+1) − g(θk+1|θk) ≤ g(θk|θk) = f(θk) (3)
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results
1
Define a sequence (θk)k≥0 ⇐ ⇒ find a suitable majorization.
2
g : θ − → Y − Xθ2 is convex, whose Hessian matrix is XtX. Taylor expansion leads to ∀y ∈ Rp g(y) ≤ g(x) + ∇g(x), y − x + ρ(X)y − x2, where ρ(X) is the spectral radius of X.
3
We are naturally driven to upper bound ϕλ as ϕλ(θ) ≤ ϕλ(θk) + ∇g(θk), θ − θk + ρ(X)θ − θk2
2 + λθ1
= ψ(θk) + ρ(X)
ρ(X)
2
+ λθ1
4
To minimize the majorization of ϕλ, we then use the above proposition of soft-thresholding: Define ˜ θj
k := θj k − ∇g(θk)j/ρ(X).
Compute θj
k+1 = sgn(˜
θj
k) max
k| −
2λ ρ(X)
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results
Importance of the results: understand difficulties from a statistical point of view. What could we expect? In expectation or with high probability: Estimation/consistency: ˆ θn ≃ θ0. Selection/Support: Supp(ˆ θ0) ≃ Supp(θ0). Prediction: n−1X(ˆ θn − θ02
2 ≃ s0/n
Statistical framework: we assume that ǫi ∼ N (0, σ2) (for the sake of simplicity). High dimensional framework: s is the sparsity of θ0. n − → +∞ with p = 0(en1−δ ). It means that p may be much larger than n. We are looking for a rate of convergence involving s, p and n. Important thing: choice of λ (in terms of s, p, n and σ2).
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results
We won’t provide a sharp presentation of the best known results to keep the level understandable. Important to have in mind the extreme situation of almost orthogonal design: XtX n ≃ Ip . Solving the lasso is equivalent to solving min
w
1 2n Xty − w2
2 + λw1
Solutions are given by ST (Soft-Thresholding): wj = STλ 1 n Xt
jy
j + 1
n Xt
jǫ
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results
We would like to keep the useless coefficients to 0. It requires that λ ≥ 1 n Xt
jǫ, ∀j ∈ Jc 0.
The r.v.
1 n Xt jǫ are i.i.d. with variance σ2/n.
The expectation of the maximum of p − s Gaussian standard variables ≃
It leads to λ = Aσ
n , with A > √ 2. Precisely: P
0 : |Xt jǫ| ≤ nλ
We expect that STλ − → Id to obtain a consistency result. It means that λ − → 0, so that log p n − → 0
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application Lasso Estimator Solving the lasso - MM method Statistical results
Theorem Assume that log p << n, that all matrix X has norm 1 and ǫi ∼ N (0, σ2), then under a coherence assumption on the design matrix XtX, one has i) With high probability, J(ˆ θn) ⊂ J0. ii) There exists C such that, with high probability, X(θn − θ0)2
2
n ≤ C κ2 σ2s0 log p n , where κ2 is a positive constant that depends on the correlations in XtX. One can also find results on the exact support recovery, as well as some weaker results without any coherence assumption. N.B.: Such a coherence is measured through the almost orthogonality of the colums of X. It can be traduced in terms of | sup
i=j
Xi, Xj| ≤ ǫ.
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application
CRAN software: http://cran.r-project.org/web/packages/lars/ R Code: library(lars) data(diabetes) attach(diabetes) fit = lars(x,y) plot(fit) Lars algorithm: solves the Lasso less efficiently than the coordinate descent algorithm. Typical output of the Lars software: The greater ℓ1 norm, the lower λ Sparse solution with small values of the .1 norm.
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application
Signal processing example: We have n = 60 noisy observations Y (i) = f(i/n) + ǫi. f is an unknown periodic function defined on [0, 1], sampled at points (i/n). ǫi are independent realizations of Gaussian r.v. We use the 50 first Fourier coefficients: ϕ0(x) = 1, ϕ2j(x) = sin(2jπx) ϕ2j+1(x) = cos(2jπx), to approximate f. The OLS estimator is ˆ f OLS(x) =
p
ˆ βOLS
j
ϕj(x) with ˆ βOLS = arg min
β n
(Yi −
p
βjϕj(i/n))2. The OLS does not perform well on this example.
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application
We experiment here the Lasso estimator with λ = 3σ
n
and obtain Lasso estimator reproduces the oscillations of f but these oscillations are shrunk toward 0. When considering the initial minimization problem, the ℓ1 penalty select nicely the good features, but introduces also a bias (introduces a shrinkage of the parameters). Strategy: select features with the Lasso and run an OLS estimator using the good variables.
Big Data - Lecture 2
Introduction Sparse High Dimensional Regression Lasso estimation Application
We define ˆ f Gauss = π ˆ
J0(Y )
with ˆ J0 = Supp(ˆ θLasso), where π ˆ
J0 is the L2 projection of the observations on the features selected by the Lasso.
The Adaptive Lasso is almost equivalent: βAdaptive Lasso = arg min
β∈Rp
Y − Xβ2
2 + µ p
|βj| | ˆ βGauss
j
| This minimization remains convex and the penalty term aims to mimic the ℓ0 penalty. The Adaptive Lasso is very popular and tends to select more accurately the variables than the Gauss-Lasso estimator.
Big Data - Lecture 2