Adaptive Lasso for correlated predictors Keith Knight Department of - - PowerPoint PPT Presentation

adaptive lasso for correlated predictors
SMART_READER_LITE
LIVE PREVIEW

Adaptive Lasso for correlated predictors Keith Knight Department of - - PowerPoint PPT Presentation

Adaptive Lasso for correlated predictors Keith Knight Department of Statistics University of Toronto e-mail: keith@utstat.toronto.edu This research was supported by NSERC of Canada. OUTLINE 1. Introduction 2. The Lasso under collinearity 3.


slide-1
SLIDE 1

Adaptive Lasso for correlated predictors

Keith Knight Department of Statistics University of Toronto

e-mail: keith@utstat.toronto.edu This research was supported by NSERC of Canada.

slide-2
SLIDE 2

OUTLINE

  • 1. Introduction
  • 2. The Lasso under collinearity
  • 3. Projection pursuit with the Lasso
  • 4. Example: Diabetes data
slide-3
SLIDE 3
  • 1. INTRODUCTION
  • Assume a linear model for {(xi.Yi) : i = 1. · · · , n}:

Yi = β0 + β1x1i + · · · + βpxpi + εi = xT

i β + εi

(i = 1, · · · , n)

  • Assume that the predictors are centred and scaled to have

mean 0 and variance 1. – We can estimate β0 by ¯ Y — least squares estimator. – Thus we can assume that {Yi} are centred to have mean 0.

  • In many applications, p can be much greater than n.
  • In this talk, we will assume implicitly that p < n.
slide-4
SLIDE 4

Shrinkage estimation

  • Bridge regression: Minimize

n

  • i=1

(Yi − xT

i β)2 + λ p

  • j=1

|βj|γ for some γ > 0.

  • Includes the Lasso (Tibshirani, 1996) and ridge regression as

special cases with γ = 1 and 2 respectively. – For γ ≤ 1, it’s possible to obtain exact 0 parameter estimates. – Many other variations of the Lasso: elastic nets (Zou & Hastie, 2005), fused lasso (Tibshirani et al., 2006) among

  • thers.

– The Dantzig selector of Cand` es & Tao (2007) is similar in spirit to the Lasso.

slide-5
SLIDE 5
  • Stagewise fitting: Given

β

(k), minimize n

  • i=1

(Yi − xT

i

β

(k) − xT i φ)2

  • ver φ with all but 1 (or a small number) of its elements equal

to 0. Then define

  • β

(k+1) =

β

(k) + ǫ

φ (0 < ǫ ≤ 1) and repeat until “convergence”. – This is a special case of boosting (Shapire, 1990). – Also related to LARS (Efron et al., 2004), which in turn is related to the Lasso.

slide-6
SLIDE 6
  • 2. THE LASSO UNDER COLLINEARITY
  • For given λ, the Lasso estimator

β(λ) can be defined in a number of equivalent ways: 1. β(λ) minimizes

n

  • i=1

(Yi − xT

i β)2

subject to p

j=1 |βj| ≤ t(λ);

2. β(λ) minimizes

n

  • i=1

(xT

i β)2

subject to

  • n
  • i=1

(Yi − xT

i β)xij

  • ≤ λ

for j = 1, · · · , p.

slide-7
SLIDE 7
  • The advantage of the Lasso is that it produces exact 0

estimates while β(λ) is a smooth function of λ. – This is very useful when p ≫ n to produce “sparse” models.

  • However, when the predictors {xi} are highly correlated then
  • β(λ) may contain too many zeroes.
  • This is not necessarily undesirable but some important effects

may be missed as a result. – How does one interpret a “sparse” model under high collinearity?

slide-8
SLIDE 8

Question: Why does this happen? Answer: Redundancy in the constraints

  • n
  • i=1

(Yi − xT

i β)xij

  • ≤ λ

for j = 1, · · · , p due to collinearity; that is, we don’t have p independent constraints.

  • The Dantzig selector minimizes

j |βj| subject to similar

constraints on the correlations, and thus will tend to behave similarly.

slide-9
SLIDE 9
  • For LS estimation (λ = 0), we have

n

  • i=1

(Yi − xT

i

β)xT

i a = 0

for any a.

  • Similarly, we could try to consider estimates

β such that

  • n
  • i=1

(Yi − xT

i

β)xT

i aℓ

  • ≤ λ

for some set of vectors (projections) {aℓ : ℓ ∈ L}.

  • If the set L is finite, we can incorporate predictors {aT

ℓ x} into

the Lasso.

slide-10
SLIDE 10

Example: Principal components regression (|L| = p) where a1, · · · , ap are the eigenvectors of C =

n

  • i=1

xixT

i .

However ...

  • Projections obtain via PC are based solely on information in

the design.

  • Moreover, they need not be particular easy to interpret.
  • More generally, there’s no problem in taking |L| ≫ p.
slide-11
SLIDE 11
  • 3. PROJECTION PURSUIT WITH THE LASSO
  • For collinear predictors, it’s often desirable to consider

projections of the original predictors.

  • Given predictors x1, · · · , xp and projections {aℓ : ℓ ∈ L}, we

want to identify “interesting” (data-driven) projections aℓ1, · · · , aℓp and define new predictors aT

ℓ1x, · · · , aT ℓpx.

  • We can take L to be very large – but the projections we

consider should be easily interpretable. – Coordinate projections (i.e. original predictors). – Sums and differences of two or more predictors.

slide-12
SLIDE 12

Question: How do we do this? Answer: Two possibilities:

  • Use the Lasso on the projections.

– But we need to worry about the choice of λ. – The “active” projections will depend on λ.

  • Look at the Lasso solution as λ ↓ 0.

– This identifies a set of p projections. – These projections can be used in the Lasso.

slide-13
SLIDE 13

Question: What happens to the Lasso solution as λ → 0?

  • Suppose that

β(λ) minimizes

n

  • i=1

(Yi − xT

i β)2 + λ p

  • j=1

|βj| and that C =

n

  • i=1

xixT

i

is singular.

  • Define

D =

  • φ :

n

  • i=1

(Yi − xT

i φ)2 = min β n

  • i=1

(Yi − xT

i β)2

  • .
slide-14
SLIDE 14

Proposition: For the Lasso estimate β(λ), we have lim

λ↓0

  • β(λ) = argmin

  

p

  • j=1

|φj| : φ ∈ D    . “Proof”. Assume (for simplicity) that the minimum RSS is 0. Then β(λ) minimizes Zλ(β) = 1 λ

n

  • i=1

(Yi − xT

i β)2 + p

  • j=1

|βj|. As λ ↓ 0, the first term of Zλ blows up for β ∈ D and is exactly 0 for β ∈ D. The conclusion follows using convexity of Zλ. Corollary: The Dantzig selector estimator has the same limit as λ ↓ 0.

slide-15
SLIDE 15
  • In our problem, define tiℓ to be a scaled version of aT

ℓ xi.

  • The model now becomes

Yi =

  • ℓ∈L

φℓtiℓ + εi = tT

i φ + εi

(i = 1, · · · , n)

  • We estimate φ by minimizing
  • ℓ∈L

|φℓ| subject to

n

  • i=1

(Yi − tT

i φ)ti = 0.

  • This can be solved using linear programming methods.

– Software for the Lasso tends to be unstable as λ ↓ 0.

slide-16
SLIDE 16

Asymptotics:

  • Assume p < r = |L| are fixed and n → ∞.
  • Define matrices

C = lim

n→∞

1 n

n

  • i=1

xixT

i

D = lim

n→∞

1 n

n

  • i=1

titT

i

where C is non-singular and D singular with rank p.

  • Then

φn

p

− → some φ0.

  • We also have √n(

φn − φ0)

d

− → V where the distribution of V is concentrated on the orthogonal complement of the null space

  • f D.
slide-17
SLIDE 17
  • 4. EXAMPLE

Diabetes data (Efron et al., 2004)

  • Response: measure of disease progression.
  • Predictors: age, sex, BMI, blood pressure, and 6 blood serum

measurements (TC, LDL, HDL, TCH, LTG, GLU). – Some predictors are quite highly correlated.

  • Analysis indicates that the most important variables are LTG,

BMI, BP, TC, and sex.

  • Look at coordinate-wise projections as well as pairwise sums

and differences (100 projections in total).

slide-18
SLIDE 18

0.0 0.2 0.4 0.6 0.8 1.0 −40 −20 20 40 proportional bound coefficients

AGE SEX BMI BP TC LDL HDL TCH LTG GLU

Lasso plot for original predictors.

slide-19
SLIDE 19

Results: Estimated projections Projections Estimates BMI + LTG 29.86 LTG − TC 14.79 LDL − TC 10.32 BP − SEX 9.61 BMI + BP 6.64 BMI + GLU 5.36 BP + LTG 5.33 TCH − SEX 4.18 HDL + TCH 3.48 BP − AGE 0.55

slide-20
SLIDE 20

0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 25 30 proportional bound coefficients

BMI+LTG LTG−TC LDL−TC BP−SEX BMI+BP BMI+GLU BP+LTG TCH−SEX HDL+TCH BP−AGE

Lasso plot for the 10 identified projections.

slide-21
SLIDE 21

0.0 0.2 0.4 0.6 0.8 1.0 −40 −20 20 40 proportional bound coefficients

AGE SEX BMI BP TC LDL HDL TCH LTG GLU

Lasso trajectories for original predictors using the projections.