STK-IN4300 Statistical Learning Methods in Data Science Riccardo De - - PowerPoint PPT Presentation

stk in4300 statistical learning methods in data science
SMART_READER_LITE
LIVE PREVIEW

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De - - PowerPoint PPT Presentation

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 5 1/ 40 STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Shrinkage Methods Lasso Comparison of


slide-1
SLIDE 1

STK-IN4300 Statistical Learning Methods in Data Science

Riccardo De Bin

debin@math.uio.no

STK-IN4300: lecture 5 1/ 40

slide-2
SLIDE 2

STK-IN4300 - Statistical Learning Methods in Data Science

Outline of the lecture

Shrinkage Methods Lasso Comparison of Shrinkage Methods More on Lasso and Related Path Algorithms

STK-IN4300: lecture 5 2/ 40

slide-3
SLIDE 3

STK-IN4300 - Statistical Learning Methods in Data Science

Shrinkage Methods: ridge regression and PCR

STK-IN4300: lecture 5 3/ 40

slide-4
SLIDE 4

STK-IN4300 - Statistical Learning Methods in Data Science

Shrinkage Methods: bias and variance

STK-IN4300: lecture 5 4/ 40

slide-5
SLIDE 5

STK-IN4300 - Statistical Learning Methods in Data Science

Lasso: Least Absolute Shrinkage and Selection Operator

Lasso is similar to ridge regression, with an L1 penalty instead of the L2 one,

N

ÿ

i“1

pyi ´ β0 ´

p

ÿ

j“1

βjxijq2, subject to řp

j“1 |βj| ď t.

Or, in the equivalent Lagrangian form, ˆ βlassopλq “ argminβ # N ÿ

i“1

pyi ´ β0 ´

p

ÿ

j“1

βjxijq2 ` λ

p

ÿ

j“1

|βj| + . ‚ X must be standardized; ‚ β0 is again not considered in the penalty term.

STK-IN4300: lecture 5 5/ 40

slide-6
SLIDE 6

STK-IN4300 - Statistical Learning Methods in Data Science

Lasso: remarks

Due to the structure of the L1 norm; ‚ some estimates are forced to be 0 (variable selection); ‚ no close form for the estimator. From a Bayesian prospective: ‚ ˆ βlassopλq as the posterior mode estimate. ‚ β „ Laplacep0, τ 2q; ‚ for more details, see Park & Casella (2008). Extreme situations: ‚ λ Ñ 0, ˆ βlassopλq Ñ ˆ βOLS; ‚ λ Ñ 8, ˆ βlassopλq Ñ 0.

STK-IN4300: lecture 5 6/ 40

slide-7
SLIDE 7

STK-IN4300 - Statistical Learning Methods in Data Science

Lasso: constrained estimation

STK-IN4300: lecture 5 7/ 40

slide-8
SLIDE 8

STK-IN4300 - Statistical Learning Methods in Data Science

Lasso: shrinkage

STK-IN4300: lecture 5 8/ 40

slide-9
SLIDE 9

STK-IN4300 - Statistical Learning Methods in Data Science

Lasso: generalized linear models

Lasso (and ridge r.) can be used with any linear regression model; ‚ e.g., logistic regression. In logistic regression, the lasso solution is the maximizer of maxβ0,β # N ÿ

i“1

” yipβ0 ` βT xiq ´ logp1 ` eβ0`βT xiq ı ´ λ

p

ÿ

j“1

|βj| + . Note: ‚ penalized logistic regression can be applied to problems with high-dimensional data (see Section 18.4).

STK-IN4300: lecture 5 9/ 40

slide-10
SLIDE 10

STK-IN4300 - Statistical Learning Methods in Data Science

Comparison of Shrinkage Methods: coefficient profiles

STK-IN4300: lecture 5 10/ 40

slide-11
SLIDE 11

STK-IN4300 - Statistical Learning Methods in Data Science

Comparison of Shrinkage Methods: coefficient profiles

STK-IN4300: lecture 5 11/ 40

slide-12
SLIDE 12

STK-IN4300 - Statistical Learning Methods in Data Science

Comparison of Shrinkage Methods: coefficient profiles

STK-IN4300: lecture 5 12/ 40

slide-13
SLIDE 13

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: generalization

Generalization including lasso and ridge r. Ñ bridge regression: ˜ βpλq “ argminβ # N ÿ

i“1

pyi ´ β0 ´

p

ÿ

j“1

βjxijq2 ` λ

p

ÿ

j“1

|βj|q + , q ě 0. Where: ‚ q “ 0 Ñ best subset selection; ‚ q “ 1 Ñ lasso; ‚ q “ 2 Ñ ridge regression.

STK-IN4300: lecture 5 13/ 40

slide-14
SLIDE 14

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: generalization

Note that: ‚ 0 ă q ď 1 Ñ non differentiable; ‚ 1 ă q ă 2 Ñ compromise between lasso and ridge (but makespaceoii differentiable ñ no variable selection property). ‚ q defines the shape of the constrain area: ‚ q could be estimated from the data (tuning parameter); ‚ in practice does not work well (variance).

STK-IN4300: lecture 5 14/ 40

slide-15
SLIDE 15

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: elastic net

Different compromise lasso / ridge regression: elastic net

˜ βpλq “ argminβ # N ÿ

i“1

pyi ´ β0 ´

p

ÿ

j“1

βjxijq2 ` λ

p

ÿ

j“1

` α|βj| ` p1 ´ αqβ2

j

˘ + . Idea: ‚ L1 penalty takes care of variable selection; ‚ L2 penalty helps in correctly handling correlation; ‚ α defines how much L1 and L2 penalty should be used:

§ it is a tuning parameter, must be found in addition to λ; § a grid search is discouraged; § in real experiments, often very close to 0 or 1. STK-IN4300: lecture 5 15/ 40

slide-16
SLIDE 16

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: elastic net

Comparing the bridge regression and the elastic net, ‚ they look very similar; ‚ huge difference due to differentiability (variable selection).

STK-IN4300: lecture 5 16/ 40

slide-17
SLIDE 17

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: Least Angle Regression

The Least Angle Regression (LAR): ‚ can be viewed as a “democratic” version of the forward selection; ‚ add sequentially a new predictors into the model

§ only “as much as it deserves”;

‚ eventually reaches the least square estimation; ‚ strongly connected with lasso;

§ lasso can be seen as a special case of LAR; § LAR is often used to fit lasso models. STK-IN4300: lecture 5 17/ 40

slide-18
SLIDE 18

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: LAR

Least Angle Regression:

  • 1. Standardize the predictors (mean zero, unit norm). Initialize:

§ residuals r “ y ´ ¯

y

§ regression coefficient estimates β1 “ ¨ ¨ ¨ “ βp “ 0;

  • 2. find the predictor xj most correlated with r;
  • 3. move ˆ

βj towards its least-squares coefficient xxj, ry,

§ until for k ‰ j, corrpxk, rq “ corrpxj, rq.

  • 4. add xk in the active list and update both ˆ

βj and ˆ βk:

§ towards their joint least squares coefficient; § until xl has as much correlation with the current residual;

  • 5. continue until all p predictors have been entered.

STK-IN4300: lecture 5 18/ 40

slide-19
SLIDE 19

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: comparison

STK-IN4300: lecture 5 19/ 40

slide-20
SLIDE 20

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: overfit

STK-IN4300: lecture 5 20/ 40

slide-21
SLIDE 21

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: other shrinkage methods

Group Lasso Sometimes predictors belong to the same group: ‚ genes that belong to the same molecular pathway; ‚ dummy variables from the same categorical variable . . . Suppose the p predictors are grouped in L groups, group lasso minimizes minβ # ||py ´ β0 1 ´

L

ÿ

ℓ“1

Xℓβℓ||2

2 ` λ L

ÿ

ℓ“1

?pℓ||βj||2 + , where: ‚ ?pℓ accounts for the group sizes; ‚ || ¨ || denotes the (not squared) Euclidean norm

§ it is 0 ð all its component are 0;

‚ sparsity is encouraged at both group and individual levels.

STK-IN4300: lecture 5 21/ 40

slide-22
SLIDE 22

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: other shrinkage methods

Non-negative garrote The idea of lasso originates from the non-negative garrote, ˆ βgarrote “ argminβ

N

ÿ

i“1

pyi ´ β0 ´

p

ÿ

j“1

cjβjxijq2, subject to cj ě 0 and ÿ

j

cj ď t. Non-negative garrote starts with OLS estimates and shrinks them: ‚ by non-negative factors; ‚ the sum of the non-negative factor is constrained; ‚ for more information, see Breiman (1995).

STK-IN4300: lecture 5 22/ 40

slide-23
SLIDE 23

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: other shrinkage methods

In the case of orthogonal design (XT X “ IN), cjpλq “ ˜ 1 ´ λ ˆ βOLS

j

¸ , where λ is a tuning parameter (related to t). Note that the solution depends on ˆ βOLS: ‚ cannot be applied in p ąą N problems; ‚ may be a problem when ˆ βOLS behaves poorly; ‚ has the oracle properties (Yuan & Lin, 2006) Ð see soon.

STK-IN4300: lecture 5 23/ 40

slide-24
SLIDE 24

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: other shrinkage methods

Comparison between lasso (left) and non-negative garrote (right). (picture from Tibshirani, 1996)

STK-IN4300: lecture 5 24/ 40

slide-25
SLIDE 25

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: the oracle property

Let: ‚ A :“ tj : βj ‰ 0u be the set of the true relevant coefficients; ‚ δ be a fitting procedure (lasso, non-negative garrote, . . . ); ‚ ˆ βpδq the coefficient estimator of the procedure δ. We would like that δ: (a) identifies the right subset model, tj : ˆ βpδq ‰ 0u “ A; (b) has the optimal estimation rate, ?npˆ βpδqA ´ βAq d Ý Ñ Np0, Σq, where Σ is the covariance matrix for the true subset model. If δ satisfies (a) and (b), it is called an oracle procedure.

STK-IN4300: lecture 5 25/ 40

slide-26
SLIDE 26

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: lasso and oracle property

Consider the following setup (Knight & Fu, 2000): ‚ yi “ xiβ ` ǫi, with ǫi i.i.d. r.v. with mean 0 and variance σ2; ‚ n´1XT X Ñ C, where C is a positive definite matrix; ‚ suppose w.l.g. that A “ t1, 2, . . . , p0u, p0 ă p; ‚ C “ „ C11 C12 C21 C22  , with C11 a p0 ˆ p0 matrix; ‚ ˆ A “ tj : ˆ βlasso

j

pλq ‰ 0u

STK-IN4300: lecture 5 26/ 40

slide-27
SLIDE 27

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: lasso and oracle property

Knight & Fu (2000) demonstrated the following two lemmas:

Lemma (1)

If λ{n Ñ λ0 ě 0, then ˆ βlassopλq

p

Ý Ñ argminu V1puq, where V1puq “ pu ´ βqT Cpu ´ βqT ` λ0

p

ÿ

i“1

|uj|.

Lemma (2)

If λ{?n Ñ λ0 ě 0, then ?npˆ βlassopλq ´ βq d Ý Ñ argminu V2puq, V2puq “ ´2uT W`uT Cu`λ0

p

ÿ

i“1

rujsgnpβq1pβ ‰ 0q`|uj|1pβ “ 0qs.

STK-IN4300: lecture 5 27/ 40

slide-28
SLIDE 28

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: lasso and oracle property

From Lemma (1): ‚ only λ0 “ 0 guarantees consistency. Lemma (2) states: ‚ the lasso estimate is ?n-consistent; ‚ when λ “ Op?nq, ˆ A cannot be A with positive probability.

Proposition (1)

If λ{?n Ñ λ0 ě 0, then lim supnPr ˆ A “ As ď c ă 1. For the proof, see Zou (2006).

STK-IN4300: lecture 5 28/ 40

slide-29
SLIDE 29

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: lasso and oracle property

It may be interesting to see what happens in the intermediate case, when λ0 “ 8, i.e., λ{n Ñ 0 and λ{?n Ñ 8.

Lemma (3)

If λ

n Ñ 0 and λ ?n Ñ 8, then n λpˆ

βlassopλq ´ βq

p

Ý Ñ argminu V3puq, V3puq “ uT Cu `

p

ÿ

i“1

rujsgnpβq1pβ ‰ 0q ` |uj|1pβ “ 0qs. Note: ‚ the convergence rate of ˆ βlassopλq is slower than ?n; ‚ the optimal estimation rate is available only when λ “ Op?nq, but it leads to inconsistent variable selection; ‚ for the proof, see Zou (2006).

STK-IN4300: lecture 5 29/ 40

slide-30
SLIDE 30

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: necessary condition

Can consistency in variable selection can be achieved by sacrificing the rate of convergence in estimation? Ó Non necessarily. It is possible to derive a necessary condition for consistency of the lasso variable selection (Zou, 2006):

Theorem (necessary condition)

Suppose that limn Pr ˆ A “ As “ 1. Then there exists some sign vector s “ ps1, . . . , sp0qT , sj P t´1, 1u, such that |C21C´1

11 s| ď 1.

(1) The last equation is understood componentwise.

STK-IN4300: lecture 5 30/ 40

slide-31
SLIDE 31

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: necessary condition

If condition (1) fails ñ the lasso variable selection is inconsistent.

Corollary (1)

Suppose that p0 “ 2m ` 1 ě 3 and p “ p0 ` 1, so there is one irrelevant predictor. Let C11 “ p1 ´ ρ1qI ` ρ1J1, where J1 is the matrix of 1’s, C12 “ ρ2 1 and C22 “ 1. If ´

1 p0´1 ă ρ1 ă 1 p0 and

1 ` pp0 ´ 1ρ1q ă |ρ2| ă a p1 ` pp0 ´ 1q{ρ1{p0q, then condition (1) cannot be satisfied. So the lasso variable selection is inconsistent.

STK-IN4300: lecture 5 31/ 40

slide-32
SLIDE 32

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: Corollary (1)

Proof of Corollary (1).

Note that ‚ C´1

11 “ 1 1´ρ1 pI ´ ρ1 1`pp0´1qρ1 J1q;

‚ C21C´1

11 “ ρ2 1`pp0´1qρ1 p

1qT . Therefore C21C´1

11 s “ ρ2 1`pp0´1qρ1 přp0 j“1 sjq

1. Then, condition (1) becomes ˇ ˇ ˇ

ρ2 1`pp0´1qρ1

ˇ ˇ ˇ ¨ ˇ ˇ ˇ řp0

j“1 sj

ˇ ˇ ˇ ď 1. Note that when p0 is a odd number, | řp0

j“1 sj| ě 1.

If |

ρ2 1`pp0´1qρ1 | ą 1, then condition (1) cannot be satisfied for any

sign vector. The choice of pρ1, ρ2q ensures that C is a positive matrix and |

ρ2 1`pp0´1qρ1 | ą 1.

STK-IN4300: lecture 5 32/ 40

slide-33
SLIDE 33

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: other shrinkage methods

The Smoothly Clipped Absolute Deviation (SCAD) estimator ˆ βscadpλ, αq “ argminβ # 1 2||y ´ β0 1 ´ Xβ||2

2 ` λ p

ÿ

j“1

pjpβj; λ, αq + , where d pjpβj; λ, αq dβj “ λ " 1p|βj| ď λq ` pαλ ´ |βj|q` pα ´ 1qλ 1p|βj| ą λq * for α ą 2. Usually: ‚ α is set equal to 3.7 (based on simulations); ‚ λ is chosen via cross-validation.

STK-IN4300: lecture 5 33/ 40

slide-34
SLIDE 34

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: other shrinkage methods

The SCAD penalty function: ‚ penalizes less the largest regression coefficient estimates; ‚ makes the solution continuous. In particular, ˆ βscadpλ, αq “ $ & % sgnpβqp|βj| ´ λq` when |β| ď 2λ tpα ´ 1qβ ´ sgnpβqαλu{pα ´ 2q when 2λ ă |β| ď αλ β when |β| ą αλ Note that: ‚ D an ?n-consistent estimator (Fan & Li, 2001, Theorem 1); ‚ the SCAD estimator ˆ βscadpλ, αq is an oracle estimator (Fan & Li, 2001, Theorem 2).

STK-IN4300: lecture 5 34/ 40

slide-35
SLIDE 35

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: other shrinkage methods

STK-IN4300: lecture 5 35/ 40

slide-36
SLIDE 36

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: other shrinkage methods

Adaptive lasso The adaptive lasso is a particular case of the weighted lasso, ˆ βweightpλq “ argminβ # N ÿ

i“1

pyi ´ β0 ´

p

ÿ

j“1

βjxijq2 ` λ

p

ÿ

j“1

wj|βj| + , in which ˆ wj “ 1{|ˆ βOLS|γ. Note: ‚ it enjoys the oracle properties (Zou, 2006, Theorem 2); ‚ when γ “ 1, it is very closely related to the non-negative garrote (there is an additional sign constrain); ‚ relies on ˆ βOLS Ñ sometimes lasso used in a first step; ‚ two-dimensional tuning parameter.

STK-IN4300: lecture 5 36/ 40

slide-37
SLIDE 37

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: other shrinkage methods

(a) best subset regression; (b) bridge with α “ 0.5. (picture from Zou, 2006)

STK-IN4300: lecture 5 37/ 40

slide-38
SLIDE 38

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: other shrinkage methods

(c) lasso; (d) scad. (picture from Zou, 2006)

STK-IN4300: lecture 5 38/ 40

slide-39
SLIDE 39

STK-IN4300 - Statistical Learning Methods in Data Science

More on Lasso and Related Path Algorithms: other shrinkage methods

(e) adaptive lasso with γ “ 0.5; (f) adaptive lasso with γ “ 0.2. (picture from Zou, 2006)

STK-IN4300: lecture 5 39/ 40

slide-40
SLIDE 40

STK-IN4300 - Statistical Learning Methods in Data Science

References I

Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37, 373–384. Fan, J. & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360. Knight, K. & Fu, W. (2000). Asymptotics for lasso-type estimators. Annals

  • f Statistics 28, 1356–1378.

Park, T. & Casella, G. (2008). The Bayesian lasso. Journal of the American Statistical Association 103, 681–686. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58, 267–288. Yuan, M. & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Methodological) 68, 49–67. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429.

STK-IN4300: lecture 5 40/ 40