Ridge/Lasso Regression, Model selection Xuezhi Wang Computer - - PowerPoint PPT Presentation

ridge lasso regression model selection
SMART_READER_LITE
LIVE PREVIEW

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer - - PowerPoint PPT Presentation

Ridge/Lasso Regression Model Selection Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon University 10701-recitation, Apr 22 Lasso Ridge/Lasso Regression Model Selection Outline Ridge/Lasso


slide-1
SLIDE 1

Ridge/Lasso Regression Model Selection

Ridge/Lasso Regression, Model selection

Xuezhi Wang

Computer Science Department Carnegie Mellon University

10701-recitation, Apr 22

Lasso

slide-2
SLIDE 2

Ridge/Lasso Regression Model Selection

Outline

1

Ridge/Lasso Regression Linear Regression Regularization Probabilistic Intepretation

2

Model Selection Variable Selection Model selection

Lasso

slide-3
SLIDE 3

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Outline

1

Ridge/Lasso Regression Linear Regression Regularization Probabilistic Intepretation

2

Model Selection Variable Selection Model selection

Lasso

slide-4
SLIDE 4

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Linear Regression

Data X: N × P matrix, Target y: N × 1 vector N samples, each sample has P features Want to find θ so that y and Xθ are as close as possible Pick θ that minimizes the cost function L = 1 2

  • i

(yi − Xiθ)2 = 1 2||y − Xθ||2 use gradient descent θt+1

j

= θt

j − step ∗ ∂L

∂θj = θt

j − step ∗

  • i

(yi − Xiθ)(−Xij)

Lasso

slide-5
SLIDE 5

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Linear Regression

Matrix form: L = 1 2

  • i

(yi − Xiθ)2 = 1 2||y − Xθ||2 = 1 2(y − Xθ)⊤(y − Xθ) = 1 2(y⊤y − y⊤Xθ − θ⊤X ⊤y + θ⊤X ⊤Xθ) Take derivative w.r.t. θ ∂L ∂θ = 1 2(−2X ⊤y + 2X ⊤Xθ) = 0 Hence we get θ = (X ⊤X)−1X ⊤y

Lasso

slide-6
SLIDE 6

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Linear Regression

Comparison of iterative methods and matrix methods: matrix methods achieve solution in a single step, but can be infeasible for real-time data, or large amount of data. iterative methods can be used in large practical problems, but need to decide learning rate Any problems? Data X is an N × P matrix Usually N > P, i.e., number of data points larger than feature dimensions. And usually X is of full column rank. Under this case X ⊤X have rank P, i.e., invertible What if X has less than full column rank?

Lasso

slide-7
SLIDE 7

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Outline

1

Ridge/Lasso Regression Linear Regression Regularization Probabilistic Intepretation

2

Model Selection Variable Selection Model selection

Lasso

slide-8
SLIDE 8

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Regularization: ℓ2 norm

Ridge Regression: min

θ

1 2

  • i

(yi − Xiθ)2 + λ||θ||2

2

Solution is given by: θ = (X ⊤X + λI)−1X ⊤y Results in a solution with small θ Solves the problem that X ⊤X is not invertible

Lasso

slide-9
SLIDE 9

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Regularization: ℓ1 norm

Lasso Regression: min

θ

1 2

  • i

(yi − Xiθ)2 + λ||θ||1 Solution is given by taking subgradient:

  • i

(yi − Xiθ)(−Xij) + λtj where tj is the subgradient of ℓ1 norm, tj = sign(θj) if θj = 0, tj ∈ [−1, 1] otherwise Sparse solution, i.e., θ will be a vector with more zero coordinates. Good for high-dimensional problems

Lasso

slide-10
SLIDE 10

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Solving Lasso regression

Efron et al. proposed LARS (least angle regression) which computes the LASSO path efficiently Forward stagewise algorithm Assume X is standardized and y is centered choose small ǫ

Start with initial residual r = y, and θ1 = ... = θP = 0 Find the predictor Zj (jth column of X) most correlated with r Update θj ← θj + δj, where δj = ǫ · sign(Z ⊤

j r)

Set r ← r − δjZj, repeat steps 2 and 3

Lasso

slide-11
SLIDE 11

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Comparison of Ridge and Lasso regression:

Two-dimensional case:

Lasso

slide-12
SLIDE 12

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Comparison of Ridge and Lasso regression:

Higher dimensional case:

Lasso

slide-13
SLIDE 13

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Choosing λ

Standard practice now is to use cross-validation

Lasso

slide-14
SLIDE 14

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Outline

1

Ridge/Lasso Regression Linear Regression Regularization Probabilistic Intepretation

2

Model Selection Variable Selection Model selection

Lasso

slide-15
SLIDE 15

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Probabilistic Intepretation of Linear regression

Assume yi = Xiθ + ǫi, where ǫ is the random noise. Assume ǫ ∼ N(0, σ2) p(yi|Xi; θ) = 1 √ 2πσ exp{−(yi − Xiθ)2 2σ2 } Since data points are i.i.d, we have the data likelihood L(θ) =

N

  • i=1

p(yi|Xi; θ) ∝ exp{− N

i=1(yi − Xiθ)2

2σ2 } The log likelihood is: ℓ(θ) = − N

i=1(yi − Xiθ)2

2σ2 + const Maximizing the log-likelihood is equivalent to minimize N

i=1(yi − Xiθ)2, i.e., the loss function in LR!

Lasso

slide-16
SLIDE 16

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Probabilistic Intepretation of Ridge regression

Assume a Gaussian prior on θ ∼ N(0, τ 2I), i.e., p(θ) ∝ exp{−θ⊤θ/2τ 2} Now get the MAP estimate of θ p(θ|X, y) ∝ p(y|X; θ)p(θ) = exp{− N

i=1(yi − Xiθ)2

2σ2 } exp{−θ⊤θ/2τ 2} The log likelihood is: ℓ(θ|X, y) = − N

i=1(yi − Xiθ)2

2σ2 − θ⊤θ/2τ 2 + const which matches minθ 1

2

  • i(yi − Xiθ)2 + λ||θ||2

2, where λ is a

constant associated with σ2, τ 2.

Lasso

slide-17
SLIDE 17

Ridge/Lasso Regression Model Selection Linear Regression Regularization Probabilistic Intepretation

Probabilistic Intepretation of Lasso regression

Assume a Laplace prior on θi

iid

∼ Laplace(0, t), i.e., p(θi) ∝ exp{−|θi|/t} Now get the MAP estimate of θ p(θ|X, y) ∝ p(y|X; θ)p(θ) = exp{− N

i=1(yi − Xiθ)2

2σ2 } exp{−

  • i

|θi|/t} The log likelihood is: ℓ(θ|X, y) = − N

i=1(yi − Xiθ)2

2σ2 −

  • i

|θi|/t + const which matches minθ 1

2

  • i(yi − Xiθ)2 + λ||θ||1, where λ is a

constant associated with σ2, t.

Lasso

slide-18
SLIDE 18

Ridge/Lasso Regression Model Selection Variable Selection Model selection

Outline

1

Ridge/Lasso Regression Linear Regression Regularization Probabilistic Intepretation

2

Model Selection Variable Selection Model selection

Lasso

slide-19
SLIDE 19

Ridge/Lasso Regression Model Selection Variable Selection Model selection

Variable Selection

Consider "best" subsets, order O(2P) (combinatorial explosion) Stepwise selection

A new variable may be added into the model even with a small improvement in LMS When applying stepwise to a perturbation of the data, probably have different set of variables enter into the model at each stage

LASSO produces sparse solutions, which takes care of model selection

we can even see when variables jump into the model by looking at the LASSO path

Lasso

slide-20
SLIDE 20

Ridge/Lasso Regression Model Selection Variable Selection Model selection

Outline

1

Ridge/Lasso Regression Linear Regression Regularization Probabilistic Intepretation

2

Model Selection Variable Selection Model selection

Lasso

slide-21
SLIDE 21

Ridge/Lasso Regression Model Selection Variable Selection Model selection

Example

Suppose you have data Y1, ..., Yn and you want to model the distribution of Y. Some popular models are: the Exponential distribution: f(y; θ) = θe−θy the Gaussian distribution: f(y; u, σ2) ∼ N(u, σ2) ... How do you know which model is better?

Lasso

slide-22
SLIDE 22

Ridge/Lasso Regression Model Selection Variable Selection Model selection

AIC

Suppose we have models M1, ..., Mk where each model is a set

  • f densities:

Mj = {p(y; θj) : θj ∈ Θj} We have data Y1, ..., Yn drawn from some density f (not necessarily drawn from these models). Define AIC(j) = ℓj(ˆ θj) − 2dj where ℓj(θj) is the log-likelihood, and ˆ θj is the parameter that maximizes the log-likelihood. dj is the dimension of Θj.

Lasso

slide-23
SLIDE 23

Ridge/Lasso Regression Model Selection Variable Selection Model selection

BIC

Bayesian Information Criterion We choose j to maximize BICj = ℓj(ˆ θj) − dj 2 log n which is similar to AIC but the penalty is harsher, hence BIC tends to choose simpler models.

Lasso

slide-24
SLIDE 24

Ridge/Lasso Regression Model Selection Variable Selection Model selection

Simple example

Let Y1, ..., Yn ∼ N(µ, 1) we want to compare two model: M0 : N(0, 1) and M1 : N(u, 1)

Lasso

slide-25
SLIDE 25

Ridge/Lasso Regression Model Selection Variable Selection Model selection

Simple example: AIC

The log-likelihood is ℓ = log

  • i

e−(Yi−u)2/2 = −

  • i

(Yi − u)2/2 AIC0 = −

  • i

Y 2

i /2 − 0

AIC1 = −

  • i

(Yi − ¯ Y)2/2 − 2 = −

  • i

Y 2

i /2 + n

2 ¯ Y 2 − 2 we choose model 1 if AIC1 > AIC0 i.e., −

  • i

Y 2

i /2 + n

2 ¯ Y 2 − 2 > −

  • i

Y 2

i /2

  • r ¯

Y >

  • 4

n.

Lasso

slide-26
SLIDE 26

Ridge/Lasso Regression Model Selection Variable Selection Model selection

Simple example: BIC

BIC0 = −

  • i

Y 2

i /2 − 0

2 log n = −

  • i

Y 2

i /2

AIC1 = −

  • i

(Yi− ¯ Y)2/2−1 2 log n = −

  • i

Y 2

i /2+n/2 ¯

Y 2−1 2 log n we choose model 1 if BIC1 > BIC0 i.e., −

  • i

Y 2

i /2 + n/2 ¯

Y 2 − 1 2 log n > −

  • i

Y 2

i /2

  • r ¯

Y >

  • log n

n .

Lasso

slide-27
SLIDE 27

Ridge/Lasso Regression Model Selection Variable Selection Model selection

Comparison

Generally speaking, AIC/CV finds the most predictive model BIC find the true model with high probability, i.e., BIC assumes that one of the models is true and that you are trying to find the model most likely to be true in the Bayesian sense.

Lasso