Greedy selection on the Lasso solution grid Piotr Pokarowski - - PowerPoint PPT Presentation

greedy selection on the lasso solution grid
SMART_READER_LITE
LIVE PREVIEW

Greedy selection on the Lasso solution grid Piotr Pokarowski - - PowerPoint PPT Presentation

Greedy selection on the Lasso solution grid Piotr Pokarowski Faculty of Mathematics, Informatics and Mechanics, University of Warsaw 1 Dec 2016 Piotr Pokarowski Penalized Loss Minimization Framework Data = { ( y 1 , x T 1 ) , . . . , ( y n ,


slide-1
SLIDE 1

Greedy selection on the Lasso solution grid

Piotr Pokarowski

Faculty of Mathematics, Informatics and Mechanics, University of Warsaw

1 Dec 2016

Piotr Pokarowski

slide-2
SLIDE 2

Penalized Loss Minimization Framework

Data = {(y1, xT

1· ), . . . , (yn, xT n·)} = Train ⊕ Valid ⊕ Test

Fitting: β(λ) = arg min

β

{loss(β, Train) + penalty(β, λ)} Selection: λ = arg min

λ

err

  • β(λ), Valid
  • Assessment:

err = err

  • β(

λ), Test

  • Piotr Pokarowski
slide-3
SLIDE 3

Loss and Penalty

Loss is relaxation of prediction error and tempered (partial, scaled etc.) negative log-likelihood loss(β, Train) =

n

  • i=1

L(yi, f (xi·, β)) Penalty on a model β = (β1, . . . , βp)T penalty(β, λ) =

p

  • j=1

Pλ(|βj|) λ1(t > 0) Pλ(t) λt2

Piotr Pokarowski

slide-4
SLIDE 4

Loss Functions ⊃ linear, logistic models

For i = 1, . . . , n we have xi. ∈ Rp and y = (y1, . . . , yn)T, X = [x1., . . . , xn.]T = [x.1, . . . , x.p]. For simplicity of presentation yT1n = 0 and the columns are standardized such that xT

.j 1n = 0,

xT

.j x.j = 1 for j = 1, . . . , p.

We consider a generalized linear model with a canonical link function g(Eyi) = xT

  • i. β∗. Let εi = (yi − Eyi)/sd(yi).

We assume that ε = (ε1, . . . , εn)T ∈ Rn is a vector of iid zero-mean errors having a subgaussian distribution with a constant σ that is E exp(uεi) ≤ exp(σ2u2/2) for u ∈ R.

Piotr Pokarowski

slide-5
SLIDE 5

Penalty Functions - Classics

  • A. Hoerl and R. Kennard, Technometric 1970:

Ridge Regression (RR) ≡ ℓ2-penalty Pλ(t) = λt2

  • R. Nishi, Ann. Stat. 1984:

Generalized Information Criterion (GIC) ≡ ℓ0-penalty Pλ(t) = λ1(t > 0)

  • R. Tibshirani, JRRS-B 1996:

Lasso ≡ ℓ1-penalty Pλ(t) = λt

Piotr Pokarowski

slide-6
SLIDE 6

Penalty Functions - New Propositions

  • H. Zou and T. Hastie, JRSS-B 2005 (1750 cit.):

Elastic Net (EN) Pλ1,λ2(t) = λ1t + λ2 2 t2 Pλ,α(t) = λ(αt + 1 − α 2 t2)

  • CH. Zhang, Ann. Stat. 2010 (270 cit.):

Minimax Concave Penalty (MCP) Pλ,γ(t) = λ(t ∧ γλ)(1 − t ∧ γλ 2γλ ) GIC MCP Lasso EN RR

Piotr Pokarowski

slide-7
SLIDE 7

1 2 3 4 1 2 3 4

Elastic Net Penalty

t P alpha = 0.1 alpha = 0.5 alpha = 0.9 1 2 3 4 1 2 3 4

EN thresholding functions

β β alpha = 0.1 alpha = 0.5 alpha = 0.9 Piotr Pokarowski

slide-8
SLIDE 8

1 2 3 4 1 2 3 4

Minimax Concave Penalty

t P gamma = 25 gamma = 2.5 gamma = 1.1 1 2 3 4 1 2 3 4

MCP thresholding functions

β β gamma = 25 gamma = 2.5 gamma = 1.1 Piotr Pokarowski

slide-9
SLIDE 9

Algorithm 1 GIC-thresholded Lasso (SS) Input: y, X and λ Screening (Lasso)

  • β = argminβ {ℓ(β) + λ|β|1};
  • rder nonzero |

βj1| ≥ . . . ≥ | βjs|, where s = |supp β|; set J =

  • {j1}, {j1, j2}, . . . , supp

β

  • ;

Selection (GIC)

  • T = argminJ∈J
  • ℓ(

βML

J

) + λ2|J|

  • ;

Output: T, βSS = βML

b T

Piotr Pokarowski

slide-10
SLIDE 10

Algorithm 2 Greedy Selection on the Lasso Solution Grid (SOSnet) Input: y, X and (o, λ ≤ λ1 < . . . < λm) Screening (Lasso) for k = 1 to m do

  • β(k) = argminβ {ℓ(β) + λk|β|1};
  • rder nonzero |

β(k)

j1 | ≥ . . . ≥ |

β(k)

jsk |, sk = |supp

β(k)|; Ordering (squared Wald tests) for l = 1 to o do set J = {j1, j2, . . . , jskl}, skl = ⌊ sk·l

  • ⌋; compute

βML

J

; set predictors in J according to squared Wald tests: w2

i1 ≥ w2 i2 ≥ . . . ≥ w2 iskl ;

set Jkl = {{i1}, {i1, i2}, . . . , {i1, i2, . . . , iskl}} end for; end for; Selection (Generalized Information Criterion, GIC) J = m

k=1

  • l=1 Jkl
  • T = argminJ∈J
  • ℓJ + λ2|J|
  • Output:

T, βSOSnet = βML

b T

Piotr Pokarowski

slide-11
SLIDE 11

1 2 3 4 5 6 7 8 −1 1 2 3 4 5

When thresholding separates a true model ?

indices coefficients beta hat_beta

Piotr Pokarowski

slide-12
SLIDE 12

Lasso separarion error (1)

A true model is T = supp(β∗) = {j ∈ F : β∗

j = 0}.

β∗

min = minj∈T |β∗ j | and t = |T|.

A Bregman divergence D(β, β∗) = ℓ(β) − ℓ(β∗) − ˙ ℓ(β∗)T(β − β∗) A symmetrized Bregman divergence ∆(β, β∗) = D(β, β∗) + D(β∗, β) = (β − β∗)T( ˙ ℓ(β) − ˙ ℓ(β∗))

Piotr Pokarowski

slide-13
SLIDE 13

Lasso separarion error (2)

For a ∈ (0, 1) consider a cone CT,a =

  • ν ∈ Rp : |ν ¯

T|1 ≤ 1 + a

1 − a|νT|1

  • .

(1) A general invertibility factor defined in J. Huang and C-H Zhang JMLR 2012: ζa = inf

ν∈CT,a

∆(β∗ + ν, β∗) |νT|1|ν|∞ . (2)

Piotr Pokarowski

slide-14
SLIDE 14

Lasso separarion error (3)

We have on A = { ˙ ℓ(β∗) ≤ λ} a so-called oracle inequality |∆|∞ ≤ (1 + a)λζ−1

a

< β∗

min/2.

It is easy to check that Aa ⊆ {T ∈ J } Hence for λ < (1 + a)−1ζaβ∗

min/2 we have

P(T ∈ J ) ≤ 2p exp

  • − a2λ2

2σ2

  • .

Piotr Pokarowski

slide-15
SLIDE 15

GIC error (1)

Let W∗ = diag(sd(y1), ..., sd(yn)), X∗ = W 1/2

X. Let X∗J be a submatrix of X∗ with columns having indices in J. H∗J - orthogonal projection on columns of X∗J. A scaled K-L distances between T and its submodels is defined in X-T Shen et al JASA 2012: δk = min

J⊂T,|T\J|=k ||(I − H∗J)Xβ∗||2,

ck = min

i

min

βT :||X∗T βT −X∗β∗||≤δk

¨ ℓ(xT

iTβT)/¨

ℓ(xT

  • i. β∗)

˜ δ = min

k c2 kδk/k

Piotr Pokarowski

slide-16
SLIDE 16

GIC error (2)

If tσ2 < λ2 <

˜ δ 2(1+a)2 then

P(T ∈ J , ˆ T ⊂ T) ≤ exp

  • − a2λ2

2σ2

  • .

If σ2

a2 min(tc−1 t

, log(3p)) < λ2 then P(T ∈ J , ˆ T ⊃ T) ≤ 3p exp

  • − a2λ2

4σ2

  • .

Piotr Pokarowski