Greedy selection on the Lasso solution grid Piotr Pokarowski - PowerPoint PPT Presentation

Greedy selection on the Lasso solution grid Piotr Pokarowski Faculty of Mathematics, Informatics and Mechanics, University of Warsaw 1 Dec 2016 Piotr Pokarowski

Penalized Loss Minimization Framework Data = { ( y 1 , x T 1 · ) , . . . , ( y n , x T n · ) } = Train ⊕ Valid ⊕ Test Fitting: � β ( λ ) = arg min { loss ( β, Train ) + penalty ( β, λ ) } β � � Selection: � � λ = arg min err β ( λ ) , Valid λ � � β ( � � Assessment: � err = err λ ) , Test Piotr Pokarowski

Loss and Penalty Loss is relaxation of prediction error and tempered (partial, scaled etc.) negative log-likelihood n � loss ( β, Train ) = L ( y i , f ( x i · , β )) i = 1 Penalty on a model β = ( β 1 , . . . , β p ) T p � penalty ( β, λ ) = P λ ( | β j | ) j = 1 λ 1 ( t > 0 ) � P λ ( t ) � λ t 2 Piotr Pokarowski

Loss Functions ⊃ linear, logistic models For i = 1 , . . . , n we have x i . ∈ R p and X = [ x 1 . , . . . , x n . ] T = [ x . 1 , . . . , x . p ] . y = ( y 1 , . . . , y n ) T , For simplicity of presentation y T 1 n = 0 and the columns are standardized such that x T x T . j 1 n = 0 , . j x . j = 1 for j = 1 , . . . , p . We consider a generalized linear model with a canonical link function g ( E y i ) = x T i . β ∗ . Let ε i = ( y i − E y i ) / sd ( y i ) . We assume that ε = ( ε 1 , . . . , ε n ) T ∈ R n is a vector of iid zero-mean errors having a subgaussian distribution with a constant σ that is E exp ( u ε i ) ≤ exp ( σ 2 u 2 / 2 ) for u ∈ R . Piotr Pokarowski

Penalty Functions - Classics A. Hoerl and R. Kennard, Technometric 1970: Ridge Regression (RR) ≡ ℓ 2 -penalty P λ ( t ) = λ t 2 R. Nishi, Ann. Stat. 1984: Generalized Information Criterion (GIC) ≡ ℓ 0 -penalty P λ ( t ) = λ 1 ( t > 0 ) R. Tibshirani, JRRS-B 1996: Lasso ≡ ℓ 1 -penalty P λ ( t ) = λ t Piotr Pokarowski

Penalty Functions - New Propositions H. Zou and T. Hastie, JRSS-B 2005 (1750 cit.): Elastic Net (EN) P λ 1 ,λ 2 ( t ) = λ 1 t + λ 2 2 t 2 P λ,α ( t ) = λ ( α t + 1 − α t 2 ) 2 CH. Zhang, Ann. Stat. 2010 (270 cit.): Minimax Concave Penalty (MCP) P λ,γ ( t ) = λ ( t ∧ γλ )( 1 − t ∧ γλ 2 γλ ) GIC � MCP � Lasso � EN � RR Piotr Pokarowski

Elastic Net Penalty EN thresholding functions alpha = 0.1 alpha = 0.1 4 4 alpha = 0.5 alpha = 0.5 alpha = 0.9 alpha = 0.9 3 3 P 2 β 2 1 1 0 0 0 1 2 3 4 0 1 2 3 4 β t Piotr Pokarowski

Minimax Concave Penalty MCP thresholding functions gamma = 25 gamma = 25 4 4 gamma = 2.5 gamma = 2.5 gamma = 1.1 gamma = 1.1 3 3 P 2 β 2 1 1 0 0 0 1 2 3 4 0 1 2 3 4 β t Piotr Pokarowski

Algorithm 1 GIC-thresholded Lasso (SS) Input: y , X and λ Screening (Lasso) � β = argmin β { ℓ ( β ) + λ | β | 1 } ; order nonzero | � β j 1 | ≥ . . . ≥ | � β j s | , where s = | supp � β | ; � � { j 1 } , { j 1 , j 2 } , . . . , supp � set J = β ; Selection (GIC) � � � ℓ ( � β ML ) + λ 2 | J | T = argmin J ∈J ; J β SS = � Output: � T , � β ML b T Piotr Pokarowski

Algorithm 2 Greedy Selection on the Lasso Solution Grid (SOSnet) Input: y , X and ( o , λ ≤ λ 1 < . . . < λ m ) Screening (Lasso) for k = 1 to m do β ( k ) = argmin β { ℓ ( β ) + λ k | β | 1 } ; � β ( k ) β ( k ) order nonzero | � j 1 | ≥ . . . ≥ | � j sk | , s k = | supp � β ( k ) | ; Ordering (squared Wald tests) for l = 1 to o do set J = { j 1 , j 2 , . . . , j s kl } , s kl = ⌊ s k · l o ⌋ ; compute � β ML ; J set predictors in J according to squared Wald tests: w 2 i 1 ≥ w 2 i 2 ≥ . . . ≥ w 2 i skl ; set J kl = {{ i 1 } , { i 1 , i 2 } , . . . , { i 1 , i 2 , . . . , i s kl }} end for ; end for ; Selection (Generalized Information Criterion, GIC) � � J = � m � o � ℓ J + λ 2 | J | l = 1 J kl T = argmin J ∈J k = 1 β SOSnet = � Output: � T , � β ML b T Piotr Pokarowski

When thresholding separates a true model ? beta 5 hat_beta 4 coefficients 3 2 1 0 −1 1 2 3 4 5 6 7 8 indices Piotr Pokarowski

Lasso separarion error (1) A true model is T = supp ( β ∗ ) = { j ∈ F : β ∗ j � = 0 } . β ∗ min = min j ∈ T | β ∗ j | and t = | T | . A Bregman divergence D ( β, β ∗ ) = ℓ ( β ) − ℓ ( β ∗ ) − ˙ ℓ ( β ∗ ) T ( β − β ∗ ) A symmetrized Bregman divergence ∆( β, β ∗ ) = D ( β, β ∗ ) + D ( β ∗ , β ) = ( β − β ∗ ) T ( ˙ ℓ ( β ) − ˙ ℓ ( β ∗ )) Piotr Pokarowski

Lasso separarion error (2) For a ∈ ( 0 , 1 ) consider a cone � � T | 1 ≤ 1 + a ν ∈ R p : | ν ¯ C T , a = 1 − a | ν T | 1 . (1) A general invertibility factor defined in J. Huang and C-H Zhang JMLR 2012: ∆( β ∗ + ν, β ∗ ) ζ a = inf . (2) | ν T | 1 | ν | ∞ ν ∈C T , a Piotr Pokarowski

Lasso separarion error (3) We have on A = { ˙ ℓ ( β ∗ ) ≤ λ } a so-called oracle inequality | ∆ | ∞ ≤ ( 1 + a ) λζ − 1 < β ∗ min / 2 . a It is easy to check that A a ⊆ { T ∈ J } Hence for λ < ( 1 + a ) − 1 ζ a β ∗ min / 2 we have � � − a 2 λ 2 P ( T �∈ J ) ≤ 2 p exp . 2 σ 2 Piotr Pokarowski

GIC error (1) Let W ∗ = diag ( sd ( y 1 ) , ..., sd ( y n )) , X ∗ = W 1 / 2 X . ∗ Let X ∗ J be a submatrix of X ∗ with columns having indices in J . H ∗ J - orthogonal projection on columns of X ∗ J . A scaled K-L distances between T and its submodels is defined in X-T Shen et al JASA 2012: J ⊂ T , | T \ J | = k || ( I − H ∗ J ) X β ∗ || 2 , δ k = min ℓ ( x T ¨ iT β T ) / ¨ ℓ ( x T i . β ∗ ) c k = min min i β T : || X ∗ T β T − X ∗ β ∗ ||≤ δ k ˜ k c 2 δ = min k δ k / k Piotr Pokarowski

GIC error (2) ˜ If t σ 2 < λ 2 < δ 2 ( 1 + a ) 2 then � � − a 2 λ 2 P ( T ∈ J , ˆ T ⊂ T ) ≤ exp . 2 σ 2 , log ( 3 p )) < λ 2 then If σ 2 a 2 min ( tc − 1 t � � − a 2 λ 2 P ( T ∈ J , ˆ T ⊃ T ) ≤ 3 p exp . 4 σ 2 Piotr Pokarowski

Greedy selection on the Lasso solution grid Piotr Pokarowski - PowerPoint PPT Presentation

Greedy selection on the Lasso solution grid Piotr Pokarowski Faculty of Mathematics, Informatics and Mechanics, University of Warsaw 1 Dec 2016 Piotr Pokarowski Penalized Loss Minimization Framework Data = { ( y 1 , x T 1 ) , . . . , ( y n ,

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Greedy On-Line Planning - abstract overview: what is greedy on-line planning? Part 1: - greedy

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Greedy algorithms Greedy algorithms Find the best solution to a local problem and (hope) it

Greedy Algorithms 1 The main idea of greedy algorithm is look some optimal solution locally

From greedy approximation to greedy optimization Vladimir Temlyakov July, 2014 Vladimir

From greedy approximation to greedy optimization Vladimir Temlyakov December 10, 2013 Vladimir

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

CS 170 Section 4 Greedy Algorithms I Owen Jow | owenjow@berkeley.edu Agenda Greedy

On Model Selection Consistency Of Lasso Yewon Kim 12/08/2015 Introduction Model selection is a

Sun and Grid John Barr Grid Business Development 07808 328351 john.barr@sun.com Sun and Grid

Greedy Algorithms Pedro Ribeiro DCC/FCUP 2018/2019 Pedro Ribeiro (DCC/FCUP) Greedy Algorithms

Greedy routing Greedy routing Other variations on greedy criterion Introduce

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

Sparse CCA using Lasso Anastasia Lykou & Joe Whittaker Department of Mathematics and

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

Fission fragment characterization with FALSTAFF at NFS D. Dor 1) , F. Farget 2) , F.-R. Lecolley

Sparsity and image processing Aurlie Boisbunon INRIA-SAM, AYIN March 26, 2014 Why sparsity?

An MPEC Formulation for Parameter Identification of Complementarity Systems S. Berard J.C.

Sparse Robust Regression using Non-concave Penalized Density Power Divergence Subhabrata Majumdar

MRTOF mass measurements at GARIS-II: Toward SHE identification via mass spectroscopy Purpose of

Development of the 20 PMT for Hyper-Kamiokande C. Bronner, Y. Nishimura, J. Xia, T. Tashiro

Large Area Picosecond Photodetectors (LAPPD) Kurtis Nishimura on behalf of the LAPPD

for Particle IDentification (PID) Junqi Xie Argonne National Laboratory 9700 S Cass Ave.,

Greedy selection on the Lasso solution grid Piotr Pokarowski - PowerPoint PPT Presentation

Greedy selection on the Lasso solution grid Piotr Pokarowski Faculty of Mathematics, Informatics and Mechanics, University of Warsaw 1 Dec 2016 Piotr Pokarowski Penalized Loss Minimization Framework Data = { ( y 1 , x T 1 ) , . . . , ( y n ,

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Greedy On-Line Planning - abstract overview: what is greedy on-line planning? Part 1: - greedy

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Greedy algorithms Greedy algorithms Find the best solution to a local problem and (hope) it

Greedy Algorithms 1 The main idea of greedy algorithm is look some optimal solution locally

From greedy approximation to greedy optimization Vladimir Temlyakov July, 2014 Vladimir

From greedy approximation to greedy optimization Vladimir Temlyakov December 10, 2013 Vladimir

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

CS 170 Section 4 Greedy Algorithms I Owen Jow | owenjow@berkeley.edu Agenda Greedy

On Model Selection Consistency Of Lasso Yewon Kim 12/08/2015 Introduction Model selection is a

Sun and Grid John Barr Grid Business Development 07808 328351 john.barr@sun.com Sun and Grid

Greedy Algorithms Pedro Ribeiro DCC/FCUP 2018/2019 Pedro Ribeiro (DCC/FCUP) Greedy Algorithms

Greedy routing Greedy routing Other variations on greedy criterion Introduce

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

Sparse CCA using Lasso Anastasia Lykou &amp; Joe Whittaker Department of Mathematics and

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

Fission fragment characterization with FALSTAFF at NFS D. Dor 1) , F. Farget 2) , F.-R. Lecolley

Sparsity and image processing Aurlie Boisbunon INRIA-SAM, AYIN March 26, 2014 Why sparsity?

An MPEC Formulation for Parameter Identification of Complementarity Systems S. Berard J.C.

Sparse Robust Regression using Non-concave Penalized Density Power Divergence Subhabrata Majumdar

MRTOF mass measurements at GARIS-II: Toward SHE identification via mass spectroscopy Purpose of

Development of the 20 PMT for Hyper-Kamiokande C. Bronner, Y. Nishimura, J. Xia, T. Tashiro

Large Area Picosecond Photodetectors (LAPPD) Kurtis Nishimura on behalf of the LAPPD

for Particle IDentification (PID) Junqi Xie Argonne National Laboratory 9700 S Cass Ave.,

Sparse CCA using Lasso Anastasia Lykou & Joe Whittaker Department of Mathematics and