Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo - - PowerPoint PPT Presentation

pathwise coordinate optimization for nonconvex sparse
SMART_READER_LITE
LIVE PREVIEW

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo - - PowerPoint PPT Presentation

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao http://www.princeton.edu/tuoz Department of Computer Science Johns Hopkins University Mar. 25. 2015 A General Theory of Pathwise Coordinate Optimization Collaborators


slide-1
SLIDE 1

Pathwise Coordinate Optimization for Nonconvex Sparse Learning

Tuo Zhao http://www.princeton.edu/˜tuoz

Department of Computer Science Johns Hopkins University

  • Mar. 25. 2015
slide-2
SLIDE 2

A General Theory of Pathwise Coordinate Optimization

Collaborators

This is joint work with

  • Prof. Han Liu at Princeton University,
  • Prof. Tong Zhang at Rutgers University and Baidu,

Xingguo Li at University of Minnesota. Manuscript: http://arxiv.org/abs/1412.7477 Software Package: http://cran.r-project.org/web/packages/picasso/

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 2/45

slide-3
SLIDE 3

A General Theory of Pathwise Coordinate Optimization

Outline

Background Pathwise Coordinate Optimization Computational and Statistical Theories Numerical Simulations Conclusions

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 3/45

slide-4
SLIDE 4

Background

slide-5
SLIDE 5

A General Theory of Pathwise Coordinate Optimization

Regularized M-Estimation

Let β∗ denote the parameter to be estimated. We solve the following regularized M-estimation problem min

β∈Rd L(β) + Rλ(β)

  • Fλ(β)

, where L(β) is a smooth loss function, and Rλ(β) is a regularization function with a tuning parameter λ. Examples: Lasso, Logistic Lasso (Tibshirani, 1996), Group Lasso (Yuan and Lin, 2006), Graphical Lasso (Yuan and Lin, 2007; Banerjee et al., 2008; Friedman et al. 2008), ...

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 5/45

slide-6
SLIDE 6

A General Theory of Pathwise Coordinate Optimization

Regularization Functions

Rλ(β) is coordinate separable, Rλ(β) =

d

  • j=1

rλ(βj). Rλ(β) is decomposable, Rλ(β) = λ β1 + Hλ(β) = λ

d

  • j=1

[|βj| + hλ(βj)]. Examples: Smooth Clipped Absolute Deviation (SCAD, Fan and Li, 2001) and Minimax Concavity Penalty (MCP, Zhang, 2010)

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 6/45

slide-7
SLIDE 7

A General Theory of Pathwise Coordinate Optimization

Regularization Functions

For any γ > 2, SCAD is defined as rλ(βj) =            λ|βj| if |βj| λ, −|βj|2 − 2λγ|βj| + λ2 2(γ − 1) if λ < |βj| λγ, (γ + 1)λ2 2 if |βj| > λγ. hλ(βj) =            if |βj| λ, 2λ|βj| − |βj|2 − λ2 2(γ − 1) if λ < |βj| λγ, (γ + 1)λ2 − 2λ|βj| 2 if |βj| > λγ.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 7/45

slide-8
SLIDE 8

A General Theory of Pathwise Coordinate Optimization

Regularization Functions

For any γ > 1, MCP is defined as rλ(βj) =        λ

  • |βj| − |βj|2

2λγ

  • if |βj| λγ,

λ2γ 2 if |βj| > λγ. hλ(βj) =        −|βj|2 2γ if |βj| λγ, λ2γ − 2λ|βj| 2 if |βj| > λγ.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 8/45

slide-9
SLIDE 9

A General Theory of Pathwise Coordinate Optimization

Regularization Functions

−3 −2 −1 1 2 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0

SCAD MCP `1 θj rλ(θj)

−3 −2 −1 1 2 3 −2.5 −2.0 −1.5 −1.0 −0.5 0.0

SCAD MCP `1 θj hλ(θj)

Figure: 1. λ = 1 and γ = 2.01.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 9/45

slide-10
SLIDE 10

A General Theory of Pathwise Coordinate Optimization

Loss Functions

X ∈ Rn×d – design matrix, y ∈ Rn – response vector. Least Square Loss: L(β) = 1 2n y − Xβ2

2 .

Logistic Loss: L(β) = 1 n

n

  • i=1
  • log
  • 1 + exp(XT

i∗β)

  • − yiXT

i∗β

  • .

Others: Huber Loss, Multi-category Logistic Loss,...

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 10/45

slide-11
SLIDE 11

A General Theory of Pathwise Coordinate Optimization

Reformulation

We rewrite the regularized M-estimation problem as min

β∈Rd

  • Lλ(β) + λ β1
  • Fλ(β)

.

  • Lλ(β) is smooth but nonconvex,
  • Lλ(β) = L(β) + Hλ(β).

λ β1 is nonsmooth but convex. Remark: Amenable to theoretical analysis.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 11/45

slide-12
SLIDE 12

A General Theory of Pathwise Coordinate Optimization

Randomized Coordinate Descent Algorithm

At the t-th iteration, we randomly select a coordinate j from d

  • coordinates. We then take β(t+1)

\j

← β(t)

\j , and

Exact Coordinate Minimization (Fu, 1998) β(t+1)

j

← arg min

βj

  • Lλ(βj; β(t)

\j ) + λ|βj|.

Inexact Coordinate Minimization (Shalev-Shwartz, 2011) β(t+1)

j

← arg min

βj

(βj − β(t))∇j Lλ(β(t)) + L 2(βj − β(t))2 + λ|βj|, where L is the step size parameter.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 12/45

slide-13
SLIDE 13

A General Theory of Pathwise Coordinate Optimization

Examples

Sparse Linear Regression + MCP: Tj,λ(β(t)) =     

  • β(t+1)

j

if | β(t+1)

j

| γλ, Sλ( β(t+1)

j

) 1 − 1/γ if | β(t+1)

j

| < γλ. where β(t+1)

j

= XT

∗j(y − X∗\jβ(t) \j )/n.

Sparse Logistic Regression + MCP: Tj,λ(β(t)) = Sλ(β(t) − ∇j Lλ(β(t))/L) Remark: Sublinear Convergence to Local Optima without Statistical Guarantees (Shalev-Shwartz, 2011).

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 13/45

slide-14
SLIDE 14

Pathwise Coordinate Optimization

slide-15
SLIDE 15

A General Theory of Pathwise Coordinate Optimization

Pathwise Coordinate Optimization

Much faster than other competing algorithms. Very simple implementation. Easily scale to large problems. NO computational analysis in existing literature NO statistical guranratee on the obtained estimator. Our Contribution: The FIRST pathwise coordinate optimization algorithm with both computational and statistical guarantees. The FIRST two-step estimator with both computational and statistical guarantees.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 15/45

slide-16
SLIDE 16

A General Theory of Pathwise Coordinate Optimization

Pathwise Coordinate Optimization

Friedman et al. 2007, Mazumder et al. 2011

Outer loop

Inner loop

Warm start initialization Active coordinate minimization Active set identification

Convergence Active set Convergence Coordinate updating I n i t i a l S

  • l

u t i

  • n

Regularization parameter initialization Output solution

Middle loop

Figure: 2. The pathwise coordinate optimization framework contains 3 nested loops : (I) Warm start initialization; (II) Active set identification; (III) Active coordinate minimization.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 16/45

slide-17
SLIDE 17

A General Theory of Pathwise Coordinate Optimization

Restricted Strong Convexity and Smoothness

Motivation: For any β, β′ ∈ Rd such that |{j | βj 0 or β′

j 0}| s, we have

  • Lλ(β′) −

Lλ(β) − (β′ − β)T∇ Lλ(β) C−(s) 2

  • β′ − β
  • 2

2 ,

  • Lλ(β′) −

Lλ(β) − (β′ − β)T∇ Lλ(β) C+(s) 2

  • β′ − β
  • 2

2 ,

where C−(s), C+(s) > 0 are two constants depending on s. Remark: An algorithm, which can maintain SPARSE solutions throughout all iterations, behaves like minimizing a STRONGLY CONVEX function. Therefore a linear convergence can be expected.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 17/45

slide-18
SLIDE 18

A General Theory of Pathwise Coordinate Optimization

Warm Start Initialization (Outer Loop)

We choose a sequence of DECREASING regularization parameters {λK}N

K=1:

λ0 λ1 λ2 ... λN−1 λN. The algorithm yields a sequence of output solutions { β

{K}}N K=0 from sparse to dense,

  • β

{K} ← min β

  • LλK(β) + λK β1 .

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 18/45

slide-19
SLIDE 19

A General Theory of Pathwise Coordinate Optimization

Warm Start Initialization (Outer Loop)

We choose λ0 = ∇L(0)∞, then have min

ξ∈∂01

∇L(0) + ∇Hλ(0) + λ0ξ∞ = 0 and β

{0} = 0.

The regularization sequence {λK}N

K=0 is geometrically

decreasing λK = ηλK−1 with η ∈ (0, 1). When solving the optimization problem with λK, we use

  • β

{K−1} as INITIALIZATION.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 19/45

slide-20
SLIDE 20

A General Theory of Pathwise Coordinate Optimization

Geometric Interpretation

CλK−1 CλK+1 CλK Cλ1 CλN b θ

{0}

b θ

{1}

b θ

{K−2}

b θ

{K−1}

b θ

{K}

b θ

{K+1}

b θ

{N}

θ∗ · · · · · · Basin of Attraction for Basin of Attraction for Basin of Attraction for Basin of Attraction for Basin of Attraction for λ1 λK−1 λK λK+1 λN

Figure: 3. Large regularization parameters suppress the overselection

  • f irrelevant variables {j | β∗

j = 0} and yields highly sparse solutions.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 20/45

slide-21
SLIDE 21

A General Theory of Pathwise Coordinate Optimization

Active Set Strategy (Friedman et al. 2007)

Define A =

  • j | βj 0
  • be the set of indices of nonzero

coordinates, and A =

  • j | βj = 0
  • be the set of indices of zero
  • coordinates. A naive updating scheme is:

(1) Active coordinate minimization: Cyclically update βj’s in A until convergence. (2) Sweeping coordinates: Check all βj in A, and if any coordinate becomes zero, move it to A. (3) Adding coordinates: Update βj’s over A for only once, and if any coordinate becomes nonzero, move it to A. Then we go back to (1). Remark: Heuristic tricks without theoretical guarantees.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 21/45

slide-22
SLIDE 22

A General Theory of Pathwise Coordinate Optimization

Active Set Identification (Middle Loop)

For notational simplicity, the outer loop index K is omitted. Greedy Selection: At the m-th iteration, we have β[m] and define Am = {j | β[m]

j

0} and Am = {j | β[m]

j

= 0}. β[m+0.5] ← Active Coordinate Minimization over Am. km ← arg maxk∈Am |∇k Lλ(β[m+0.5])|. β[m+1]

km

← Tkm,λ(β[m+0.5]) and β[m+1]

\km

= β[m+0.5]

\km

. Remark: Conservative coordinate selection.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 22/45

slide-23
SLIDE 23

A General Theory of Pathwise Coordinate Optimization

Active Set Identification (Middle Loop)

For notational simplicity, the outer loop index K is omitted. Randomized Selection: At the m-th iteration, we have β[m] and define Am = {j | β[m]

j

0} and Am = {j | β[m]

j

= 0}. β[m+0.5] ← Active Coordinate Minimization over Am. Randomly select km ∈ Am such that |∇k Lλ(β[m+0.5])| δλ. β[m+1]

km

← Tkm,λ(β[m+0.5]) and β[m+1]

\km

← β[m+0.5]

\km

. Remark: Conservative coordinate selection.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 23/45

slide-24
SLIDE 24

A General Theory of Pathwise Coordinate Optimization

Active Set Identification (Middle Loop)

For notational simplicity, the outer loop index K is omitted. Truncated Cyclic Selection: At the m-th iteration, we have β[m] and define Am = {j | β[m]

j

0} and Am = {j | β[m]

j

= 0}. β[m+0.5] ← Active Coordinate Minimization over Am. For all k ∈ Am, take β[m+0.5]

k

  • Tk,λ(β[m+0.5])

if |∇k Lλ(β[m+0.5])| δλ, β[m+0.5]

k

if |∇k Lλ(β[m+0.5])| < δλ. β[m+1] ← β[m+0.5]. Remark: Prevent from the overselection of irrelevant variables.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 24/45

slide-25
SLIDE 25

A General Theory of Pathwise Coordinate Optimization

Active Set Identification (Middle Loop)

θ[m+0.5]

Relevant Blocks Added by Cyclic Search Added by Greedy Selection Added by Randomized Selection Added by Truncated Cyclic Selection

θ[m+1] θ[m+1] θ[m+1] θ[m+1]

Failure Success Success Success

Figure: 4. The cyclic search in Friedman et al. 2007, Mazumder et al. 2011 may overselect irrelevant variables.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 25/45

slide-26
SLIDE 26

A General Theory of Pathwise Coordinate Optimization

Active Coordinate Minimization (Inner Loop)

θ[0]

Relevant Blocks

θ[m−0.5]

Add

. . .

θ[m] Sweep θ[m+0.5]

. . .

b θ

Figure: 5. The active coordinate minimization not only decreases the

  • bjective value, but also sweeps some variables from the active set.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 26/45

slide-27
SLIDE 27

Computational and Statistical Theories

slide-28
SLIDE 28

A General Theory of Pathwise Coordinate Optimization

Preliminaries

Definition [Sparse Eigenvalues]: Given an integer s 1, ρ+(s) = sup

v0s

vT∇2L(β)v v2

2

, ρ−(s) = inf

v0s

vT∇2L(β)v v2

2

, and ρ−(s) = ρ−(s) − α. Lemma [Restricted Curvature]: Given ρ+(s) > ρ−(s) > α, for any β, β′ ∈ Rd such that |{j | βj 0 or β′

j 0}| s,

  • Lλ(β′) −

Lλ(β) − (β′ − β)T∇ Lλ(β) ρ+(s) 2

  • β′ − β
  • 2

2 ,

  • Lλ(β′) −

Lλ(β) − (β′ − β)T∇ Lλ(β) ρ−(s) 2

  • β′ − β
  • 2

2 ,

where Hλ(β′) − Hλ(β′) − (β′ − β)T∇Hλ(β) −α 2

  • β′ − β
  • 2

2.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 28/45

slide-29
SLIDE 29

A General Theory of Pathwise Coordinate Optimization

Preliminaries

Assumption A: λN 4 ∇L(β∗)∞ and η ∈ [23/24, 1). The regularization parameters are LARGE enough to eliminate irrelevant variables (Negahban et al. 2012). Assumption B: Given β∗0 s∗, there exists an s such that (1) s (484κ2 + 100κ)s∗, (2) ρ+(s∗ + 2 s + 2) < +∞, (3) ρ−(s∗ + 2 s + 2) > 0, where κ = ρ+(s∗ + 2 s + 2)/ ρ−(s∗ + 2 s + 2). The algorithm can tolerate AT MOST s + 1 nonzero irrelevant variables throughout all iterations (Bickel, 2009; Zhang, 2009).

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 29/45

slide-30
SLIDE 30

A General Theory of Pathwise Coordinate Optimization

Globally Convergence (Greedy-PICASSO)

Suppose that Assumptions A and B hold. We have the following results: (Solution Sparsity) Through all iterations of PICASSO, any solution β satisfies

  • βS
  • s + 1.

(Sparse Optimum) At the K-th iteration of the outer loop, PICASSO converges to a unique sparse local optimum ¯ βλK satisfying

  • ¯

βλK

S

  • s and minξ∈∂
  • ¯

βλK

  • 1

LλK( ¯ βλK) + λξ

  • ∞ = 0.

(Logarithm Iteration Complexity) To attain FλN( β

{N}) − FλN( ¯

βλN) ǫ, the number of active set identification iterations is at most O (N · log(1/ǫ)).

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 30/45

slide-31
SLIDE 31

A General Theory of Pathwise Coordinate Optimization

Two-step Method

Step 1 – Convex Relaxation: Obtain βrelax satisfying min

ξ∈∂βrelax1

  • ∇L(βrelax) + λ0ξ
  • ∞ λ0

8 . Step 2 – PICASSO: Solve the optimization problem with PICASSSO, and use βrelax as initialization for λ0. Remark: The low precision makes Step 1 very efficient. Remark: The restricted strong convexity holds for β − β∗2 R (e.g. logistic loss, huber loss), where R is a constant and does not scale with (n, d, s∗). All previous theoretical results hold.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 31/45

slide-32
SLIDE 32

A General Theory of Pathwise Coordinate Optimization

Nearly Unbiased Estimation

Suppose that Assumptions A and B. We have

  • β

{N} − β∗

  • 2

= O ∇S1L(β∗)

  • 2
  • ρ−(s∗ + 2

s)

  • Strong Signals

+ λN

  • |S2|
  • ρ−(s∗ +

s)

  • Weak Signals
  • ,

where S1 = {j | |β∗

j | γλN} and S2 = {j | 0 < |β∗ j | < γλN}.

Clarification: To establish the theoretical analysis for each individual problem, we need to assume that the model is CORRECTLY specified. This is a very common assumption in high dimensional statistical theories.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 32/45

slide-33
SLIDE 33

A General Theory of Pathwise Coordinate Optimization

Model Specification

Sparse Linear Regression: We consider a linear model y = Xβ∗ + ǫ, where ǫ ∼ N(0, σ2In) is the observational noise vector. Sparse Logistic Regression: We consider a logistic model yi ∼ Bernoulli

  • exp(XT

i∗β∗)

1 + exp(XT

i∗β∗)

  • for i = 1, ..., n.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 33/45

slide-34
SLIDE 34

A General Theory of Pathwise Coordinate Optimization

Application to Spare Linear Regression

Verify Assumption A: Given λN = 8σ

  • log d/n, with HIGH

PROBABILITY, we have λN 4 ∇L(β∗)∞ . Verify Assumption B: Suppose that each column of X is independently sampled from a sub-Gaussian distribution with mean 0 and covariance Σ, where Λmin(Σ) ψmin and Λmax(Σ) ψmax. Given α = ψmin/4, there exists an s such that for large enough n, HIGH PROBABILITY, we have (1) s [484κ2 + 100κ] · s∗, (2) ρ−(s∗ + 2 s + 2) ψmin/4, (3) ρ+(s∗ + 2 s + 2) 3ψmax/2.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 34/45

slide-35
SLIDE 35

A General Theory of Pathwise Coordinate Optimization

Application to Spare Linear Regression

Parameter Estimation: Given α = ψmin/4 and λN = 8σ

  • log d/n, we have
  • β

{N} − β∗

  • 2

= OP

  • σ
  • s∗

1

n

  • Strong Signals

+ σ

  • s∗

2 log d

n

  • Weak Signals
  • ,

where s∗

1 = |{j | |β∗ j γλN}| and s∗ 2 = |{j | 0 < |β∗ j | < γλN}|.

MCP v.s. ℓ1:

  • β

ℓ1 − β∗

  • 2 = OP
  • σ
  • s∗ log d

n

  • .

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 35/45

slide-36
SLIDE 36

A General Theory of Pathwise Coordinate Optimization

Application to Spare Linear Regression

Minimum Signal Strength: min

j∈S |β∗ j | C′σ

ψmin

  • log d

n . Support Recovery: Given α = ψmin/4 and λN = 8σ

  • log d/n, we have

¯ βλN = arg min

β

1 2n y − Xβ2

2 subject to βS = 0

with high probability. MCP v.s. ℓ1: Restricted Strong Convexity v.s. Irrepresentablity.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 36/45

slide-37
SLIDE 37

A General Theory of Pathwise Coordinate Optimization

Application to Spare Logistic Regression

Verify Assumption A: Given λN = 8

  • log d/n, with high

probability we have λN 4 ∇L(β∗)∞ . Verify Assumption B: Suppose that each column of X is independently sampled from a sub-Gaussian distribution with mean 0 and covariance Σ, where Λmin(Σ) ψmin and Λmax(Σ) ψmax. Given α = ψmin/4, there exists an s such that for large enough n and any β − β∗ R, with high probability, we have (1) s [484κ2 + 100κ] · s∗, (2) ρ−(s∗ + 2 s + 2) ψmin/4, (3) ρ+(s∗ + 2 s + 2) 3ψmax/2.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 37/45

slide-38
SLIDE 38

A General Theory of Pathwise Coordinate Optimization

Application to Spare Logistic Regression

Parameter Estimation: Given α = ψmin/4 and λN = 8

  • log d/n, we have
  • β

{N} − β∗

  • 2

= OP

  • s∗

1

n

  • Strong Signals

+

  • s∗

2 log d

n

  • Weak Signals
  • ,

where s∗

1 = |{j | |β∗ j γλN}| and s∗ 2 = |{j | 0 < |β∗ j | < γλN}|.

MCP v.s. ℓ1:

  • β

ℓ1 − β∗

  • 2 = OP

s∗ log d n

  • .

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 38/45

slide-39
SLIDE 39

Numerical Simulations

slide-40
SLIDE 40

A General Theory of Pathwise Coordinate Optimization

Numerical Simulations

PICASSO with Greedy selection, denoted by “G-PICASSO”. PICASSO with Randomized selection, denoted by “R-PICASSO”. PICASSO with Truncated Cyclic selection, denoted by “TC-PICASSO”. SPARSENET proposed in Mazumder et al. 2011. PISTA proposed in Wang et al. 2014.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 40/45

slide-41
SLIDE 41

A General Theory of Pathwise Coordinate Optimization

Numerical Simulations

Table: 1. Quantitive comparison on sparse linear regression (N = 100, n = 60, d = 1000, σ = 1, λN = 0.25

  • log d/n), γ = 1.05.

Method

  • β − β∗
  • 2
  • βS
  • βSc
  • Correct Selection

Timing G-PICASSO 0.8003(0.8908) 2.812(0.4997) 0.844(2.066) 666/1000 0.0169(0.0027) R-PICASSO 0.8102(0.9663) 2.791(0.5355) 0.902(2.353) 653/1000 0.0186(0.0034) TC-PICASSO 0.8057(0.8374) 2.800(0.4839) 0.888(2.038) 645/1000 0.0167(0.0024) SPARSENET 1.1260(1.2708) 2.669(0.6942) 1.678(3.191) 514/1000 0.0171(0.0025) PISTA 0.8135(0.8998) 2.797(0.5115) 0.881(2.112) 664/1000 2.1771(0.3805) Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 41/45

slide-42
SLIDE 42

Conclusions

slide-43
SLIDE 43

A General Theory of Pathwise Coordinate Optimization

Conclusions

Multistage convex relaxation (Zhang, 2010; Zhang 2012): No theoretical guarantee on the iteration complexity; Needs to be combined with an efficient solver for each subproblem. One-step convex relaxation method (Zou and Li, 2008; Wang and Li, 2013; Fan et al. 2014): Attains suboptimal statistical rates of convergence; Requires a stronger minimum signal strength assumption; Needs to be combined with an efficient solver for each subproblem. Path-following proximal gradient algorithm (Wang et al. 2014): Worse empirical computational performance; Requires β∗2 R/2 for sparse generalized linear model estimation.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 43/45

slide-44
SLIDE 44

A General Theory of Pathwise Coordinate Optimization

Conclusions

Proximal gradient algorithm (Loh and Wainwright, 2013): Solves min

β∈Rd L(β) + Rλ(β)

subject to β1 R/2. (1) Sophisticated parameter tuning; Inexact convergence; Slower parameter estimation rates of convergence; Requires β∗1 R/2 for all nonconves sparse learning problems. Pathwise Calibrated Sparse Shooting Algorithm: Concrete theoretical guarantees; Empirically very efficient; Weaker Assumptions.

Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 44/45

slide-45
SLIDE 45

Thank You! Questions?