Graphlet Screening (GS) Achieves Optimal Rate in Variable Selection - - PowerPoint PPT Presentation

graphlet screening gs
SMART_READER_LITE
LIVE PREVIEW

Graphlet Screening (GS) Achieves Optimal Rate in Variable Selection - - PowerPoint PPT Presentation

Graphlet Screening (GS) Achieves Optimal Rate in Variable Selection Jiashun Jin Carnegie Mellon University Collaborated with Cun-Hui Zhang (Rutgers) Qi Zhang (Univ. of Pittsburgh) Jiashun Jin Graphlet Screening (GS) Variable selection Y = X


slide-1
SLIDE 1

Graphlet Screening (GS)

Achieves Optimal Rate in Variable Selection Jiashun Jin

Carnegie Mellon University Collaborated with Cun-Hui Zhang (Rutgers) Qi Zhang (Univ. of Pittsburgh)

Jiashun Jin Graphlet Screening (GS)

slide-2
SLIDE 2

Variable selection

Y = Xβ + z, X = Xn,p, z ∼ N(0, In)

◮ p ≫ n ≫ 1 ◮ signals are rare and weak ◮ let G = X ′X be the Gram matrix

◮ diagonals of G are normalized to 1 ◮ G is sparse (few large entries each row) Jiashun Jin Graphlet Screening (GS)

slide-3
SLIDE 3

Subset selection

1 2Y − Xβ2

2 + λ2

2 β0

◮ L0-penalization method ◮ Variants: Cp, AIC, BIC, RIC ◮ Computationally challenging

Mallows (1973), Akaike (1974), Schwartz (1978), Foster & George (1994)

Jiashun Jin Graphlet Screening (GS)

slide-4
SLIDE 4

The lasso

1 2Y − Xβ2

2 + λβ1

◮ L1-penalization method; Basis Pursuit ◮ Widely used

◮ computationally efficient even when p is large ◮ in the noiseless case, if signals sufficiently

sparse, equivalent to L0-penalization

Chen et al. (1998); Tibshirani (1996); Donoho (2006)

Jiashun Jin Graphlet Screening (GS)

slide-5
SLIDE 5

Limitation of L0-Penalization, I

  • Ex. Y = Xβ + z, z ∼ N(0, In), βj take values from

{0, τ} and

G = X ′X =     D . . . D . . . . . . . . . ... . . . . . . D     , D =

  • 1

a a 1

  • {1, 2, . . . p} partitions into 3 types of 2 × 2 blocks:

◮ I. No signal ◮ II. One signal ◮ III. Two signals

Jiashun Jin Graphlet Screening (GS)

slide-6
SLIDE 6

Limitation of L0 penalization, II

◮ one-stage method ◮ one tuning parameter ◮ does not exploit ‘local’ graphical structure

Therefore, many penalization methods (e.g. lasso, SCAD, MC+, Dantzig selector) are non-optimal, as L0-penalization is the ‘idol’ these methods mimic

‘local’: neighboring nodes in geodesic distance of a graph (TBD)

Jiashun Jin Graphlet Screening (GS)

slide-7
SLIDE 7

Where are the signals?

Tukey, J.W. (1965). Which part of the sample contains the information, Proc. Natl. Sci. Acad.

John Wilder Tukey (1915-2000)

Jiashun Jin Graphlet Screening (GS)

slide-8
SLIDE 8

Graph of Strong Dependence (GOSD)

GOSD is the graph G = (V , E):

◮ V = {1, 2, . . . , p}: each variable is a node ◮ An edge between nodes i and j iff

  • G(i, j)

1 log(p), say

◮ G = X ′X sparse =

⇒ G sparse

Jiashun Jin Graphlet Screening (GS)

slide-9
SLIDE 9

Signal sparsity and graph sparsity

◮ Despite its sparsity, G is usually complicate ◮ Denote the support of β by

S = S(β) = {1 ≤ i ≤ p, βi = 0} Restricting nodes to S forms a subgraph GS

◮ Key insight: GS decomposes into many

small-size components that are disconnected to each other

Component: a maximal connected subgraph

Jiashun Jin Graphlet Screening (GS)

slide-10
SLIDE 10

For today

Graphlet Screening (GS):

◮ gs-step: graphlet screening by sequential

χ2-tests

◮ gc-step: graphlet cleaning by Penalized MLE ◮ Focus: rare and weak signals

Jiashun Jin Graphlet Screening (GS)

slide-11
SLIDE 11

Graphlet screening (gs-step), Initial stage

Y = Xβ + z, X = Xn,p, z ∼ N(0, In); G : GOSD

◮ Fix m ≥ 1 (small) ◮ Let {Gt : 1 ≤ t ≤ T} be all connected

subgraphs of G with size ≤ m

◮ arranged by size, ties breaking lexicographically:

1 9 5 7 6 8 3 2 10 1 4 8

p = 10, m = 3, T = 30; {Gt, 1 ≤ t ≤ T}: {1}, {2}, . . . {10} {1, 2}, {1, 7}, . . . , {9, 10} {1, 2, 4}, {1, 2, 7}, . . . , {8, 9, 10}

Jiashun Jin Graphlet Screening (GS)

slide-12
SLIDE 12

gs-step, II. Updating stage

X = [x1, x2, . . . , xp], {Gt}T

t=1:

all connected subgraphs with size ≤ m

For t = 1, 2, . . . , T

◮ St−1: set of retained indices in last stage ◮ Define T(Y ; D, F) = PGtY 2 − PFY 2

◮ F = Gt ∩ St−1: nodes accepted previously ◮ D = Gt \ F: nodes currently under investigation ◮ PF: projection from Rn to subspace {xj : j ∈ F}

◮ Adding nodes in D to St−1 iff

T(Y ; D, F) > t(D, F), t(D, F): threshold TBD

Once accepted, a node is kept until the end of gs-step

Jiashun Jin Graphlet Screening (GS)

slide-13
SLIDE 13

Comparison with marginal regression (computational complexity)

◮ Marginal screening

◮ ineffective (neglects ‘local’ graphical structure) ◮ ‘brute-forth’ m-variate screening is computationally

challenging: O(pm)

◮ gs-step

◮ only screens connected subgraphs of G ◮ if maximum degree of G ≤ K, then there are

≤ C(eK)mp such subgraphs

Fan & Lv (2008), Wasserman & Roeder (2009), Frieze & Molloy (1999)

Jiashun Jin Graphlet Screening (GS)

slide-14
SLIDE 14

Two important properties of gs-step

S∗ ≡ ST: set of survived nodes in the end of gs-step If both signals and Graph G are sparse:

◮ Sure Screening (SS): S∗ retains all but a

small proportion of signals

◮ Separable After Screening (SAS): S∗

decomposes into many small-size components

Jiashun Jin Graphlet Screening (GS)

slide-15
SLIDE 15

Reduce to many small-size regression, I

G = X ′X, I0 ⊂ S∗ : a component G I0 : row restriction; G I0,I0 : row & column restriction

◮ Restrict regression to I0

Y = Xβ + z = ⇒ X ′Y = X ′Xβ + X ′z = ⇒ (X ′Y )I0 = (Gβ)I0 + (X ′z)I0

◮ (X ′z)I0 ∼ N(0, G I0,I0) since z ∼ N(0, In) ◮ Key: (Gβ)I0 ≈ G I0,I0βI0 ◮ Result: many small-size regression:

(X ′Y )I0 ≈ N

  • G I0,I0βI0, G I0,I0

Jiashun Jin Graphlet Screening (GS)

slide-16
SLIDE 16

Reduce to small-size regression, II

Why (Gβ)I0 ≡ G I0β ≈ G I0,I0βI0? G I0β =

  • G I0,I0 G I0,J0 . . .

     βI0 βJ0 . . .      

◮ I0, J0 ⊂ S∗: components ◮ By SS property, β = 0 ◮ By SAS property, G I0,J0 ≈ 0

Jiashun Jin Graphlet Screening (GS)

slide-17
SLIDE 17

Graphlet cleaning (gc-step)

Y = Xβ + z, z ∼ N(0, In)

◮ I0: a component of S∗;

S∗: set of all survived nodes

◮ βI0: restricting rows of β to I0 ◮ X ∗,I0: restricting columns of X to I0

Fixing (ugs, v gs),

◮ j /

∈ S∗: set ˆ βj = 0

◮ j ∈ S∗: estimate βI0 via minimizing

PI0(Y − X ∗,I0θ)2 + (ugs)2θ0, where an entry of θ is 0 or ≥ v gs in magnitude

Jiashun Jin Graphlet Screening (GS)

slide-18
SLIDE 18

Random design model

Y = Xβ+z, X =   X ′

1

. . . X ′

n

  , Xi

iid

∼ N(0, 1 n Ω)

◮ Ω: unknown correlation matrix ◮ Ex: Compressive Sensing, Computer Security

Dinur and Nissim (2004), Nowak et al. (2007)

Jiashun Jin Graphlet Screening (GS)

slide-19
SLIDE 19

Rare and Weak signal model

Y = Xβ + z, z ∼ N(0, In)

β = b ◦ µ, bi

iid

∼ Bernoulli(ǫ), µ ∈ Θ∗

p(τ, a)

◮ b ◦ µ ∈ Rp: (b ◦ µ)j = bjµj ◮ Θ∗

p(τ, a) = {µ ∈ Rp : τ ≤ |µj| ≤ aτ}, a > 1

◮ Two key parameters:

ǫ: sparsity; τ: (minimum) signal strength

Jiashun Jin Graphlet Screening (GS)

slide-20
SLIDE 20

Asymptotic framework

Use p as driving asymptotic parameter, and tie (ǫ, τ, n) to p by fixed parameters

◮ Signal rarity:

ǫ = ǫp = p−ϑ, 0 < ϑ < 1

◮ Signal weakness:

τ = τp =

  • 2r log(p),

r > 0

◮ Sample size:

n = pθ, (1 − ϑ) < θ < 1, so that pǫp ≪ np ≪ p

Jiashun Jin Graphlet Screening (GS)

slide-21
SLIDE 21

Limitation of ‘Oracle Property’

Oracle property or probability of exact support recovery is a widely used criterion for assessing

  • ptimality in variable selection

However, when signals are rare and weak, it is usually impossible to have exact recovery

Jiashun Jin Graphlet Screening (GS)

slide-22
SLIDE 22

Minimax Hamming distance

Measuring errors with Hamming distance: Hp(ˆ β, ǫp, µ; Ω) = E p

  • j=1

1

  • sgn(ˆ

βj) = sgn(βj)

  • Minimax Hamming distance:

Hamm∗

p(ϑ, θ, r, a, Ω) = inf ˆ β

sup

µ∈Θ∗

p(τp,a)

Hp(ˆ β, ǫp, µ; Ω)

Jiashun Jin Graphlet Screening (GS)

slide-23
SLIDE 23

Exponent ρ∗

j = ρ∗ j (ϑ, r, Ω)

Define ω = ω(S0, S1; Ω) = infδ

  • δ′Ωδ
  • where

δ ≡ u(0) − u(1) :

  • u(k)

i

= 0, i / ∈ Sk 1 ≤ |u(k)

i

| ≤ a, i ∈ Sk , k = 0, 1 Define ρ(S0, S1; ϑ, r, a, Ω) = |S0| + |S1| 2 ϑ + ωr 4 + (|S1| − |S0|)2ϑ2 4ωr

Minimax rate critically depends on the exponents: ρ∗

j = ρ∗ j (ϑ, r; Ω) =

min

(S0,S1):j∈S0∪S1 ρ(S0, S1, ϑ, r, a, Ω)

◮ not dependent on (θ, a) (mild regularity cond.) ◮ computable; has explicit form for some Ω

Jiashun Jin Graphlet Screening (GS)

slide-24
SLIDE 24

Graph of Least Favorable (GOLF)

Define sets of least favorable configuration at site j (S∗

0j, S∗ 1j) = argmax{(S0,S1): j∈S0∪S1}

  • ρ(S0, S1; ϑ, r, a, Ω)
  • Definition. GOLF is the graph G⋄ = (V , E) where

V = {1, 2, . . . , p} and there is an edge between i and j if and only if (S∗

0j ∪ S∗ 1j) ∩ (S∗ 0k ∪ S∗ 1k) = ∅

Jiashun Jin Graphlet Screening (GS)

slide-25
SLIDE 25

Lower bound

β = b ◦ µ, bj

iid

∼ Bernoulli(ǫp), µ ∈ Θ∗

p(τp, a)

ǫp = p−ϑ, τp =

  • 2r log(p)

Theorem 1. Let d(G⋄) be the maximum degree of

  • GOLF. As p → ∞,

Hamm∗

p(ϑ, θ, r, a, Ω) ≥

Lp p

j=1 p−ρ∗

j

dp(G⋄) where Lp is a generic multi-log(p) term.

Jiashun Jin Graphlet Screening (GS)

slide-26
SLIDE 26

Main result: GS is asymptotic minimax

◮ Assume p j=1 |Ω(i, j)|γ ≤ C,

γ ∈ (0, 1), 1 ≤ i ≤ p

◮ gs-step: set thresholds at 2qρ∗ j log p, 0 < q < 1 ◮ gc-step: set ugs = √2ϑ log p, and v gs = τp

Theorem 2. As p → ∞,

◮ Both SS and SAS property hold ◮ Maximum degree of GOLF ≤ Lp ◮ GS achieves optimal rate of convergence:

sup

µ∈Θ∗

p(τp,a)

Hp(ˆ βgs, ǫp, µ, Ω) ≤ Lp p

  • j=1

p−ρ∗

j

+p1−(m+1)ϑ

  • where Lp is a generic multi-log(p) term

Jiashun Jin Graphlet Screening (GS)

slide-27
SLIDE 27

Tuning parameters of Graphlet Screening

GS uses tuning parameters (δ, m, ugs, v gs) and Q = {t(D, F) : D and F as in gs-step}

◮ (δ, m): flexible (e.g. δ = 1/ log(p), m = 3) ◮ Q: only need to be in a certain range

t(D, F) = 2q log(p), q0 ≤ q ≤ q∗(D, F)

◮ ugs is relatively easy to estimate ◮ v gs is relatively hard to estimate

Jiashun Jin Graphlet Screening (GS)

slide-28
SLIDE 28

Example: ρ∗

j (ϑ, r, Ω) has simple form If λ∗

3(Ω) > 2(5 − 2

√ 6), λ∗

4(Ω) > 5 − 2

√ 6, 19 − 8 √ 6 < Ω(i, j) <

  • 1 +

√ 6 − √ 2

  • 3/2 + 1

, ∀ i = j

5 − 2 √ 6 ≈ 0.1, 19 − 8 √ 6 ≈ −0.6,

  • 1 +

√ 6 − √ 2/(

  • 3/2 + 1) ≈ 0.64

Corollary 1. As p → ∞, Hamm∗

p(ϑ, θ, r, a, Ω)

pǫp =

  • 1 + o(1),

r < ϑ, Lpp− (ϑ−r)2

4r ,

1 < r

ϑ < 5 + 2

√ 6

Jiashun Jin Graphlet Screening (GS)

slide-29
SLIDE 29

Phase diagram

A three-phase diagram in the phase space {(ϑ, r) : 0 < ϑ < 1, r > 0} to visualize the behavior of a procedure

◮ I. Region of No Recovery ◮ II. Region of Almost Full Recovery: ◮ III. Region of Exact Recovery

Jiashun Jin Graphlet Screening (GS)

slide-30
SLIDE 30

Phase diagram of GS (Corollary 1)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 4 4.5

ϑ r

Exact Recovery Almost Full Recovery No Recovery

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 4 4.5

ϑ r

Exact Recovery Almost Full Recovery No Recovery

Left: Ω = Ip; red curve: r = (1 + √ 1 − ϑ)2 Right: Ω as in Corollary 1. Blue line:

r ϑ = 5 + 2

√ 6

Jiashun Jin Graphlet Screening (GS)

slide-31
SLIDE 31

Non-optimal regions for L0/L1 penalization

G is 2 × 2 block-wise (diagonal 1, off-diagonal 0.5)

0.5 1 2 4 6

ϑ r

Exact Recovery Almost Full Recovery No Recovery 0.5 1 2 4 6

ϑ r

Exact Recovery Optimal Non−

  • ptimal

No Recovery 0.5 1 5 10 15 20

ϑ r

Exact Recovery

Optimal

Non−optimal

No Recovery

Left: GS. Middle: subset selection. Right: lasso (y-axis is prolonged) ǫp = p−ϑ, τp = √2r log p, each signal ≥ τp

Jiashun Jin Graphlet Screening (GS)

slide-32
SLIDE 32

Simulation comparison

6 8 10 12 0.1 0.2 0.3 0.4 0.5

τp

2-by-2 blockwise

LASSO Graphlet Screening

6 8 10 12 0.1 0.2 0.3 0.4 0.5

τp

Penta-diagonal

LASSO Graphlet Screening

6 8 10 12 0.1 0.2 0.3 0.4 0.5

τp

Random Correlation

LASSO Graphlet Screening

p = 5000, n = 4000, pǫp = 250; τp = 6, 7, . . . , 12. Left to right: G is block-wise, penta-diagonal, randomly generated (‘sprandsym’ in matlab).

Jiashun Jin Graphlet Screening (GS)

slide-33
SLIDE 33

Extensions

◮ Main results not tied to Rare Weak model;

hold much more broadly

◮ Extensions to non-random design is mostly

straightforward

◮ Successfully extended to cases where G is

non-sparse but sparsifiable

◮ change-point problem ◮ long-memory time series ◮ factor model

Ke, Jin, Fan (2012)

Jiashun Jin Graphlet Screening (GS)

slide-34
SLIDE 34

Take-home messages

◮ Proposed Graphlet Screening (GS) for variable

selection

◮ Proved optimality of GS ◮ Key insight:

◮ original model is decomposable due to interaction

between signal sparsity and graph sparsity

◮ minimax rate depends on X ‘locally’ so we have to

act ‘locally’

◮ Exposed intuition for the non-optimality of

penalization methods

Jiashun Jin Graphlet Screening (GS)