Stochastic Iterative Hard Thresholding for Graph-Structured Sparsity - - PowerPoint PPT Presentation

stochastic iterative hard thresholding for graph
SMART_READER_LITE
LIVE PREVIEW

Stochastic Iterative Hard Thresholding for Graph-Structured Sparsity - - PowerPoint PPT Presentation

Stochastic Iterative Hard Thresholding for Graph-Structured Sparsity Optimization Baojian Zhou 1 , Feng Chen 1 , and Yiming Ying 2 1 Department of Computer Science, 2 Department of Mathematics and Statistics, University at Albany, NY, USA


slide-1
SLIDE 1

Stochastic Iterative Hard Thresholding for Graph-Structured Sparsity Optimization

Baojian Zhou1, Feng Chen1, and Yiming Ying2

1Department of Computer Science, 2Department of Mathematics and Statistics,

University at Albany, NY, USA

06/13/2019 Poster # 92

1 / 7

slide-2
SLIDE 2

Graph structure information as a prior often have:

  • better classification,

regression performance

  • stronger interpretation

Current limitations:

  • only focus on specific loss
  • expensive full-gradient

calculation

  • cannot handle complex

structure Our goals propose/provide:

  • an algo. for general loss

under stochastic setting

  • convergence analysis
  • real-world applications

Structured sparse learning

Given M(M) = {w : supp(w) ∈ M}, the structured sparse learning problems can be formulated as min

w∈M(M) F(w) := 1

n

n

  • i=1

fi(w), where F(w) is a convex loss such as least square, logistic loss, . . . M(M) models structured sparsity such as connected subgraphs, dense subgraphs, and subgraphs isomophic to a query graph, . . .

2 / 7

w6 w5 w4 w3 w2 w1 w w6 w4 w5 w1 w2 w3 G 1

slide-3
SLIDE 3

Inspired by two recent works Hegde et al. (2016); Nguyen et al. (2017)

Algorithm 1 GraphStoIHT

1: Input: ηt, F(·), MH, MT 2: Initialize: w 0 and t = 0 3: for t = 0, 1, 2, . . . do 4:

Choose ξt from [n] with prob. pξt

5:

bt = P(∇fξt(w t), MH)

6:

w t+1 = P(w t − ηtbt, MT )

7: end for 8: Return w t+1

w6 w5 w4 w3 w2 w1 w6 w4 w5 w1 w2 w3 w6 w4 w5 w1 w2 w3 w6 w4 w5 w1 w2 w3

  • Weighted Graph Model

M = {S : |S| ≤ 3, S is connected } (Hegde et al., 2015a)

Orthogonal Projection Operator P(·, M) : Rp → Rp defined as P(w, M) = arg min

w ′∈M(M)

w − w ′2

  • s-sparse set
  • Weighted Graph Model

Two differences from StoIHT:

  • project the gradient ∇fξt(·)
  • projects the proxy onto M(MT ).

Why projection bt = P(∇fξt(w t), MH) ?

  • Both of them solve the same projection problem
  • Intuitively, sparsity is both in primal and dual space
  • Remove some noisy directions at the first stage

3 / 7

2

slide-4
SLIDE 4

Two assumptions in M(M):

1

fi(w): β-Restricted Strong Smoothness F(w): α-Restricted Strong Convexity

2

Efficient Approximated projections:

  • P(·, MH) with approximation factor cH
  • P(·, MT ) with approximation factor cT

β 2 w − w′2

B

f

(w, w

)

α 2 w − w′2

Bf (w, w ′) = f (w) − f (w ′) − ∇f (w ′), w − w ′

Theorem 1 (Linear Convergence)

Let w 0 be the start point and choose ηt = η, then w t+1 of Algorithm 1 satisfies Eξ[t]w t+1 − w ∗ ≤ κt+1w 0 − w ∗ + σ 1 − κ, where

κ = (1 + cT )

  • αβη2 − 2αη + 1 +
  • 1 − α2
  • , α0 = cHατ −
  • αβτ 2 − 2ατ + 1, β0 = (1 + cH)τ

σ =

  • β0

α0 + α0β0

  • 1 − α2
  • Eξt∇Ifξt(w ∗) + ηEξt∇Ifξt(w ∗), and η, τ ∈ (0, 2/β).

4 / 7

3

slide-5
SLIDE 5

Graph Linear Regression X ∈ Rm×p, ǫ ∼ N(0, Im) y = Xw ∗ + ǫ w ∗: Consider the least square loss arg min

supp(w)∈M(M)

F(w) := 1 n

n

  • i=1

n 2mXBiw − yBi2. Contraction factor

Algorithm κ GraphIHT (1 + cT ) √ δ + 2√1 − δ √ δ GraphStoIHT (1 + cT )

  • 2

1+δ + 2√ 2(1−δ) 1+δ

√ δ

  • For GraphIHT, δ ≤ 0.0527
  • For GraphStoIHT, δ ≤ 0.0142

Graph Logistic Regression xi ∈ Rp, yi ∈ {+1, −1} (1 + e−yi·w ∗,xi)−1 w ∗: Consider the logistic loss arg min

supp(w)∈M(M)

F(w) := 1 n

n

  • i=1

n m

m/n

  • j=1

h(w, ij)+λ 2 w2, where h(w, ij) = log(1 + exp (−yij · xij, w)). If xi is normalized, then F(w) satisfies λ-RSC and each fi(w) satisfies (α + (1 + ν)θmax)-RSS. The condition of κ < 1 is λ λ + n(1 + ν)θmax/4m ≥ 243 250, with prob. 1 − p exp (−θmaxν/4), where θmax = λmax(m/n

j=1 E[xijxT ij ]) and ν ≥ 1.

5 / 7

4

slide-6
SLIDE 6

Simulation Dataset each entry √mXij ∼ N(0, 1) supp(w ∗) is generated by random walk Entries of w ∗ from N(0, 1) Weighted Graph Model (Hegde et al., 2015b)

5 10 15 20 25 Epoch 100 10−2 10−4 10−6 10−8 x − ˆ x

GraphStoIHT

b = 1 b = 2 b = 4 b = 8 b = 16 b = 24 b = 32 b = 40 b = 48 b = 56 b = 64 b = 180 300 600 900 Iteration 100 10−2 10−4 10−6 10−8

GraphStoIHT

η = 0.1 η = 0.2 η = 0.3 η = 0.4 η = 0.5 η = 0.6 η = 0.7 η = 0.8 η = 0.9 η = 1.0 η = 1.1 η = 1.2 η = 1.3 η = 1.4 η = 1.5 η = 1.6

BackGround Angio Text

1.5 2.0 2.5 3.0 3.5 Oversampling ratio m/s 0.0 0.2 0.4 0.6 0.8 1.0 Probability of Recovery 1.5 2.0 2.5 3.0 3.5 Oversampling ratio m/s 1.5 2.0 2.5 3.0 3.5 Oversampling ratio m/s

NIHT IHT StoIHT CoSaMP GraphIHT GraphCoSaMP GraphStoIHT

Breast Cancer Dataset 295 samples with 78 positives (metastatic) and 217 negatives (non-metastatic) provided in (Van De Vijver et al., 2002). PPI network with 637 pathways is provided in (Jacob et al., 2009). We restrict our analysis on 3,243 genes (nodes) with 19,938

  • edges. These cancer-related genes form a

connected subgraph.

Algorithm Cancer related genes wt0 AUC GraphStoIHT BRCA2, CCND2, CDKN1A, ATM, AR, TOP2A 051.7 0.715 GraphIHT ATM, CDKN1A, BRCA2, AR, TOP2A 055.2 0.714 ℓ1-Path BRCA1, CDKN1A, ATM, DSC2 061.2 0.675 StoIHT MKI67, NAT1, AR, TOP2A 059.6 0.708 ℓ1/ℓ2-Edge CCND3, ATM, CDH3 051.4 0.705 ℓ1-Edge CCND3, AR, CDH3 039.9 0.698 ℓ1/ℓ2-Path BRCA1, CDKN1A 147.6 0.705 IHT NAT1, TOP2A 067.9 0.707 6 / 7

5

slide-7
SLIDE 7

See you at Poster#92

Thank you!

7 / 7