A Stochastic PCA Algorithm with an Exponential Convergence Rate - - PowerPoint PPT Presentation

a stochastic pca algorithm with an exponential
SMART_READER_LITE
LIVE PREVIEW

A Stochastic PCA Algorithm with an Exponential Convergence Rate - - PowerPoint PPT Presentation

A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence 1/19 Principal Component Analysis


slide-1
SLIDE 1

A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir

Weizmann Institute of Science NIPS Optimization Workshop December 2014

Ohad Shamir Stochastic PCA with Exponential Convergence 1/19

slide-2
SLIDE 2

Principal Component Analysis

PCA Input: x1, . . . , xn ∈ Rd Goal: Find k directions with most variance max

U∈Rd×k:U⊤U=I

1 n

n

  • i=1
  • U⊤x
  • 2

For k = 1: Find leading eigenvector of covariance matrix max

w∈Rd:w=1 w⊤

  • 1

n

n

  • i=1

xix⊤

i

  • w

Ohad Shamir Stochastic PCA with Exponential Convergence 2/19

slide-3
SLIDE 3

Existing Approaches

max

w∈Rd:w=1 w⊤

  • 1

n

n

  • i=1

xix⊤

i

  • w

Regime: n, d “large”, non-sparse matrix

Ohad Shamir Stochastic PCA with Exponential Convergence 3/19

slide-4
SLIDE 4

Existing Approaches

max

w∈Rd:w=1 w⊤

  • 1

n

n

  • i=1

xix⊤

i

  • w

Regime: n, d “large”, non-sparse matrix Approach 1: Eigendecomposition Compute leading eigenvector of 1

n

n

i=1 xix⊤ i exactly

(e.g. via QR decomposition) Runtime: O(d3)

Ohad Shamir Stochastic PCA with Exponential Convergence 3/19

slide-5
SLIDE 5

Existing Approaches

Approach 2: Power Iterations Initialize w1 randomly on unit sphere For t = 1, 2, . . .

w′

t+1 :=

1

n

n

i=1 xix⊤ i

  • wt =

1 n

n

i=1 wt, xi xi

wt+1 := w′

t+1/

  • w′

t+1

  • O

1

λ log

1

ǫ

  • iterations for ǫ-optimality

λ: Eigengap

O(nd) runtime per iteration Overall runtime O nd

λ log

d

ǫ

  • Ohad Shamir

Stochastic PCA with Exponential Convergence 4/19

slide-6
SLIDE 6

Existing Approaches

Approach 2: Power Iterations Initialize w1 randomly on unit sphere For t = 1, 2, . . .

w′

t+1 :=

1

n

n

i=1 xix⊤ i

  • wt =

1 n

n

i=1 wt, xi xi

wt+1 := w′

t+1/

  • w′

t+1

  • O

1

λ log

1

ǫ

  • iterations for ǫ-optimality

λ: Eigengap

O(nd) runtime per iteration Overall runtime O nd

λ log

d

ǫ

  • Approach 2.5: Lanczos Iterations

More complex algorithm, but roughly similar iteration runtime and only O

  • 1

√ λ log

1

ǫ

  • iterations [Kuczy´

nski and Woz´ niakowski 1989]

Overall runtime O

  • nd

√ λ log

d

ǫ

  • Ohad Shamir

Stochastic PCA with Exponential Convergence 4/19

slide-7
SLIDE 7

Existing Approaches

Approach 3: Stochastic/Incremental Algorithms Example (Oja’s algorithm) Initialize w1 randomly on unit sphere For t = 1, 2, . . .

Pick it ∈ {1, . . . , n} (randomly or otherwise) w′

t+1 := wt + ηtxitx⊤ it wt

wt+1 := w′

t+1/

  • w′

t+1

  • Also Krasulina 1969; Arora, Cotter, Livescu, Srebro 2012;

Mitliagkas, Caramanis, Jain 2013; De Sa, Olukotun, R´ e 2014...

Ohad Shamir Stochastic PCA with Exponential Convergence 5/19

slide-8
SLIDE 8

Existing Approaches

Approach 3: Stochastic/Incremental Algorithms Example (Oja’s algorithm) Initialize w1 randomly on unit sphere For t = 1, 2, . . .

Pick it ∈ {1, . . . , n} (randomly or otherwise) w′

t+1 := wt + ηtxitx⊤ it wt

wt+1 := w′

t+1/

  • w′

t+1

  • Also Krasulina 1969; Arora, Cotter, Livescu, Srebro 2012;

Mitliagkas, Caramanis, Jain 2013; De Sa, Olukotun, R´ e 2014...

O(d) runtime per iteration Iteration bounds:

Balsubramani, Dasgupta, Freund 2013: ˜ O d

λ2

1

ǫ + d

  • De Sa, Olukotun, R´

e 2014: For a different SGD method, ˜ O d

λ2ǫ

  • Runtime: ˜

O

  • d2

λ2ǫ

  • Ohad Shamir

Stochastic PCA with Exponential Convergence 5/19

slide-9
SLIDE 9

Existing Approaches

Up to constants/log-factors: Algorithm Time per iter. # iter. Runtime Exact d3 Power/Lanczos nd

1 λp nd λp

Incremental d

d λ2ǫ d2 λ2ǫ

Main Question Can we get the best of both worlds? O(d) time per iteration and fast convergence (logarithmic dependence on ǫ?)

Ohad Shamir Stochastic PCA with Exponential Convergence 6/19

slide-10
SLIDE 10

Convex Optimization to the Rescue?

Our problem is equivalent to: min

w:w=1

1 n

n

  • i=1
  • − w, xi2

Much recent progress in solving strongly convex + smooth problems with finite-sum structure min

w∈W

1 n

n

  • i=1

fi(w) Stochastic algorithms with O(d) runtime per iteration and exponential convergence

[Le Roux, Schmidt, Bach 2012; Shalev-Shwartz and Zhang 2012; Johnson and Zhang 2013; Zhang, Mahdavi, Jin 2013; Koneˇ cn´ y and Richt´ arik 2013; Xiao and Zhang 2014; Zhang and Xiao, 2014...]

Ohad Shamir Stochastic PCA with Exponential Convergence 7/19

slide-11
SLIDE 11

Convex Optimization to the Rescue?

min

w:w=1

1 n

n

  • i=1
  • − w, xi2

Unfortunately: Function not strongly convex, or even convex (in fact, concave everywhere) Has > 1 global optima, plateaus... ⇒ Existing results don’t work as-is But: Maybe we can borrow some ideas...

Ohad Shamir Stochastic PCA with Exponential Convergence 8/19

slide-12
SLIDE 12

Algorithm

min

w:w=1

1 n

n

  • i=1
  • − w, xi2

Oja Iteration Choose it ∈ {1, . . . , n} at random w′

t+1 = wt + ηt wt, xit xit

wt+1 := w′

t+1/

  • w′

t+1

  • Essentially projected stochastic gradient descent

Ohad Shamir Stochastic PCA with Exponential Convergence 9/19

slide-13
SLIDE 13

Algorithm

Letting A = 1

n

n

i=1 xix⊤ i , update step is

w′

t+1 = wt + ηtxitx⊤ it wt

= wt + ηtAwt

power/gradient step

+ ηt

  • xitx⊤

it − A

  • wt
  • zero-mean noise

Ohad Shamir Stochastic PCA with Exponential Convergence 10/19

slide-14
SLIDE 14

Algorithm

Letting A = 1

n

n

i=1 xix⊤ i , update step is

w′

t+1 = wt + ηtxitx⊤ it wt

= wt + ηtAwt

power/gradient step

+ ηt

  • xitx⊤

it − A

  • wt
  • zero-mean noise

Main idea: Replace by w′

t+1 = wt +

ηAwt

power/gradient step

+ η

  • xitx⊤

it − A

  • (wt − ˜

u)

  • zero-mean noise

where ˜ u “close” to wt (similar to SVRG of Johnson and Zhang (2013))

Ohad Shamir Stochastic PCA with Exponential Convergence 10/19

slide-15
SLIDE 15

Algorithm

VR-PCA Parameters: Step size η, epoch length m Input: Data set {xi}n

i=1, Initial unit vector ˜

w0 For s = 1, 2, . . .

˜ u = 1

n

n

i=1 xix⊤ i ˜

ws−1 w0 = ˜ ws−1 For t = 1, 2, . . . , m

Pick it ∈ {1, . . . , n} uniformly at random w′

t = wt−1 + η

  • xit x⊤

it (wt−1 − ˜

ws−1) + ˜ u

  • wt =

1

w′

tw′

t

˜ ws = wm

Ohad Shamir Stochastic PCA with Exponential Convergence 11/19

slide-16
SLIDE 16

Algorithm

VR-PCA Parameters: Step size η, epoch length m Input: Data set {xi}n

i=1, Initial unit vector ˜

w0 For s = 1, 2, . . .

˜ u = 1

n

n

i=1 xix⊤ i ˜

ws−1 w0 = ˜ ws−1 For t = 1, 2, . . . , m

Pick it ∈ {1, . . . , n} uniformly at random w′

t = wt−1 + η

  • xit x⊤

it (wt−1 − ˜

ws−1) + ˜ u

  • wt =

1

w′

tw′

t

˜ ws = wm

To get k > 1 directions: Either repeat, or perform orthogonal-like iterations: Replace all vectors by k × d matrices Replace normalization step by orthogonalization step

Ohad Shamir Stochastic PCA with Exponential Convergence 11/19

slide-17
SLIDE 17

Analysis

Theorem Suppose maxi xi2 ≤ r, and A has leading eigenvector v1. Assuming ˜ w0, v1 ≥

1 √ 2, then for any δ, ǫ ∈ (0, 1), if

η ≤ c1δ2

r2 λ

, m ≥ c2 log(2/δ)

ηλ

, mη2r2 + r

  • mη2 log(2/δ) ≤ c3,

(where c1, c2, c3 are constants) and we run T =

  • log(1/ǫ)

log(2/δ)

  • epochs,

then Pr

  • ˜

wT, v12 ≥ 1 − ǫ

  • ≥ 1 − 2 log(1/ǫ)δ

Corollary Picking η, m appropriately, ǫ-convergence w.h.p. in O

  • d
  • n + 1

λ2

  • log

1

ǫ

  • runtime

Exponential convergence with O(d)-time iterations Proportional to # examples plus eigengap Proportional to single data pass if λ ≥ 1/√n

Ohad Shamir Stochastic PCA with Exponential Convergence 12/19

slide-18
SLIDE 18

Proof Idea

Track decay of F(wt) = 1 − wt, v12 Key Lemma Assuming η = αλ and F(wt) ≤ 3/4, E [F(wt+1)|wt] ≤

  • 1 − Θ(αλ2)
  • F(wt) + O
  • α2λ2F(˜

ws−1)

  • .

Ohad Shamir Stochastic PCA with Exponential Convergence 13/19

slide-19
SLIDE 19

Proof Idea

Track decay of F(wt) = 1 − wt, v12 Key Lemma Assuming η = αλ and F(wt) ≤ 3/4, E [F(wt+1)|wt] ≤

  • 1 − Θ(αλ2)
  • F(wt) + O
  • α2λ2F(˜

ws−1)

  • .

Ohad Shamir Stochastic PCA with Exponential Convergence 13/19

slide-20
SLIDE 20

Proof Idea

Assume η = αλ (α ≪ 1)

Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

slide-21
SLIDE 21

Proof Idea

Assume η = αλ (α ≪ 1) Using martingale arguments: W.h.p., never reach “flat” region in m ≤ O

  • 1

α2λ2

  • iterations

Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

slide-22
SLIDE 22

Proof Idea

Assume η = αλ (α ≪ 1) Using martingale arguments: W.h.p., never reach “flat” region in m ≤ O

  • 1

α2λ2

  • iterations

⇒ For all t ≤ m E [F(wt+1)|wt] ≤

  • 1 − Θ(αλ2)
  • F(wt)+O
  • α2λ2F(˜

ws−1)

  • .

Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

slide-23
SLIDE 23

Proof Idea

Assume η = αλ (α ≪ 1) Using martingale arguments: W.h.p., never reach “flat” region in m ≤ O

  • 1

α2λ2

  • iterations

⇒ For all t ≤ m E [F(wt+1)|wt] ≤

  • 1 − Θ(αλ2)
  • F(wt)+O
  • α2λ2F(˜

ws−1)

  • .

Carefully unwinding recursion, and using w0 = ˜ ws−1, E [F(wm)|w0] ≤

  • 1 − Θ(αλ2)

m + O(α)

  • F(˜

ws−1)

Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

slide-24
SLIDE 24

Proof Idea

Assume η = αλ (α ≪ 1) Using martingale arguments: W.h.p., never reach “flat” region in m ≤ O

  • 1

α2λ2

  • iterations

⇒ For all t ≤ m E [F(wt+1)|wt] ≤

  • 1 − Θ(αλ2)
  • F(wt)+O
  • α2λ2F(˜

ws−1)

  • .

Carefully unwinding recursion, and using w0 = ˜ ws−1, E [F(wm)|w0] ≤

  • 1 − Θ(αλ2)

m + O(α)

  • F(˜

ws−1) ⇒ If m ≥ Ω 1

αλ2

  • , F(wm) smaller than F(˜

ws−1) by some constant factor

Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

slide-25
SLIDE 25

Proof Idea

Assume η = αλ (α ≪ 1) Using martingale arguments: W.h.p., never reach “flat” region in m ≤ O

  • 1

α2λ2

  • iterations

⇒ For all t ≤ m E [F(wt+1)|wt] ≤

  • 1 − Θ(αλ2)
  • F(wt)+O
  • α2λ2F(˜

ws−1)

  • .

Carefully unwinding recursion, and using w0 = ˜ ws−1, E [F(wm)|w0] ≤

  • 1 − Θ(αλ2)

m + O(α)

  • F(˜

ws−1) ⇒ If m ≥ Ω 1

αλ2

  • , F(wm) smaller than F(˜

ws−1) by some constant factor Overall, if

1 αλ2 ≪ m ≪ 1 α2λ2 , every epoch shrinks F by

constant factor

Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

slide-26
SLIDE 26

Experiments

Ran VR-PCA with Random initialization m = n (epoch ≈ one pass over data) η = 0.05/√n (based on theory) Compared to Oja’s algorithm with hand-tuned step-size

Ohad Shamir Stochastic PCA with Exponential Convergence 15/19

slide-27
SLIDE 27

Experiments

20 40 60 −5 −4 −3 −2 −1 λ=0.258 20 40 60 −5 −4 −3 −2 −1 λ=0.163 20 40 60 −5 −4 −3 −2 −1 λ=0.078 20 40 60 −5 −4 −3 −2 −1 λ=0.016 20 40 60 −5 −4 −3 −2 −1 λ=0.004 VR−PCA ηt = 1/t ηt = 3/t ηt = 9/t ηt = 27/t

Ohad Shamir Stochastic PCA with Exponential Convergence 16/19

slide-28
SLIDE 28

Preliminary Experiments

10 20 30 40 50 60 70 80 90 100 −9 −8 −7 −6 −5 −4 −3 −2 −1 VR−PCA ηt=3 ηt=9 ηt=27 ηt=81 ηt=243 Hybrid

Ohad Shamir Stochastic PCA with Exponential Convergence 17/19

slide-29
SLIDE 29

Work-in-progress and Open Questions

Ohad Shamir Stochastic PCA with Exponential Convergence 18/19

slide-30
SLIDE 30

Work-in-progress and Open Questions

Runtime is O(d(n + 1

λ2 ) log

1

ǫ

  • ) can we get

O(d(n + 1

λ) log

1

ǫ

  • r better, analogous to convex case?

Experimentally, required parameters do seem somewhat different

Generalizing analysis to PCA with several directions Theoretically optimal parameters without knowing λ Other “fast-stochastic” approaches Other non-convex problems

Ohad Shamir Stochastic PCA with Exponential Convergence 18/19

slide-31
SLIDE 31

More details: arXiv 1409:2848

THANKS!

Ohad Shamir Stochastic PCA with Exponential Convergence 19/19