An Efficient Gauss-Newton Algorithm for Symmetric Low-Rank Product - - PowerPoint PPT Presentation

an efficient gauss newton algorithm for symmetric low
SMART_READER_LITE
LIVE PREVIEW

An Efficient Gauss-Newton Algorithm for Symmetric Low-Rank Product - - PowerPoint PPT Presentation

An Efficient Gauss-Newton Algorithm for Symmetric Low-Rank Product Matrix Approximations Xin Liu State Key Laboratory of Scientific and Engineering Computing Institute of Computational Mathematics and Scientific/Engineering Computing Academy of


slide-1
SLIDE 1

An Efficient Gauss-Newton Algorithm for Symmetric Low-Rank Product Matrix Approximations

Xin Liu

State Key Laboratory of Scientific and Engineering Computing Institute of Computational Mathematics and Scientific/Engineering Computing Academy of Mathematics and Systems Science Chinese Academy of Sciences, China (Joint work with Zaiwen Wen1, Yin Zhang2) 2014 Workshop on Optimization for Modern Computation, BICMR, Peking University September 3, 2014

1Peking University, China 2Rice University, U.S. Xin Liu (AMSS) SLRPGN September 3, 2014 1 / 29

slide-2
SLIDE 2

Original Motivation

Low-rank approximation: given B ∈ Rn×m and k < min(n, m), min

X,Z

  • XZT − BF : X ∈ Rn×k, Z ∈ Rm×k

Closely related to SVD k-Dominant (k-D) SVD of n × m ⇒ solution Solution + QR(n × k) and SVD (k × k) ⇒ k-D SVD of n × m How about the symmetric case? for A = AT ∈ Rn×n (e.g., A = BBT), min

X

  • XXT − AF : X ∈ Rn×k

Xin Liu (AMSS) SLRPGN September 3, 2014 2 / 29

slide-3
SLIDE 3

Symmetric Low-Rank Product Optimization

A nonlinear, nonconvex least squares problem min

X∈Rn×k XXT − A2 F

Fundamental in low-rank matrix approximations Principal subspace of A:

span(X∗) = span{q1, q2, . . . , qk}

where X∗ is a global minimizer, and {qj}k

j=1 are k dominant

eigenvectors of A. For A = BBT, columns of X are “principal components” of B.

Xin Liu (AMSS) SLRPGN September 3, 2014 3 / 29

slide-4
SLIDE 4

Most Established Eigensolvers

Why not just call eigs (or svds) in Matlab? (ARPACK) Why not use one of the existing eigensolvers? Emerging applications demand new capacities. high efficiency at moderate accuracy high eigenspace dimensions high parallel scalability warm-start capacity Established eigensolvers often lack in one or more aspects. Advanced scientific computing and evolving computer archtectures call for new algorithms (either of general or special purpose).

Xin Liu (AMSS) SLRPGN September 3, 2014 4 / 29

slide-5
SLIDE 5

Block Methods

Block vs. Sequential (Lanczos-type Methods) Block SpMV: AV = [Av1 Av2 · · · Avk] Sequential SpMv’s: Av → A2v · · · → Akv (+ inner products for orthogonalization) As k increases, block methods are gaining advantages. Block methods can be warm-started in an iterative setting. Classic Block Method SSI: (power method) Xi+1 = orth(AXi) Other block algorithms: Block Jacobian-Davidson: Feast Trace minimization: LOBPCG, LMSVD Research on block methods seems still largely unsettled.

Xin Liu (AMSS) SLRPGN September 3, 2014 5 / 29

slide-6
SLIDE 6

Trace Minimization

Trace Minimization max

X∈Rn×k tr(XTAX)

s.t. XTX = I. L.-Wen-Zhang, Limited Memory Block Krylov Subspace Optimization for Computing Dominant Singular Value Decompositions, SIAM Journal on Scientific Computing, 35-3(2013); A := BBT, main iteration of LMSVD: X(i+1) = orth(AY(i)), where Y(i) = argmax{φ(X) | XTX = I, X ∈ Sk},

Sk = span{Xk, Xk−1, ..., Xk−p};

LMSVD code is available at MathWorks (Google: LMSVD).

Xin Liu (AMSS) SLRPGN September 3, 2014 6 / 29

slide-7
SLIDE 7

Bottleneck

Two main types of operations: AX & RR/orth As k increases, AX ≪ RR/orth −→ bottleneck Parallel Scalability AX −→ Ax1 ∪ Ax2 ∪ ... ∪ Axk. Higher.

RR/orth inherits sequentiality. Lower.

Avoid bottleneck? Do less RR/orth No free lunch? Do more BLAS3 (higher scalability than AX)

Xin Liu (AMSS) SLRPGN September 3, 2014 7 / 29

slide-8
SLIDE 8

Orthogonal Free Models

Unconstrained Model: minX∈Rn×k 1

4||XTX||2 F + 1 2tr(XTAX), Dai-Jiang-Cui, 2013

Trace-penalty Minimization min

X∈Rn×k f(X) := 1

2tr(XTAX) + µ 4XTX − I2

F.

EIGPEN, Wen-Yang-L.-Zhang, 2013, available at ”optimization online” Good properties: Equivalence to Trace Minimization does NOT require µ → ∞ No non-global local minimizer, less undesired saddle point RR/Orth mostly replaced by “big” BLAS3 Efficient for moderate accuracy, numerically stable Parallel scalability appears promising Algorithm Gradient method with Barzilai Borwein stepsize Rayleigh-Ritz restart

Xin Liu (AMSS) SLRPGN September 3, 2014 8 / 29

slide-9
SLIDE 9

Why New Approach?

Questions to EIGPEN Gradient method + BB (How about high-order methods?) Condition number: k = 1: κ(∇2fµ(ˆ X)) = λn−λ1

λ2−λ1 ;

k > 1:

κ(∇2fµ(ˆ

X)) = 0, (How about linear convergence rate?) κ

  • ∇2fµ(ˆ

X)

  • Q⊥

k

  • max

S∈Rn×k

  • tr(ST∇2fµ(ˆ

X)(S)) : tr(STS) = 1, STQk = 0

  • min

S∈Rn×k

  • tr(ST∇2fµ(ˆ

X)(S)) : tr(STS) = 1, STQk = 0

  • =

λn − λ1 λk+1 − λk . (Explain why RR restart is useful.)

µ should be tuned in properly. (How to avoid µ?)

Solution SLRP: minX∈Rn×k XXT − A2

F

Xin Liu (AMSS) SLRPGN September 3, 2014 9 / 29

slide-10
SLIDE 10

Notation

Let eigenvalues of A be in a descending order

λ1 ≥ λ2 ≥ · · · ≥ λn

Eigenvalue decomposition: A = QnΛnQT

n ,

QT

n Qn = I,

Λn diagonal

k-D principal eigenspace: span(Qk) span{q1, q2, . . . , qk} k-D principal eigenpair: (Qk, Λk)

Xin Liu (AMSS) SLRPGN September 3, 2014 10 / 29

slide-11
SLIDE 11

Optimal Solutions of SLRP

Equivalence

Assume that A = AT ∈ Rn×n such that λk ≥ 0. Then X ∈ Rn×k is a solution to min XXT − A2

F if and only if it has SVD:

X = QkΛ

1 2

k VT,

where (Qk, Λk) is a k-D principal eigenpair of A, V ∈ Rk×k is orthogonal but otherwise arbitrary. 1st-order condition for SLRP: AX = X(XTX) Stationary points span invariant subspaces.

Xin Liu (AMSS) SLRPGN September 3, 2014 11 / 29

slide-12
SLIDE 12

Gauss-Newton Review

Nonlinear Least Squares Model: min

x∈Rn f(x) 1

2r(x)Tr(x). r(x) : Rn → Rm. Linearize: r(x + s) ≈ r(x) + J(x)s, where J(x) is the Jacobian. Normal equations + Line Search: (minimize the lienar approximation) J(x)TJ(x)s = −J(x)Tr(x). x = x + αs. Some properties: Fast for small residual. Slow for large residual. Local convergence may require α < 1 all the time.

Xin Liu (AMSS) SLRPGN September 3, 2014 12 / 29

slide-13
SLIDE 13

SLRP

SLRP: Nonlinear Least Squares Model min

X∈Rn×k f(X) 1

2R(X)2

F.

R(X) XXT − A. Let J(X) : Rn×k → Rn×n be the Jacobian operator of R(X) at X. Normal equations: (size nk × nk) J(X)TJ(X)(S) = −J(X)T(R(X)). Infinitely many solutions since J(X) is rank deficient.

Xin Liu (AMSS) SLRPGN September 3, 2014 13 / 29

slide-14
SLIDE 14

GN Directions

Special structure of normal equations allows low-cost solution: SXTX + XSTX = AX − X(XTX)

GN Direction

Let X ∈ Rn×k be full rank, and PX = X(XTX)−1XT. Then SC =

  • I − 1

2PX AX(XTX)−1 − X

  • + XC,

where CT = −C, satisfies the normal equations. In particular, for C = 0, S0 =

  • I − 1

2PX AX(XTX)−1 − X

  • is a minimum weighted-norm Gauss-Newton direction at X.

Xin Liu (AMSS) SLRPGN September 3, 2014 14 / 29

slide-15
SLIDE 15

SLRPGN (Theoretical Version)

Gauss-Newton (GN): While not “converged”, do

1

If σmin(X) < δ, set X = X + P; —- Correction Step

2

Select α = min

  • 1, σ3

min(X)/∇f(X)F

  • , set X = X + αS0.

Calculation GN step: Y = X(XTX)−1, G = AY − X S0 = G − X(YTG)/2 Computational cost: 1 block SpMV: AY 3 dense matrix multiplications 1 k × k linear system with n rhs

Xin Liu (AMSS) SLRPGN September 3, 2014 15 / 29

slide-16
SLIDE 16

SLRPGN (Practical Implementation)

So far, in practice

α = 1 appears always to work well;

Correction step can hardly be invoked.

[X, Y] = GN(A, X)

While not “converged”, do

1

Y = X(XTX)−1

2

X = AY − X(Y TAY − I)/2

Simple Algorithm Two-liner with no parameters No orthogonalization No Rayleigh-Ritz (unless eigenpairs are required)

Xin Liu (AMSS) SLRPGN September 3, 2014 16 / 29

slide-17
SLIDE 17

Step Size and Correction Step

Full Rankness: σmin(Xi+1) ≥ 0.75 σmin(Xi) Correction Step:

δ ≤

  • λn/λ1

4+ √ 20 λn k

Xc = X + P (:=

  • λn

p UVT p , where UTX = 0 and UTU = I )

Key properties:

σmin(Xc) ≥ δ

f(Xc) < f(X) − 1

4λ2 n

Xin Liu (AMSS) SLRPGN September 3, 2014 17 / 29

slide-18
SLIDE 18

Convergence Results

Theorem (Global Convergence)

Suppose that A ≻ 0. Let {Xi} be generated by SLRPGN(TH) starting from a full-rank initial point. Then after finite number of iterations, step-size

α = 1 will always be taken, no more correction step, and ∇f(Xi) → 0.

f(X) does not have any local (non-global) minimum. It is unlikely that the iterates get trapped at a saddle point. Better local convergence result holds if we further assume λk > λk+1.

Theorem (Q-Linear Rate)

Suppose A ≻ 0 and λk > λk+1. Then {Xi}, a sequence generated by SLRPGN(PR) starting from a full-rank initial point X0 ∈ L(γ), globally converges to L(f∗), where L(γ) := {X | f(X) ≤ γ} denotes the level set, f∗ denotes the global minimum of SLRP and γ > f∗ is a constant. Moreover, the gradient sequence {∇f(Xi)} converges to zero at a Q-linear rate λk+1

λk .

Xin Liu (AMSS) SLRPGN September 3, 2014 18 / 29

slide-19
SLIDE 19

Low-rank Approximation without SVD

Recall for B ∈ Rn×m, min

X,Z

  • XZT − BF : X ∈ Rn×k, Z ∈ Rm×k

SVD: XZT = (UkΣk)VT

k ,

which is the principal part of B = UΣVT. GN: A = @(X)B ∗ (B′ ∗ X);

[X, Y] = SLRPGN(A, X);

ZT = YTB

Xin Liu (AMSS) SLRPGN September 3, 2014 19 / 29

slide-20
SLIDE 20

Experiment Environment

Platform All the experiments were preformed on a linux workstation with 2 Intel Xeon E5-2697 CPUs (2.70GHz, 12 cores) and 128GB of memory running Ubuntu 12.04 and Matlab 2013b. Tested Methods Matlab EIGS — Lanczos-based (ARPACK, Sorensen et.al.) LANSVD — Lanczos-based (PROPACK, R. M. Larsen) LMSVD — block subspace method (L.-Wen-Zhang, SISC, 2013) SLRPBB – BB + gradient (EIGPEN) (Wen-Yang-L.-Zhang) SLRPGN – proposed GN algorithm Required Accuracy: moderate

Xin Liu (AMSS) SLRPGN September 3, 2014 20 / 29

slide-21
SLIDE 21

Comparison Between BB and GN

100 200 300 400 500 600 10

−5

10

−4

10

−3

10

−2

10

−1

Number of dominant singular values computed f(X)/||AAT||F

2

(a) Optimal residual

100 200 300 400 500 600 2 4 6 8 10 12 14 16 Number of dominant singular values computed CPU time SLRPBB SLRPGN−with−fun−grad SLRPGN

(b) Runtime

100 200 300 400 500 600 10

−4.7

10

−4.6

10

−4.5

10

−4.4

10

−4.3

10

−4.2

10

−4.1

Number of dominant singular values computed ||∇ f(X)||F/||AAT||F

2 SLRPBB SLRPGN−with−fun−grad SLRPGN

(c) gradient norm

(SLRPBB and SLRPGN with varying number of computed singular values on the random example “randcolu”, size: 10000 × 10000,

tol = 10−4)

Xin Liu (AMSS) SLRPGN September 3, 2014 21 / 29

slide-22
SLIDE 22

Comparison Between BB and GN (Cont’d)

20 40 60 80 100 120 10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10 10

1

10

2

iteration ||∇ f(X)||F/||AAT||F

2 SLRPBB SLRPGN−with−fun−grad

(d) k = 100

20 40 60 80 100 120 10

−6

10

−4

10

−2

10 10

2

10

4

iteration ||∇ f(X)||F/||AAT||F

2 SLRPBB SLRPGN−with−fun−grad

(e) k = 400

20 40 60 80 100 120 10

−6

10

−4

10

−2

10 10

2

10

4

iteration ||∇ f(X)||F/||AAT||F

2 SLRPBB SLRPGN−with−fun−grad

(f) k = 450

10 20 30 40 50 10

−20

10

−15

10

−10

10

−5

10 10

5

iteration ||∇ f(X)||F/||AAT||F

2 SLRPBB SLRPGN−with−fun−grad

(g) k = 500

20 40 60 80 100 120 10

−10

10

−8

10

−6

10

−4

10

−2

10 10

2

10

4

iteration ||∇ f(X)||F/||AAT||F

2 SLRPBB SLRPGN−with−fun−grad

(h) k = 550

20 40 60 80 100 120 10

−10

10

−8

10

−6

10

−4

10

−2

10 10

2

10

4

iteration ||∇ f(X)||F/||AAT||F

2 SLRPBB SLRPGN−with−fun−grad

(i) k = 600

Xin Liu (AMSS) SLRPGN September 3, 2014 22 / 29

slide-23
SLIDE 23

Matrix Separation

Robust Principal Component Analysis Data matrix is M = L0 + S0 + ω ∈ Rm×n, where L0 is low-rank, S0 is sparse and ω is small noise. Given M, find L0 and S0 approximately by solving: min

L,S L∗ + µS1, s.t. L + S = M.

Matlab Code: IALM (Lin et al) Alternating Direction Multiplier Method (ADMM) Calls SVD at every iteration (warm-start desired) Test cases: random instances

Xin Liu (AMSS) SLRPGN September 3, 2014 23 / 29

slide-24
SLIDE 24

RPCA Results

CPU Time in Seconds

500 1000 1500 2000 2500 3000 3500 4000 50 100 150 200 250 300

size of square matrix CPU time τr = 0.10; τs = 0.05

LMSVD LANSVD EIGS SLRP

(a) rank = 0.10n Sample 5%

500 1000 1500 2000 2500 3000 3500 4000 50 100 150 200 250 300

size of square matrix CPU time τr = 0.05; τs = 0.10

LMSVD LANSVD EIGS SLRP

(b) rank = 0.05n Sample 10%

(All achieved similar accuracy)

Xin Liu (AMSS) SLRPGN September 3, 2014 24 / 29

slide-25
SLIDE 25

Matrix Completion

Find a low-rank matrix from a sampled set of its entries Given the entries of M + ω in Ω, find X ≈ M by solving: min

X X∗, s.t. Xij = Mij, ∀(i, j) ∈ Ω.

Matlab Code: SVT and NNLS Singular Value Thresholding Calls SVD at every iteration (warm-start desired) Test cases: random instances

Xin Liu (AMSS) SLRPGN September 3, 2014 25 / 29

slide-26
SLIDE 26

SVT Results

CPU Time in Seconds

0.05 0.1 0.15 0.2 0.25 0.3 0.35 50 100 150 200 250 300 350 400 450

Sample Ratio CPU time

LMSVD LANSVD SLRPGN

(a) rank = 10

0.05 0.1 0.15 0.2 0.25 0.3 500 1000 1500 2000 2500 3000 3500

Sample Ratio CPU time

LMSVD LANSVD SLRPGN

(b) rank = 50

(All achieved similar accuracy)

Xin Liu (AMSS) SLRPGN September 3, 2014 26 / 29

slide-27
SLIDE 27

NNLS Results

CPU Time in Seconds3

5 10 15 20 25 30 10 20 30 40 50 60 70 80 90

Sample Ratio CPU time

LMSVD LANSVD SLRPGN

(a) rank = 10

5 10 15 20 25 30 40 60 80 100 120 140 160 180 200 220 240

Sample Ratio CPU time

LMSVD LANSVD SLRPGN

(b) rank = 50

(All achieved similar accuracy)

3The sparse-dense matrix multiplication uses MKL Xin Liu (AMSS) SLRPGN September 3, 2014 27 / 29

slide-28
SLIDE 28

Summery and Remarks

SLRP: min XXT − A2

F.

Output (X, Y) GN: Y = X(XTX)−1; X = AY − X(YTAY − I)/2 SLRPGN: simple and parameter-free Principal subspace without SVD, nor Rayleigh-Ritz Benefit of concurrency already seen in plain Matlab Global convergence and local Q-linear convergence rate Effective for small residuals and low-moderate accuracy (so far) Further Works Strategically placed Rayleigh-Ritz will improve accuracy Other eigen-techniques (poly-filtering, deflation, ...) help too SLRPGN + (a few RR): Potential as eigensolver worth investigating Parallel scalability to be exploited

Xin Liu (AMSS) SLRPGN September 3, 2014 28 / 29

slide-29
SLIDE 29

Thank you for your attention!

Xin Liu (AMSS) SLRPGN September 3, 2014 29 / 29