A Projected Preconditioned Conjugate Gradient algorithm for - - PowerPoint PPT Presentation

a projected preconditioned conjugate gradient algorithm
SMART_READER_LITE
LIVE PREVIEW

A Projected Preconditioned Conjugate Gradient algorithm for - - PowerPoint PPT Presentation

A Projected Preconditioned Conjugate Gradient algorithm for computing a large invariant subspace of a Hermitian matrix Eugene Vecharynski Lawrence Berkeley National Laboratory joint work with Chao Yang (LBNL) and John E.Pask (LLNL) Scientific


slide-1
SLIDE 1

A Projected Preconditioned Conjugate Gradient algorithm for computing a large invariant subspace of a Hermitian matrix

Eugene Vecharynski Lawrence Berkeley National Laboratory joint work with Chao Yang (LBNL) and John E.Pask (LLNL) Scientific Computing and Matrix Computations Seminar UC Berkeley, October 22, 2014

slide-2
SLIDE 2

Outline

◮ Motivation, the problem, existing approaches ◮ The Projected Preconditioned Conjugate Gradient (PPCG)

algorithm for computing LARGE invariant subspaces

◮ Computational examples ◮ Conclusions and future work

Work supported by DOE SciDAC projects.

slide-3
SLIDE 3

Motivation

Extreme-scale electronic structure calculations

◮ Ground and excited states of a quantum many-body system ◮ Density functional theory based electronic structure

calculations requires several lowest eigenpairs

◮ Density functional perturbation theory often requires many

more lowest eigenpairs

slide-4
SLIDE 4

HUGE eigenvalue problems

Find a subset of lowest eigenpairs (λ, x) of Ax = λx, A = A∗.

◮ The problem size n is well beyond 106 ◮ A small fraction of lowest eigenpairs wanted ( ≈ 1%) ◮ The number of targeted eigenpairs can be on the order of

103–105

◮ Solutions of the eigenvalue problem are computed repeatedly

after slight perturbations of A (SCF loop)

slide-5
SLIDE 5

HUGE eigenvalue problems

Find a subset of lowest eigenpairs (λ, x) of Ax = λx, A = A∗.

◮ The problem size n is well beyond 106 ◮ A small fraction of lowest eigenpairs wanted ( ≈ 1%) ◮ The number of targeted eigenpairs can be on the order of

103–105

◮ Solutions of the eigenvalue problem are computed repeatedly

after slight perturbations of A (SCF loop) Need powerful eigensolvers that compute LARGE eigenspaces!

slide-6
SLIDE 6

Modern eigensolvers

Desirable features:

◮ Block iterations ◮ Preconditioning ◮ Low memory requirement ◮ Simplicity of implementation

State-of-the art methods:

◮ Block Davidson (’75) ◮ LOBPCG (Knyazev ’01) ◮ Jacobi-Davidson (Sleijpen-V. der Vorst ’96) ◮ TRACEMIN (Sameh-Wisniewski ’82)

slide-7
SLIDE 7

Modern eigensolvers

Desirable features:

◮ Block iterations ◮ Preconditioning ◮ Low memory requirement ◮ Simplicity of implementation

State-of-the art methods:

◮ Block Davidson (’75) ◮ LOBPCG (Knyazev ’01) ◮ Jacobi-Davidson (Sleijpen-V. der Vorst ’96) ◮ TRACEMIN (Sameh-Wisniewski ’82)

slide-8
SLIDE 8

Davidson-Liu (Preconditioned Steepest Descent) algorithm

Given an initial guess X (0) and a preconditioner M, compute k lowest eigenpairs of A:

◮ X ← X (0) ◮ While convergence not reached

  • W ← M(AX − X(X ∗AX))
  • S ← [X, W ]
  • Find eigenvectors C associated with the k smallest

– eigenvalues of (S∗AS)C = (S∗S)CΩ, C ∗(S∗S)C = I

  • X ← SC

◮ EndWhile

slide-9
SLIDE 9

Davidson-Liu (Preconditioned Steepest Descent) algorithm

Given an initial guess X (0) and a preconditioner M, compute k lowest eigenpairs of A:

◮ X ← X (0) ◮ While convergence not reached

  • W ← M(AX − X(X ∗AX))
  • S ← [X, W ]
  • Find eigenvectors C associated with the k smallest

– eigenvalues of (S∗AS)C = (S∗S)CΩ, C ∗(S∗S)C = I

  • X ← SC

◮ EndWhile

slide-10
SLIDE 10

The Rayleigh–Ritz (RR) procedure

At each step, RR leads to the projected dense eigenproblem (S∗AS)C = (S∗S)CΩ, C ∗(S∗S)C = I

◮ Size of the projected problem is proportional to k (2k for

Davidson-Liu, or 3k for LOBPCG)

slide-11
SLIDE 11

The Rayleigh–Ritz (RR) procedure

At each step, RR leads to the projected dense eigenproblem (S∗AS)C = (S∗S)CΩ, C ∗(S∗S)C = I

◮ Size of the projected problem is proportional to k (2k for

Davidson-Liu, or 3k for LOBPCG)

◮ Complexity of dense eigensolver scales as k3

slide-12
SLIDE 12

The Rayleigh–Ritz (RR) procedure

At each step, RR leads to the projected dense eigenproblem (S∗AS)C = (S∗S)CΩ, C ∗(S∗S)C = I

◮ Size of the projected problem is proportional to k (2k for

Davidson-Liu, or 3k for LOBPCG)

◮ Complexity of dense eigensolver scales as k3 ◮ Limited parallel scalability of existing kernels for dense

eigenproblems (e.g., ScaLAPACK)

slide-13
SLIDE 13

The Rayleigh–Ritz (RR) procedure

At each step, RR leads to the projected dense eigenproblem (S∗AS)C = (S∗S)CΩ, C ∗(S∗S)C = I

◮ Size of the projected problem is proportional to k (2k for

Davidson-Liu, or 3k for LOBPCG)

◮ Complexity of dense eigensolver scales as k3 ◮ Limited parallel scalability of existing kernels for dense

eigenproblems (e.g., ScaLAPACK)

◮ The RR becomes a computational bottleneck in practice

slide-14
SLIDE 14

The Rayleigh–Ritz (RR) procedure

At each step, RR leads to the projected dense eigenproblem (S∗AS)C = (S∗S)CΩ, C ∗(S∗S)C = I

◮ Size of the projected problem is proportional to k (2k for

Davidson-Liu, or 3k for LOBPCG)

◮ Complexity of dense eigensolver scales as k3 ◮ Limited parallel scalability of existing kernels for dense

eigenproblems (e.g., ScaLAPACK)

◮ The RR becomes a computational bottleneck in practice ◮ Our idea: Entirely remove or reduce the number of the RR

calculations!

slide-15
SLIDE 15

The RR cost

Computation Time (sec.) GEMM 11 AX 6 RR 66

Table: The timing profiles for Davidson–Liu to compute ≈ 2, 000 eigenpairs on 480 cores.

slide-16
SLIDE 16

Existing “RR-avoiding” techniques

◮ Spectrum slicing (Schofield–Chelikowsky–Saad ’12)

◮ Break the spectrum into subintervals ◮ Use polynomial filters to compute interior eigenvalues in

different subintervals simultaneously

◮ Orthogonalize whenever necessary ◮ Available in PARSEC software

Dividing spectrum is not trivial. Preconditioning is not used.

slide-17
SLIDE 17

Existing “RR-avoiding” techniques

◮ Spectrum slicing (Schofield–Chelikowsky–Saad ’12)

◮ Break the spectrum into subintervals ◮ Use polynomial filters to compute interior eigenvalues in

different subintervals simultaneously

◮ Orthogonalize whenever necessary ◮ Available in PARSEC software

Dividing spectrum is not trivial. Preconditioning is not used.

◮ Penalized trace minimization (Wen–Yang–Liu-Zhang ’13)

◮ Based on the unconstrained formulation

min

X trace(X ∗AX) + µX ∗X − I2 F

◮ Steepest descent with Barzilai-Borwein line search ◮ RR, orthonormalization every once a while

The penalty parameter µ may be hard to estimate.

slide-18
SLIDE 18

Projected gradient methods: the general framework

◮ Elegant approach for solving constrained optimization

problems: Rosen ’60-61, Goldstein ’64, Levitin–Polyak ’66 min

x f (x),

x ∈ C

slide-19
SLIDE 19

Projected gradient methods: the general framework

◮ Elegant approach for solving constrained optimization

problems: Rosen ’60-61, Goldstein ’64, Levitin–Polyak ’66 min

x f (x),

x ∈ C

◮ In a nutshell: Skip the constraint, make a step in the gradient

direction, project onto the constraints Convergence theory exists for classes of constraints

slide-20
SLIDE 20

Projected gradient methods: the eigenvalue problem

◮ Block Rayleigh Quotient

f (X) = trace(X ∗X)−1(X ∗AX), X ∈ Cn×k

slide-21
SLIDE 21

Projected gradient methods: the eigenvalue problem

◮ Block Rayleigh Quotient

f (X) = trace(X ∗X)−1(X ∗AX), X ∈ Cn×k

◮ The minimization problem

min

X f (X),

X ∈ C = {X : X ∗X = I}

slide-22
SLIDE 22

Projected gradient methods: the eigenvalue problem

◮ Block Rayleigh Quotient

f (X) = trace(X ∗X)−1(X ∗AX), X ∈ Cn×k

◮ The minimization problem

min

X f (X),

X ∈ C = {X : X ∗X = I}

◮ The preconditioned gradient direction for f (X)

W = M(AX − X(X ∗AX)), X ∈ C

slide-23
SLIDE 23

Projected gradient methods: the eigenvalue problem

◮ Block Rayleigh Quotient

f (X) = trace(X ∗X)−1(X ∗AX), X ∈ Cn×k

◮ The minimization problem

min

X f (X),

X ∈ C = {X : X ∗X = I}

◮ The preconditioned gradient direction for f (X)

W = M(AX − X(X ∗AX)), X ∈ C

◮ The Preconditioned Steepest Descent (PSD) or Davidson–Liu

X ← XCX + WCW , where CX and CW are k-by-k iteration coefficient matrices.

slide-24
SLIDE 24

PSD

X ← XCX + W CW , X ← orth( ˆ X), X ∗X = I

slide-25
SLIDE 25

A projected PSD

ˆ X ← X ˆ CX + W ˆ CW ,

slide-26
SLIDE 26

A projected PSD

ˆ X ← X ˆ CX + W ˆ CW , X ← orth( ˆ X), X ∗X = I

slide-27
SLIDE 27

Locally Optimal Block PCG (LOBPCG)

X ← XCX + W CW + PCP, P ∈ col{X, Xprev} X ← orth( ˆ X), X ∗X = I

slide-28
SLIDE 28

A projected PCG (PPCG)

ˆ X ← X ˆ CX + W ˆ CW + P ˆ CP, P ∈ col{X, Xprev} X ← orth( ˆ X), X ∗X = I

slide-29
SLIDE 29

A projected PCG (PPCG)

◮ Block iteration decouples into single-vector updates

ˆ xj ← αjxj + βjwj + γjpj, j = 1, . . . , k;

slide-30
SLIDE 30

A projected PCG (PPCG)

◮ Block iteration decouples into single-vector updates

ˆ xj ← αjxj + βjwj + γjpj, j = 1, . . . , k;

◮ Choose αj, βj, and γj to minimize x∗Ax over span{xj, wj, pj}

⇒ solve k 3-by-3 eigenproblems

slide-31
SLIDE 31

A projected PCG (PPCG)

◮ Block iteration decouples into single-vector updates

ˆ xj ← αjxj + βjwj + γjpj, j = 1, . . . , k;

◮ Choose αj, βj, and γj to minimize x∗Ax over span{xj, wj, pj}

⇒ solve k 3-by-3 eigenproblems

◮ X ← orth( ˆ

X), ˆ X = [ˆ x1, . . . , ˆ xk]

slide-32
SLIDE 32

The PPCG algorithm

Given an initial guess X (0) and a preconditioner T, compute k lowest eigenpairs of A:

◮ X ← X (0), P ← [ ] ◮ While convergence not reached

  • W ← (I − XX ∗)T(I − XX ∗)(AX − X(X ∗AX))
  • For j = 1, . . . , k Do

⋄ S ← [xj, wj, pj] ⋄ Find eigenvector c = (αj, βj, γj)T – of S∗ASc = θ(S∗S)c corresponding to the smallest θ ⋄ pj ← βjwj + γjpj ⋄ ˆ xj ← αjxj + pj

  • EndFor
  • ˆ

X ← [ˆ x1, . . . , ˆ xk], P ← [p1, . . . , pk]

  • X ← orth( ˆ

X)

◮ EndWhile

slide-33
SLIDE 33

The PPCG algorithm

Given an initial guess X (0) and a preconditioner T, compute k lowest eigenpairs of A:

◮ X ← X (0), P ← [ ] ◮ While convergence not reached

  • W ← (I − XX ∗)T(I − XX ∗)
  • M

(AX − X(X ∗AX))

  • For j = 1, . . . , k Do

⋄ S ← [xj, wj, pj] ⋄ Find eigenvector c = (αj, βj, γj)T – of S∗ASc = θ(S∗S)c corresponding to the smallest θ ⋄ pj ← βjwj + γjpj ⋄ ˆ xj ← αjxj + pj

  • EndFor
  • ˆ

X ← [ˆ x1, . . . , ˆ xk], P ← [p1, . . . , pk]

  • X ← orth( ˆ

X)

◮ EndWhile

slide-34
SLIDE 34

The PPCG algorithm

Given an initial guess X (0) and a preconditioner T, compute k lowest eigenpairs of A:

◮ X ← X (0), P ← [ ] ◮ While convergence not reached

  • W ← (I − XX ∗)T(I − XX ∗)
  • M

(AX − X(X ∗AX))

  • For j = 1, . . . , k Do

⋄ S ← [xj, wj, pj] ⋄ Find eigenvector c = (αj, βj, γj)T – of S∗ASc = θ(S∗S)c corresponding to the smallest θ ⋄ pj ← βjwj + γjpj ⋄ ˆ xj ← αjxj + pj

  • EndFor
  • ˆ

X ← [ˆ x1, . . . , ˆ xk], P ← [p1, . . . , pk]

  • X ← orth( ˆ

X)

  • Once in a while, X ← RR(X)

◮ EndWhile

slide-35
SLIDE 35

The PPCG algorithm

Given an initial guess X (0) and a preconditioner T, compute k lowest eigenpairs of A:

◮ X ← X (0), P ← [ ] ◮ While convergence not reached

  • W ← (I − XX ∗)T(I − XX ∗)
  • M

(AX − X(X ∗AX))

  • For j = 1, . . . , k Do ⇐ can be done for subblocks!

⋄ S ← [xj, wj, pj] ⋄ Find eigenvector c = (αj, βj, γj)T – of S∗ASc = θ(S∗S)c corresponding to the smallest θ ⋄ pj ← βjwj + γjpj ⋄ ˆ xj ← αjxj + pj

  • EndFor
  • ˆ

X ← [ˆ x1, . . . , ˆ xk], P ← [p1, . . . , pk]

  • X ← orth( ˆ

X)

  • Once in a while, X ← RR(X)

◮ EndWhile

slide-36
SLIDE 36

The PPCG algorithm: practical aspects

  • 1. Partition X, W , and P into subblocks, as opposed to columns
slide-37
SLIDE 37

The PPCG algorithm: practical aspects

  • 1. Partition X, W , and P into subblocks, as opposed to columns
  • 2. Periodic RR needed (every 5-10 steps) to re-position columns
  • f X into approximate eigenvectors
slide-38
SLIDE 38

The PPCG algorithm: practical aspects

  • 1. Partition X, W , and P into subblocks, as opposed to columns
  • 2. Periodic RR needed (every 5-10 steps) to re-position columns
  • f X into approximate eigenvectors
  • 3. Locking converged eigenpairs
slide-39
SLIDE 39

The PPCG algorithm: practical aspects

  • 1. Partition X, W , and P into subblocks, as opposed to columns
  • 2. Periodic RR needed (every 5-10 steps) to re-position columns
  • f X into approximate eigenvectors
  • 3. Locking converged eigenpairs
  • 4. QR factorization of ˆ

X can be skipped every few iterations

slide-40
SLIDE 40

The PPCG algorithm: practical aspects

  • 1. Partition X, W , and P into subblocks, as opposed to columns
  • 2. Periodic RR needed (every 5-10 steps) to re-position columns
  • f X into approximate eigenvectors
  • 3. Locking converged eigenpairs
  • 4. QR factorization of ˆ

X can be skipped every few iterations

  • 5. “Buffer”
slide-41
SLIDE 41

Test problems

Problem na ecut (Ryd) nG k Li318 318 25 206,691 2,062 Graphene 512 25 538,034 2,254 bulk Si 1000 35 1,896,173 2,550

Table: Test problems

slide-42
SLIDE 42

Numerical experiments: execution time

10 20 30 40 50 60 70 80 10

−2

10 10

2 Convergence of eigensolvers for Li318 (2,062 eigenpairs)

Time (sec.) Frobenius norm of the residual Davidson PPCG

(a) Li318

20 40 60 80 100 120 140 10

−2

10

−1

10 10

1

Convergence of eigensolvers for Graphene512 (2,254 eigenpairs) Time (sec.) Frobenius norm of the residual Davidson PPCG

(b) Graphene512

50 100 150 200 250 300 10

−3

10

−2

10

−1

10 10

1

Convergence of eigensolvers for bulk Si (2,550 eigenpairs) Time (sec.) Frobenius norm of the residual Davidson PPCG

(c) bulk Si Figure: Convergence of the Davidson–Liu and PPCG algorithms.

◮ Implementation in Quantum Espresso ◮ # of cores: 480 (Li318), 576 (Graphene512), and 2,400 (Si) ◮ RR performed once every 5 iterations ◮ Subblock size is 5 (Li318) and 50 (Graphene512 and Si) ◮ “Buffer size” is 50

slide-43
SLIDE 43

Numerical experiments: timing profiles

Computation PPCG Davidson GEMM 16 11 AX 10 6 RR 13 66 CholQR 8 Table: The timing profiles (in sec.) for PPCG and Davidson to compute 2,062 eigenpairs of the Li318 problem on 480 cores. Computation PPCG Davidson GEMM 27 41 AX 94 96 RR 40 191 CholQR 19 Table: The timing profiles (in sec.) for PPCG and Davidson to compute 2,550 eigenpairs of the bulk Si problem on 2,400 cores.

slide-44
SLIDE 44

Numerical experiments: scaling (bulk Si)

ncpus Computation 200 400 800 1,600 2,400 2,800 GEMM 202 104 57 35 27 26 AX 247 165 129 106 94 92 RR 142 77 48 40 40 41 CholQR 66 91 21 18 19 21 Total 685 399 266 209 189 188 Table: Scaling of different computational components of PPCG. ncpus Computation 200 400 800 1,600 2,400 2,800 GEMM 248 138 76 47 41 38 AX 253 169 133 111 96 96 RR 474 303 214 189 191 189 Total 986 615 425 348 329 323 Table: Scaling of different computational components of Davidson.

slide-45
SLIDE 45

Numerical experiments: scaling (bulk Si)

500 1000 1500 2000 2500 3000 150 200 250 300 350 400 450 500 550 600 650 Scaling of eigensolvers for bulk Si (2,550 eigenpairs) Time (sec.) Number of cpus Davidson PPCG

slide-46
SLIDE 46

Numerical experiments: the SCF loop

10 20 30 40 50 60 70 80 10

−6

10

−4

10

−2

10 10

2

Convergence of the SCF iteration for Li318 Estimated scf accuracy Time (sec.) Davidson PPCG PPCG (no CholQR)

(a) Li318

50 100 150 200 10

−4

10

−2

10 10

2

Convergence of the SCF iteration for Graphene512 Estimated scf accuracy Time (sec.) Davidson PPCG PPCG (no CholQR)

(b) Graphene512

50 100 150 200 250 300 10

−6

10

−4

10

−2

10 Convergence of the SCF iteration for bulk Si Estimated scf accuracy Time (sec.) Davidson PPCG PPCG (no CholQR)

(c) bulk Si Figure: The convergence of the SCF iteration with Davidson and PPCG.

slide-47
SLIDE 47

Numerical experiments: effects of RR (Li318)

10 20 30 40 50 10

−2

10 10

2 PPCG convergence for different RR frequences (iterations)

Iteration number Frobenius norm of the residual no RR rr_period = 1 rr_period = 5 rr_period = 10 rr_period = 15

slide-48
SLIDE 48

Numerical experiments: effects of the subblock size (Li318)

10 20 30 40 50 60 10

−2

10 10

2 PPCG convergence for different sbsize values (iterations)

Iteration number Frobenius norm of the residual sbsize = 1 sbsize = 5 sbsize = 10 sbsize = 50 sbsize = 100 10 20 30 40 50 60 70 10

−2

10 10

2

PPCG convergence for different sbsize values (time) Time (sec.) Frobenius norm of the residual sbsize = 1 sbsize = 5 sbsize = 10 sbsize = 50 sbsize = 100

Figure: Convergence rate with respect to iteration count (left) and time (right)

slide-49
SLIDE 49

Conclusions

◮ The PPCG iteration reduces the amount of RR computations ◮ Up to 2x speedup demonstrated ◮ Pilot versions implemented in Quantum Espresso and Qbox

Current and future work

◮ Better theoretical understanding (domain decomposition

framework?)

◮ Integration into molecular dynamics simulation codes, with

application to the chemistry and dynamics of lithium-ion batteries

◮ Extension to non-linear, structured, and interior eigenproblems

slide-50
SLIDE 50

Questions?

Thank you!

◮ E. Vecharynski, C. Yang, and J. E. Pask: “A Projected

Preconditioned Conjugate Gradient Algorithm for Computing a Large Invariant Subspace of a Hermitian Matrix”, available at http://arxiv.org/abs/1407.7506

slide-51
SLIDE 51

QE process grid for a dense ScaLAPACK eigensolver

ncpus Processor grid 200 10x10 400 14x14 800 20x20 1,600 28x28 2,400 34x34 3,000 38x38

Table: The default ScaLAPACK processor grid configurations used by QE for different total core counts.

slide-52
SLIDE 52

Numerical experiments: effects of RR (Li318)

10 20 30 40 50 10

−2

10 10

2 PPCG convergence for different RR frequences (iterations)

Iteration number Frobenius norm of the residual no RR rr_period = 1 rr_period = 5 rr_period = 10 rr_period = 15 20 40 60 80 100 10

−2

10 10

2

PPCG convergence for different RR frequences (time) Time (sec.) Frobenius norm of the residual no RR rr_period = 1 rr_period = 5 rr_period = 10 rr_period = 15

Figure: Convergence rate with respect to iteration count (left) and time (right)