Parallel QR Algorithm with Aggressive Early Deflation Meiyue Shao - - PowerPoint PPT Presentation

parallel qr algorithm with aggressive early deflation
SMART_READER_LITE
LIVE PREVIEW

Parallel QR Algorithm with Aggressive Early Deflation Meiyue Shao - - PowerPoint PPT Presentation

Parallel QR Algorithm with Aggressive Early Deflation Meiyue Shao Department of Computing Science and HPC2N Ume University, Sweden Joint work with R. Granat, B. Kgstr om, and D. Kressner Trogir, October 2011 Introduction 1 / 12


slide-1
SLIDE 1

Parallel QR Algorithm with Aggressive Early Deflation Meiyue Shao Department of Computing Science and HPC2N Umeå University, Sweden Joint work with R. Granat, B. Kågstr¨

  • m, and D. Kressner

Trogir, October 2011

slide-2
SLIDE 2

Introduction — 1/12 —

  • Dense linear eigenvalue problems

– Standard eigenvalue problem (SEP): Ax = λx – Generalized eigenvalue problem (GEP): Ax = λBx Sometimes ALL eigenvalues are needed. Achieved via Schur decomposition: A = QTQH or (A, B) = (QS ZH, QTZH).

slide-3
SLIDE 3

Modern QR Algorithm — 2/12 —

  • QR algorithm:
  • 1. (optional) Balancing (isolating and scaling)
  • 2. Hessenberg reduction (A → H)

  • 3. Repeat

Aggressive early deflation Multi-shift QR sweep (bulge-chasing) Until converge (H → T) →

  • 4. (optional) Backward transformation.
slide-4
SLIDE 4

Modern QR Algorithm — 3/12 —

  • Aggressive early deflation (AED)

H =           n − nwin − 1 1 nwin n − nwin − 1 H11 H12 H13 1 H21 H22 H23 nwin H32 H33          , U =           n − nwin − 1 1 nwin n − nwin − 1 I 1 1 nwin V          , UHHU =           H11 H12 H13V H21 H22 H23V s S           , S = VHH33V = . – If the last entry of the vector s is small enough, we can deflate an eigenvalue. – Otherwise, the undeflatable eigenvalue is moved up. – Reduce back to Hessenberg form after all eigenvalues are tested. – Undeflatable eigenvalues can be used as shifts in the next QR sweep.

slide-5
SLIDE 5

Parallel QR Algorithm — 4/12 —

  • Software structure

Entry routine for new parallel QR algorithm.

PDHSEQR PDLAQR1

Modified version of ScaLAPACK’s current implementation of the parallel QR algorithm.

PDLAQR3 PDLAQR5

Aggressive early deflation and shift computation. Multishift QR iteration based on chains of tightly coupled bulges.

PDLAQR0

New parallel QR algorithm.

slide-6
SLIDE 6

Parallel QR Algorithm — 5/12 —

  • Parallel bulge-chasing algorithms

(on distributed-memory systems) ScaLAPACK 1.8.0 New algorithm PDLAQR1() PDLAQR5() BLAS-1 BLAS-3 loosely coupled tightly coupled

slide-7
SLIDE 7

Parallel QR Algorithm — 6/12 —

  • Local bulge chasing

Several chains of tightly coupled bulges are chased simultaneously.

slide-8
SLIDE 8

Parallel QR Algorithm — 7/12 —

  • Cross border chasing

Odd-numbered windows Even-numbered windows

slide-9
SLIDE 9

Parallel QR Algorithm — 8/12 —

  • Aggressive early deflation

– Schur decomposition for a smaller matrix Several possible choices: recursion, (modified) ScaLAPACK solver, or even LAPACK solver. Depends on the size of AED window, as well as the number of processors. – We need to take care of eigenvalue reordering (PBDTRORD). – Usually the AED phase is slow since the submatrix (AED window) is not large enough. Sometimes we need to redistribute the submatrix to a subset of processors. – We prefer QR sweep since it scales better than AED. So the threshold (NIBBLE) for skipping a QR sweep is larger than that used in LAPACK.

slide-10
SLIDE 10

Numerical Experiments on Akka — 9/12 —

100 cores 100 cores 2000 4000 6000 8000 10000 12000 14000 16000 18000 252 secs. 16755 secs.

A 16000 × 16000 dense eigenvalue problem with 100 processors H → T = QTHQ ScaLAPACK New software

slide-11
SLIDE 11

Numerical Experiments on Akka — 10/12 — Profile of the total execution time (A → H → T) (4000 × 4000 per core)

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1 n=4000 n=8000 n=16000 n=32000 Balancing Hessenberg red. QR: AED Schur red. QR: AED reordering QR: return to Hess. QR: Sweep - updates QR: Sweep - local chase

slide-12
SLIDE 12

Numerical Experiments on Akka — 11/12 — A 100, 000 × 100, 000 Dense Eigenvalue Problem # Procs 16 × 16 24 × 24 32 × 32 Total Time 5.87 hrs 3.97 hrs 3.07 hrs Balancing 0.24 hrs 0.24 hrs 0.24 hrs

  • Hess. red.

2.92 hrs 1.78 hrs 1.08 hrs QR+AED 2.72 hrs 1.95 hrs 1.75 hrs AED/(QR+AED) 44% 44% 42% shifts per eig 0.30 0.22 0.16

slide-13
SLIDE 13

Concluding Remarks — 12/12 —

  • New issues in the parallel QR algorithm

– multiple chains of shifts – crossover points for different algorithms – shifting strategy – data redistribution

  • The software will be released soon.
  • Future work

faster, faster, and even faster . . . (perhaps less and less energy consumption in the future)

slide-14
SLIDE 14

Thank you!