Parallel Shift-Invert Spectrum Slicing on Distributed Architectures - - PowerPoint PPT Presentation

parallel shift invert spectrum slicing on distributed
SMART_READER_LITE
LIVE PREVIEW

Parallel Shift-Invert Spectrum Slicing on Distributed Architectures - - PowerPoint PPT Presentation

Parallel Shift-Invert Spectrum Slicing on Distributed Architectures with GPU Accelerators (pap167s1) Da David Williams-You Young, Chao o Ya Yang Sc Scalable e So Solver ers s Group up Co Computational Research Di Division La


slide-1
SLIDE 1

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 1

Parallel Shift-Invert Spectrum Slicing on Distributed Architectures with GPU Accelerators (pap167s1)

Th The 49th

th In

International Conference ce on Par Paral allel Pr Processing (I (ICPP20) 20)

Da David Williams-You Young, Chao

  • Ya

Yang Sc Scalable e So Solver ers s Group up Co Computational Research Di Division La Lawrence Berke keley National La Lab

slide-2
SLIDE 2

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 2

Di Diagona

  • nalization
  • n is the

he Bot

  • ttlenec

eneck for

  • r Large

e Sc Scale e DFT DFT Calcul ulations

  • ns

! "# "# = "#%#: ' ()

Guess "# Form F("#) Solve EVP → " Converged? Terminate No Yes

! ∈ ℝ/ × / "# ∈ ℝ/ × 1 %# ∈ ℝ1 × 1

Requires repeated partial diagonalization of 2 (2 < () eigenpairs of the EVP.

slide-3
SLIDE 3

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 3

We Must Exploit the Structure of the SCF Problem and Aspects of Modern Computing Architectures to Obtain Performance Improvements

Ge General M Methods ds a are f for Ge General P Probl blems

slide-4
SLIDE 4

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 4

FL FLOPs are e Chea heap vs Com

  • mmuni

unication

  • n

Reliance on accelerators (GPUs):

  • IBM POWER9: 1 TFLOP/s
  • Intel KNL: 3 TFLOP/s
  • NVIDIA Tesla V100: 7.8 TFLOP/s
  • 46.8 TFLOP/s / Summit node (6x)

Cheap FLOPs → Exposing Bottlenecks E.g. 10k x 10k DGEMM on V100 (PCI-E)

  • DGEMM time = 2 sec
  • Communication (H2D + D2H) = 6 sec
slide-5
SLIDE 5

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 5

The SISLICE Method

A Parallel Implementation of Shift-Invert Spectrum Slicing for the SCF Eigenvalue Problem

CPU Implementation: arXiv:1908.06043 (to appear in ACM Trans. Math. Softw.)

slide-6
SLIDE 6

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 6

Sp Spec ectrum um Sl Slicing ng Partitions

  • ns the

he Ei Eigenspectrum in into In Independent t Tasks

σ1 σ2 σns−1 σns

Slice 1 Slice 2 Slice ns Slice ns + 1

  • Eigenvalues in each slice are to be determined

“independently”

  • Trades redundant FLOPs for less communication
slide-7
SLIDE 7

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 7

Th The SISLICE Method Exploits the Conve vergent Properties of th the SCF Pro rocedure re

Initial Guess for !" Partition Spectrum → $

%

Update $

%

Perform Shift-Invert Subspace Iterations in Parallel Extract !" First SCF Iteration? Yes No Synchronization Computation (FLOPs) Cheap / Replicated

DBWY, et al. arXiv:1908.06043 Lin, et al. SIAM Rev 58, 34, 2016

slide-8
SLIDE 8

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 8

Shi Shift-In Invert rt Spectru trum Slicing Tra rades Diagonalizati tion for Li Linear System Solves

! " = " − %& '( ∶ * +,

%( %-

Input : Symmetric F ∈ RN×N, shift partition {j}ns

j=1, number of

desired eigenpairs M, basis dimension K, and max iteration niter. Output: Eigenvectors C ∈ RN×M, and eigenvalues E ∈ RM×M.

1 Distribute work over j. 2 for j assigned to this execution context do 3

Form initial guess Vj ∈ RN×K

4

Factorize (F − jI) (TRF) for i = 1 : niter do

5

Vj ← (F − jI)−1Vj (TRS)

6

Vj ← orth(Vj) (CholQR) end

7

(Vj, Ej,~ ri) ← RayleighRitz(F, Vj) (RR) end

8 (C, E) ← DistValidate({(Vj, Ej,~

rj)})

DBWY, et al. arXiv:1908.06043

slide-9
SLIDE 9

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 9

Shi Shift-In Invert rt Spectru trum Slicing Tra rades Diagonalizati tion for Li Linear System Solves

! " = " − %& '( ∶ * +,

  • Triangular Factorization (LU / LDLT) + Back Solve

ü Lower prefactor / better strong scaling ü Able to exploit sparsity (SuperLU, PARDISO, etc) ✗ Orders of magnitude more FLOPs ü Shift Independence → massive parallelism

DBWY, et al. arXiv:1908.06043

%( %-

slide-10
SLIDE 10

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 10 10

Sy Sync nchr hroni

  • nization
  • n Onl

nly Req equi uires es Com

  • mmuni

unication

  • n of
  • f !(#) Da

Data

Strong scaling of SISLICE synchronization (MPI_Allgather) for various values of #. Timings were obtained on the Summit Supercomputer.

Rank 2 Rank 1 Rank 0 r3 Λ3 X3 r2 Λ2 X2 r1 Λ1 X1 r3 Λ3 r2 Λ2 r1 Λ1 X3 r3 Λ3 r2 Λ2 r1 Λ1 X2 r3 Λ3 r2 Λ2 r1 Λ1 X1

DBWY, et al. arXiv:1908.06043 DBWY, et al. ICPP20

Pre-Synchronization Post-Synchronization

100 200 5 10 Nodes Wall Time / ms

K = 100 K = 500 K = 1000

slide-11
SLIDE 11

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 11 11

Th The SISLICE CPU Implementation Exhibits Linear Strong Sc Scaling ng

102 103 104 101 102 103 Number of Processors Wall Time / s ScaLAPACK ELPA SISLICE 8x8 SISLICE 16x16

Si10H16 (UF Sparse Matrix Collection)

  • N = 17,077
  • M = 8,500
  • NS = 100, K = 100
  • NNZ = 87,592 (99.7% zeros)
  • SuperLU for distributed LU factorization
  • NB = 128 for ScaLAPACK/ELPA

2.7x speedup

DBWY, et al. arXiv:1908.06043

slide-12
SLIDE 12

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 12 12

Th The Proxy y Applica cation for GPU-SI SISL SLICE

SISLICE ≈ SISUBIT + Synchronization

Three limiting cases for the shift-invert subspace iteration:

  • 1. Shared memory, dense matrices (Dense SM)
  • 2. Distributed memory, dense matrices (Dense DM)
  • 3. Sparse matrices

100 200 5 10 Nodes Wall Time / ms

K = 100 K = 500 K = 1000

slide-13
SLIDE 13

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 13 13

Dens Dense e SM SM-SI SISU SUBIT

DBWY, et al. ICPP20

GPU: cuSOLVER + cuBLAS Intel CPU: MKL IBM CPU: ESSL

V100 XG 100 102 104 Wall Time / ms

TRF TRS CholQR RR H2D + D2H

SISS SYEVD 100 102 104 106 Wall Time / ms

SM-V100 SM-POWER9 SM-KNL SM-XG ELPA1-GPU ELPA2-CPU ScaLAPACK

Kernel Speedup N ≤ 1,000 N ≥ 10,000 TRF 6x (XG) 3x (KNL) TRS 1.5x (POWER9) 4-5x (XG) CholQR 50x (POWER9) 20x (POWER9) RR 1.5-2x (XG) 6x (XG) SISUBIT 1.5-2x (XG) 4x (XG)

slide-14
SLIDE 14

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 14 14

Dens Dense e DM DM-SI SISU SUBIT

GPU: SLATE CPU: ScaLAPACK / ELPA

SISS SYEVD 2 4 6 ·105 Wall Time / ms

SLATE ScaLAPACK ELPA1-GPU ELPA2-CPU

Kernel Speedup N = 100, 000 N = 300, 000 4 nodes 64 nodes 32 nodes 64 nodes TRF 2.1x 0.8x 1.7x 1.8x TRS 0.4x 0.2x 0.1x 0.1x CholQR 0.07x 0.04x 0.07x 0.04x RR 0.07x 0.02x 0.02x 0.01x SISUBIT 2.3x 0.5x 1.1x 0.9x

DBWY, et al. ICPP20

4 16 32 64 104 105 106 Nodes Wall Time / ms

ScaLAPACK N = 50,000 ScaLAPACK N = 100,000 ScaLAPACK N = 200,000 ScaLAPACK N = 300,000 SLATE N = 50,000 SLATE N = 100,000 SLATE N = 200,000 SLATE N = 300,000

slide-15
SLIDE 15

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 15 15

Sp Sparse e SI SISU SUBIT

DBWY, et al. ICPP20

1 4 16 101 102 Nodes Wall Time / s

SuperLU_DIST CPU TRF SuperLU_DIST GPU TRF SuperLU_DIST TRS PARDISO TRF PARDISO TRS

SISS SYEVD 200 400 600 800 Wall Time / s

SuperLU_DIST PARDISO ScaLAPACK ELPA1-GPU ELPA2-CPU

SuiteSparse: Ga10As10H30

  • N = 113,081
  • NNZ = 6,115,633 (99.95% zero)
slide-16
SLIDE 16

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 16 16

Co Conclusions

  • For matrices that can occupy the memory of a single compute

node, the GPU implementation of SISLICE exhibits performance gains over CPU implementations of SISLICE as well as SYEVD

  • Further improvements in the distributed memory GPU linear

algebra software stack will yield drastic improvements in years to come

slide-17
SLIDE 17

BERKELE LEY LA LAB

Of Office of Sc Scienc ence 17 17

Ac Acknowledgements

The development of SISLICE has been supported by the U.S. Department of Energy:

  • Scientific Discovery Through Advanced Computing (SciDAC-4)
  • Exascale Computing Project (NWChemEx 17-SC-20-SC)

Calculations were performed using DOE computing facilities:

  • Cori (NERSC)
  • Summit (OLCF)