Parallel Shift-Invert Spectrum Slicing on Distributed Architectures - PowerPoint PPT Presentation

Parallel Shift-Invert Spectrum Slicing on Distributed Architectures with GPU Accelerators (pap167s1) Da David Williams-You Young, Chao o Ya Yang Sc Scalable e So Solver ers s Group up Co Computational Research Di Division La Lawrence Berke keley National La Lab th In The 49 th Th International Conference ce on Paral Par allel Pr Processing (I (ICPP20) 20) Office of Of BERKELE LEY LA LAB 1 Scienc Sc ence

Di Diagona onalization on is the he Bot ottlenec eneck for or Large e Sc Scale e DFT DFT Calcul ulations ons Requires repeated partial ! " # " # = " # % # : ' ( ) diagonalization of 2 ( 2 < ( ) eigenpairs of the EVP. ! ∈ ℝ / × / Guess " # Form F(" # ) Solve EVP → " " # ∈ ℝ / × 1 % # ∈ ℝ 1 × 1 Terminate Converged? No Yes Of Office of BERKELE LEY LA LAB 2 Sc Scienc ence

General M Ge Methods ds a are f for Ge General P Probl blems We Must Exploit the Structure of the SCF Problem and Aspects of Modern Computing Architectures to Obtain Performance Improvements Office of Of BERKELE LEY LA LAB 3 Scienc Sc ence

FLOPs are FL e Chea heap vs Com ommuni unication on Reliance on accelerators (GPUs): IBM POWER9: 1 TFLOP/s • Intel KNL: 3 TFLOP/s • NVIDIA Tesla V100: 7.8 TFLOP/s • 46.8 TFLOP/s / Summit node (6x) • Cheap FLOPs → Exposing Bottlenecks E.g. 10k x 10k DGEMM on V100 (PCI-E) DGEMM time = 2 sec • Communication (H2D + D2H) = 6 sec • Of Office of BERKELE LEY LA LAB 4 Scienc Sc ence

The SISLICE Method A Parallel Implementation of Shift-Invert Spectrum Slicing for the SCF Eigenvalue Problem CPU Implementation: arXiv:1908.06043 (to appear in ACM Trans. Math. Softw. ) Of Office of BERKELE LEY LA LAB 5 Sc Scienc ence

Spec Sp ectrum um Sl Slicing ng Partitions ons the he Ei Eigenspectrum in into In Independent t Tasks σ n s − 1 σ n s σ 1 σ 2 Slice n s Slice n s + 1 Slice 1 Slice 2 • Eigenvalues in each slice are to be determined “independently” • Trades redundant FLOPs for less communication Of Office of BERKELE LEY LA LAB 6 Scienc Sc ence

The SISLICE Method Exploits the Conve Th vergent Properties of the SCF Pro th rocedure re Initial Guess for ! " Cheap / Replicated Computation (FLOPs) Yes First SCF Perform Shift-Invert Subspace Partition Spectrum → $ % Iteration? Iterations in Parallel No Update $ % Extract ! " Synchronization DBWY , et al . arXiv:1908.06043 Lin, et al. SIAM Rev 58, 34, 2016 Of Office of BERKELE LEY LA LAB 7 Scienc Sc ence

Shift-In Shi Invert rt Spectru trum Slicing Tra rades Diagonalizati tion for Linear System Solves Li ! " = " − %& '( ∶ * + , % ( % - : Symmetric F ∈ R N × N , shift partition { � j } n s j =1 , number of Input desired eigenpairs M , basis dimension K , and max iteration n iter . Output: Eigenvectors C ∈ R N × M , and eigenvalues E ∈ R M × M . 1 Distribute work over j . 2 for j assigned to this execution context do Form initial guess V j ∈ R N × K 3 Factorize ( F − � j I ) (TRF) 4 for i = 1 : n iter do V j ← ( F − � j I ) − 1 V j (TRS) 5 V j ← orth( V j ) (CholQR) 6 end ( V j , E j , ~ r i ) ← RayleighRitz( F, V j ) (RR) 7 end 8 ( C, E ) ← DistValidate( { ( V j , E j , ~ r j ) } ) DBWY , et al . arXiv:1908.06043 Of Office of BERKELE LEY LA LAB 8 Scienc Sc ence

Shift-In Shi Invert rt Spectru trum Slicing Tra rades Diagonalizati tion for Linear System Solves Li ! " = " − %& '( ∶ * + , % ( % - Triangular Factorization (LU / LDLT) + Back Solve • ü Lower prefactor / better strong scaling ü Able to exploit sparsity (SuperLU, PARDISO, etc) ✗ Orders of magnitude more FLOPs ü Shift Independence → massive parallelism DBWY , et al . arXiv:1908.06043 Of Office of BERKELE LEY LA LAB 9 Scienc Sc ence

Sync Sy nchr hroni onization on Onl nly Req equi uires es Com ommuni unication on of of !(#) Da Data Pre-Synchronization Post-Synchronization K = 100 Λ 1 Λ 1 X 1 X 1 10 K = 500 r 1 r 1 K = 1000 Rank 0 Wall Time / ms Λ 2 r 2 Λ 3 r 3 5 Λ 1 r 1 Rank 1 0 Λ 2 Λ 2 X 2 X 2 r 2 r 2 0 100 200 Λ 3 Nodes r 3 Strong scaling of SISLICE synchronization Λ 1 (MPI_Allgather) for various values of # . Timings r 1 Rank 2 were obtained on the Summit Supercomputer. Λ 2 r 2 Λ 3 Λ 3 X 3 X 3 r 3 r 3 DBWY , et al . arXiv:1908.06043 DBWY, et al. ICPP20 Office of Of BERKELE LEY LA LAB 10 10 Scienc Sc ence

Th The SISLICE CPU Implementation Exhibits Linear Strong Sc Scaling ng ScaLAPACK Si 10 H 16 (UF Sparse Matrix Collection) ELPA 103 SISLICE 8x8 • N = 17,077 SISLICE 16x16 • M = 8,500 • NS = 100, K = 100 • NNZ = 87,592 (99.7% zeros) Wall Time / s 102 • SuperLU for distributed LU factorization • NB = 128 for ScaLAPACK/ELPA 2.7x speedup 101 102 103 104 Number of Processors DBWY , et al . arXiv:1908.06043 Office of Of BERKELE LEY LA LAB 11 11 Scienc Sc ence

Th The Proxy y Applica cation for GPU-SI SISL SLICE SISLICE ≈ SISUBIT + Synchronization Three limiting cases for the shift-invert subspace iteration: 1. Shared memory, dense matrices (Dense SM) 2. Distributed memory, dense matrices (Dense DM) K = 100 3. Sparse matrices 10 K = 500 K = 1000 Wall Time / ms 5 0 0 100 200 Nodes Of Office of BERKELE LEY LA LAB 12 12 Sc Scienc ence

Dense Dens e SM SM-SI SISU SUBIT 10 6 TRF SM-V100 10 4 TRS SM-POWER9 Wall Time / ms Wall Time / ms CholQR SM-KNL 10 4 RR SM-XG H2D + D2H ELPA1-GPU 10 2 ELPA2-CPU 10 2 ScaLAPACK 10 0 10 0 V100 XG SISS SYEVD Speedup Kernel GPU: cuSOLVER + cuBLAS N ≤ 1,000 N ≥ 10,000 TRF 6x (XG) 3x (KNL) Intel CPU: MKL TRS 1.5x (POWER9) 4-5x (XG) CholQR 50x (POWER9) 20x (POWER9) IBM CPU: ESSL RR 1.5-2x (XG) 6x (XG) SISUBIT 1.5-2x (XG) 4x (XG) DBWY , et al . ICPP20 Office of Of BERKELE LEY LA LAB 13 13 Scienc Sc ence

Dense Dens e DM DM-SI SISU SUBIT · 10 5 6 ScaLAPACK N = 50,000 SLATE 10 6 ScaLAPACK N = 100,000 ScaLAPACK ScaLAPACK N = 200,000 Wall Time / ms Wall Time / ms ScaLAPACK N = 300,000 ELPA1-GPU SLATE N = 50,000 4 ELPA2-CPU SLATE N = 100,000 10 5 SLATE N = 200,000 SLATE N = 300,000 2 10 4 4 16 32 64 0 Nodes SISS SYEVD Speedup GPU: SLATE Kernel N = 100 , 000 N = 300 , 000 4 nodes 64 nodes 32 nodes 64 nodes CPU: ScaLAPACK / ELPA TRF 2.1x 0.8x 1.7x 1.8x TRS 0.4x 0.2x 0.1x 0.1x CholQR 0.07x 0.04x 0.07x 0.04x RR 0.07x 0.02x 0.02x 0.01x SISUBIT 2.3x 0.5x 1.1x 0.9x DBWY , et al . ICPP20 Office of Of BERKELE LEY LA LAB 14 14 Scienc Sc ence

Sp Sparse e SI SISU SUBIT 800 SuperLU_DIST SuiteSparse: Ga10As10H30 PARDISO • N = 113,081 Wall Time / s ScaLAPACK 600 ELPA1-GPU • NNZ = 6,115,633 (99.95% zero) ELPA2-CPU 400 200 SISS SYEVD SuperLU_DIST CPU TRF 10 2 SuperLU_DIST GPU TRF SuperLU_DIST TRS Wall Time / s PARDISO TRF PARDISO TRS 10 1 1 4 16 Nodes DBWY , et al . ICPP20 Office of Of BERKELE LEY LA LAB 15 15 Scienc Sc ence

Co Conclusions • For matrices that can occupy the memory of a single compute node, the GPU implementation of SISLICE exhibits performance gains over CPU implementations of SISLICE as well as SYEVD • Further improvements in the distributed memory GPU linear algebra software stack will yield drastic improvements in years to come Of Office of BERKELE LEY LA LAB 16 16 Scienc Sc ence

Acknowledgements Ac The development of SISLICE has been supported by the U.S. Department of Energy: Scientific Discovery Through Advanced Computing (SciDAC-4) • Exascale Computing Project (NWChemEx 17-SC-20-SC) • Calculations were performed using DOE computing facilities: Cori (NERSC) • Summit (OLCF) • Office of Of BERKELE LEY LA LAB 17 17 Scienc Sc ence

Parallel Shift-Invert Spectrum Slicing on Distributed Architectures - PowerPoint PPT Presentation

Parallel Shift-Invert Spectrum Slicing on Distributed Architectures with GPU Accelerators (pap167s1) Da David Williams-You Young, Chao o Ya Yang Sc Scalable e So Solver ers s Group up Co Computational Research Di Division La

Program Slicing 2 1 Program Slicing 1. Slicing overview 2. Types of slices, levels of slices

Strategies for Spectrum Slicing Based on Restarted Lanczos Methods Carmen Campos and Jose E.

1 2 nd Shift Associates 2 nd Shift Associates 3 rd Shift Associates 3 rd Shift Associates 2

Slicing Functional Programs: A Suspicion 10th CREST Open Workshop on Program Analysis and

Using program slicing data to predict code faults David Bowes University of Hertfordshire

HOLY SHIFT! Linda Zheng Roadmap You are here My Shift Introduction Shift AST Experience

Paradigm Shift: Moving from Vertical Paradigm Shift: Moving from Vertical Paradigm Shift:

Sharon Mast, Facilitator IIRP World Conference Bethlehem PA October 27, 2014 Shift your

Program Slicing Gias Uddin Special Topic Lecture prepared from the Survey of Frank Tip on Program

From Channel Slicing to From Channel Slicing to Spatial Division Multiplexing Spatial Division

Projection and slicing theorems in Heisenberg groups Pertti Mattila University of Helsinki

Using Dependence Graphs for Slicing Functional Programs Dr. Vadim Zaytsev aka @grammarware

Dynamic Slicing Techniques for Petri Nets M. Llorens, J. Oliver, J. Silva, S. Tamarit, G. Vidal

What is Parametric Trace Slicing Good For? Giles Reger School of Computer Science, University of

Spectrum Sharing in Cognitive Radio Networks By: H.Feizresan Summer 2009 1 Spectrum sharing in

TATA HARRIER Harrier Gear Shift Knob TATA HARRIER GEAR KNOB TATA NEXON Nexon Gear Shift Knob

Inverse Kinematics (part 2) CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Spring

Session #10: (More) Trapdoors and Applications Chris Peikert Georgia Institute of Technology

Lattice-Based Cryptography: Constructing Trapdoors and More Applications Chris Peikert Georgia

Logistics Assignments Crossover and Mutation Checkpoint 1 -- Problem Graded --

Inverse kinematics Forward kinematics Given a joint configuration, what is the position of an end

Matrix Calculations: Inverse and Basis Transformation A. Kissinger (and H. Geuvers) Institute

High-speed key encapsulation from NTRU Andreas Hlsing 1 , Joost Rijneveld 2 , John Schanck 3,4 ,

1 Singular Value Decomposition The singular vector decomposition allows us to write any matrix A