Accelerated Sparse Matrix Multiplication for Quantum Chemistry with - PowerPoint PPT Presentation

Department of Materials Accelerated Sparse Matrix Multiplication for Quantum Chemistry with CP2K on Hybrid Supercomputers Ole Sch¨ utt ole.schuett@mat.ethz.ch Nanoscale Simulations ole.schuett@mat.ethz.ch 1 / 17

Application: Emerging Photovoltaics Processes at TiO 2 -Interface Electron Transport across Hole Transporting Material Nanoparticles (HTM) Schiffmann et al. (2010) 17k atoms, 80k electrons spiro-MeOTAD Requirements electronic properties = ⇒ Schr¨ odinger equation ( H Ψ = E Ψ) lack of symmetries = ⇒ large simulation cells ( > 1000 atoms) ole.schuett@mat.ethz.ch 2 / 17

Linear Scaling Self Consistent Field Dense linear algebra Density P as matrix function of P : Sparse linear algebra Guess initial density ρ �� − 1 � � H − µ ✶ P = 1 + exp Calculate matrix H from ρ kT Costs: O ( N ), but dominates for small systems = 1 2 [1 − sign ( H − µ ✶ )] Calculate SCF Iteration limit of small kT (ground state) eigenvectors ψ i of H Costs: O ( N 3 ) Calculate ρ directly as Evaluate sign() as polynomial series: matrix function of H Costs: O ( N ) Calculate new density X 0 = A · || A || − 1 ρ = � i | ψ i | 2 X n +1 = 1 2 X n (3 ✶ − X 2 n ) Calculate energy from ρ sign( A ) = X ∞ LS-SCF entirely based on sparse linear algebra. ole.schuett@mat.ethz.ch 3 / 17

Benchmarks of Condensed Phase Systems DFT on 46.656 cores Diagonalization 100 Linear scaling O ( N ) methods 80 DFTB on 9.216 cores are inevitable Wall time [min] 60 for large systems 40 20 VandeVondele et al. (2012): 0 0 10000 20000 30000 40000 50000 60000 Linear Scaling Self-Consistent Field Calculations Number of atoms with Millions of Atoms in the Condensed Phase ole.schuett@mat.ethz.ch 3 / 17

The DBCSR Library DBCSR = Distributed Block Compressed Sparse Row Working horse of CP2K’s linear scaling DFT code Non-zero elements are small dense blocks e.g. 13 × 13 Each block corresponds to interaction between two atoms Additions are local operations Multiplications are more elaborate... H H H neglect exploit O O O distant symmetry atom pairs H H H H O H H O H H O H ole.schuett@mat.ethz.ch 4 / 17

Architecture of DBCSR’s Multiplication Code Cluster Cannon MPI Parallelization Node Multrec Cache Optimization CSR Stack generation Scheduler CPU/GPU Load balancing Host Driver Cuda Driver fallback GPU BLAS Libsmm Libcusmm ole.schuett@mat.ethz.ch 5 / 17

Hiding Communication with Double Buffering time MPI send MPI receive MPI send Host Buffer 1 generate stacks generate stacks host host to to Device process stacks process stack device device Buffer 1 MPI receive MPI send MPI receive Host generate stacks Buffer 2 host to Device process stacks device Buffer 2 1. Cannon Tick 2. Cannon Tick 3. Cannon Tick Ideally: Network and GPU always busy ole.schuett@mat.ethz.ch 6 / 17

Managing Dependencies with Cuda Events and Streams time a panel host2dev b panel host2dev Streams c panel set zero dev2host stack buffer 1 host2dev calc stack buffer 2 host2dev calc Queried before reusing host stack buffer ole.schuett@mat.ethz.ch 7 / 17

Cuda Kernel Implementation GPU Memory Usage Larger matrices are processed in slabs P A , P B , P C Each thread computes a tile T of the result slab P C Results T are keept in thread’s registers Outer-product style multiplication reduces access to P A and P B P B is stored transposed to coalesced memory access Write back to global memory uses Compare-and-Swap ole.schuett@mat.ethz.ch 8 / 17

Cuda Kernel Auto-Tuning 300 Winner 250 Performance GFlop/s 200 150 100 50 0 1000 2000 3000 4000 5000 6000 7000 # Parameter Set Six parameters to optimize: v , w , N , M , #threads, #minBlocksPerSM On average > 8500 parameters-sets per kernel (heuristically pruned) Number of kernels optimized so far: 2349 ole.schuett@mat.ethz.ch 9 / 17

Cuda Kernel Performance 1400 7 1200 libcusmm 6 cuBLAS 1000 Perfomance [GFlop/s] Arithmetic Intensity Roofline 5 wo/ writeback 800 4 600 3 400 2 200 1 0 0 0 20 40 60 80 100 120 140 160 Block size (n=m=k) K20X GPU has 1.3TFlop/s and 180GB/s memory bandwidth with ECC ole.schuett@mat.ethz.ch 10 / 17

GPU Model Comparison Tesla K80 1200 Tesla K40 Tesla K20X 1000 Perfomance [GFlop/s] 800 600 400 200 0 0 5 10 15 20 25 30 35 Block size (n=m=k) ole.schuett@mat.ethz.ch 11 / 17

Single Node Performance 350 300 Performance [GFLOP/s] 250 GPU+CPU CPU-only 200 150 100 50 0 2 4 6 8 10 12 Cores 4.5x Speedup GPU+CPU vs CPU-only Artifical benchmark with favorable 23x23 block-size; Dual Sandy Bridge (E5-2620, 2.0GHz, 6 cores); Nvidia K20 GPU. ole.schuett@mat.ethz.ch 12 / 17

Full Daint System Science Case Matrix dims: 772868 × 772868 Filter threshold: 10 − 6 Matrix occupation ≈ 4% SCF steps ≈ 50 # multiplies needed ≈ 2000 Dense flops needed: 1846613343679824128000 Actual flops needed: 849928403736295802 Sparsity boost: 2172 × GPU flop share: 99.4% 80’000 atoms DFT, high accuracy settings Aggregated nanoparticles in explicit solution Walltime on 5184 nodes: 6264s Relevant for 3 rd generation solar cells ole.schuett@mat.ethz.ch 13 / 17

Bridging from Linear Scaling SCF to Materials Properties 2D polymers: synthetically tailored 2D materials beyond graphene Based on linear scaling MD simulations for 10’000s of atoms, the morphology and properties of the proposed 2D polymer sheets has been investigated Payamyar et al., (2013) ADVANCED MATERIALS, DOI: 10.1002/adma.201304705 ole.schuett@mat.ethz.ch 14 / 17

Bridging from Linear Scaling SCF to Materials Properties A 2 A 2 Area: 223˚ Area: 168˚ 2ps of MD Payamyar et al., (2013) ADVANCED MATERIALS, DOI: 10.1002/adma.201304705 ole.schuett@mat.ethz.ch 15 / 17

Outlook: Strong Scaling of Dense Matrix Multiplications Matrix Functions: Diagonalization → Taylor series Matrix Inverse: Cholesky → Hotteling 250 Total Perfomance [TFlop/s] 200 150 100 cuBLAS wo/ comm 32er kernel wo/ comm 50 DBCSR (32er blocks) Cray's libsci_acc 0 0 200 400 600 800 1000 1200 # nodes Benchmark of pdgemm, 32kx32k double precision matrix ole.schuett@mat.ethz.ch 16 / 17

Conclusion Our DBCSR library enables O ( N ) quantum chemistry methods, which allow for novel science. Lessons learned Overlapping communication with computation is key Auto-tuning is the way to go Avoid manual scheduling, use Cuda events Acknowledgements Contacts mailto:ole.schuett@mat.ethz.ch Joost VandeVondele http://nanosim.ethz.ch Florian Schiffmann http://dbcsr.cp2k.org Urban Borstnik http://cp2k.org Peter Messmer Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important! ole.schuett@mat.ethz.ch 17 / 17

Accelerated Sparse Matrix Multiplication for Quantum Chemistry with - PowerPoint PPT Presentation

Department of Materials Accelerated Sparse Matrix Multiplication for Quantum Chemistry with CP2K on Hybrid Supercomputers Ole Sch utt ole.schuett@mat.ethz.ch Nanoscale Simulations ole.schuett@mat.ethz.ch 1 / 17 Application: Emerging

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh, Morten St ockel IT

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Fast sparse matrixvector multiplication by partitioning and reordering Albert-Jan Yzelman

Complexity of matrix multiplication (For Hierarchical matrix) For Usual matrix The

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

Matrix-chain multiplication Carola Wenk 1 CMPS 6610 Algorithms Matrix-chain multiplication

Chapter VI All Pair Shortest Paths and Matrix Multiplication VI.1 APSPs and Matrix

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Digital Material Project Multiscale Modeling of Defects in Solids: NSF/KDI 9873214 James P.

Springbank Airport Com m unity Noise Consultative Com m ittee May 30, 2018 Agenda Welcome

Exponential Asymptotics and Singularities Saleh Tanveer (Ohio State University) Joint work with

Beam Parameters Reconstruction Using Pair Monitor can we do something more? Goran Kaarevi ,

n ): Continuous Models (state trajectory: Ordinary Differential Equation (ODE) formalism d n x d

Linear Dynamics of an Elastic Beam and Plate Under Moving Loads with Uncertain Parameters

Rogue Waves Thama Duba, Colin Please, Graeme Hocking, Kendall Born, Meghan Kennealy 18 January

Resequencing Calculus Existing Solutions An Early Multivariate Approach Our Solution Revised