Accelerated Sparse Matrix Multiplication for Quantum Chemistry with - - PowerPoint PPT Presentation

accelerated sparse matrix multiplication for quantum
SMART_READER_LITE
LIVE PREVIEW

Accelerated Sparse Matrix Multiplication for Quantum Chemistry with - - PowerPoint PPT Presentation

Department of Materials Accelerated Sparse Matrix Multiplication for Quantum Chemistry with CP2K on Hybrid Supercomputers Ole Sch utt ole.schuett@mat.ethz.ch Nanoscale Simulations ole.schuett@mat.ethz.ch 1 / 17 Application: Emerging


slide-1
SLIDE 1

Department of Materials

Accelerated Sparse Matrix Multiplication for Quantum Chemistry with CP2K

  • n Hybrid Supercomputers

Ole Sch¨ utt

  • le.schuett@mat.ethz.ch

Nanoscale Simulations

  • le.schuett@mat.ethz.ch

1 / 17

slide-2
SLIDE 2

Application: Emerging Photovoltaics

Processes at TiO2-Interface

Schiffmann et al. (2010)

Electron Transport across Nanoparticles

17k atoms, 80k electrons

Hole Transporting Material (HTM)

spiro-MeOTAD

Requirements

electronic properties = ⇒ Schr¨

  • dinger equation (HΨ = EΨ)

lack of symmetries = ⇒ large simulation cells (> 1000 atoms)

  • le.schuett@mat.ethz.ch

2 / 17

slide-3
SLIDE 3

Linear Scaling Self Consistent Field

Guess initial density ρ Calculate matrix H from ρ Costs: O(N), but dominates for small systems Calculate eigenvectors ψi of H Costs: O(N 3) Calculate new density ρ =

i |ψi|2

Calculate ρ directly as matrix function of H Costs: O(N) Calculate energy from ρ SCF Iteration Dense linear algebra Sparse linear algebra

Density P as matrix function of P: P =

  • 1 + exp

H − µ✶ kT −1 = 1 2 [1 − sign (H − µ✶)] Evaluate sign() as polynomial series: X0 = A · ||A||−1 Xn+1 = 1 2Xn(3✶ − X 2

n )

sign(A) = X∞

LS-SCF entirely based on sparse linear algebra.

limit of small kT (ground state)

  • le.schuett@mat.ethz.ch

3 / 17

slide-4
SLIDE 4

Benchmarks of Condensed Phase Systems

10000 20000 30000 40000 50000 60000 Number of atoms 20 40 60 80 100 Wall time [min]

Diagonalization Linear scaling

DFT on 46.656 cores DFTB on 9.216 cores

O(N) methods are inevitable for large systems

VandeVondele et al. (2012): Linear Scaling Self-Consistent Field Calculations with Millions of Atoms in the Condensed Phase

  • le.schuett@mat.ethz.ch

3 / 17

slide-5
SLIDE 5

The DBCSR Library

DBCSR = Distributed Block Compressed Sparse Row Working horse of CP2K’s linear scaling DFT code Non-zero elements are small dense blocks e.g. 13 × 13 Each block corresponds to interaction between two atoms Additions are local operations Multiplications are more elaborate...

H O H H O H

neglect distant atom pairs

H O H H O H

exploit symmetry

H O H H O H

  • le.schuett@mat.ethz.ch

4 / 17

slide-6
SLIDE 6

Architecture of DBCSR’s Multiplication Code

Cluster Node GPU

Cannon

MPI Parallelization

Multrec

Cache Optimization

CSR

Stack generation

Scheduler

CPU/GPU Load balancing

Host Driver Cuda Driver Libsmm BLAS Libcusmm fallback

  • le.schuett@mat.ethz.ch

5 / 17

slide-7
SLIDE 7

Hiding Communication with Double Buffering

MPI send MPI receive generate stacks host to device process stacks MPI send generate stacks host to device process stack MPI receive MPI send MPI receive generate stacks host to device process stacks Host Buffer 1 Device Buffer 1 Host Buffer 2 Device Buffer 2 time

  • 1. Cannon Tick
  • 2. Cannon Tick
  • 3. Cannon Tick

Ideally: Network and GPU always busy

  • le.schuett@mat.ethz.ch

6 / 17

slide-8
SLIDE 8

Managing Dependencies with Cuda Events and Streams

a panel b panel c panel stack buffer 1 Streams host2dev host2dev set zero host2dev calc stack buffer 2 host2dev calc dev2host Queried before reusing host stack buffer time

  • le.schuett@mat.ethz.ch

7 / 17

slide-9
SLIDE 9

Cuda Kernel Implementation

GPU Memory Usage

Larger matrices are processed in slabs PA, PB, PC Each thread computes a tile T of the result slab PC Results T are keept in thread’s registers Outer-product style multiplication reduces access to PA and PB PB is stored transposed to coalesced memory access Write back to global memory uses Compare-and-Swap

  • le.schuett@mat.ethz.ch

8 / 17

slide-10
SLIDE 10

Cuda Kernel Auto-Tuning

1000 2000 3000 4000 5000 6000 7000 # Parameter Set 50 100 150 200 250 300 Performance GFlop/s

Winner

Six parameters to optimize: v, w, N, M, #threads, #minBlocksPerSM On average > 8500 parameters-sets per kernel (heuristically pruned) Number of kernels optimized so far: 2349

  • le.schuett@mat.ethz.ch

9 / 17

slide-11
SLIDE 11

Cuda Kernel Performance

20 40 60 80 100 120 140 160 Block size (n=m=k) 200 400 600 800 1000 1200 1400 Perfomance [GFlop/s]

libcusmm cuBLAS Roofline wo/ writeback

1 2 3 4 5 6 7 Arithmetic Intensity

K20X GPU has 1.3TFlop/s and 180GB/s memory bandwidth with ECC

  • le.schuett@mat.ethz.ch

10 / 17

slide-12
SLIDE 12

GPU Model Comparison

5 10 15 20 25 30 35 Block size (n=m=k) 200 400 600 800 1000 1200 Perfomance [GFlop/s]

Tesla K80 Tesla K40 Tesla K20X

  • le.schuett@mat.ethz.ch

11 / 17

slide-13
SLIDE 13

Single Node Performance

2 4 6 8 10 12 Cores 50 100 150 200 250 300 350 Performance [GFLOP/s]

GPU+CPU CPU-only

4.5x Speedup GPU+CPU vs CPU-only

Artifical benchmark with favorable 23x23 block-size; Dual Sandy Bridge (E5-2620, 2.0GHz, 6 cores); Nvidia K20 GPU.

  • le.schuett@mat.ethz.ch

12 / 17

slide-14
SLIDE 14

Full Daint System Science Case

80’000 atoms DFT, high accuracy settings Aggregated nanoparticles in explicit solution Relevant for 3rd generation solar cells

Matrix dims: 772868 × 772868 Filter threshold: 10−6 Matrix occupation ≈ 4% SCF steps ≈ 50 # multiplies needed ≈ 2000 Dense flops needed: 1846613343679824128000 Actual flops needed: 849928403736295802 Sparsity boost: 2172× GPU flop share: 99.4% Walltime on 5184 nodes: 6264s

  • le.schuett@mat.ethz.ch

13 / 17

slide-15
SLIDE 15

Bridging from Linear Scaling SCF to Materials Properties

2D polymers: synthetically tailored 2D materials beyond graphene Based on linear scaling MD simulations for 10’000s of atoms, the morphology and properties of the proposed 2D polymer sheets has been investigated

Payamyar et al., (2013) ADVANCED MATERIALS, DOI: 10.1002/adma.201304705

  • le.schuett@mat.ethz.ch

14 / 17

slide-16
SLIDE 16

Bridging from Linear Scaling SCF to Materials Properties

Area: 223˚ A2 2ps of MD Area: 168˚ A2

Payamyar et al., (2013) ADVANCED MATERIALS, DOI: 10.1002/adma.201304705

  • le.schuett@mat.ethz.ch

15 / 17

slide-17
SLIDE 17

Outlook: Strong Scaling of Dense Matrix Multiplications

Matrix Functions: Diagonalization → Taylor series Matrix Inverse: Cholesky → Hotteling

200 400 600 800 1000 1200 # nodes 50 100 150 200 250 Total Perfomance [TFlop/s]

cuBLAS wo/ comm 32er kernel wo/ comm DBCSR (32er blocks) Cray's libsci_acc

Benchmark of pdgemm, 32kx32k double precision matrix

  • le.schuett@mat.ethz.ch

16 / 17

slide-18
SLIDE 18

Conclusion

Our DBCSR library enables O(N) quantum chemistry methods, which allow for novel science.

Lessons learned

Overlapping communication with computation is key Auto-tuning is the way to go Avoid manual scheduling, use Cuda events

Acknowledgements

Joost VandeVondele Florian Schiffmann Urban Borstnik Peter Messmer

Contacts

mailto:ole.schuett@mat.ethz.ch http://nanosim.ethz.ch http://dbcsr.cp2k.org http://cp2k.org

Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important!

  • le.schuett@mat.ethz.ch

17 / 17