Direct Self-Consistent Field Computations on GPU Clusters Guochun - PowerPoint PPT Presentation

Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Ivan Ufimtsev, Volodymyr Kindratenko Todd Martinez National Center for Supercomputing Department of Chemistry Applications Stanford University University of Illinois at Urbana- Champaign National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

Presentation Outline GPU computing NCSA’s Lincoln GPU cluster SCF theory in Quantum Chemistry Implementation on a GPU cluster Kernels for J and K matrices Parallelization strategy for GPU cluster Performance Conclusions and future work IPDPS 200

Why GPUs? GPU Performance Trends 7800 GTX 6800 Ultra 5950 Ultra 5800 IPDPS 200

NVIDIA Tesla T10 GPU Architecture T10 architecture PCIe interface Input assembler Thread execution manager 240 streaming processors TPC 1 TPC 10 Geometry controller Geometry controller arranged as 30 streaming SMC SMC multiprocessors SM SM SM SM SM SM I cache I cache I cache I cache I cache I cache MT issue MT issue MT issue MT issue MT issue MT issue C cache C cache C cache C cache C cache C cache At 1.3 GHz this provides SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP 1 TFLOP SP SP SP SP SP SP SP SP SP SP SP SP SP ● SP SP SP SP SP SP SP SP SP SP SP SP 86.4 GFLOP DP ● SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU Shared Shared Shared Shared Shared Shared memory memory memory memory memory memory 512-bit interface to off-chip Texture units Texture units Texture L1 Texture L1 GDDR3 memory 512-bit memory interconnect 102 GB/s bandwidth ● L2 ROP ROP L2 DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM IPDPS 200

Intel 64 Tesla Linux Cluster Lincoln Dell PowerEdge 1955 server Two Compute Nodes Intel 64 (Harpertown) 2.33 GHz dual IB socket quad core SDR IB SDR IB 16 GB DDR2 Dell PowerEdge Dell PowerEdge 1955 server 1955 server Infiniband SDR Tesla S1070 1U GPU PCIe x8 PCIe x8 Computing Server PCIe interface PCIe interface 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM T10 T10 T10 T10 Cluster DRAM DRAM DRAM DRAM Servers: 192 Tesla S1070 Accelerator Units: 96 IPDPS 200

HPL Benchmark for Lincoln 12 12 10 10 achieved GFLOPS 8 8 6 6 4 4 2 2 0 0 system size We used Massimiliano Fatica(nvidia)’s GPU enabled HPL package. IPDPS 200

Quantum Chemistry Why do we need to deal with…  Energy (H = E ):  Quantifies intra/intermolecular interactions Drives chemistry, little interesting happens on flat surface Geometry optimization ( R E = 0)  Searches for stable atomic arrangements (molecular shapes)  E ) Molecular dynamics (∂2R/ ∂t2 = -1/M R The chemistry itself (at some, sometimes crude, approximation) Studies system at atomistic time, and length scales IPDPS 200

Exact energy is a hard problem ( ) = ? Ψ r i E = ?     ∂ 2 2 + ∂ 2 2 + ∂ 2   − 1 1 ∑ ∑ Z A ∑ ( ) = E Ψ r i ( ) − + Ψ r i      ∂ x i ∂ y i ∂ z i  r i − R A r i − r j 2 2     i i , A i , j IPDPS 200

Hartree-Fock approximation is one of the simplest  is an antisymmetrized product of N 1-electron orbitals  ( ) ψ 2 r ( ) ... ψ N r N ( ) Ψ = A ψ 1 r     1 2 Expand over predefined basis set   K ∑ ( ) = ( ) ψ i r C ij ϕ j r j = 1 Ψ ↔ C ij = ? IPDPS 200

Hartree-Fock Self Consistent Field (SCF) procedure ( ) C = E SC F C ( ) ( ) = F C k F k + 1 C F k + 1 C k + 1 = E SC k + 1 Repeat until Ck+1 more or less equals Ck IPDPS 200

Hartree-Fock equations ( ) C = E SC F C core + J ij C ( ) − 1 ( ) = H ij ( ) ij C 2 K ij C F ∑ ( ) J ij = [ ij | kl ] P kl C k , l ∑ ( ) K ij = [ ik | jl ] P kl C k , l 1 ( ) ϕ j r ( ) ( ) ϕ l r ( ) d r ∫∫ [ ij | kl ] = ϕ i r ϕ k r 1 d r 2 1 − r 1 1 2 2 r 2 • All matrices are of N  N size ( N ~ 1,000 … 10,000) • N 3 operations to solve HF equations (need to deal with diagonalization) • N 4 operations to get F IPDPS 200

Kernel In GPU 2e integral grid 1 ( ) ϕ j r ( ) ( ) ϕ l r ( ) d r ∫∫ [ ij | kl ] = ϕ i r ϕ k r 1 d r 2 1 − r 1 1 2 2 r 2 [ ij | ij ] [ kl | kl ] ≥ 10 − 11 leaves only N 2 out of N 4 integrals [ ij | kl ] ≤ |kl] SIMD warp SIMD warp Most negligibly small Only significant integrals integrals will be calculated will be calculated [ij| IPDPS 200

Kernels in GPU: K- matrix implementation ∑ K ij = [ ik | jl ] P kl k , l [ jl | jl ] [ ik | ik ] IPDPS 200

Singe node execution time breakdown 12 12 10 10 8 8 runtime (seconds) 6 runtime (seconds) 6 4 4 2 2 0 0 The J and K matrices computation and Linear Algebra (LA) • computation dominate the overall execution time Pair quantity computations can be significant • IPDPS 200

GPU cluster parallelization strategy Each GPU has a global id nodeid * num_gpu_per_node + local_gpu_index J/K matrices work distribution Computations for elements in J and K matrices are not even. Sort pre-computed pair quantities and choose every one element in N to compute for each GPU LA using intel MKL IPDPS 200

Parallelization strategy (II) s tart G ues s initial m olecular orbital coefficients m atrix C and com pute dens ity m atrix P (Eq.10) Start as MPI program, each node • has as many MPI processes as m aster MPI p rocesses, e - 1 pre-com pute pair-wis e quantities p m ultiple PO S IX threads r m u p e o CPU cores c t K m aster MPI processes, com pute J and K (Eq. 8, 9) 2 m ultiple PO S IX threads, J a d n m C u p o e GPUs t One MPI process per node is • designated as “master” form Focks ub-m atrices (Eq. 7) k m a x r t i m aster MPI p rocesses, 3 ran k 0 MPI proc ess o c F gath er com plete F ockm atrix F The master MPI processes create h e a g • t r threads for controlling GPUs as 4 s catter F m all MPI processes o eF k c a x t r well as CPU work threads D u b i s r t t i i com pute m atrix C (Eq. 5 ) 5 all MPI processes S o e v m l e b o p r MPI processes/GPU management g e u a n v e e l • l i 6 all MPI processes, threads/CPU work threads are gather and b roadcas t P n a ran k 0 MPI process f a h e g i l r t awaken or put to sleep as needed no C onverg e? y es done IPDPS 200

MPI process node node node node 0 1 2 3 CPU work thread CPU thread for Generated guess matrix managing GPU C and compute matrix kernels P Pair-quantity computing on CPU Using density matrices P Computing J and K matrices on GPUs Partial J and K Reduction of J and K matrices, form the Fock matrix Fock matrix Distribute the Fock matrix, do linear algebra, compute matrix C and P, gather P Distr-ed fork matrix Distr-ed P matrix Broadcast P P matrix IPDPS 200

Performance: load balancing balanced K matrix Computation Unbalanced K matrix computation 30 12 Computation time (seconds) Computation time (seconds) 25 10 8 20 6 15 4 10 2 5 0 0 Node Index Node Index balanced J matrix Computation Sorting for pair quantity computations • 4 and work selection strategy makes the Computation time (seconds) 3.5 computation on GPUs well balanced, 3 2.5 reducing performance degradation 2 1.5 1 0.5 0 Node Index IPDPS 200

Performance Atoms Electrons Orbitals S shells P shells Olestra 453 1366 2131 1081 350 BPTI 875 3400 4893 2202 897 CspA 1732 6290 8753 4220 1511 BPTI CspA Olestra 12 12 12 10 10 10 Runtime (s) 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 # of nodes Using 321g basis set IPDPS 200

Scalability of J, K and LA CspA Olestra BPTI 12 12 12 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 number of nodes J and K matrices computation can scale well to 128 nodes • Linear Algebra scales only up to 16 nodes even for CsPA molecule • IPDPS 200

Performance: Linear Algebra breakdown 12 10 time per iteration (secs) 8 6 4 2 0 # of cluster nodes Diagonization scales the worst, dgemm is also important • A fast, scalable GPU based SCALAPACK is needed • Magma from UTK? • Cula? • IPDPS 200

Results: Olestra molecule Olestra molecule consisting of 453 atoms (a small example model used of testing the developed software) can be computed by the state-of-the-art quantum chemistry software package GAMESS running on an Intel Pentium D 3 GHz processor in over 12,408 seconds whereas our 8-node GPU cluster implementation performs the same computation in just over 5 seconds, a 2,452× speedup. IPDPS 200

Example: CspA molecule For larger models, one SCF iteration for Cold shock protein A (CspA) molecule consisting of 1,732 atoms can be done in 88 seconds on a 16 node GPU cluster. IPDPS 200

Direct Self-Consistent Field Computations on GPU Clusters Guochun - PowerPoint PPT Presentation

Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Ivan Ufimtsev, Volodymyr Kindratenko Todd Martinez National Center for Supercomputing Department of Chemistry Applications Stanford University University of Illinois at

General Structure of a PW code Self-Consistent KS eqs. or Global Minimization approach

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Feasibility of Consistent, Feasibility of Consistent, Feasibility of Consistent, Feasibility of

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

CASA Seminar Self- -consistent Space Charge consistent Space Charge Self Distributions: Theory

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

CSS Modules with BEM Consistent Design Consistent Design Different Module Versions Consistent

Structuring Computations Structuring Computations Contents Jacobs Types06, 18/4/06

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Part 2: Applications

Coordination-Free Computations Christopher Meiklejohn LASP DISTRIBUTED, EVENTUALLY CONSISTENT

Malware: Viruses, Worms, Rootkits, Botnets Spring 2015

Attacking AUTOSAR using Software and Hardware Attacks Pascal al Nasahl Graz University of

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

What Does Quality Mean? Operational meanings: CISC 323: Intro to Software Software does

Linearizability, revisited Radha Jagadeesan Gustavo Petri Corin Pitcher James Riely

The Soot framework for Java program analysis: a retrospective Patrick Lam, Eric Bodden, Ond

3 John Series Lesson #020 October 19, 2003 Dean Bible Ministries www.deanbibleministries.org

Registrant List UCSF OCME Page 1 of 6 Name City, State Acosta R Maritza Masters Sausalito,

Sambuz

Useful Links

Newsletter

Mail Us

Direct Self-Consistent Field Computations on GPU Clusters Guochun - PowerPoint PPT Presentation

Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Ivan Ufimtsev, Volodymyr Kindratenko Todd Martinez National Center for Supercomputing Department of Chemistry Applications Stanford University University of Illinois at

General Structure of a PW code Self-Consistent KS eqs. or Global Minimization approach

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Feasibility of Consistent, Feasibility of Consistent, Feasibility of Consistent, Feasibility of

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

CASA Seminar Self- -consistent Space Charge consistent Space Charge Self Distributions: Theory

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

CSS Modules with BEM Consistent Design Consistent Design Different Module Versions Consistent

Structuring Computations Structuring Computations Contents Jacobs Types06, 18/4/06

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Part 2: Applications

Coordination-Free Computations Christopher Meiklejohn LASP DISTRIBUTED, EVENTUALLY CONSISTENT

Malware: Viruses, Worms, Rootkits, Botnets Spring 2015

Attacking AUTOSAR using Software and Hardware Attacks Pascal al Nasahl Graz University of

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

What Does Quality Mean? Operational meanings: CISC 323: Intro to Software Software does

Linearizability, revisited Radha Jagadeesan Gustavo Petri Corin Pitcher James Riely

The Soot framework for Java program analysis: a retrospective Patrick Lam, Eric Bodden, Ond

3 John Series Lesson #020 October 19, 2003 Dean Bible Ministries www.deanbibleministries.org

Registrant List UCSF OCME Page 1 of 6 Name City, State Acosta R Maritza Masters Sausalito,

Sambuz

Useful Links

Newsletter

Mail Us

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team