Direct Self-Consistent Field Computations on GPU Clusters Guochun - - PowerPoint PPT Presentation

direct self consistent field computations on gpu clusters
SMART_READER_LITE
LIVE PREVIEW

Direct Self-Consistent Field Computations on GPU Clusters Guochun - - PowerPoint PPT Presentation

Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Ivan Ufimtsev, Volodymyr Kindratenko Todd Martinez National Center for Supercomputing Department of Chemistry Applications Stanford University University of Illinois at


slide-1
SLIDE 1

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

Direct Self-Consistent Field Computations on GPU Clusters

Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at Urbana- Champaign

Ivan Ufimtsev, Todd Martinez Department of Chemistry Stanford University

slide-2
SLIDE 2

Presentation Outline

GPU computing NCSA’s Lincoln GPU cluster SCF theory in Quantum Chemistry Implementation on a GPU cluster Kernels for J and K matrices Parallelization strategy for GPU cluster Performance Conclusions and future work

IPDPS 200

slide-3
SLIDE 3

Why GPUs?

5800 5950 Ultra 6800 Ultra 7800 GTX

IPDPS 200

GPU Performance Trends

slide-4
SLIDE 4

NVIDIA Tesla T10 GPU Architecture

T10 architecture

240 streaming processors arranged as 30 streaming multiprocessors At 1.3 GHz this provides

  • 1 TFLOP SP
  • 86.4 GFLOP DP

512-bit interface to off-chip GDDR3 memory

  • 102 GB/s bandwidth

TPC 1

Geometry controller SMC SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache MT issue I cache

SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache MT issue I cache

SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache

MT issue

I cache

Texture units Texture L1

TPC 10

Geometry controller SMC SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache

MT issue

I cache

Texture units Texture L1 Thread execution manager Input assembler PCIe interface L2 ROP L2 ROP 512-bit memory interconnect

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM

IPDPS 200

slide-5
SLIDE 5

Intel 64 Tesla Linux Cluster Lincoln

Dell PowerEdge 1955 server

Intel 64 (Harpertown) 2.33 GHz dual socket quad core 16 GB DDR2 Infiniband SDR

Tesla S1070 1U GPU Computing Server

1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM

Cluster

Servers: 192 Accelerator Units: 96

Two Compute Nodes

Dell PowerEdge 1955 server

IB

Tesla S1070

T10 T10 PCIe interface

DRAM DRAM

T10 T10 PCIe interface

DRAM DRAM

Dell PowerEdge 1955 server

PCIe x8 PCIe x8

SDR IB SDR IB

IPDPS 200

slide-6
SLIDE 6

HPL Benchmark for Lincoln

2 4 6 8 10 12 2 4 6 8 10 12 system size achieved GFLOPS

IPDPS 200

We used Massimiliano Fatica(nvidia)’s GPU enabled HPL package.

slide-7
SLIDE 7

Why do we need to deal with…

Energy (H =  E ):  Quantifies intra/intermolecular interactions Drives chemistry, little interesting happens on flat surface Geometry optimization ( R  E = 0) Searches for stable atomic arrangements (molecular shapes) Molecular dynamics (∂2R/ ∂t2 = -1/M R  E) The chemistry itself (at some, sometimes crude, approximation) Studies system at atomistic time, and length scales

Quantum Chemistry

IPDPS 200

slide-8
SLIDE 8

Exact energy is a hard problem

− 1 2 ∂2 ∂xi

2 + ∂2

∂yi

2 + ∂2

∂zi

2

     

i

− ZA ri − RA

i,A

+ 1 ri − rj

i, j

          Ψ ri

( )= EΨ ri ( )

Ψ ri

( )= ?

E = ?

IPDPS 200

slide-9
SLIDE 9

Hartree-Fock approximation is one of the simplest Ψ = A ψ 1 r

1

( )ψ 2 r

2

( )...ψ N rN ( )

     is an antisymmetrized product of N 1-electron orbitals  ψ i r

( )=

Cijϕ j r

( )

j=1 K

Expand

  • ver predefined basis set

  Ψ ↔ Cij = ?

IPDPS 200

slide-10
SLIDE 10

Hartree-Fock Self Consistent Field (SCF) procedure F C

( )C = ESC

Fk+1 C

( )= F Ck

( )

Fk+1Ck+1 = ESCk+1

Repeat until Ck+1 more or less equals Ck

IPDPS 200

slide-11
SLIDE 11

Hartree-Fock equations F C

( )C = ESC

  • All matrices are of N N size (N ~ 1,000 … 10,000)
  • N3 operations to solve HF equations (need to deal with diagonalization)
  • N4 operations to get F

F

ij C

( )= Hij

core + Jij C

( )− 1

2 Kij C

( )

Jij = [ij | kl]P

kl C

( )

k,l

Kij = [ik | jl]P

kl C

( )

k,l

[ij | kl] = ϕi r

1

( )ϕ j r

1

( )

1 r

1 − r 2

ϕk r

2

( )ϕl r

2

( )dr

1dr2

∫∫

IPDPS 200

slide-12
SLIDE 12

2e integral grid

SIMD warp

Most negligibly small integrals will be calculated

SIMD warp

Only significant integrals will be calculated

[ij| |kl]

[ij | kl] ≤ [ij | ij] [kl | kl] ≥ 10−11 leaves only N2 out of N4 integrals

[ij | kl] = ϕi r

1

( )ϕ j r

1

( )

1 r

1 − r 2

ϕk r

2

( )ϕl r

2

( )dr

1dr2

∫∫

IPDPS 200

Kernel In GPU

slide-13
SLIDE 13

Kernel in GPU: J-matrix implementation

Jij = [ij | kl]P

kl k,l

[ij | ij] [kl | kl]

[ij | kl] ≤ [ij | ij] [kl | kl]

IPDPS 200

slide-14
SLIDE 14

Kernels in GPU: K-matrix implementation

Kij = [ik | jl]P

kl k,l

[ik | ik] [ jl | jl]

IPDPS 200

slide-15
SLIDE 15

Singe node execution time breakdown

2 4 6 8 10 12 runtime (seconds) 2 4 6 8 10 12 runtime (seconds)

  • The J and K matrices computation and Linear Algebra (LA)

computation dominate the overall execution time

  • Pair quantity computations can be significant

IPDPS 200

slide-16
SLIDE 16

GPU cluster parallelization strategy

Each GPU has a global id nodeid * num_gpu_per_node + local_gpu_index J/K matrices work distribution Computations for elements in J and K matrices are not even. Sort pre-computed pair quantities and choose every one element in N to compute for each GPU LA using intel MKL

IPDPS 200

slide-17
SLIDE 17

pre-com pute pair-wis e quantities com pute J and K(Eq. 8, 9) form Focks ub-m atrices (Eq. 7) gath er com plete F

  • ckm

atrix F s catter F com pute m atrix C(Eq. 5 ) gather and b roadcas t P

C

  • nverg

e?

G ues s initial m

  • lecular orbital coefficients

m atrix C and com pute dens ity m atrix P(Eq.10) done s tart

p r e - c

  • m

p u t e C

  • m

p u t e Ja n d K g a t h e r F

  • c

k m a t r i x D i s t r i b u t eF

  • c

k m a t r i x S

  • l

v e e i g e n v a l u e p r

  • b

l e m f i n a l g a t h e r

1 2 3 4 5 6

y es no

m aster MPI p rocesses, m ultiple PO S IX threads m aster MPI processes, m ultiple PO S IX threads, GPUs m aster MPI p rocesses, ran k 0 MPI proc ess all MPI processes all MPI processes all MPI processes, ran k 0 MPI process

Parallelization strategy (II)

  • Start as MPI program, each node

has as many MPI processes as CPU cores

  • One MPI process per node is

designated as “master”

  • The master MPI processes create

threads for controlling GPUs as well as CPU work threads

  • MPI processes/GPU management

threads/CPU work threads are awaken or put to sleep as needed

IPDPS 200

slide-18
SLIDE 18

node Computing J and K matrices on GPUs Reduction of J and K matrices, form the Fock matrix Pair-quantity computing on CPU Using density matrices P Distribute the Fock matrix, do linear algebra, compute matrix C and P, gather P Broadcast P node 1 node 2 node 3

MPI process CPU work thread CPU thread for managing GPU kernels Fock matrix

Distr-ed fork matrix Distr-ed P matrix

P matrix

Partial J and K Generated guess matrix C and compute matrix P

IPDPS 200

slide-19
SLIDE 19

Performance: load balancing

2 4 6 8 10 12

Unbalanced K matrix computation

Node Index Computation time (seconds) 0.5 1 1.5 2 2.5 3 3.5 4

balanced J matrix Computation

Node Index Computation time (seconds) 5 10 15 20 25 30

balanced K matrix Computation

Node Index Computation time (seconds)

  • Sorting for pair quantity computations

and work selection strategy makes the computation on GPUs well balanced, reducing performance degradation

IPDPS 200

slide-20
SLIDE 20

Atoms Electrons Orbitals S shells P shells Olestra 453 1366 2131 1081 350 BPTI 875 3400 4893 2202 897 CspA 1732 6290 8753 4220 1511

Performance

2 4 6 8 10 12

Olestra

2 4 6 8 10 12

BPTI

2 4 6 8 10 12

CspA # of nodes

Runtime (s) Using 321g basis set

IPDPS 200

slide-21
SLIDE 21

Scalability of J, K and LA

2 4 6 8 10 12

Olestra

2 4 6 8 10 12

BPTI

2 4 6 8 10 12

CspA

  • J and K matrices computation can scale well to 128 nodes
  • Linear Algebra scales only up to 16 nodes even for CsPA molecule

number of nodes

IPDPS 200

slide-22
SLIDE 22

2 4 6 8 10 12 # of cluster nodes time per iteration (secs)

Performance: Linear Algebra breakdown

  • Diagonization scales the worst, dgemm is also important
  • A fast, scalable GPU based SCALAPACK is needed
  • Magma from UTK?
  • Cula?

IPDPS 200

slide-23
SLIDE 23

Results: Olestra molecule

Olestra molecule consisting of 453 atoms (a small example model used of testing the developed software) can be computed by the state-of-the-art quantum chemistry software package GAMESS running on an Intel Pentium D 3 GHz processor in over 12,408 seconds whereas our 8-node GPU cluster implementation performs the same computation in just over 5 seconds, a 2,452× speedup.

IPDPS 200

slide-24
SLIDE 24

Example: CspA molecule

For larger models, one SCF iteration for Cold shock protein A (CspA) molecule consisting of 1,732 atoms can be done in 88 seconds on a 16 node GPU cluster.

IPDPS 200

slide-25
SLIDE 25

Conclusions and future work

GPU computing brings Quantum Chemistry computing to a new level Parallelization enables computing of large molecules in shorter time J and K matrices show good scalability Linear Algebra can only scale up to 16 nodes Linear Algebra becomes a major bottleneck A linear algebra package using GPUs with good scalability is needed

  • Matrix multiplication and eigenvalue solver

Only S and P orbitals are supported at this moment

IPDPS 200

slide-26
SLIDE 26

Acknowledgement

This work was supported by the National Science Foundation grant CHE-06-26354.

IPDPS 200