S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak - - PowerPoint PPT Presentation

s6623 advances in namd gpu performance
SMART_READER_LITE
LIVE PREVIEW

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak - - PowerPoint PPT Presentation

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing Facility (OLCF) hynninena@ornl.gov ORNL is managed by UT-Battelle for the US Department of Energy Motivation Make NAMD run fast on upcoming


slide-1
SLIDE 1

ORNL is managed by UT-Battelle for the US Department of Energy

S6623: Advances in NAMD GPU Performance

Antti-Pekka Hynninen Oak Ridge Leadership Computing Facility (OLCF) hynninena@ornl.gov

slide-2
SLIDE 2

2 S6623: Advances in NAMD GPU Performance

Motivation

  • Make NAMD run fast on upcoming Summit

supercomputer

  • Center for Accelerated Application Readiness

(CAAR) project

– “Molecular Machinery of the Brain” – PI Prof. Klaus Schulten, University of Illinois at Urbana- Champaign – Co-PI James C. Phillips

slide-3
SLIDE 3

3 S6623: Advances in NAMD GPU Performance

Introduction to NAMD

James C. Phillips S6361 - Attacking HIV with Petascale Molecular Dynamics Simulations on Titan and Blue Waters Thu 10am, Room 211A

30K atoms 10M atoms

  • Popular classical molecular dynamics software

– Free to download – Available on most supercomputer centers

  • Runs across many hardware platforms

– CPU, Intel Phi, Nvidia GPU

64M atoms

slide-4
SLIDE 4

4 S6623: Advances in NAMD GPU Performance

Introduction to Summit

  • ~3,400 nodes each with

– Multiple IBM Power 9 CPUs – Multiple NVIDIA Volta GPUs – NVLINK (5x-12x faster than PCIe3)

  • 5x-10x faster than Titan (135-270 peta flops)
  • Arrives 2018
  • About 90% of FLOPs in GPUs
slide-5
SLIDE 5

5 S6623: Advances in NAMD GPU Performance

GPU accelerated MD in NAMD

  • GPU used only for non-bonded and PME

reciprocal forces

  • Bonded forces and time-step integration performed
  • n CPU

Bonded forces Non-bonded forces Time-step integration GPU CPU PME forces

  • No need to re-write the entire MD engine
  • Enables us to use the same communication,

thermostat, and sampling methods on all hardware platforms

slide-6
SLIDE 6

6 S6623: Advances in NAMD GPU Performance

Non-bonded force computation

  • Computes forces between

all pairs of atoms that are within a radius R

  • Takes up approximately

80%-90% of simulation time

  • Trivial algorithm scales as

O(N2)

  • Use of neighbor lists

brings the algorithm down to O(N)

– Spatial sorting R

slide-7
SLIDE 7

7 S6623: Advances in NAMD GPU Performance

Non-bonded force computation in NAMD

  • Two levels of spatial

sorting

– Simulation box is divided into patches – Within the patch, atoms are sorted spatially into groups of 32 using

  • rthogonal recursive

bisection method

4 4 4

slide-8
SLIDE 8

8 S6623: Advances in NAMD GPU Performance

Non-bonded force compute

  • Compute = all pairwise

interactions between two patches

Compute 1 Compute 2 32 32 Compute 1 Patch 1 Patch 2 Patch 3 Patch 2

  • For GPU,

compute is split into tiles of size 32x32 atoms

slide-9
SLIDE 9

9 S6623: Advances in NAMD GPU Performance

Non-bonded force computation

Atoms in patch i Atoms in patch j

32 31 30 1 2 3

Fj Fi

  • One warp per tiles
  • Loop through 32x32 tile diagonally

– Avoids race condition when storing forces Fi and Fj

  • Bitmask used for exclusion lookup

32 32 Warp 1 Warp 2 Warp 3 Warp 4

slide-10
SLIDE 10

10 S6623: Advances in NAMD GPU Performance

Non-bonded force computation kernel

  • Warp-level programming

– __shfl() commands – No shared memory – No __syncthreads()

  • Requires SM 3.0 (Kepler) or newer
  • Builds atom-based neighbor lists and exclusion

bitmasks on the fly

slide-11
SLIDE 11

11 S6623: Advances in NAMD GPU Performance

Neighbor list construction on GPU

Bounding box neighbor list Compute forces – atom-based neighbor list Sort neighbor list

R R

Sort neighbor list

slide-12
SLIDE 12

12 S6623: Advances in NAMD GPU Performance

Neighbor list sorting

  • Tile lists executed on the same thread block should

have approximately the same work load

  • Simple solution is to sort according to tile list length
  • Also minimizes tail effects at the end of kernel

execution

Load imbalance! No load imbalance Thread block sort

Warp 1

Warp 2

Warp 3

Warp 1 Warp 2 Warp 3 Warp 4

slide-13
SLIDE 13

13 S6623: Advances in NAMD GPU Performance

Non-bonded kernel performance*

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 DHFR (24K atoms) ApoA1 (92K atoms) STMV (1.06M atoms)

Speedup vs. NAMD 2.11

Non-bonded Non-bonded neighbor list 1 1.5 2 2.5 3 3.5 4 13K atoms 5.7M atoms

Speedup vs. NAMD 2.11

GBIS Non-bonded neighbor list

* Titan supercomputer, K20 GPU

  • Explicit solvent

– 30% faster – Neighbor list build up to 2.7x faster

  • GB implicit solvent

– 13K: 3.5x faster – 5.7M: 38% faster

slide-14
SLIDE 14

14 S6623: Advances in NAMD GPU Performance

CPU

Simulation performance

GPU: Non-bonded force kernel Patch 3 Patch 1 Patch 2 Communication & time-step integration GPU to CPU memcopy

  • Simulation performance is influenced by

communication and time-step integration

slide-15
SLIDE 15

15 S6623: Advances in NAMD GPU Performance

CPU

Streaming force computation

GPU: Non-bonded force kernel Patch 3 Patch 1 Patch 2 Communication & time-step integration CPU polling for results GPU: Non-bonded force kernel Patch 2 Patch 1 Patch 3

  • Streaming: communication &

integration done during kernel execution

savings

slide-16
SLIDE 16

16 S6623: Advances in NAMD GPU Performance

CPU polling for results

Streaming force computation

  • Streaming: sort computes “globally” and preserve

patch order

  • Kernel performance: sort neighbor list “locally”

GPU: Non-bonded force kernel Patch 2 Patch 1 Patch 3 CPU polling for results GPU: Non-bonded force kernel Patch 2 Patch 1 Patch 3

slide-17
SLIDE 17

17 S6623: Advances in NAMD GPU Performance

Neighbor list sorting - global

Sort computes to reverse order Compute forces – record output order Output: 8, 10, 7, 6, 9, 4, 5, 3, 2, 1 Input: 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 1 2 3 4 9 6 7

10 8

Compute Tile list 5 4 3 2 1

10

9 8 7 5 6 1(1) 3(3) 4(5) 2(2) 5(4) 6(7) (5) (5) 9(6) 7(8) (8) (6) Tile list length Compute Reverse output + build sort key 7(10) (10) (10) Reverse: 1, 2, 3, 5, 4, 9, 6, 7, 10, 8 1 2 3 4 1 2 3 4 (10)

slide-18
SLIDE 18

18 S6623: Advances in NAMD GPU Performance

Neighbor list sorting - local

Sort window

slide-19
SLIDE 19

19 S6623: Advances in NAMD GPU Performance

Non-bonded kernel performance

100 200 300 400 500 600 700 800 900 2000 4000 Compute Neighbor list index streaming streaming with local sort (32) 100 200 300 400 500 600 700 800 900 2000 4000 Compute Neighbor list index 2 4 6 8 10 12 2000 4000 List length Neighbor list index 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 no streaming, incl. mem copy streaming global sort streaming global & local sort (32) Normalized kernel runtime 2 4 6 8 10 12 50 100 150 Tile list length Neighbor list index

slide-20
SLIDE 20

20 S6623: Advances in NAMD GPU Performance

Streaming time-step profile

Non-bonded kernel Time step integration

slide-21
SLIDE 21

21 S6623: Advances in NAMD GPU Performance

Streaming simulation performance

1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1 2 4 8 Number of Titan nodes

Speedup with streaming

DHFR (24K atoms) ApoA1 (92K atoms)

  • 10% - 30% faster simulations using streaming
slide-22
SLIDE 22

22 S6623: Advances in NAMD GPU Performance

Particle Mesh Ewald (PME) – NAMD 2.11

Charge spreading 3D FFT Real to Complex 3D FFT Complex to Real Force gather Poisson solver

  • Charge spreading and force

gathering on GPU

  • 3D FFT and Poisson solver on

CPU

slide-23
SLIDE 23

23 S6623: Advances in NAMD GPU Performance

Particle Mesh Ewald (PME) – New

Charge spreading 3D FFT Real to Complex 3D FFT Complex to Real Force gather Poisson solver

  • Everything on GPU
  • Uses cuFFT
slide-24
SLIDE 24

24 S6623: Advances in NAMD GPU Performance

PME Performance on single GPU*

1 1.5 2 2.5 3 3.5 4 DHFR, order=4 DHFR, order=6 DHFR, order=8

Speedup vs. NAMD 2.11 * Titan supercomputer: K20 GPU + AMD Opteron CPU

  • DHFR 24K atoms
  • 64x64x64 grid
slide-25
SLIDE 25

25 S6623: Advances in NAMD GPU Performance

Finally, simulation performance

1 1.1 1.2 1.3 1.4 1.5 1.6 1 2 4 8 Number of Titan nodes

Speedup vs. NAMD 2.11

DHFR (24K atoms) ApoA1 (92K atoms)

  • Explicit solvent

– 30% - 57% faster simulations

  • GB implicit solvent

– Up to 3.5x faster simulations

1 1.5 2 2.5 3 3.5 4 1 2 4 Number of Titan nodes

13K atoms

1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1 2 4 Number of Titan nodes

5.7M atoms

slide-26
SLIDE 26

26 S6623: Advances in NAMD GPU Performance

Challenges: Simulation performance

  • Single-GPU performance for DHFR of 47 ns/day on

K20, is still only about half of the performance of GPU-only codes such as Amber* (95 ns/day)

  • Major part of the runtime is now taken by memory

copying, setup time, time-step integration

– This is where the effort has to go now

*http://ambermd.org/gpus/benchmarks.htm

slide-27
SLIDE 27

27 S6623: Advances in NAMD GPU Performance

Challenges: Simulation performance

Non-bonded kernel Time step integration CPU-CPU memory copy CPU-GPU memory copy & force clear Bonded forces

slide-28
SLIDE 28

28 S6623: Advances in NAMD GPU Performance

Conclusions

  • Explicit solvent non-bonded force kernels

– 30% speedup – 2x speedup in neighbor list builder

  • Implicit solvent non-bonded force kernels

– 38% speedup (but up to 3.5x speedup) – 3.5x speedup in neighbor list builder

  • Improved simulation performance

– 30% – 57% faster simulations than NAMD 2.11 – Up to 3.5x faster for GB implicit solvent

  • Challenge

– Time-step integrator and rest of the critical path code needs work

  • Planned for release in NAMD 2.12