ORNL is managed by UT-Battelle for the US Department of Energy
S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak - - PowerPoint PPT Presentation
S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak - - PowerPoint PPT Presentation
S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing Facility (OLCF) hynninena@ornl.gov ORNL is managed by UT-Battelle for the US Department of Energy Motivation Make NAMD run fast on upcoming
2 S6623: Advances in NAMD GPU Performance
Motivation
- Make NAMD run fast on upcoming Summit
supercomputer
- Center for Accelerated Application Readiness
(CAAR) project
– “Molecular Machinery of the Brain” – PI Prof. Klaus Schulten, University of Illinois at Urbana- Champaign – Co-PI James C. Phillips
3 S6623: Advances in NAMD GPU Performance
Introduction to NAMD
James C. Phillips S6361 - Attacking HIV with Petascale Molecular Dynamics Simulations on Titan and Blue Waters Thu 10am, Room 211A
30K atoms 10M atoms
- Popular classical molecular dynamics software
– Free to download – Available on most supercomputer centers
- Runs across many hardware platforms
– CPU, Intel Phi, Nvidia GPU
64M atoms
4 S6623: Advances in NAMD GPU Performance
Introduction to Summit
- ~3,400 nodes each with
– Multiple IBM Power 9 CPUs – Multiple NVIDIA Volta GPUs – NVLINK (5x-12x faster than PCIe3)
- 5x-10x faster than Titan (135-270 peta flops)
- Arrives 2018
- About 90% of FLOPs in GPUs
5 S6623: Advances in NAMD GPU Performance
GPU accelerated MD in NAMD
- GPU used only for non-bonded and PME
reciprocal forces
- Bonded forces and time-step integration performed
- n CPU
Bonded forces Non-bonded forces Time-step integration GPU CPU PME forces
- No need to re-write the entire MD engine
- Enables us to use the same communication,
thermostat, and sampling methods on all hardware platforms
6 S6623: Advances in NAMD GPU Performance
Non-bonded force computation
- Computes forces between
all pairs of atoms that are within a radius R
- Takes up approximately
80%-90% of simulation time
- Trivial algorithm scales as
O(N2)
- Use of neighbor lists
brings the algorithm down to O(N)
– Spatial sorting R
7 S6623: Advances in NAMD GPU Performance
Non-bonded force computation in NAMD
- Two levels of spatial
sorting
– Simulation box is divided into patches – Within the patch, atoms are sorted spatially into groups of 32 using
- rthogonal recursive
bisection method
4 4 4
8 S6623: Advances in NAMD GPU Performance
Non-bonded force compute
- Compute = all pairwise
interactions between two patches
Compute 1 Compute 2 32 32 Compute 1 Patch 1 Patch 2 Patch 3 Patch 2
- For GPU,
compute is split into tiles of size 32x32 atoms
9 S6623: Advances in NAMD GPU Performance
Non-bonded force computation
Atoms in patch i Atoms in patch j
32 31 30 1 2 3
Fj Fi
- One warp per tiles
- Loop through 32x32 tile diagonally
– Avoids race condition when storing forces Fi and Fj
- Bitmask used for exclusion lookup
32 32 Warp 1 Warp 2 Warp 3 Warp 4
10 S6623: Advances in NAMD GPU Performance
Non-bonded force computation kernel
- Warp-level programming
– __shfl() commands – No shared memory – No __syncthreads()
- Requires SM 3.0 (Kepler) or newer
- Builds atom-based neighbor lists and exclusion
bitmasks on the fly
11 S6623: Advances in NAMD GPU Performance
Neighbor list construction on GPU
Bounding box neighbor list Compute forces – atom-based neighbor list Sort neighbor list
R R
Sort neighbor list
12 S6623: Advances in NAMD GPU Performance
Neighbor list sorting
- Tile lists executed on the same thread block should
have approximately the same work load
- Simple solution is to sort according to tile list length
- Also minimizes tail effects at the end of kernel
execution
Load imbalance! No load imbalance Thread block sort
Warp 1
Warp 2
Warp 3
Warp 1 Warp 2 Warp 3 Warp 4
13 S6623: Advances in NAMD GPU Performance
Non-bonded kernel performance*
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 DHFR (24K atoms) ApoA1 (92K atoms) STMV (1.06M atoms)
Speedup vs. NAMD 2.11
Non-bonded Non-bonded neighbor list 1 1.5 2 2.5 3 3.5 4 13K atoms 5.7M atoms
Speedup vs. NAMD 2.11
GBIS Non-bonded neighbor list
* Titan supercomputer, K20 GPU
- Explicit solvent
– 30% faster – Neighbor list build up to 2.7x faster
- GB implicit solvent
– 13K: 3.5x faster – 5.7M: 38% faster
14 S6623: Advances in NAMD GPU Performance
CPU
Simulation performance
GPU: Non-bonded force kernel Patch 3 Patch 1 Patch 2 Communication & time-step integration GPU to CPU memcopy
- Simulation performance is influenced by
communication and time-step integration
15 S6623: Advances in NAMD GPU Performance
CPU
Streaming force computation
GPU: Non-bonded force kernel Patch 3 Patch 1 Patch 2 Communication & time-step integration CPU polling for results GPU: Non-bonded force kernel Patch 2 Patch 1 Patch 3
- Streaming: communication &
integration done during kernel execution
savings
16 S6623: Advances in NAMD GPU Performance
CPU polling for results
Streaming force computation
- Streaming: sort computes “globally” and preserve
patch order
- Kernel performance: sort neighbor list “locally”
GPU: Non-bonded force kernel Patch 2 Patch 1 Patch 3 CPU polling for results GPU: Non-bonded force kernel Patch 2 Patch 1 Patch 3
17 S6623: Advances in NAMD GPU Performance
Neighbor list sorting - global
Sort computes to reverse order Compute forces – record output order Output: 8, 10, 7, 6, 9, 4, 5, 3, 2, 1 Input: 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 1 2 3 4 9 6 7
10 8
Compute Tile list 5 4 3 2 1
10
9 8 7 5 6 1(1) 3(3) 4(5) 2(2) 5(4) 6(7) (5) (5) 9(6) 7(8) (8) (6) Tile list length Compute Reverse output + build sort key 7(10) (10) (10) Reverse: 1, 2, 3, 5, 4, 9, 6, 7, 10, 8 1 2 3 4 1 2 3 4 (10)
18 S6623: Advances in NAMD GPU Performance
Neighbor list sorting - local
Sort window
19 S6623: Advances in NAMD GPU Performance
Non-bonded kernel performance
100 200 300 400 500 600 700 800 900 2000 4000 Compute Neighbor list index streaming streaming with local sort (32) 100 200 300 400 500 600 700 800 900 2000 4000 Compute Neighbor list index 2 4 6 8 10 12 2000 4000 List length Neighbor list index 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 no streaming, incl. mem copy streaming global sort streaming global & local sort (32) Normalized kernel runtime 2 4 6 8 10 12 50 100 150 Tile list length Neighbor list index
20 S6623: Advances in NAMD GPU Performance
Streaming time-step profile
Non-bonded kernel Time step integration
21 S6623: Advances in NAMD GPU Performance
Streaming simulation performance
1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1 2 4 8 Number of Titan nodes
Speedup with streaming
DHFR (24K atoms) ApoA1 (92K atoms)
- 10% - 30% faster simulations using streaming
22 S6623: Advances in NAMD GPU Performance
Particle Mesh Ewald (PME) – NAMD 2.11
Charge spreading 3D FFT Real to Complex 3D FFT Complex to Real Force gather Poisson solver
- Charge spreading and force
gathering on GPU
- 3D FFT and Poisson solver on
CPU
23 S6623: Advances in NAMD GPU Performance
Particle Mesh Ewald (PME) – New
Charge spreading 3D FFT Real to Complex 3D FFT Complex to Real Force gather Poisson solver
- Everything on GPU
- Uses cuFFT
24 S6623: Advances in NAMD GPU Performance
PME Performance on single GPU*
1 1.5 2 2.5 3 3.5 4 DHFR, order=4 DHFR, order=6 DHFR, order=8
Speedup vs. NAMD 2.11 * Titan supercomputer: K20 GPU + AMD Opteron CPU
- DHFR 24K atoms
- 64x64x64 grid
25 S6623: Advances in NAMD GPU Performance
Finally, simulation performance
1 1.1 1.2 1.3 1.4 1.5 1.6 1 2 4 8 Number of Titan nodes
Speedup vs. NAMD 2.11
DHFR (24K atoms) ApoA1 (92K atoms)
- Explicit solvent
– 30% - 57% faster simulations
- GB implicit solvent
– Up to 3.5x faster simulations
1 1.5 2 2.5 3 3.5 4 1 2 4 Number of Titan nodes
13K atoms
1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1 2 4 Number of Titan nodes
5.7M atoms
26 S6623: Advances in NAMD GPU Performance
Challenges: Simulation performance
- Single-GPU performance for DHFR of 47 ns/day on
K20, is still only about half of the performance of GPU-only codes such as Amber* (95 ns/day)
- Major part of the runtime is now taken by memory
copying, setup time, time-step integration
– This is where the effort has to go now
*http://ambermd.org/gpus/benchmarks.htm
27 S6623: Advances in NAMD GPU Performance
Challenges: Simulation performance
Non-bonded kernel Time step integration CPU-CPU memory copy CPU-GPU memory copy & force clear Bonded forces
28 S6623: Advances in NAMD GPU Performance
Conclusions
- Explicit solvent non-bonded force kernels
– 30% speedup – 2x speedup in neighbor list builder
- Implicit solvent non-bonded force kernels
– 38% speedup (but up to 3.5x speedup) – 3.5x speedup in neighbor list builder
- Improved simulation performance
– 30% – 57% faster simulations than NAMD 2.11 – Up to 3.5x faster for GB implicit solvent
- Challenge
– Time-step integrator and rest of the critical path code needs work
- Planned for release in NAMD 2.12