s6623 advances in namd gpu performance
play

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak - PowerPoint PPT Presentation

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing Facility (OLCF) hynninena@ornl.gov ORNL is managed by UT-Battelle for the US Department of Energy Motivation Make NAMD run fast on upcoming


  1. S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing Facility (OLCF) hynninena@ornl.gov ORNL is managed by UT-Battelle for the US Department of Energy

  2. Motivation • Make NAMD run fast on upcoming Summit supercomputer • Center for Accelerated Application Readiness (CAAR) project – “ Molecular Machinery of the Brain” – PI Prof. Klaus Schulten, University of Illinois at Urbana- Champaign – Co-PI James C. Phillips 2 S6623: Advances in NAMD GPU Performance

  3. Introduction to NAMD • Popular classical molecular dynamics software – Free to download – Available on most supercomputer centers • Runs across many hardware platforms – CPU, Intel Phi, Nvidia GPU 64M atoms 30K atoms 10M atoms James C. Phillips S6361 - Attacking HIV with Petascale Molecular Dynamics Simulations on Titan and Blue Waters Thu 10am, Room 211A 3 S6623: Advances in NAMD GPU Performance

  4. Introduction to Summit • ~3,400 nodes each with – Multiple IBM Power 9 CPUs – Multiple NVIDIA Volta GPUs – NVLINK (5x-12x faster than PCIe3) • 5x-10x faster than Titan (135-270 peta flops) • Arrives 2018 • About 90% of FLOPs in GPUs 4 S6623: Advances in NAMD GPU Performance

  5. GPU accelerated MD in NAMD • GPU used only for non-bonded and PME reciprocal forces • Bonded forces and time-step integration performed on CPU Non-bonded forces PME forces GPU CPU Bonded forces Time-step integration • No need to re-write the entire MD engine • Enables us to use the same communication, thermostat, and sampling methods on all hardware platforms 5 S6623: Advances in NAMD GPU Performance

  6. Non-bonded force computation • Computes forces between all pairs of atoms that are within a radius R • Takes up approximately 80%-90% of simulation time R • Trivial algorithm scales as O( N 2 ) • Use of neighbor lists brings the algorithm down to O(N) – Spatial sorting 6 S6623: Advances in NAMD GPU Performance

  7. Non-bonded force computation in NAMD • Two levels of spatial 4 sorting 4 – Simulation box is divided into patches – Within the patch, atoms are sorted spatially into groups of 32 using 4 orthogonal recursive bisection method 7 S6623: Advances in NAMD GPU Performance

  8. Non-bonded force compute Patch 1 Compute 1 Compute 1 Patch 2 32 32 • For GPU, Patch 2 compute is split Compute 2 into tiles of size 32x32 atoms Patch 3 • Compute = all pairwise interactions between two patches 8 S6623: Advances in NAMD GPU Performance

  9. Non-bonded force computation F i 32 31 30 Atoms in patch j Warp 1 Warp 2 Warp 3 Warp 4 F j 32 3 2 32 1 Atoms in patch i • One warp per tiles • Loop through 32x32 tile diagonally – Avoids race condition when storing forces F i and F j • Bitmask used for exclusion lookup 9 S6623: Advances in NAMD GPU Performance

  10. Non-bonded force computation kernel • Warp-level programming – __shfl() commands – No shared memory – No __syncthreads() • Requires SM 3.0 (Kepler) or newer • Builds atom-based neighbor lists and exclusion bitmasks on the fly 10 S6623: Advances in NAMD GPU Performance

  11. Neighbor list construction on GPU Bounding box neighbor list R Sort neighbor list Compute forces – atom-based neighbor list R Sort neighbor list 11 S6623: Advances in NAMD GPU Performance

  12. Neighbor list sorting Load imbalance! Warp 1 Warp 3 Warp 2 Thread block sort No load imbalance Warp 1 Warp 2 Warp 3 Warp 4 • Tile lists executed on the same thread block should have approximately the same work load • Simple solution is to sort according to tile list length • Also minimizes tail effects at the end of kernel execution 12 S6623: Advances in NAMD GPU Performance

  13. Non-bonded kernel performance* Speedup vs. NAMD 2.11 • Explicit solvent 2.8 2.6 Non-bonded – 30% faster 2.4 Non-bonded neighbor list 2.2 – Neighbor list build 2 1.8 up to 2.7x faster 1.6 1.4 1.2 1 DHFR (24K atoms) ApoA1 (92K atoms) STMV (1.06M atoms) Speedup vs. NAMD 2.11 4 3.5 • GB implicit solvent GBIS Non-bonded neighbor list 3 – 13K: 3.5x faster 2.5 – 5.7M: 38% faster 2 1.5 1 * Titan supercomputer, K20 GPU 13K atoms 5.7M atoms 13 S6623: Advances in NAMD GPU Performance

  14. Simulation performance • Simulation performance is influenced by communication and time-step integration GPU to CPU memcopy GPU: Non-bonded force kernel CPU Patch 1 Patch 2 Patch 3 Communication & time-step integration 14 S6623: Advances in NAMD GPU Performance

  15. Streaming force computation GPU: Non-bonded force kernel CPU Patch 1 Patch 2 Patch 3 Communication & • Streaming: communication & time-step integration integration done during kernel execution GPU: Non-bonded force kernel savings Patch 1 CPU polling for results Patch 2 Patch 3 15 S6623: Advances in NAMD GPU Performance

  16. Streaming force computation GPU: Non-bonded force kernel Patch 1 CPU polling for results Patch 2 Patch 3 GPU: Non-bonded force kernel CPU polling for results Patch 1 Patch 2 Patch 3 • Streaming: sort computes “globally” and preserve patch order • Kernel performance: sort neighbor list “locally” 16 S6623: Advances in NAMD GPU Performance

  17. Neighbor list sorting - global Input: 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 8 6 4 3 2 1 10 Tile list Sort computes to reverse order 9 7 5 Compute Output: 8, 10, 7, 6, 9, 4, 5, 3, 2, 1 Compute forces – record output order Reverse: 1, 2, 3, 5, 4, 9, 6, 7, 10, 8 (8) 7(8) 9(6) (6) 4 Compute Reverse output + build sort key (10) 7(10) (10) (10) 3 6(7) (5) (5) 5(4) 2 1 5 6 7 2(2) 4(5) 3(3) 1(1) 1 2 3 4 9 10 8 1 2 3 4 Tile list length 17 S6623: Advances in NAMD GPU Performance

  18. Neighbor list sorting - local Sort window 18 S6623: Advances in NAMD GPU Performance

  19. Non-bonded kernel performance 1.2 900 800 1.15 Normalized kernel runtime 700 1.1 600 1.05 Compute 500 1 400 0.95 300 0.9 streaming 200 0.85 100 streaming with local sort (32) 0.8 0 no streaming, incl. streaming global sort streaming global & 0 2000 4000 mem copy local sort (32) Neighbor list index 12 900 10 Tile list length 12 800 8 700 10 List length 600 6 Compute 8 500 4 6 400 2 300 4 200 0 2 0 50 100 150 100 0 0 Neighbor list index 0 2000 4000 0 2000 4000 Neighbor list index Neighbor list index 19 S6623: Advances in NAMD GPU Performance

  20. Streaming time-step profile Non-bonded kernel Time step integration 20 S6623: Advances in NAMD GPU Performance

  21. Streaming simulation performance Speedup with streaming 1.35 DHFR (24K atoms) 1.3 ApoA1 (92K atoms) 1.25 1.2 1.15 1.1 1.05 1 1 2 4 8 Number of Titan nodes • 10% - 30% faster simulations using streaming 21 S6623: Advances in NAMD GPU Performance

  22. Particle Mesh Ewald (PME) – NAMD 2.11 Charge 3D FFT spreading Real to Complex Poisson solver • Charge spreading and force gathering on GPU • 3D FFT and Poisson solver on CPU Force 3D FFT gather Complex to Real 22 S6623: Advances in NAMD GPU Performance

  23. Particle Mesh Ewald (PME) – New Charge 3D FFT spreading Real to Complex Poisson solver • Everything on GPU • Uses cuFFT Force 3D FFT gather Complex to Real 23 S6623: Advances in NAMD GPU Performance

  24. PME Performance on single GPU* Speedup vs. NAMD 2.11 4 3.5 • DHFR 24K atoms 3 • 64x64x64 grid 2.5 2 1.5 1 DHFR, order=4 DHFR, order=6 DHFR, order=8 * Titan supercomputer: K20 GPU + AMD Opteron CPU 24 S6623: Advances in NAMD GPU Performance

  25. Finally, simulation performance Speedup vs. NAMD 2.11 DHFR (24K atoms) • Explicit solvent 1.6 ApoA1 (92K atoms) 1.5 – 30% - 57% faster 1.4 simulations 1.3 1.2 13K atoms 4 1.1 3.5 3 1 1 2 4 8 2.5 Number of Titan nodes 2 1.5 1 1 2 4 • GB implicit solvent Number of Titan nodes 5.7M atoms 1.35 1.3 – Up to 3.5x faster 1.25 1.2 simulations 1.15 1.1 1.05 1 1 2 4 Number of Titan nodes 25 S6623: Advances in NAMD GPU Performance

  26. Challenges: Simulation performance • Single-GPU performance for DHFR of 47 ns/day on K20, is still only about half of the performance of GPU-only codes such as Amber* (95 ns/day) • Major part of the runtime is now taken by memory copying, setup time, time-step integration – This is where the effort has to go now *http://ambermd.org/gpus/benchmarks.htm 26 S6623: Advances in NAMD GPU Performance

  27. Challenges: Simulation performance Bonded forces Non-bonded kernel CPU-GPU memory copy & force clear Time step integration CPU-CPU memory copy 27 S6623: Advances in NAMD GPU Performance

  28. Conclusions • Explicit solvent non-bonded force kernels – 30% speedup – 2x speedup in neighbor list builder • Implicit solvent non-bonded force kernels – 38% speedup (but up to 3.5x speedup) – 3.5x speedup in neighbor list builder • Improved simulation performance – 30% – 57% faster simulations than NAMD 2.11 – Up to 3.5x faster for GB implicit solvent • Challenge – Time-step integrator and rest of the critical path code needs work • Planned for release in NAMD 2.12 28 S6623: Advances in NAMD GPU Performance

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend