S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak - PowerPoint PPT Presentation

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing Facility (OLCF) hynninena@ornl.gov ORNL is managed by UT-Battelle for the US Department of Energy

Motivation • Make NAMD run fast on upcoming Summit supercomputer • Center for Accelerated Application Readiness (CAAR) project – “ Molecular Machinery of the Brain” – PI Prof. Klaus Schulten, University of Illinois at Urbana- Champaign – Co-PI James C. Phillips 2 S6623: Advances in NAMD GPU Performance

Introduction to NAMD • Popular classical molecular dynamics software – Free to download – Available on most supercomputer centers • Runs across many hardware platforms – CPU, Intel Phi, Nvidia GPU 64M atoms 30K atoms 10M atoms James C. Phillips S6361 - Attacking HIV with Petascale Molecular Dynamics Simulations on Titan and Blue Waters Thu 10am, Room 211A 3 S6623: Advances in NAMD GPU Performance

Introduction to Summit • ~3,400 nodes each with – Multiple IBM Power 9 CPUs – Multiple NVIDIA Volta GPUs – NVLINK (5x-12x faster than PCIe3) • 5x-10x faster than Titan (135-270 peta flops) • Arrives 2018 • About 90% of FLOPs in GPUs 4 S6623: Advances in NAMD GPU Performance

GPU accelerated MD in NAMD • GPU used only for non-bonded and PME reciprocal forces • Bonded forces and time-step integration performed on CPU Non-bonded forces PME forces GPU CPU Bonded forces Time-step integration • No need to re-write the entire MD engine • Enables us to use the same communication, thermostat, and sampling methods on all hardware platforms 5 S6623: Advances in NAMD GPU Performance

Non-bonded force computation • Computes forces between all pairs of atoms that are within a radius R • Takes up approximately 80%-90% of simulation time R • Trivial algorithm scales as O( N 2 ) • Use of neighbor lists brings the algorithm down to O(N) – Spatial sorting 6 S6623: Advances in NAMD GPU Performance

Non-bonded force computation in NAMD • Two levels of spatial 4 sorting 4 – Simulation box is divided into patches – Within the patch, atoms are sorted spatially into groups of 32 using 4 orthogonal recursive bisection method 7 S6623: Advances in NAMD GPU Performance

Non-bonded force compute Patch 1 Compute 1 Compute 1 Patch 2 32 32 • For GPU, Patch 2 compute is split Compute 2 into tiles of size 32x32 atoms Patch 3 • Compute = all pairwise interactions between two patches 8 S6623: Advances in NAMD GPU Performance

Non-bonded force computation F i 32 31 30 Atoms in patch j Warp 1 Warp 2 Warp 3 Warp 4 F j 32 3 2 32 1 Atoms in patch i • One warp per tiles • Loop through 32x32 tile diagonally – Avoids race condition when storing forces F i and F j • Bitmask used for exclusion lookup 9 S6623: Advances in NAMD GPU Performance

Non-bonded force computation kernel • Warp-level programming – __shfl() commands – No shared memory – No __syncthreads() • Requires SM 3.0 (Kepler) or newer • Builds atom-based neighbor lists and exclusion bitmasks on the fly 10 S6623: Advances in NAMD GPU Performance

Neighbor list construction on GPU Bounding box neighbor list R Sort neighbor list Compute forces – atom-based neighbor list R Sort neighbor list 11 S6623: Advances in NAMD GPU Performance

Neighbor list sorting Load imbalance! Warp 1 Warp 3 Warp 2 Thread block sort No load imbalance Warp 1 Warp 2 Warp 3 Warp 4 • Tile lists executed on the same thread block should have approximately the same work load • Simple solution is to sort according to tile list length • Also minimizes tail effects at the end of kernel execution 12 S6623: Advances in NAMD GPU Performance

Non-bonded kernel performance* Speedup vs. NAMD 2.11 • Explicit solvent 2.8 2.6 Non-bonded – 30% faster 2.4 Non-bonded neighbor list 2.2 – Neighbor list build 2 1.8 up to 2.7x faster 1.6 1.4 1.2 1 DHFR (24K atoms) ApoA1 (92K atoms) STMV (1.06M atoms) Speedup vs. NAMD 2.11 4 3.5 • GB implicit solvent GBIS Non-bonded neighbor list 3 – 13K: 3.5x faster 2.5 – 5.7M: 38% faster 2 1.5 1 * Titan supercomputer, K20 GPU 13K atoms 5.7M atoms 13 S6623: Advances in NAMD GPU Performance

Simulation performance • Simulation performance is influenced by communication and time-step integration GPU to CPU memcopy GPU: Non-bonded force kernel CPU Patch 1 Patch 2 Patch 3 Communication & time-step integration 14 S6623: Advances in NAMD GPU Performance

Streaming force computation GPU: Non-bonded force kernel CPU Patch 1 Patch 2 Patch 3 Communication & • Streaming: communication & time-step integration integration done during kernel execution GPU: Non-bonded force kernel savings Patch 1 CPU polling for results Patch 2 Patch 3 15 S6623: Advances in NAMD GPU Performance

Streaming force computation GPU: Non-bonded force kernel Patch 1 CPU polling for results Patch 2 Patch 3 GPU: Non-bonded force kernel CPU polling for results Patch 1 Patch 2 Patch 3 • Streaming: sort computes “globally” and preserve patch order • Kernel performance: sort neighbor list “locally” 16 S6623: Advances in NAMD GPU Performance

Neighbor list sorting - global Input: 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 8 6 4 3 2 1 10 Tile list Sort computes to reverse order 9 7 5 Compute Output: 8, 10, 7, 6, 9, 4, 5, 3, 2, 1 Compute forces – record output order Reverse: 1, 2, 3, 5, 4, 9, 6, 7, 10, 8 (8) 7(8) 9(6) (6) 4 Compute Reverse output + build sort key (10) 7(10) (10) (10) 3 6(7) (5) (5) 5(4) 2 1 5 6 7 2(2) 4(5) 3(3) 1(1) 1 2 3 4 9 10 8 1 2 3 4 Tile list length 17 S6623: Advances in NAMD GPU Performance

Neighbor list sorting - local Sort window 18 S6623: Advances in NAMD GPU Performance

Non-bonded kernel performance 1.2 900 800 1.15 Normalized kernel runtime 700 1.1 600 1.05 Compute 500 1 400 0.95 300 0.9 streaming 200 0.85 100 streaming with local sort (32) 0.8 0 no streaming, incl. streaming global sort streaming global & 0 2000 4000 mem copy local sort (32) Neighbor list index 12 900 10 Tile list length 12 800 8 700 10 List length 600 6 Compute 8 500 4 6 400 2 300 4 200 0 2 0 50 100 150 100 0 0 Neighbor list index 0 2000 4000 0 2000 4000 Neighbor list index Neighbor list index 19 S6623: Advances in NAMD GPU Performance

Streaming time-step profile Non-bonded kernel Time step integration 20 S6623: Advances in NAMD GPU Performance

Streaming simulation performance Speedup with streaming 1.35 DHFR (24K atoms) 1.3 ApoA1 (92K atoms) 1.25 1.2 1.15 1.1 1.05 1 1 2 4 8 Number of Titan nodes • 10% - 30% faster simulations using streaming 21 S6623: Advances in NAMD GPU Performance

Particle Mesh Ewald (PME) – NAMD 2.11 Charge 3D FFT spreading Real to Complex Poisson solver • Charge spreading and force gathering on GPU • 3D FFT and Poisson solver on CPU Force 3D FFT gather Complex to Real 22 S6623: Advances in NAMD GPU Performance

Particle Mesh Ewald (PME) – New Charge 3D FFT spreading Real to Complex Poisson solver • Everything on GPU • Uses cuFFT Force 3D FFT gather Complex to Real 23 S6623: Advances in NAMD GPU Performance

PME Performance on single GPU* Speedup vs. NAMD 2.11 4 3.5 • DHFR 24K atoms 3 • 64x64x64 grid 2.5 2 1.5 1 DHFR, order=4 DHFR, order=6 DHFR, order=8 * Titan supercomputer: K20 GPU + AMD Opteron CPU 24 S6623: Advances in NAMD GPU Performance

Finally, simulation performance Speedup vs. NAMD 2.11 DHFR (24K atoms) • Explicit solvent 1.6 ApoA1 (92K atoms) 1.5 – 30% - 57% faster 1.4 simulations 1.3 1.2 13K atoms 4 1.1 3.5 3 1 1 2 4 8 2.5 Number of Titan nodes 2 1.5 1 1 2 4 • GB implicit solvent Number of Titan nodes 5.7M atoms 1.35 1.3 – Up to 3.5x faster 1.25 1.2 simulations 1.15 1.1 1.05 1 1 2 4 Number of Titan nodes 25 S6623: Advances in NAMD GPU Performance

Challenges: Simulation performance • Single-GPU performance for DHFR of 47 ns/day on K20, is still only about half of the performance of GPU-only codes such as Amber* (95 ns/day) • Major part of the runtime is now taken by memory copying, setup time, time-step integration – This is where the effort has to go now *http://ambermd.org/gpus/benchmarks.htm 26 S6623: Advances in NAMD GPU Performance

Challenges: Simulation performance Bonded forces Non-bonded kernel CPU-GPU memory copy & force clear Time step integration CPU-CPU memory copy 27 S6623: Advances in NAMD GPU Performance

Conclusions • Explicit solvent non-bonded force kernels – 30% speedup – 2x speedup in neighbor list builder • Implicit solvent non-bonded force kernels – 38% speedup (but up to 3.5x speedup) – 3.5x speedup in neighbor list builder • Improved simulation performance – 30% – 57% faster simulations than NAMD 2.11 – Up to 3.5x faster for GB implicit solvent • Challenge – Time-step integrator and rest of the critical path code needs work • Planned for release in NAMD 2.12 28 S6623: Advances in NAMD GPU Performance

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak - PowerPoint PPT Presentation

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing Facility (OLCF) hynninena@ornl.gov ORNL is managed by UT-Battelle for the US Department of Energy Motivation Make NAMD run fast on upcoming

NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01 1 Molecular dynamics and NAMD MD

Scaling Challenges in NAMD: Past and Future Outline NAMD: An Introduction Past Scaling

Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips

Experiences with Charm++ and NAMD on Knights Landing Supercomputers 15 th Annual Workshop on

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Improving NAMD Performance on Multi-GPU Platforms David J. Hardy Theoretical and Computational

VMD & NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

Scriptable Asynchronous Multi-Copy Algorithms in NAMD via Charm++ Partitions James Phillips

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

Morse Parameters of -Uranium by Ab-initio Calculation Hyun Woo Seong a , Ho Jin Ryu a , a

Orthogonal Functions and Fourier Series University of Texas at Austin CS384G - Computer

CS6320: 3D Computer Vision: Photometric Stereo An alternative Method Xiaowei Shen Wei Liu

Advanced scenario approach for assessment of environmental changes Vladimir Penenko & Elena

Dr Drawings wings - Real ality ity Capt apture ure Technologie nologies s an and d BIM

The helium line at 5876 angstrem in the spectrum of the Herbig Ae/Be star HD37806 as an indicator

Optical Interferometry with HST /FGS at V > 15 E. Nelan and R. Makidon Space Telescope Science

The Gamma Ray Large Area Space Telescope (GLAST) Tsunefumi Mizuno, 1 on behalf of the GLAST LAT

Sambuz

Useful Links

Newsletter

Mail Us

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak - PowerPoint PPT Presentation

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing Facility (OLCF) hynninena@ornl.gov ORNL is managed by UT-Battelle for the US Department of Energy Motivation Make NAMD run fast on upcoming

NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01 1 Molecular dynamics and NAMD MD

Scaling Challenges in NAMD: Past and Future Outline NAMD: An Introduction Past Scaling

Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips

Experiences with Charm++ and NAMD on Knights Landing Supercomputers 15 th Annual Workshop on

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Improving NAMD Performance on Multi-GPU Platforms David J. Hardy Theoretical and Computational

VMD &amp; NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

Scriptable Asynchronous Multi-Copy Algorithms in NAMD via Charm++ Partitions James Phillips

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

Morse Parameters of -Uranium by Ab-initio Calculation Hyun Woo Seong a , Ho Jin Ryu a , a

Orthogonal Functions and Fourier Series University of Texas at Austin CS384G - Computer

CS6320: 3D Computer Vision: Photometric Stereo An alternative Method Xiaowei Shen Wei Liu

Advanced scenario approach for assessment of environmental changes Vladimir Penenko &amp; Elena

Dr Drawings wings - Real ality ity Capt apture ure Technologie nologies s an and d BIM

The helium line at 5876 angstrem in the spectrum of the Herbig Ae/Be star HD37806 as an indicator

Optical Interferometry with HST /FGS at V &gt; 15 E. Nelan and R. Makidon Space Telescope Science

The Gamma Ray Large Area Space Telescope (GLAST) Tsunefumi Mizuno, 1 on behalf of the GLAST LAT

Sambuz

Useful Links

Newsletter

Mail Us

VMD & NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Advanced scenario approach for assessment of environmental changes Vladimir Penenko & Elena

Optical Interferometry with HST /FGS at V > 15 E. Nelan and R. Makidon Space Telescope Science