Improving NAMD Performance on Volta GPUs David Hardy - Research - PowerPoint PPT Presentation

Improving NAMD Performance on Volta GPUs David Hardy - Research Programmer, University of Illinois at Urbana-Champaign Ke Li - HPC Developer Technology Engineer, NVIDIA John Stone - Senior Research Programmer, University of Illinois at Urbana-Champaign

Journey of a Legacy HPC Application • NAMD has been developed for more than 20 years • Parallel scaling of large systems on supercomputers - Blue Waters (NCSA), Titan (ORNL), Stampede (TACC), Summit (ORNL) • First full-featured molecular dynamics code to adopt CUDA - Stone, et al. J Comput Chem , 28:2618-2640, 2007 • Why were certain design choices made? • What lessons have we learned? • Where does development need to go in the Age of Volta?

NAMD & VMD: Computational Microscope Enable researchers to investigate systems described at the atomic scale NAMD - molecular dynamics simulation VMD - visualization, system preparation and analysis Neuron Ribosome Virus Capsid

Simulated System Sizes Have Increased Exponentially 10 8 HIV capsid 10 7 Number of atoms Ribosome 10 6 STMV ATP Synthase 10 5 ApoA1 Lysozyme 10 4 1986 1990 1994 1998 2002 2006 2010 2014

Parallelism in Molecular Dynamics Limited to Each Timestep Update about 1% of Computational workflow of MD: computational work coordinates Initialize coordinates forces, coordinates Force about 99% of computational work calculation Occasional output of reduced quantities (energy, temperature, pressure) Occasional output of coordinates (trajectory snapshot)

Work Dominated by Nonbonded Forces 90% — Non-bonded forces, short-range cutoff 5% — Long-range electrostatics, gridded (e.g. PME) force calculation 2% — Bonded forces (bonds, angles, etc.) 2% — Correction for excluded interactions update 1% — Integration, constraints, thermostat, barostat coordinates Apply GPU acceleration first to the most expensive part

Parallelize by Spatial Decomposition of Atoms Data parallelism is common to many codes

NAMD Also Decomposes Compute Objects Kale et al., J. Comp. Phys. 151:283-312, 1999 • Spatially decompose data and communication • Separate but related work decomposition • “Compute objects” create much greater amount of parallelization , facilitating iterative, measurement-based load balancing system, all from use of Charm++

Overlap Calculations, Offload Nonbonded Forces Phillips et al., SC2002 Offload to GPU Objects are assigned to processors and queued as data arrives

Early Nonbonded Forces Kernel Used All Memory Systems Start with most expensive calculation: direct nonbonded interactions. • Decompose work into pairs of patches, identical to NAMD structure. • GPU hardware assigns patch-pairs to multiprocessors dynamically. • Force computation on single multiprocessor (GeForce 8800 GTX has 16) 16kB Shared Memory Patch A Coordinates & Parameters Texture Unit 32-way SIMD Multiprocessor Constants Force Table 32-256 multiplexed threads Exclusions Interpolation 32kB Registers 8kB cache 8kB cache Patch B Coords, Params, & Forces 768 MB Main Memory, no cache, 300+ cycle latency

NAMD Performance Improved Using Early GPUs ApoA1 Performance 3 • Full NAMD, not test harness Nonbond PME • Useful performance boost Other 2.25 – 8x speedup for nonbonded seconds per step – 5x speedup overall w/o PME – 3.5x speedup overall w/ PME faster 1.5 – GPU = quad-core CPU • Plans for better performance 0.75 – Overlap GPU and CPU work. – Tune or port remaining work. • PME, bonded, integration, etc. 0 CPU GPU 2.67 GHz Core 2 Quad Extreme + GeForce 8800 GTX

Reduce Communication Latency by Separating Work Units Phillips et al., SC2008 Remote Force Local Force GPU f f x x Remote Local Local Update CPU f x f x Other Nodes/Processes One Timestep

Early GPU Fits Into Parallel NAMD as Coprocessor • Offload most expensive calculation: non-bonded forces • Fits into existing parallelization • Extends existing code without modifying core data structures • Requires work aggregation and kernel scheduling considerations to optimize remote communication • GPU is treated as a coprocessor

NAMD Scales Well on Kepler Based Computers 64 21M atoms 32 16 Performance (ns per day) 8 224M atoms 4 2 1 Blue Waters XK7 (GTC16) Kepler based Titan XK7 (GTC16) 0.5 Edison XC30 (SC14) Blue Waters XE6 (SC14) 0.25 256 512 1024 2048 4096 8192 16384 GTC 2017 Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics 14 (2fs timestep) Number of Nodes Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Large Rate Difference Between Pascal and CPU 20x FLOP rate difference between GPU and CPU Requires full use of CPU cores and vectorization! • Balance between GPU and CPU capability keeps shifting towards GPU • NVIDIA plots show only through Pascal — Volta widens the performance gap! • Difference made worse by multiple GPUs per CPU (e.g. DGX, Summit) • Past efforts to balance work between GPU and CPU are now CPU bound

Balancing Work Between GPU and CPU Case Study: Multilevel Summation CUDA Kernels in VMD Hardy, et al. Journal of Parallel Computing, 35:164-177, 2009 • Effort to balance computational work between GPUs and CPU • GPU gets only straightforward data parallel algorithms, regularized work units • CPU gets algorithms less well suited to GPU, plus “overflow” work Le, et al. PLoS Comput Bio, 6:e1000939, 2010 Partial model: ~10M atoms Electrostatic field of chromatophore model from multilevel summation method: computed with 3 GPUs (G80) in ~90 seconds, 46x faster than single CPU socket

Time-Averaged Electrostatics Analysis on   NCSA Blue Waters NCSA Blue Waters Node Type Seconds per trajectory frame for one compute node Cray XE6 Compute Node: 9.33 32 CPU cores (2xAMD 6200 CPUs) Cray XK6 GPU-accelerated Compute Node: 2.25 16 CPU cores + NVIDIA X2090 (Fermi) GPU speedup ok Speedup for GPU XK6 nodes vs. CPU XE6 nodes XK6 nodes are 4.15x faster overall Tests on XK7 nodes indicate MSM is CPU-bound In progress…. with the Kepler K20X GPU. XK7 nodes 4.3x CPU-bound faster overall Performance not much faster (yet) than Fermi X2090 Need to move spatial hashing, prolongation, interpolation onto the GPU… Preliminary performance for VMD time-averaged electrostatics w/ Multilevel Summation Method on the NCSA Blue Waters Early Science System

Multilevel Summation on the GPU Accelerate short-range cutoff and lattice cutoff parts Performance profile for 0.5 Å map of potential for 1.5 M atoms. Hardware platform is Intel QX6700 CPU and NVID IA GTX 280. Computational steps CPU (s) w/ GPU (s) Speedup Short-range cutoff 480.07 14.87 32.3 Long-range anterpolation 0.18 restriction 0.16 lattice cutoff 49.47 1.36 36.4 prolongation 0.17 interpolation 3.47 Total 533.52 20.21 26.4 Multilevel summation of electrostatic potentials using graphics processing units . D. Hardy, J. Stone, K. Schulten. J. Parallel Computing , 35:164-177, 2009.

Balancing Work Between GPU and CPU Case Study: Multilevel Summation CUDA Kernels in VMD Conclusions: • Successful (at the time) of exploiting task and data level parallelism • Work was reasonably well balanced between CPUs and GPUs for Tesla and Fermi • For Kepler and later, approach was CPU bound

Reduce Latency, Offload All Force Computation • Overlapped GPU communication and computation (2012) • Offload atom-based work for PME (2013) Emphasis on improving communication latency - Use higher order interpolation with coarser grid - Reduce parallel FFT communication • Faster nonbonded force kernels (2016) • Offload entire PME using cuFFT (for single node use) (2016) Emphasis on using GPUs • Offload remaining force terms (2017) more effectively - Includes: bonds, angles, dihedrals, impropers, crossterms, exclusions

Overlapped GPU Communication and Computation • Allows incremental results from a single grid to be processed on CPU before grid finishes on GPU • Allows merging and prioritizing of remote and local work • GPU side: Write results to host-mapped memory (also without streaming) - __threadfence_system() and __syncthreads() - Atomic increment for next output queue location - Write result index to output queue - • CPU side: Poll end of output queue (int array) in host memory -

Non-overlapped Kernel Communication Integration unable to start until GPU kernel finishes

Overlapped Kernel Communication GPU kernel communicates results while running; patches begin integration as soon as data arrives

Non-bonded force computation in NAMD Faster • Two levels of spatial 4 sorting 4 – Simulation box is divided into patches – Within the patch, atoms are sorted spatially into groups of 32 using 4 orthogonal recursive bisection method 7 S6623: Advances in NAMD GPU Performance

Non-bonded force compute Faster Patch 1 Compute 1 Compute 1 Patch 2 32 32 • For GPU, Patch 2 compute is split Compute 2 into tiles of size 32x32 atoms Patch 3 • Compute = all pairwise interactions between two patches 8 S6623: Advances in NAMD GPU Performance

Non-bonded force computation Faster F i 32 31 30 Atoms in patch j Warp 1 Warp 2 Warp 3 Warp 4 F j 32 3 2 32 1 Atoms in patch i • One warp per tiles • Loop through 32x32 tile diagonally – Avoids race condition when storing forces F i and F j • Bitmask used for exclusion lookup 9 S6623: Advances in NAMD GPU Performance

Improving NAMD Performance on Volta GPUs David Hardy - Research - PowerPoint PPT Presentation

Improving NAMD Performance on Volta GPUs David Hardy - Research Programmer, University of Illinois at Urbana-Champaign Ke Li - HPC Developer Technology Engineer, NVIDIA John Stone - Senior Research Programmer, University of Illinois at

NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01 1 Molecular dynamics and NAMD MD

Scaling Challenges in NAMD: Past and Future Outline NAMD: An Introduction Past Scaling

Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips

Experiences with Charm++ and NAMD on Knights Landing Supercomputers 15 th Annual Workshop on

Multisensory Learning In Adaptive Interactive Systems Erica Volta Erica Volta Who I am 2020

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

VOLTA Architecture and performance optimization Guillaume Thomas-Collignon Paulius Micikevicius

VMD & NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

Scriptable Asynchronous Multi-Copy Algorithms in NAMD via Charm++ Partitions James Phillips

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

Dissecting the Volta GPU Architecture through Microbenchmarking GTC 2018 Zhe Jia, Marco

Improving NAMD Performance on Multi-GPU Platforms David J. Hardy Theoretical and Computational

Improving NAMD Performance and Scaling on Heterogeneous Architectures David J. Hardy and Julio D.

The MD Course, the Volta Experiment, and Other Stuff Steve Zinder Microbial Diversity June

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by

S6261 VMD+OptiX: Streaming Interactive Ray Tracing from Remote GPU Clusters to Your VR Headset

Bristol Township School District Realignment of Attendance Areas February 8, 2016 Current

LUCA RICCI (Italy, Veneto, Collalto di Susegana) RICCIS BACKGROUND: Federico Ricci is at the

Spring/2016 TABLE OF CONTENTS 3 Introduc3on 6 Our

BUSINESS UPDATE FEB. 16, 2017 PRESENTED BY: LYN YNN GOOD D | PRESIDENT, CHAIRMAN AND CEO STEVE

A Collector Reflects on Provenance Mark Samuels Lasner, University of Delaware Library Four

2015-2016 General Fund Budget Discussion March 10, 2015 2015-2016 Preliminary Expenditures

The GSA Total Workplace Program Helping Create The 21 st Century Workplace 1 U.S. General

Improving NAMD Performance on Volta GPUs David Hardy - Research - PowerPoint PPT Presentation

Improving NAMD Performance on Volta GPUs David Hardy - Research Programmer, University of Illinois at Urbana-Champaign Ke Li - HPC Developer Technology Engineer, NVIDIA John Stone - Senior Research Programmer, University of Illinois at

NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01 1 Molecular dynamics and NAMD MD

Scaling Challenges in NAMD: Past and Future Outline NAMD: An Introduction Past Scaling

Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips

Experiences with Charm++ and NAMD on Knights Landing Supercomputers 15 th Annual Workshop on

Multisensory Learning In Adaptive Interactive Systems Erica Volta Erica Volta Who I am 2020

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

VOLTA Architecture and performance optimization Guillaume Thomas-Collignon Paulius Micikevicius

VMD &amp; NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

Scriptable Asynchronous Multi-Copy Algorithms in NAMD via Charm++ Partitions James Phillips

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

Dissecting the Volta GPU Architecture through Microbenchmarking GTC 2018 Zhe Jia, Marco

Improving NAMD Performance on Multi-GPU Platforms David J. Hardy Theoretical and Computational

Improving NAMD Performance and Scaling on Heterogeneous Architectures David J. Hardy and Julio D.

The MD Course, the Volta Experiment, and Other Stuff Steve Zinder Microbial Diversity June

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by

S6261 VMD+OptiX: Streaming Interactive Ray Tracing from Remote GPU Clusters to Your VR Headset

Bristol Township School District Realignment of Attendance Areas February 8, 2016 Current

LUCA RICCI (Italy, Veneto, Collalto di Susegana) RICCIS BACKGROUND: Federico Ricci is at the

Spring/2016 TABLE OF CONTENTS 3 Introduc3on 6 Our

BUSINESS UPDATE FEB. 16, 2017 PRESENTED BY: LYN YNN GOOD D | PRESIDENT, CHAIRMAN AND CEO STEVE

A Collector Reflects on Provenance Mark Samuels Lasner, University of Delaware Library Four

2015-2016 General Fund Budget Discussion March 10, 2015 2015-2016 Preliminary Expenditures

The GSA Total Workplace Program Helping Create The 21 st Century Workplace 1 U.S. General

VMD & NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD