Improving NAMD Performance and Scaling on Heterogeneous - - PowerPoint PPT Presentation

improving namd performance and scaling on heterogeneous
SMART_READER_LITE
LIVE PREVIEW

Improving NAMD Performance and Scaling on Heterogeneous - - PowerPoint PPT Presentation

Improving NAMD Performance and Scaling on Heterogeneous Architectures David J. Hardy and Julio D. C. Maia NIH Center for Macromolecular Modeling and Bioinformatics Theoretical and Computational Biophysics Group Beckman Institute for Advanced


slide-1
SLIDE 1

Improving NAMD Performance and Scaling on Heterogeneous Architectures

David J. Hardy and Julio D. C. Maia NIH Center for Macromolecular Modeling and Bioinformatics Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign

slide-2
SLIDE 2

NAMD Scalable Molecular Dynamics

  • Code written in C++/Charm++/CUDA
  • Performance scales to hundreds of thousands
  • f CPU cores and tens of thousands of GPUs
  • Large systems (single copy scaling)
  • Enhanced sampling (multiple copy scaling)
  • Runs on laptops up to supercomputers
  • Runs on AWS cloud, MS Azure
  • TCL/Python script as input file
  • Workflow control
  • Method development at higher level
  • Structure preparation and analysis with VMD
  • QwikMD

2

NAMD: http://www.ks.uiuc.edu/Research/namd/

Zika Virus

  • E. coli chemosensory array
  • J. Phillips, D. Hardy, J. Maia, et al. J. Chem. Phys. 153, 044130 (2020) https://doi.org/10.1063/5.0014475
slide-3
SLIDE 3

NAMD Highlights

  • User defined forces
  • Grid forces
  • Interactive molecular dynamics
  • Steered molecular dynamics
  • Accelerated sampling methods
  • Replica exchange
  • Collective variables (Colvars)
  • Biased simulation
  • Enhanced sampling
  • Alchemical transformations
  • Free energy perturbation (FEP)
  • Thermodynamic integration (TI)
  • Constant-pH molecular dynamics
  • Hybrid QM/MM simulation
  • Multiple QM regions

3

Complete List of NAMD Features: https://www.ks.uiuc.edu/Research/namd/2.14/ug/

Proteosome (MDFF+IMD) Membrane vesicle fusion and formation (grid forces) ABC transporter mechanism (Colvars) DNA QM/MM simulation

slide-4
SLIDE 4

Molecular Dynamics Simulation

  • Most fundamentally, integrate Newton’s equations of motion:

4

integrate for up to billions of time steps most of the computational work

(Lennard-Jones) (electrostatics)

slide-5
SLIDE 5

Parallelism for MD Simulation Limited to Each Timestep

5

Computational workflow of MD: initialize particle positions particle forces force calculation

about 99% of computational work

update positions

about 1% of computational work

reduced quantities (energy, temperature, pressure) position coordinates (trajectory snapshot)

  • ccasional
  • utput

a

Loop millions

  • f timesteps
slide-6
SLIDE 6

NAMD 2.14 Decomposes Force Terms into Fine-Grained Objects for Scalability

6

Offload forces to GPU

slide-7
SLIDE 7

NAMD 2.14 Excels at Scalable Parallelism on CPUs and GPUs

7

2.0 4.0 8.0 16.0 32.0 64.0 128.0

performance (ns/day)

10 100 1000

number of nodes

0.1 0.3 1.0 3.0 9.0 27.0 Summit Frontera

5x2x2 grid = 21M Atoms

= =

Replications of the Satellite Tobacco Mosaic Virus (STMV)

7x6x5 grid = 224M Atoms

slide-8
SLIDE 8

8

NAMD 2.14 Simulating SARS-CoV-2 on Summit

(A) Virion, (B) Spike, (C) Glycan shield conformations

A B C D

nm 10 20 nm

5 10

4 8 16 32 64 128 64 128 256 512 1024 2048 4096 Number of Nodes Spike-ACE2 Virion Performance (ns per day) Summit CPU+GPU Summit CPU-only Frontera CPU-only

Scaling performance:

  • ~305M atom virion
  • ~8.5M atom spike

Collaboration with Amaro Lab at UCSD, images rendered by VMD

strong scaling 51% efficiency

slide-9
SLIDE 9

Benchmarks on Single Nodes and Newer GPUs Reveal Problems

9

NAMD 2.13 (2018) has ~20% perf improvement from P100 to V100 Hardware has ~70% perf improvement!

Peak Performance in TFLOPS

16 Pascal (P100) Volta (V100)

15.7 9.3

slide-10
SLIDE 10

10

Gaps in the blue strip = GPU is idle! Profiling ApoA1, 92k atoms NAMD 2.13, 16 cores and 1 GPU Volta

Profiling on Modern GPUs

slide-11
SLIDE 11

NAMD 2.13 and 2.14 Have Limited GPU Performance

  • Offloading force calculation is not enough!
  • Overall utilization of modern GPUs is limited
  • We want better single GPU performance
  • Majority of MD users run system sizes < 1M

atoms on a single GPU

  • Must transition from GPU-offload approach to

GPU-resident!

11

The DGX-2 has 16 V100 GPUs but only 48 CPU cores: We need to do more GPU work with less CPU Power

slide-12
SLIDE 12

NAMD 3.0: GPU-Resident NAMD

  • Fetches GPU force buffers directly from the force module
  • Bypass any CPU-GPU memory transfers - only call GPU

kernels!

  • Convert forces in a structure-of-arrays (SOA) data structure

using the GPU

  • Invoke GPU Integration Tasks Once

Calculate Forces Convert buffers to SOA Integrate all the atoms Fetch GPU Force Buffers

https://www.ks.uiuc.edu/Research/namd/3.0alpha/

12

slide-13
SLIDE 13

NAMD 3.0 Has Better GPU Utilization

Forces Integration Forces Integration Forces Integration Forces Integration

NAMD 2.14 Gaps between GPU tasks NAMD 3.0 No CPU bottlenecks

13

slide-14
SLIDE 14

65 130 195 260 JAC (23k) ApoA1 (92k) F1-ATPase(327k) STMV (1.06M)

7.8 25.7 102.5 254.4

3.8 12.2 41.4 124.2

NAMD 2.14 NAMD 3.0

ns/day

Intel Xeon E5-2650 V2 w/ 16 physical cores Nvidia Titan V

NVE 12A Cutoff 2fs timestep

NAMD 3.0: Performance on Different Systems

14

slide-15
SLIDE 15

NAMD 3.0: Multi-Copy Performance - Aggregate Throughput With DGX-2

ApoA1 92k atoms

16 Replicas 1 for each NVIDIA V100

1000 2000 3000 4000 12A Cutoff 8A Cutoff

3,005.65 1,924.36

283.84 283.7

NAMD 2.14 NAMD 3.0

ns/day

15

slide-16
SLIDE 16

16

NAMD 3.0: Single trajectory - Multiple GPU Performance

# GPUs 1 2 3 4 5 6 7 8

11.5 20.5 28.3 34.8 39.9 44.9 47.5 50.8

8.3 9.6 9.2 10.4 10.9 11.5 10.8 13.0

NAMD 2.14 NAMD 3.0

ns/day

STMV 1.06M atoms 2fs timestep No PME yet

slide-17
SLIDE 17

PME Impedes Scalability

  • For multi-node scaling, 3D FFT communication cost grows faster than

computation cost

  • For single-node multi-GPU scaling:
  • 3D FFTs are too small to parallelize effectively with cuFFT
  • Too much latency introduced with pencil decomposition and cuFFT 1D FFTs
  • Is task-based parallelism best, delegating one GPU for 3D FFTs and reciprocal space

calculation?

  • Requires gathering all grid data to that one GPU and being careful to not overload it with
  • ther work
  • Why not use a better scaling algorithm, such as MSM?

17

slide-18
SLIDE 18

Multilevel Summation Method (MSM)

  • Split the 1/r potential into a short-range cutoff part plus smoothed parts that are

successively more slowly varying. All but the top level potential are cut off.

  • Smoothed potentials are interpolated from successively coarser grids.
  • Finest grid spacing h and smallest cutoff distance a are doubled at each successive level.

18

= + + atoms h-grid 2h-grid

Split the 1/r potential Interpolate the smoothed potentials

a 2a . . . . . . 1/r r0

  • D. Hardy, et al. J. Chem. Theory Comput. 11(2), 766-779 (2015) https://doi.org/10.1021/ct5009075
  • D. Hardy, et al. J. Chem. Phys. 144, 114112 (2016) https://doi.org/10.1063/1.4943868
slide-19
SLIDE 19

MSM Calculation is O(N)

19

force exact short-range part interpolated long-range part

+ =

Computational Steps

short-range cutoff interpolation anterpolation h-grid cutoff 2h-grid cutoff 4h-grid restriction restriction prolongation prolongation long-range parts

positions, charges potentials, forces

grid cutoff ⟺ 3D convolution anterpolation ⟺ PME charge spreading interpolation ⟺ PME force interpolation

slide-20
SLIDE 20

Periodic MSM: Replaces PME

  • Previous implementation was fine for non-periodic boundaries but

insufficient for periodic boundary conditions

  • Lower accuracy than PME, requires system to be neutrally charged
  • New development for MSM:
  • Interpolation with periodic B-spline basis functions gives same PME accuracy
  • Handle infinite 1/r tail as reciprocal space calculation of top level grid
  • Number of grid levels can be terminated long before reaching a single point;

use it to bound size of FFT

  • Communication is nearest neighbor up the tree to the top grid level

20

slide-21
SLIDE 21

Extending NAMD 3.0 to multiple nodes

  • Reintroducing Charm++ communication
  • Fast GPU integration calls the force kernels directly
  • Unused Sequencer user-level threads are put to sleep
  • Awaken threads for atom migration between patches and coordinate output
  • Will GPU direct messaging be the best alternative?
  • Charm++ support is being developed

21

slide-22
SLIDE 22

Additional Challenges for NAMD

  • Feature-complete GPU-resident version
  • NAMD 3.0 for now supports just a subset of features
  • Incorporating Colvars (collective variables) force biasing
  • Poses a significant performance penalty without reimplementing parts of Colvars on GPU
  • Introducing support for other GPU vendors
  • AMD HIP port of NAMD 2.14, still working on 3.0
  • Intel DPC++ port of non-bonded CUDA kernels

22

Intel GPUs AMD GPUs

slide-23
SLIDE 23

Acknowledgments

  • NAMD development is funded by NIH P41-GM104601
  • NAMD team:

23

David Hardy Julio Maia John Stone Jim Phillips Mohammad Soroush Barhaghi Mariano Spivak Wei Jiang Rafael Bernardi Ronak Buch Jaemin Choi