Efficient Parallelization of Molecular Dynamics Simulations on - - PowerPoint PPT Presentation

efficient parallelization of molecular
SMART_READER_LITE
LIVE PREVIEW

Efficient Parallelization of Molecular Dynamics Simulations on - - PowerPoint PPT Presentation

Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers Jaewoon Jung (RIKEN, RIKEN AICS) Yuji Sugita (RIKEN, RIKEN AICS, RIKEN QBiC, RIKEN iTHES) Molecular Dynamics (MD) 1. Energy/forces are described by


slide-1
SLIDE 1

Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers

Jaewoon Jung (RIKEN, RIKEN AICS) Yuji Sugita (RIKEN, RIKEN AICS, RIKEN QBiC, RIKEN iTHES)

slide-2
SLIDE 2

Molecular Dynamics (MD)

  • 1. Energy/forces are described by classical molecular mechanics

force field.

  • 2. Update state according to equations of motion.

Long time MD trajectories are important to obtain thermodynamic quantities of target systems.

i i i i

d dt m d dt   r p p F

Equation of motion Long time MD trajectory => Ensemble generation

( ) ( ) ( ) ( )

i i i i i i

t t t t m t t t t           p r r p p F

Integration

slide-3
SLIDE 3

Potential energy in MD using PME

3

2 total bonds 2 angles dihedrals 12 6 1 1 1

( ) ( ) [1 cos( )] 2

b a n N N ij ij i j ij ij ij ij j i j

E k b b k V n r r q q r r r     

   

                                             

    

O(N) O(N) O(N) O(N2)

Main bottleneck in MD

12 6 2 2 2

erfc( ) exp( / 4 ) 2 FFT( ( ))

ij ij i j ij ij ij ij ij i j R

r r q q r Q r r r   

  

                                       

 

k

k k k

Real space, O(CN) Reciprocal space, O(NlogN)

Total number of particles

slide-4
SLIDE 4

GENESIS MD software (Generalized Ensemble Simulation Systems)

  • 1. Aims at developing efficient and accurate methodologies for free

energy calculations in biological systems.

  • 2. Efficient Parallelization - Suitable for massively parallel super-

computers, in particular, K.

  • 3. Applicability for large scale simulation.
  • 4. Algorithms coupled with different molecular models such as coar-

segrained, all-atom, and hybrid QM/MM.

  • 5. Generalized ensemble with Replica Exchange Molecular

Dynamics.

Ref : Jaewoon Jung et al. WIREs CMS, 5, 310-323 (2015) Website : http://www.riken.jp/TMS2012/cbp/en/research/software/genesis/index.html

slide-5
SLIDE 5

Midpoint method : interaction between two particles are decided from the midpoint position of them. Midpoint cell method : interaction between two particles are decided from the midpoint cells where each particle resides.

Small communication, efficient energy/force evaluations

Ref : J. Jung, T. Mori and Y. Sugita, JCC 35, 1064 (2014)

Parallelization of the real space interaction: Midpoint cell method (1)

slide-6
SLIDE 6

Basic domain decomposition using the midpoint cell method (2)

  • 1. Partitioning space into fixed size

boxes, with dimension larger than the cutoff distance.

  • 2. We need only information of neighbor

space(domain) for computation of energies.

  • 3. Communication is reduced by

increasing process number .

  • 4. Efficient for good parallelization and

suitable for large system with massiv- ely parallel supercomputers.

slide-7
SLIDE 7

Parallelization of FFT in GENESIS : Volumetric decomposition scheme in 3D FFT

  • 1. More communications than

existing FFT

  • 2. MPI Alltoall communications
  • nly in one dimensional space

(existing : communications in two/three dimensional space)

  • 3. Reduce communicational cost

for large number of processors

slide-8
SLIDE 8

FFT in GENESIS (2 dimensional view)

GENESIS (Identical domain decomposition between two space) NAMD, Gromacs (Different domain decomposition)

slide-9
SLIDE 9

GENESIS performance on K

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 GENESIS r1.0 GENESIS r1.1 GENESIS r1.2 NAMD 2.9 Gromacs 5.0.7 Gromacs 5.1.2

ApoA1 (92,224 atoms) on 128 cores STMV (1,066,628 atoms)

128 256 512 1024 2048 4096 8192 16384 32768 65536 0.5 1 2 4 8 16 32 64

ns/day Number of Cores GENESIS r1.1 GENESIS r1.0 NAMD 2.9

slide-10
SLIDE 10

Why midpoint cell method for GPU+CPU cluster?

  • 1. The main bottleneck of MD is the real space non-bonded interactions for

small number of processors.

  • 2. The main bottleneck of MD moves to the reciprocal space non-bonded

interactions as we increase the number of processors.

  • 3. When we assign GPU for the real space non-bonded interactions, the

reciprocal space interaction will be more crucial.

  • 4. The midpoint cell method with volumetric decomposition FFT could be one
  • f good solutions to optimize the reciprocal space interactions due to

avoiding communications before/after FFT.

  • 5. In particular, the midpoint cell method with volumetric decomposition FFT

will be very useful for massively parallel supercomputers with GPUs.

slide-11
SLIDE 11

Overview of CPU+GPU calculations

1. Computation intensive work : GPU

  • Pairlist
  • Real space non-bonded interaction

2. Communication intensive work or no computation intensive work : CPU

  • Reciprocal space non-bonded

interaction with FFT

  • Bonded interaction
  • Exclusion list

3. Integration is performed on CPU due to file I/O.

slide-12
SLIDE 12

Real space non-bonded interaction on GPU (1)

  • non-excluded particle list scheme

Non-excluded particle list scheme is suitable for GPU due to small amount of memory for pairlist.

slide-13
SLIDE 13

Real space non-bonded interaction on GPU (2)

  • How to make block/threads in each cell pair

We make 32 thread blocks for efficient calculation on GPU by making blocks according to 8 atoms in cell I and 4 atoms in cell j.

slide-14
SLIDE 14

Overview of GPU+CPU calculations with multiple time step integrator

  • 1. In the multiple time step

integrator, we do not perform reciprocal space interaction every step.

  • 2. If reciprocal space interaction

is not necessary, we assign subset of real space interaction

  • n CPU to maximize the

performance.

  • 3. Integration is performed on

CPU only.

slide-15
SLIDE 15

Validation Tests (Energy drift)

Machine Precision Integrator Energy drift CPU Double Velocity Verlet 3.37×10-6 CPU Single Velocity Verlet 1.03×10-5 CPU Double RESPA (4fs) 1.01×10-6 CPU Single RESPA (4fs) 8.92×10-5 CPU+GPU Double Velocity Verlet 7.03×10-6 CPU+GPU Single Velocity Verlet

  • 4.56×10-5

CPU+GPU Double RESPA (4fs)

  • 3.21×10-6

CPU+GPU Single RESPA (4fs)

  • 3.68×10-5

CPU+GPU Single Langevin RESPA (8fs) 5.48×10-5 CPU+GPU Single Langevin RESPA (10fs) 1.63×10-6

  • Unit : kT/ns/degree of freedom
  • 2fs time step with SHAKE/RATTLE/SETTLE constraints
  • In the case of RESPA, the slow force time step is written in parentheses
  • Our energy drift is similar to AMBER double and hybrid precision calculation.
slide-16
SLIDE 16

Benchmark condition

  • 1. MD program : GENESIS
  • 2. System : TIP3P water (22,000), STMV (1 million), Crowding system1 (11.7

million), and Crowding system 2 (100 million)

  • 3. Cutoff : 12.0 Å
  • 4. PME grid sizes : 1923 (STMV), 3843 (Crowding system1), and 7683

(Crowding system 2)

  • 5. Integrator : Velocity Verlet (VVER), Respa (PME reciprocal every second

step), and Langevin Respa (PME reciprocal every fourth step)

slide-17
SLIDE 17

Acceleration of real space interactions (1)

  • System : 9,250 TIP water molecules
  • Cutoff distance : 12 Å
  • Box size : 64 Å × 64 Å × 64 Å
  • 1. 1 GPU increase the speed 3 times and 2 GPUs 6 times.
  • 2. By assigning CPU as well as GPU when FFT on CPU is skipped, we can increase the

speed up to 7.7 times.

slide-18
SLIDE 18

Acceleration of real space interactions (2)

Benchmark system (11.7 million atoms, Cutoff = 12.0 Å)

slide-19
SLIDE 19

Comparison between real space and reciprocal space interactions

  • 1. In both systems, the main bottleneck is the reciprocal space interactions irrespective of

the number of processors.

  • 2. Therefore, it is important to optimize the reciprocal space interaction when CPU+GPU

clusters are used (=> Midpoint cell method could be best choice) STMV (1 million atoms) 11.7 million atoms

slide-20
SLIDE 20

Comparison between TSUBAME and K

STMV (1 million atoms) 11.7 million atoms

  • 1. K has better parallel efficiency of reciprocal space interaction than TSUBAME.
  • 2. Irrespective of the performance of reciprocal space interaction, TSUBAME shows better

performance than K due to efficient evaluation of real space interaction on GPU.

slide-21
SLIDE 21

Benchmark on TSUBAME

VVER (1 million atoms) VVER (11.7 million atoms) RESPA (1 million atoms) RESPA (11.7 million atoms)

slide-22
SLIDE 22

Performance on TSUBAME for 100 million atom systems

Integrator Number of Nodes Time per step (ms) Simulation time (ns/day) VVER 512 126.09 1.37 1024 97.87 1.77 RESPA 512 109.80 1.57 1024 70.77 2.44 Langevin RESPA 512 78.92 2.19 1024 44.13 3.92

slide-23
SLIDE 23

Summary

1. We implemented MD for GPU+CPU clusters. 2. We assign GPUs for real space non-bonded interactions and CPUs for reciprocal space interactions, bonded interactions, and integrators. 3. We introduce a non-excluded particle list scheme for efficient usage of memory on GPU. 4. We also optimized the usage of GPUs and CPUs for multiple time step integrators. 5. Benchmark result on TSUBAME shows very good strong/weak scalability for 1 million, 11.7 million, and 100 million atoms systems.