Efficient Parallelization of Molecular Dynamics Simulations on - - PowerPoint PPT Presentation
Efficient Parallelization of Molecular Dynamics Simulations on - - PowerPoint PPT Presentation
Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers Jaewoon Jung (RIKEN, RIKEN AICS) Yuji Sugita (RIKEN, RIKEN AICS, RIKEN QBiC, RIKEN iTHES) Molecular Dynamics (MD) 1. Energy/forces are described by
Molecular Dynamics (MD)
- 1. Energy/forces are described by classical molecular mechanics
force field.
- 2. Update state according to equations of motion.
Long time MD trajectories are important to obtain thermodynamic quantities of target systems.
i i i i
d dt m d dt r p p F
Equation of motion Long time MD trajectory => Ensemble generation
( ) ( ) ( ) ( )
i i i i i i
t t t t m t t t t p r r p p F
Integration
Potential energy in MD using PME
3
2 total bonds 2 angles dihedrals 12 6 1 1 1
( ) ( ) [1 cos( )] 2
b a n N N ij ij i j ij ij ij ij j i j
E k b b k V n r r q q r r r
O(N) O(N) O(N) O(N2)
Main bottleneck in MD
12 6 2 2 2
erfc( ) exp( / 4 ) 2 FFT( ( ))
ij ij i j ij ij ij ij ij i j R
r r q q r Q r r r
k
k k k
Real space, O(CN) Reciprocal space, O(NlogN)
Total number of particles
GENESIS MD software (Generalized Ensemble Simulation Systems)
- 1. Aims at developing efficient and accurate methodologies for free
energy calculations in biological systems.
- 2. Efficient Parallelization - Suitable for massively parallel super-
computers, in particular, K.
- 3. Applicability for large scale simulation.
- 4. Algorithms coupled with different molecular models such as coar-
segrained, all-atom, and hybrid QM/MM.
- 5. Generalized ensemble with Replica Exchange Molecular
Dynamics.
Ref : Jaewoon Jung et al. WIREs CMS, 5, 310-323 (2015) Website : http://www.riken.jp/TMS2012/cbp/en/research/software/genesis/index.html
Midpoint method : interaction between two particles are decided from the midpoint position of them. Midpoint cell method : interaction between two particles are decided from the midpoint cells where each particle resides.
Small communication, efficient energy/force evaluations
Ref : J. Jung, T. Mori and Y. Sugita, JCC 35, 1064 (2014)
Parallelization of the real space interaction: Midpoint cell method (1)
Basic domain decomposition using the midpoint cell method (2)
- 1. Partitioning space into fixed size
boxes, with dimension larger than the cutoff distance.
- 2. We need only information of neighbor
space(domain) for computation of energies.
- 3. Communication is reduced by
increasing process number .
- 4. Efficient for good parallelization and
suitable for large system with massiv- ely parallel supercomputers.
Parallelization of FFT in GENESIS : Volumetric decomposition scheme in 3D FFT
- 1. More communications than
existing FFT
- 2. MPI Alltoall communications
- nly in one dimensional space
(existing : communications in two/three dimensional space)
- 3. Reduce communicational cost
for large number of processors
FFT in GENESIS (2 dimensional view)
GENESIS (Identical domain decomposition between two space) NAMD, Gromacs (Different domain decomposition)
GENESIS performance on K
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 GENESIS r1.0 GENESIS r1.1 GENESIS r1.2 NAMD 2.9 Gromacs 5.0.7 Gromacs 5.1.2
ApoA1 (92,224 atoms) on 128 cores STMV (1,066,628 atoms)
128 256 512 1024 2048 4096 8192 16384 32768 65536 0.5 1 2 4 8 16 32 64
ns/day Number of Cores GENESIS r1.1 GENESIS r1.0 NAMD 2.9
Why midpoint cell method for GPU+CPU cluster?
- 1. The main bottleneck of MD is the real space non-bonded interactions for
small number of processors.
- 2. The main bottleneck of MD moves to the reciprocal space non-bonded
interactions as we increase the number of processors.
- 3. When we assign GPU for the real space non-bonded interactions, the
reciprocal space interaction will be more crucial.
- 4. The midpoint cell method with volumetric decomposition FFT could be one
- f good solutions to optimize the reciprocal space interactions due to
avoiding communications before/after FFT.
- 5. In particular, the midpoint cell method with volumetric decomposition FFT
will be very useful for massively parallel supercomputers with GPUs.
Overview of CPU+GPU calculations
1. Computation intensive work : GPU
- Pairlist
- Real space non-bonded interaction
2. Communication intensive work or no computation intensive work : CPU
- Reciprocal space non-bonded
interaction with FFT
- Bonded interaction
- Exclusion list
3. Integration is performed on CPU due to file I/O.
Real space non-bonded interaction on GPU (1)
- non-excluded particle list scheme
Non-excluded particle list scheme is suitable for GPU due to small amount of memory for pairlist.
Real space non-bonded interaction on GPU (2)
- How to make block/threads in each cell pair
We make 32 thread blocks for efficient calculation on GPU by making blocks according to 8 atoms in cell I and 4 atoms in cell j.
Overview of GPU+CPU calculations with multiple time step integrator
- 1. In the multiple time step
integrator, we do not perform reciprocal space interaction every step.
- 2. If reciprocal space interaction
is not necessary, we assign subset of real space interaction
- n CPU to maximize the
performance.
- 3. Integration is performed on
CPU only.
Validation Tests (Energy drift)
Machine Precision Integrator Energy drift CPU Double Velocity Verlet 3.37×10-6 CPU Single Velocity Verlet 1.03×10-5 CPU Double RESPA (4fs) 1.01×10-6 CPU Single RESPA (4fs) 8.92×10-5 CPU+GPU Double Velocity Verlet 7.03×10-6 CPU+GPU Single Velocity Verlet
- 4.56×10-5
CPU+GPU Double RESPA (4fs)
- 3.21×10-6
CPU+GPU Single RESPA (4fs)
- 3.68×10-5
CPU+GPU Single Langevin RESPA (8fs) 5.48×10-5 CPU+GPU Single Langevin RESPA (10fs) 1.63×10-6
- Unit : kT/ns/degree of freedom
- 2fs time step with SHAKE/RATTLE/SETTLE constraints
- In the case of RESPA, the slow force time step is written in parentheses
- Our energy drift is similar to AMBER double and hybrid precision calculation.
Benchmark condition
- 1. MD program : GENESIS
- 2. System : TIP3P water (22,000), STMV (1 million), Crowding system1 (11.7
million), and Crowding system 2 (100 million)
- 3. Cutoff : 12.0 Å
- 4. PME grid sizes : 1923 (STMV), 3843 (Crowding system1), and 7683
(Crowding system 2)
- 5. Integrator : Velocity Verlet (VVER), Respa (PME reciprocal every second
step), and Langevin Respa (PME reciprocal every fourth step)
Acceleration of real space interactions (1)
- System : 9,250 TIP water molecules
- Cutoff distance : 12 Å
- Box size : 64 Å × 64 Å × 64 Å
- 1. 1 GPU increase the speed 3 times and 2 GPUs 6 times.
- 2. By assigning CPU as well as GPU when FFT on CPU is skipped, we can increase the
speed up to 7.7 times.
Acceleration of real space interactions (2)
Benchmark system (11.7 million atoms, Cutoff = 12.0 Å)
Comparison between real space and reciprocal space interactions
- 1. In both systems, the main bottleneck is the reciprocal space interactions irrespective of
the number of processors.
- 2. Therefore, it is important to optimize the reciprocal space interaction when CPU+GPU
clusters are used (=> Midpoint cell method could be best choice) STMV (1 million atoms) 11.7 million atoms
Comparison between TSUBAME and K
STMV (1 million atoms) 11.7 million atoms
- 1. K has better parallel efficiency of reciprocal space interaction than TSUBAME.
- 2. Irrespective of the performance of reciprocal space interaction, TSUBAME shows better
performance than K due to efficient evaluation of real space interaction on GPU.