The Need for Parallel I/O in Classical Molecular Dynamics I.T. - - PowerPoint PPT Presentation
The Need for Parallel I/O in Classical Molecular Dynamics I.T. - - PowerPoint PPT Presentation
The Need for Parallel I/O in Classical Molecular Dynamics I.T. Todorov I.J. Bush, A.R. Porter Advanced Research Computing, CSE Department STFC Daresbury Laboratory, Warrington WA4 4AD, UK The DL_POLY_3 MD Package General purpose MD
The DL_POLY_3 MD Package
- General purpose MD simulation package
- Written in modularised free formatted FORTRAN90 (+MPI2)
with rigorous code syntax (FORCHECK and NAGWare verified) and no external library dependencies
- Generic parallelisation (for short-ranged interactions) based
- n spatial domain decomposition (DD) and linked cells (LC)
- Long-ranged Coulomb interactions are handled by SPM Ewald
employing 3D FFTs for k-space evaluation - limited use to 2k CPUs
- Maximum particle load ≈ 2.1×109 atoms
- Full force field and molecular description but no rigid body
description yet (as in DL_POLY_2)
- Free format semantically approached reading with some fail-
safe features (but fully fool-proofed)
3
A A B B C C D D
Domain Decomposition Parallelisation
Development
1 2 3 4 5 35 40 45 50 55 60 65
Published Lines of Code [1000] Development Time [Years]
v01-Mar-03 v02-Mar- 04 v03-Sep-04 v04-Mar-05 v05-Oct-05 v06-Mar-06 v07-Dec-06 v08-Sep-07 v09-Jan-08
196 885 1420
5
DL_POLY Licence Statistics
200 400 600 800 1000 200 400 600 800 1000
max load 700'000 atoms per 1GB/CPU max load 220'000 ions per 1GB/CPU max load 210'000 ions per 1GB/CPU
Solid Ar (32'000 atoms per CPU) NaCl (27'000 ions per CPU) SPC Water (20'736 ions per CPU)
21 million atoms 28 million atoms 33 million atoms good parallelisation perfect parallelisation
Speed Gain Processor Count
Performance Weak Scaling on IBM p575
I/O Weak Scaling on IBM p575
200 400 600 800 1000 200 400 600 800
Solid Ar NaCl SPC Water Time [s] Processor Count
dashed lines show shut-down times solid lines show start-up times
Proof of Concept on IBM p575
300,763,000 NaCl with full SPME electrostatics evaluation on 1024 CPU cores Start-up time ≈ 1 hour Timestep time ≈ 68 seconds FFT evaluation ≈ 55 seconds In theory ,the system can be seen by the eye. Although you would need a very good microscope – the MD cell size for this system is 2μm along the side and as the wavelength of the visible light is 0.5μm so it should be theoretically possible.
Importance of I/O - I
Types of MD studies most dependent on I/O
- Large length-scales (109 particles), short time-scale such as screw
deformations
- Medium big length-scales (106–108 particles), medium time-scale
(ps-ns) such as radiation damage cascades
- Medium length-scale (105–106 particles), long time-scale (ns-µs)
such as membrane and protein processes Types of I/O: portable human readable loss of precision size
- ASCII
+ + – –
- Binary
– – + +
- XDR Binary
+ – + +
Importance of I/O - II
Example: 15 million system simulated with 2048 MPI tasks MD time per timestep ~0.7 (2.7) seconds on Cray XT4 (BG/L) Configuration read ~100 seconds (once during the simulation) Configuration write ~600 seconds for 1.1 GB with the fastest I/O method – MPI-I/O for Cray XT4 (parallel direct access for BG/L) I/O in native binary is only 3 times faster and 3 times smaller Some unpopular solutions
- Saving only the important fragments of the configuration
- Saving only fragments that have moved more than a given
distance between two consecutive dumps
- Distributed dump – separated configuration in separate files for
each MPI task (CFD)
ASCII I/O Solutions in DL_POLY_3
- 1. Serial direct access write (abbreviated as SDAW) – where only a
single node, the master, prints it all and all the rest communicate information to a master in turn while the master completes writing a configuration of the time evolution.
- 2. Parallel direct access write (PDAW) – where all nodes print in
the same file in an orderly manner so no overlapping occurs using Fortran direct access files. However, it should be noted that the behaviour of this method is not defined by the Fortran standard, and in particular we have experienced problems when disk cache is not coherent with the memory.
- 3. MPI-I/O write (MPIW) which has the same concept as the PDAW
but is performed using MPI-I/O rather than direct access. 4.
- 4. Serial NetCDF write
Serial NetCDF write ( (SNCW SNCW) using NetCDF libraries for machine- ) using NetCDF libraries for machine- independent data formats of array-based, scientific data (widely independent data formats of array-based, scientific data (widely used by various scientific communities) used by various scientific communities)
Overall I/O Performance
512 1024 1536 2048 25 50 75 100 125
Write Speed [MB/s] Processor Count
BG/L : SDAW, PDAW, MPIW BG/P : SDAW, PDAW P5-575: SDAW, PDAW, MPIW, SNCW XT3 SC: PDAW, MPIW (to 512) XT3 DC: PDAW, MPIW (to 1024) XT4 SC: SDAW, MPIW, SNCW XT4 DC: SDAW, MPIW, SNCW
DL_POLY_3 I/O Conclusions
- PDAW performs markedly superiorly to the SDAW or MPIW where
supported by the platform for this particular size of messages (73 Bytes ASCII per message). Improvements by an order of magnitude can be obtained, even though the I/O is not scaling especially well itself.
- MPIW optimised and performed consistently well for Cray XT3/4
- architectures. MPIW and much better than SDAW but as seen on
Cray XT3 this was still not as fast as PDAW. MPIW on Cray XT4 can achieve an improvement by a factor of two, similar performance to PDAW on the Cray XT3, once the storage methodology (OST) is
- ptimised for the dedicated I/O processing units.
- MPIW performs badly on IBM platforms. PDAW not accessible on
Cray XT4. SDAW extremely slow on Cray XT3.
- While on the IBM P5-575 SNCW was only 1.5 times faster than
SDAW on average, on the Cray XT4 it was 10 times and it did not matter whether runs were in single- or dual-core mode. Despite these differences, SNCW performed 2.5 times faster on the IBM P5- 575 than on the Cray XT4.
Benchmarking BG/L Jülich
2000 4000 6000 8000 10000 12000 14000 16000 2000 4000 6000 8000 10000 12000 14000 16000
14.6 million particle Gd2Zr2O7 system Speed Gain Processor count Perfect MD step total Link cells van der Waals Ewald real Ewald k-space
Benchmarking XT4 UK
1000 2000 3000 4000 5000 6000 7000 8000 1000 2000 3000 4000 5000 6000 7000 8000
14.6 million particle Gd2Zr2O7 system Processor count Speed Gain Perfect MD step total Link cells van der Waals Ewald real Ewald k-space
Benchmarking Main Platforms - I
500 1000 1500 2000 2500 2 4 6 8
3.8 million particle Gd2Zr2O7 system Evaluations [s
- 1
] Processor count
CRAY XT4 SC CRAY XT4 DC CRAY XT3 SC CRAY XT3 DC 3GHz Woodcrest DC IBM p575 BG/L BG/P
Benchmarking Main Platforms - II
500 1000 1500 2000 2500 5 10 15 20 25 LINK CELLS 500 1000 1500 2000 2500 5 10 15 20 25 EWALD K-SPACE 500 1000 1500 2000 2500 25 50 75 100 125 VAN DER WAALS 500 1000 1500 2000 2500 25 50 75 100 125 EWALD REAL
Acknowledgements Thanks to
- Ian Bush (DL/NAG) for optimisation support
- Andy Porter (DL) for NetCDF work and support
- Martyn Foster (NAG) and David Quigley (UoW) for Cray
XT4 optimisation
- Luican Anton (NAG) for first draft of MPI-I/O writing