The Need for Parallel I/O in Classical Molecular Dynamics I.T. - - PowerPoint PPT Presentation

the need for parallel i o in classical molecular dynamics
SMART_READER_LITE
LIVE PREVIEW

The Need for Parallel I/O in Classical Molecular Dynamics I.T. - - PowerPoint PPT Presentation

The Need for Parallel I/O in Classical Molecular Dynamics I.T. Todorov I.J. Bush, A.R. Porter Advanced Research Computing, CSE Department STFC Daresbury Laboratory, Warrington WA4 4AD, UK The DL_POLY_3 MD Package General purpose MD


slide-1
SLIDE 1

The Need for Parallel I/O in Classical Molecular Dynamics

I.T. Todorov I.J. Bush, A.R. Porter

Advanced Research Computing, CSE Department STFC Daresbury Laboratory, Warrington WA4 4AD, UK

slide-2
SLIDE 2

The DL_POLY_3 MD Package

  • General purpose MD simulation package
  • Written in modularised free formatted FORTRAN90 (+MPI2)

with rigorous code syntax (FORCHECK and NAGWare verified) and no external library dependencies

  • Generic parallelisation (for short-ranged interactions) based
  • n spatial domain decomposition (DD) and linked cells (LC)
  • Long-ranged Coulomb interactions are handled by SPM Ewald

employing 3D FFTs for k-space evaluation - limited use to 2k CPUs

  • Maximum particle load ≈ 2.1×109 atoms
  • Full force field and molecular description but no rigid body

description yet (as in DL_POLY_2)

  • Free format semantically approached reading with some fail-

safe features (but fully fool-proofed)

slide-3
SLIDE 3

3

A A B B C C D D

Domain Decomposition Parallelisation

slide-4
SLIDE 4

Development

1 2 3 4 5 35 40 45 50 55 60 65

Published Lines of Code [1000] Development Time [Years]

v01-Mar-03 v02-Mar- 04 v03-Sep-04 v04-Mar-05 v05-Oct-05 v06-Mar-06 v07-Dec-06 v08-Sep-07 v09-Jan-08

slide-5
SLIDE 5

196 885 1420

5

DL_POLY Licence Statistics

slide-6
SLIDE 6

200 400 600 800 1000 200 400 600 800 1000

max load 700'000 atoms per 1GB/CPU max load 220'000 ions per 1GB/CPU max load 210'000 ions per 1GB/CPU

Solid Ar (32'000 atoms per CPU) NaCl (27'000 ions per CPU) SPC Water (20'736 ions per CPU)

21 million atoms 28 million atoms 33 million atoms good parallelisation perfect parallelisation

Speed Gain Processor Count

Performance Weak Scaling on IBM p575

slide-7
SLIDE 7

I/O Weak Scaling on IBM p575

200 400 600 800 1000 200 400 600 800

Solid Ar NaCl SPC Water Time [s] Processor Count

dashed lines show shut-down times solid lines show start-up times

slide-8
SLIDE 8

Proof of Concept on IBM p575

300,763,000 NaCl with full SPME electrostatics evaluation on 1024 CPU cores Start-up time ≈ 1 hour Timestep time ≈ 68 seconds FFT evaluation ≈ 55 seconds In theory ,the system can be seen by the eye. Although you would need a very good microscope – the MD cell size for this system is 2μm along the side and as the wavelength of the visible light is 0.5μm so it should be theoretically possible.

slide-9
SLIDE 9

Importance of I/O - I

Types of MD studies most dependent on I/O

  • Large length-scales (109 particles), short time-scale such as screw

deformations

  • Medium big length-scales (106–108 particles), medium time-scale

(ps-ns) such as radiation damage cascades

  • Medium length-scale (105–106 particles), long time-scale (ns-µs)

such as membrane and protein processes Types of I/O: portable human readable loss of precision size

  • ASCII

+ + – –

  • Binary

– – + +

  • XDR Binary

+ – + +

slide-10
SLIDE 10

Importance of I/O - II

Example: 15 million system simulated with 2048 MPI tasks MD time per timestep ~0.7 (2.7) seconds on Cray XT4 (BG/L) Configuration read ~100 seconds (once during the simulation) Configuration write ~600 seconds for 1.1 GB with the fastest I/O method – MPI-I/O for Cray XT4 (parallel direct access for BG/L) I/O in native binary is only 3 times faster and 3 times smaller Some unpopular solutions

  • Saving only the important fragments of the configuration
  • Saving only fragments that have moved more than a given

distance between two consecutive dumps

  • Distributed dump – separated configuration in separate files for

each MPI task (CFD)

slide-11
SLIDE 11

ASCII I/O Solutions in DL_POLY_3

  • 1. Serial direct access write (abbreviated as SDAW) – where only a

single node, the master, prints it all and all the rest communicate information to a master in turn while the master completes writing a configuration of the time evolution.

  • 2. Parallel direct access write (PDAW) – where all nodes print in

the same file in an orderly manner so no overlapping occurs using Fortran direct access files. However, it should be noted that the behaviour of this method is not defined by the Fortran standard, and in particular we have experienced problems when disk cache is not coherent with the memory.

  • 3. MPI-I/O write (MPIW) which has the same concept as the PDAW

but is performed using MPI-I/O rather than direct access. 4.

  • 4. Serial NetCDF write

Serial NetCDF write ( (SNCW SNCW) using NetCDF libraries for machine- ) using NetCDF libraries for machine- independent data formats of array-based, scientific data (widely independent data formats of array-based, scientific data (widely used by various scientific communities) used by various scientific communities)

slide-12
SLIDE 12

Overall I/O Performance

512 1024 1536 2048 25 50 75 100 125

Write Speed [MB/s] Processor Count

BG/L : SDAW, PDAW, MPIW BG/P : SDAW, PDAW P5-575: SDAW, PDAW, MPIW, SNCW XT3 SC: PDAW, MPIW (to 512) XT3 DC: PDAW, MPIW (to 1024) XT4 SC: SDAW, MPIW, SNCW XT4 DC: SDAW, MPIW, SNCW

slide-13
SLIDE 13

DL_POLY_3 I/O Conclusions

  • PDAW performs markedly superiorly to the SDAW or MPIW where

supported by the platform for this particular size of messages (73 Bytes ASCII per message). Improvements by an order of magnitude can be obtained, even though the I/O is not scaling especially well itself.

  • MPIW optimised and performed consistently well for Cray XT3/4
  • architectures. MPIW and much better than SDAW but as seen on

Cray XT3 this was still not as fast as PDAW. MPIW on Cray XT4 can achieve an improvement by a factor of two, similar performance to PDAW on the Cray XT3, once the storage methodology (OST) is

  • ptimised for the dedicated I/O processing units.
  • MPIW performs badly on IBM platforms. PDAW not accessible on

Cray XT4. SDAW extremely slow on Cray XT3.

  • While on the IBM P5-575 SNCW was only 1.5 times faster than

SDAW on average, on the Cray XT4 it was 10 times and it did not matter whether runs were in single- or dual-core mode. Despite these differences, SNCW performed 2.5 times faster on the IBM P5- 575 than on the Cray XT4.

slide-14
SLIDE 14

Benchmarking BG/L Jülich

2000 4000 6000 8000 10000 12000 14000 16000 2000 4000 6000 8000 10000 12000 14000 16000

14.6 million particle Gd2Zr2O7 system Speed Gain Processor count Perfect MD step total Link cells van der Waals Ewald real Ewald k-space

slide-15
SLIDE 15

Benchmarking XT4 UK

1000 2000 3000 4000 5000 6000 7000 8000 1000 2000 3000 4000 5000 6000 7000 8000

14.6 million particle Gd2Zr2O7 system Processor count Speed Gain Perfect MD step total Link cells van der Waals Ewald real Ewald k-space

slide-16
SLIDE 16

Benchmarking Main Platforms - I

500 1000 1500 2000 2500 2 4 6 8

3.8 million particle Gd2Zr2O7 system Evaluations [s

  • 1

] Processor count

CRAY XT4 SC CRAY XT4 DC CRAY XT3 SC CRAY XT3 DC 3GHz Woodcrest DC IBM p575 BG/L BG/P

slide-17
SLIDE 17

Benchmarking Main Platforms - II

500 1000 1500 2000 2500 5 10 15 20 25 LINK CELLS 500 1000 1500 2000 2500 5 10 15 20 25 EWALD K-SPACE 500 1000 1500 2000 2500 25 50 75 100 125 VAN DER WAALS 500 1000 1500 2000 2500 25 50 75 100 125 EWALD REAL

slide-18
SLIDE 18

Acknowledgements Thanks to

  • Ian Bush (DL/NAG) for optimisation support
  • Andy Porter (DL) for NetCDF work and support
  • Martyn Foster (NAG) and David Quigley (UoW) for Cray

XT4 optimisation

  • Luican Anton (NAG) for first draft of MPI-I/O writing

routine

http://www.ccp5.ac.uk/DL_POLY/