Parallel IO concepts (MPI-IO and pHDF5) Matthieu Haefele Saclay, - - PowerPoint PPT Presentation

parallel io concepts mpi io and phdf5
SMART_READER_LITE
LIVE PREVIEW

Parallel IO concepts (MPI-IO and pHDF5) Matthieu Haefele Saclay, - - PowerPoint PPT Presentation

Parallel IO concepts (MPI-IO and pHDF5) Matthieu Haefele Saclay, April 2018, Parallel filesystems and parallel IO libraries PATC@MdS Matthieu Haefele Outline Day 1 Morning: HDF5 in the context of Input/Output (IO) HDF5 Application


slide-1
SLIDE 1

Matthieu Haefele

Parallel IO concepts (MPI-IO and pHDF5)

Saclay, April 2018,

Parallel filesystems and parallel IO libraries PATC@MdS

Matthieu Haefele

slide-2
SLIDE 2

Matthieu Haefele

Outline Day 1

Morning: HDF5 in the context of Input/Output (IO) HDF5 Application Programming Interface (API) Playing with Dataspace Hands on session Afternoon: Basics on HPC, MPI and parallel file systems Parallel IO with POSIX, MPI-IO and Parallel HDF5 Hands on session (pHDF5)

slide-3
SLIDE 3

Matthieu Haefele

HPC machine architecture

An HPC machine is composed of processing elements or cores which Can access a central memory Can communicate through a high performance network Are connected to a high performance storage system Until now, two major families of HPC machines existed: Shared memory machines Distributed memory machines New architectures like GPGPUs, MIC, FPGAs, . . . are not covered here

slide-4
SLIDE 4

Matthieu Haefele

Distributed memory machines

Hard drives High performance network Node Core Memory I/O Nodes

Operating system Operating system Operating system Operating system Operating system

slide-5
SLIDE 5

Matthieu Haefele

MPI: Message Passing Interface

MPI is an Application Programming Interface Defines a standard for developing parallel applications Several implementations exists (openmpi, mpich, IBM, Par-Tec. . . ) It is composed of A parallel execution environment A library to link the application with

slide-6
SLIDE 6

Matthieu Haefele

MPI communications

Four classes of communications Collective: all processes belonging to a same MPI communicator communicates together according to a defined pattern (scatter, gather, reduce, . . . ) Point-to-Point: one process sends a message to another

  • ne (send, receive)

For both Collective or Point-to-Point, blocking and non-blocking functions are available

slide-7
SLIDE 7

Matthieu Haefele

inode pointer structure (ext3)

Infos inode Direct blocks Indirect blocks Double Indirect blocks

slide-8
SLIDE 8

Matthieu Haefele

“Serial” file system

Meta-data, block address and file blocks are stored a single logical drive with a “serial” file system

Logical drive Meta-data

slide-9
SLIDE 9

Matthieu Haefele

Parallel file system architecture

Meta-data Direct/indirect blocks I/O nodes / Meta-data server Object Storage T argets Dedicated network

Meta-data and file blocks are stored on separate devices Several devices are used Bandwidth is aggregated A file is striped across different

  • bject storage targets.
slide-10
SLIDE 10

Matthieu Haefele

Parallel file system usage

Meta-data Direct/indirect blocks I/O nodes / Meta-data server Object Storage T argets Application FS client

The file system client gives to the application the view of a “serial” file system

slide-11
SLIDE 11

Matthieu Haefele

The software stack

MPI-IO Standard library I/O library Data structures Operating system

Object Interface Streaming Interface

slide-12
SLIDE 12

Matthieu Haefele

Let us put everything together

FS client

Meta-data Direct/indirect blocks

FS client MPI-IO Standard library I/O library Data structures

MPI execution environment

I/O node

MPI-IO Standard library I/O library Data structures MPI-IO Standard library I/O library Data structures MPI-IO Standard library I/O library Data structures MPI-IO Standard library I/O library Data structures MPI-IO Standard library I/O library Data structures
slide-13
SLIDE 13

Matthieu Haefele

Test case to illustrate strategies

S S S/py S/px

Let us consider: A 2D structured array The array is of size S × S A block-block distribution is used With P = pxpy cores

slide-14
SLIDE 14

Matthieu Haefele

Multiple files

Each MPI process writes its own file A single distributed data is spread out in different files The way it is spread out depends on the number of MPI processes ⇒ More work at post-processing level ⇒ May lead to huge amount of files (forbidden) ⇒ Very easy to implement

slide-15
SLIDE 15

Matthieu Haefele

Multiple files

POSIX IO operations

slide-16
SLIDE 16

Matthieu Haefele

MPI gather + single file

A collective MPI call is first performed to gather the data

  • n one MPI process. Then, this process writes a single file

The memory of a single node can be a limitation ⇒ Single resulting file

slide-17
SLIDE 17

Matthieu Haefele

MPI Gather + single file

Gather operation POSIX IO operation

slide-18
SLIDE 18

Matthieu Haefele

MPI-IO concept

I/O part of the MPI specification Provide a set of read/write methods Allow one to describe how a data is distributed among the processes (thanks to MPI derived types) MPI implementation takes care of actually writing a single contiguous file on disk from the distributed data Result is identical as the gather + POSIX file MPI-IO performs the gather operation within the MPI implementation No more memory limitation Single resulting file Definition of MPI derived types Performance linked to MPI library

slide-19
SLIDE 19

Matthieu Haefele

MPI-IO API

Positioning Synchronism Coordination

Non collective Collective Explicit offsets Individual file pointers Shared file pointers Blocking Non blocking & Split call MPI_FILE_READ_AT MPI_FILE_WRITE_AT MPI_FILE_IREAD_AT MPI_FILE_IWRITE_AT MPI_FILE_READ_AT_ALL MPI_FILE_WRITE_AT_ALL MPI_FILE_READ_AT_ALL_BEGIN MPI_FILE_READ_AT_ALL_END MPI_FILE_WRITE_AT_ALL_BEGIN MPI_FILE_WRITE_AT_ALL_END MPI_FILE_READ MPI_FILE_WRITE MPI_FILE_IREAD MPI_FILE_IWRITE MPI_FILE_READ_ALL MPI_FILE_WRITE_ALL MPI_FILE_READ_ALL_BEGIN MPI_FILE_READ_ALL_END MPI_FILE_WRITE_ALL_BEGIN MPI_FILE_WRITE_ALL_END MPI_FILE_READ_SHARED MPI_FILE_WRITE_SHARED MPI_FILE_IREAD_SHARED MPI_FILE_IWRITE_SHARED MPI_FILE_READ_ORDERED MPI_FILE_WRITE_ORDERED MPI_FILE_READ_ORDERED_BEGIN MPI_FILE_READ_ORDERED_END MPI_FILE_WRITE_ORDERED_BEGIN MPI_FILE_WRITE_ORDERED_END Blocking Non blocking & Split call Blocking Non blocking & Split call

Level 0 Level 1 Level 2 Level 3

slide-20
SLIDE 20

Matthieu Haefele

MPI-IO

slide-21
SLIDE 21

Matthieu Haefele

MPI-IO level illustration

p0 p1 p2 p3 MPI processes File space Level 0 Level 1 Level 2 Level 3

slide-22
SLIDE 22

Matthieu Haefele

Parallel HDF5

Built on top of MPI-IO Must follow some restrictions to enable underlying collective calls of MPI-IO From the programming point of view, only few parameters have to be given to the HDF5 library Data distribution is described thanks to HDF5 hyper-slices Result is a single portable HDF5 file Easy to develop Single portable file Maybe some performance issues

slide-23
SLIDE 23

Matthieu Haefele

Parallel HDF5

HDF5 file

slide-24
SLIDE 24

Matthieu Haefele

Parallel HDF5 implementation

INTEGER(HSIZE_T) :: array_size(2), array_subsize(2), array_start(2) INTEGER(HID_T) :: plist_id1, plist_id2, file_id, filespace, dset_id, memspace array_size(1) = S array_size(2) = S array_subsize(1) = local_nx array_subsize(2) = local_ny array_start(1) = proc_x * array_subsize(1) array_start(2) = proc_y * array_subsize(2) !Allocate and fill the tab array CALL h5open_f(ierr) CALL h5pcreate_f(H5P_FILE_ACCESS_F, plist_id1, ierr) CALL h5pset_fapl_mpio_f(plist_id1, MPI_COMM_WORLD, MPI_INFO_NULL, ierr) CALL h5fcreate_f('res.h5', H5F_ACC_TRUNC_F, file_id, ierr, access_prp = plist_id1) ! Set collective call CALL h5pcreate_f(H5P_DATASET_XFER_F, plist_id2, ierr) CALL h5pset_dxpl_mpio_f(plist_id2, H5FD_MPIO_COLLECTIVE_F, ierr) CALL h5screate_simple_f(2, array_size, filespace, ierr) CALL h5screate_simple_f(2, array_subsize, memspace, ierr) CALL h5dcreate_f(file_id, 'pi_array', H5T_NATIVE_REAL, filespace, dset_id, ierr) CALL h5sselect_hyperslab_f (filespace, H5S_SELECT_SET_F, array_start, array_subsize, ierr) CALL h5dwrite_f(dset_id, H5T_NATIVE_REAL, tab, array_subsize, ierr, memspace, filespace, plist_id2) ! Close HDF5 objects

slide-25
SLIDE 25

Matthieu Haefele

IO technology comparison

Scientific results / diagnostics Multiple POSIX files in ASCII or binary MPI-IO pHDF5 XIOS Restart files SIONlib ADIOS

slide-26
SLIDE 26

Matthieu Haefele

IO technology comparison

POSIX MPI-IO pHDF5 SIONlib ADIOS XIOS FTI

Abstraction Stream Stream Stream Object Object Object Object Purpose General General General General General General Specific Hardware No No No No No Yes Yes API Imperative Imperative Imperative Imperative Declarative Decl./Imp Declarative Format Binary Binary HDF5 NetCDF/HDF5 Binary NetCDF/HDF5 Binary Single/multi File Multi Single Single/Multi Single Multi++ Single/Multi N.A Online Post-processing No No No Yes No Yes No

slide-27
SLIDE 27

PDI: the Approach

PDI: the Parallel Data Interface PDI only provides a declarative API (no behavior)

PDI expose(name, data): data available for output PDI import(name, data): imports data into application

Behavior is provided by existing IO libraries

A plug-in system, event-based HDF5, FTI (available), SION, XIOS, IME (planned), . . .

Behavior is selected through a configuration file

which & how plug-ins are used for which data & when a simple yaml file-format

8 / 1

slide-28
SLIDE 28

PDI: the Architecture

Config. file (Yaml) PDI Plug-ins Application codes . . . FTI Other plug-ins HDF5 PDI expose PDI import API

9 / 1

slide-29
SLIDE 29

Matthieu Haefele

Hands-on parallel HDF5 objective 1/2

MPI rank 0 MPI rank 1 MPI rank 2 MPI rank 3

slide-30
SLIDE 30

Matthieu Haefele

Hands-on parallel HDF5 1/2

  • 1. git clone https://github.com/mathaefele/parallel HDF5 hands-on.git
  • 2. Parallel multi files: all MPI ranks write their whole

memory in separate file (provided in phdf5-1)

  • 3. Serialized: each rank opens the file and writes its data
  • ne after the other

3.1 Data written as separate datasets 3.2 Data written in the same dataset

  • 4. Parallel single file: specific HDF5 parameters given at
  • pen and write time to let MPI-IO manage the concurrent

file access

slide-31
SLIDE 31

Matthieu Haefele

Hands-on parallel HDF5 objective 2/2

MPI rank 0 MPI rank 1 MPI rank 2 MPI rank 3 Single file

slide-32
SLIDE 32

Matthieu Haefele

Hands-on parallel HDF5 2/2

Same exercice as the previous one, but now each rank has ghost cells that should not be written.

  • 1. Parallel multi files: all MPI ranks write their whole

memory in separate file (provided in phdf5-4)

  • 2. Parallel single file: specific HDF5 parameters given at
  • pen and write time to let MPI-IO manage the concurrent

file access that write the good portion of memory