Parallel IO concepts (MPI-IO and pHDF5) Matthieu Haefele Saclay, - PowerPoint PPT Presentation

Parallel IO concepts (MPI-IO and pHDF5) Matthieu Haefele Saclay, April 2018, Parallel filesystems and parallel IO libraries PATC@MdS Matthieu Haefele

Outline Day 1 Morning: HDF5 in the context of Input/Output (IO) HDF5 Application Programming Interface (API) Playing with Dataspace Hands on session Afternoon: Basics on HPC, MPI and parallel file systems Parallel IO with POSIX, MPI-IO and Parallel HDF5 Hands on session (pHDF5) Matthieu Haefele

HPC machine architecture An HPC machine is composed of processing elements or cores which Can access a central memory Can communicate through a high performance network Are connected to a high performance storage system Until now, two major families of HPC machines existed: Shared memory machines Distributed memory machines New architectures like GPGPUs, MIC, FPGAs, . . . are not covered here Matthieu Haefele

Distributed memory machines Operating system Operating system Operating system Node Core Memory High performance network I/O Nodes Hard drives Operating system Operating system Matthieu Haefele

MPI: Message Passing Interface MPI is an Application Programming Interface Defines a standard for developing parallel applications Several implementations exists (openmpi, mpich, IBM, Par-Tec. . . ) It is composed of A parallel execution environment A library to link the application with Matthieu Haefele

MPI communications Four classes of communications Collective : all processes belonging to a same MPI communicator communicates together according to a defined pattern (scatter, gather, reduce, . . . ) Point-to-Point : one process sends a message to another one (send, receive) For both Collective or Point-to-Point, blocking and non-blocking functions are available Matthieu Haefele

inode pointer structure (ext3) Indirect blocks Direct blocks Double Indirect blocks inode Infos Matthieu Haefele

“Serial” file system Meta-data, block address and file blocks are stored a single logical drive with a “serial” file system Logical drive Meta-data Matthieu Haefele

Parallel file system architecture I/O nodes / Meta-data server Meta-data and file blocks are Meta-data Direct/indirect blocks stored on separate devices Several devices are used Dedicated network Bandwidth is aggregated A file is striped across different object storage targets. Object Storage T argets Matthieu Haefele

Parallel file system usage Application I/O nodes / Meta-data server Meta-data Direct/indirect blocks FS client Object Storage T argets The file system client gives to the application the view of a “serial” file system Matthieu Haefele

The software stack Data structures Object Interface I/O library MPI-IO Standard library Streaming Interface Operating system Matthieu Haefele

Let us put everything together I/O node MPI execution Data structures environment I/O library MPI-IO Standard library Data structures Data structures Data structures Data structures Data structures I/O library I/O library I/O library I/O library I/O library MPI-IO MPI-IO MPI-IO MPI-IO MPI-IO Standard library Standard library Standard library Standard library Standard library Meta-data Direct/indirect blocks FS client FS client Matthieu Haefele

Test case to illustrate strategies S/p y S S/p x S Let us consider: A 2D structured array The array is of size S × S A block-block distribution is used With P = p x p y cores Matthieu Haefele

Multiple files Each MPI process writes its own file A single distributed data is spread out in different files The way it is spread out depends on the number of MPI processes ⇒ More work at post-processing level ⇒ May lead to huge amount of files (forbidden) ⇒ Very easy to implement Matthieu Haefele

Multiple files POSIX IO operations Matthieu Haefele

MPI gather + single file A collective MPI call is first performed to gather the data on one MPI process. Then, this process writes a single file The memory of a single node can be a limitation ⇒ Single resulting file Matthieu Haefele

MPI Gather + single file Gather operation POSIX IO operation Matthieu Haefele

MPI-IO concept I/O part of the MPI specification Provide a set of read/write methods Allow one to describe how a data is distributed among the processes (thanks to MPI derived types) MPI implementation takes care of actually writing a single contiguous file on disk from the distributed data Result is identical as the gather + POSIX file MPI-IO performs the gather operation within the MPI implementation No more memory limitation Single resulting file Definition of MPI derived types Performance linked to MPI library Matthieu Haefele

MPI-IO API Level 0 Level 1 Coordination Positioning Synchronism Collective Non collective MPI_FILE_READ_AT MPI_FILE_READ_AT_ALL Blocking MPI_FILE_WRITE_AT MPI_FILE_WRITE_AT_ALL MPI_FILE_READ_AT_ALL_BEGIN Explicit offsets MPI_FILE_IREAD_AT MPI_FILE_READ_AT_ALL_END Non blocking & Split call MPI_FILE_WRITE_AT_ALL_BEGIN MPI_FILE_IWRITE_AT MPI_FILE_WRITE_AT_ALL_END MPI_FILE_READ MPI_FILE_READ_ALL Blocking MPI_FILE_WRITE MPI_FILE_WRITE_ALL Individual MPI_FILE_READ_ALL_BEGIN MPI_FILE_IREAD file pointers MPI_FILE_READ_ALL_END Non blocking MPI_FILE_WRITE_ALL_BEGIN & Split call MPI_FILE_IWRITE MPI_FILE_WRITE_ALL_END MPI_FILE_READ_SHARED MPI_FILE_READ_ORDERED Blocking MPI_FILE_WRITE_SHARED MPI_FILE_WRITE_ORDERED Shared MPI_FILE_READ_ORDERED_BEGIN MPI_FILE_IREAD_SHARED file pointers MPI_FILE_READ_ORDERED_END Non blocking MPI_FILE_WRITE_ORDERED_BEGIN & Split call MPI_FILE_IWRITE_SHARED MPI_FILE_WRITE_ORDERED_END Level 2 Level 3 Matthieu Haefele

MPI-IO Matthieu Haefele

MPI-IO level illustration Level 3 Level 1 p3 MPI processes p2 Level 0 p1 p0 Level 2 File space Matthieu Haefele

Parallel HDF5 Built on top of MPI-IO Must follow some restrictions to enable underlying collective calls of MPI-IO From the programming point of view, only few parameters have to be given to the HDF5 library Data distribution is described thanks to HDF5 hyper-slices Result is a single portable HDF5 file Easy to develop Single portable file Maybe some performance issues Matthieu Haefele

Parallel HDF5 HDF5 file Matthieu Haefele

Parallel HDF5 implementation INTEGER(HSIZE_T) :: array_size(2), array_subsize(2), array_start(2) INTEGER(HID_T) :: plist_id1, plist_id2, file_id, filespace, dset_id, memspace array_size(1) = S array_size(2) = S array_subsize(1) = local_nx array_subsize(2) = local_ny array_start(1) = proc_x * array_subsize(1) array_start(2) = proc_y * array_subsize(2) !Allocate and fill the tab array CALL h5open_f(ierr) CALL h5pcreate_f(H5P_FILE_ACCESS_F, plist_id1, ierr) CALL h5pset_fapl_mpio_f(plist_id1, MPI_COMM_WORLD, MPI_INFO_NULL, ierr) CALL h5fcreate_f('res.h5', H5F_ACC_TRUNC_F, file_id, ierr, access_prp = plist_id1) ! Set collective call CALL h5pcreate_f(H5P_DATASET_XFER_F, plist_id2, ierr) CALL h5pset_dxpl_mpio_f(plist_id2, H5FD_MPIO_COLLECTIVE_F, ierr) CALL h5screate_simple_f(2, array_size, filespace, ierr) CALL h5screate_simple_f(2, array_subsize, memspace, ierr) CALL h5dcreate_f(file_id, 'pi_array', H5T_NATIVE_REAL, filespace, dset_id, ierr) CALL h5sselect_hyperslab_f (filespace, H5S_SELECT_SET_F, array_start, array_subsize, ierr) CALL h5dwrite_f(dset_id, H5T_NATIVE_REAL, tab, array_subsize, ierr, memspace, filespace, plist_id2) ! Close HDF5 objects Matthieu Haefele

IO technology comparison Scientific results / diagnostics Multiple POSIX files in ASCII or binary MPI-IO pHDF5 XIOS Restart files SIONlib ADIOS Matthieu Haefele

IO technology comparison Single/multi Online Purpose Hardware Format Abstraction API File Post-processing POSIX Stream Imperative General No Binary Multi No MPI-IO Imperative Binary Single Stream General No No pHDF5 Object Imperative General No HDF5 Single/Multi No XIOS Object Single Declarative General No NetCDF/HDF5 Yes SIONlib Imperative Binary Stream General No Multi++ No ADIOS Object Decl./Imp General Yes NetCDF/HDF5 Single/Multi Yes FTI Object Specific Binary N.A Declarative Yes No Matthieu Haefele

PDI: the Approach PDI: the Parallel Data Interface PDI only provides a declarative API (no behavior) PDI expose(name, data) : data available for output PDI import(name, data) : imports data into application Behavior is provided by existing IO libraries A plug-in system, event-based HDF5, FTI (available), SION, XIOS, IME (planned), . . . Behavior is selected through a configuration file which & how plug-ins are used for which data & when a simple yaml file-format 8 / 1

PDI: the Architecture Application codes PDI expose PDI import API Config. PDI file Plug-ins HDF5 FTI . . . Other plug-ins (Yaml) 9 / 1

Hands-on parallel HDF5 objective 1/2 MPI rank 2 MPI rank 0 MPI rank 1 MPI rank 3 Matthieu Haefele

Hands-on parallel HDF5 1/2 1. git clone https://github.com/mathaefele/parallel HDF5 hands-on.git 2. Parallel multi files: all MPI ranks write their whole memory in separate file (provided in phdf5-1) 3. Serialized: each rank opens the file and writes its data one after the other 3.1 Data written as separate datasets 3.2 Data written in the same dataset 4. Parallel single file: specific HDF5 parameters given at open and write time to let MPI-IO manage the concurrent file access Matthieu Haefele

Parallel IO concepts (MPI-IO and pHDF5) Matthieu Haefele Saclay, - PowerPoint PPT Presentation

Parallel IO concepts (MPI-IO and pHDF5) Matthieu Haefele Saclay, April 2018, Parallel filesystems and parallel IO libraries PATC@MdS Matthieu Haefele Outline Day 1 Morning: HDF5 in the context of Input/Output (IO) HDF5 Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

Parallel I/O and Parallel Refinement Chris Richardson Garth Wells DCSE project NAG

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

MPI Internals Advanced Parallel Programming Overview MPI Library Structure Point-to-point

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

In Introduction to MPI Shaohao Chen Research Computing Services Information Services and

Enhanced Memory debugging of MPI-parallel Applications in Open MPI 4th Parallel tools workshop

Basic Steps for Execution Fetch an instruction from the instruction store Decode it

Design Process: Gathering Information Dr. Crawford 1/26/2018 Overview How to gather

Operating System Principles: Devices, Device Drivers, and I/O CS 111 Operating Systems Peter

SIMD Programming SIMD Programming with Larrabee with Larrabee Tom Forsyth Larrabee Architect

748($.4+9 -#1234(4+-#%(.')%(+5#364. ! -#1234(4+-#%(.')%(+5#364.

MA/CSSE 473 Day 23 Transform and Conquer MA/CSSE 473 Day 23 Scores on HW 7 were very high

LA 2019 lect.#4 on Linear equation systems ++ Lecture showed slides 1 13, and covered

Linear Algebra I MA1S1 Tristan McLoughlin October 17, 2014 Anton & Rorres: Ch 1.3

Parallel IO concepts (MPI-IO and pHDF5) Matthieu Haefele Saclay, - PowerPoint PPT Presentation

Parallel IO concepts (MPI-IO and pHDF5) Matthieu Haefele Saclay, April 2018, Parallel filesystems and parallel IO libraries PATC@MdS Matthieu Haefele Outline Day 1 Morning: HDF5 in the context of Input/Output (IO) HDF5 Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

Parallel I/O and Parallel Refinement Chris Richardson Garth Wells DCSE project NAG

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

MPI Internals Advanced Parallel Programming Overview MPI Library Structure Point-to-point

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

In Introduction to MPI Shaohao Chen Research Computing Services Information Services and

Enhanced Memory debugging of MPI-parallel Applications in Open MPI 4th Parallel tools workshop

Basic Steps for Execution Fetch an instruction from the instruction store Decode it

Design Process: Gathering Information Dr. Crawford 1/26/2018 Overview How to gather

Operating System Principles: Devices, Device Drivers, and I/O CS 111 Operating Systems Peter

SIMD Programming SIMD Programming with Larrabee with Larrabee Tom Forsyth Larrabee Architect

748($.4+9 -#1234(4+-#%*(.')%(+5#364.* ! -#1234(4+-#%*(.')%(+5#364.*

MA/CSSE 473 Day 23 Transform and Conquer MA/CSSE 473 Day 23 Scores on HW 7 were very high

LA 2019 lect.#4 on Linear equation systems ++ Lecture showed slides 1 13, and covered

Linear Algebra I MA1S1 Tristan McLoughlin October 17, 2014 Anton &amp; Rorres: Ch 1.3

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

748($.4+9 -#1234(4+-#%(.')%(+5#364. ! -#1234(4+-#%(.')%(+5#364.

Linear Algebra I MA1S1 Tristan McLoughlin October 17, 2014 Anton & Rorres: Ch 1.3