Isolated MPI-I/O Solution on top of MPI-1 Emin Gabrielyan, Roger D. - PDF document

May 23-24, 2001, Sheraton Hyannis, Cape Cod, Hyannis MA 5th Workshop on Distributed Supercomputing: Scalable Cluster Software SFIO Isolated MPI-I/O Solution on top of MPI-1 Emin Gabrielyan, Roger D. Hersch École Polytechnique Fédérale de Lausanne, Switzerland {Emin.Gabrielyan,RD.Hersch}@epfl.ch .

MPI-I/O Access READ READ READ AT_ALL ALL ORDERED Operations BEG/END BEG/END BEG/END WRITE WRITE WRITE collective READ READ READ ALL AT_ALL ORDER. g n WRITE WRITE i WRITE k c o l b - n o n Coordination IREAD IREAD IREAD AT SHARED m s IWRITE IWRITE IWRITE i n g o n r i h k c c o n y l b S non-collective Positioning READ READ READ AT SHARED WRITE WRITE WRITE explicit individual shared offsets file pointers file pointers The basic set of MPI-I/O interface functions consists of File Manipulation Op- erations, File View Operations and Data Access Operations. There are three orthogonal aspects to data access: positioning, synchronism, and coordination, and there are 12 respective types of read and of write operations. .

, y r File View contiguous in the memory y o m r o e m m e m n i n s e i e u l s l o i i f u f u n o g n i u i i t s g n s u i o t a o n c u l n o l g e o c i w n t n o s a c Memory Memory View View fragmentation in the memory File File y r , o y m r o e m m e m n non-contiguous in the memory e i l s n i f u i e n o s l u i u i f o g s u u i n t g o n i i u o s t g c a n i n l o t o l c n e n o w c s n a o n Memory Memory View View File File fragmentation of the view of file contiguous in the file non-contiguous in the file The file view is a global concept, which interferes with all data access operations. For each process it specifies its own view of the shared data file: a se- quence of pieces in the common data file that are visible for the particular process. In order to specify the file view the user creates a derived datatype, which defines the fragmented structure of the visible part of the file. Since each access operation can use another derived datatype that specifies the fragmentation in memory, there are two additional orthogonal aspects to data access: the fragmentation in the memory and the fragmentation of the file view. .

Derived Datatypes MPI_Type_contiguous(2,T3,& T4 ) MPI_Type_struct(2,...,& T3 ) MPI_Type_struct(2,...,& T3 ) MPI_Type_contiguous(2,T1,& T2 ) MPI_Type_contiguous(2,T1,& T2 ) (3,1,2,MPI_BYTE,& T1 ) (3,1,2,MPI_BYTE,& T1 ) (3,1,2,MPI_BYTE,& T1 ) (3,1,2,MPI_BYTE,& T1 ) MPI_Type_vector MPI_Type_vector MPI_Type_vector MPI_Type_vector Derived Datatype T4 T4 MPI-1 provides techniques for creating datatype objects of arbitrary data layout in memory. The opaque datatype object can be used in various MPI operations, but the layout information, once put in a derived datatype, can not be decoded from the datatype. .

MPI-I/O Implementation MPI-I/O Interface MPI-I/O Implementation Access to the internal operations and data structures of the MPI- 1 implementation, in order to decode the layout information of the file view’s derived MPI-1 Interface datatype. MPI-1 Implementation MPI-2 operations and the MPI-I/O subset in particular form an exten- sion to MPI-1. However a developer of MPI-I/O needs access to the source code of the MPI-1 implementation, on top of which he intends to implement MPI-I/O. For each MPI-1 implementation a specific development of MPI-I/O will be required. .

Reverse Engineering or Memory Painting Buffer of the size of the datatype T4 Buffer of the size of T4 ’s extent Contiguous datatype Derived datatype T4 MPI_Send(source,size,MPI_BYTE,...) MPI_Recv(destination-LB,1, T4 ,...) The layout information can not be decoded from the datatype, but the behaviour of the datatype depends on the layout. We try to define a special test for a derived datatype, analyse the behaviour of the datatype and based on it, decode the layout information of the datatype. For example, MPI_Recv operation receives a contiguous network stream and dis- tributes it in memory according to the data layout of the datatype. If the memory is previously initialised with a “green colour”, and the network stream has a “red colour”, then analysis of the memory after data reception will give us the necessary information on the data layout hidden in the opaque datatype. In our solution we do not use MPI_Send and MPI_Recv operations, instead we use the MPI_Unpack standard MPI-1 operation to avoid network transfers and multiple processes usage. .

Portable MPI-I/O Solution MPI-I/O Interface MPI-I/O Implementation Memory Painting MPI-1 Interface MPI-1 Implementation Once we have a tool for derived datatype decoding, it becomes possible to create an isolated MPI-I/O solution on top of any standard MPI-1. The Argonne National Laboratory’s MPICH implementation of MPI-I/O is intensively used with our datatype decoding technique and an isolated solution of a limited subset of MPI-I/O operations has been implemented. .

MPI-I/O Isolation READ READ READ AT_ALL ALL ORDERED BEG/END BEG/END BEG/END WRITE WRITE WRITE collective READ READ READ ALL AT_ALL ORDER. g n WRITE i WRITE WRITE k c o l b - n o n Coordination IREAD IREAD IREAD AT SHARED m s IWRITE IWRITE IWRITE i n g o n r i h k c c n o y l b S non-collective Positioning READ READ READ AT SHARED WRITE WRITE WRITE explicit individual shared offsets file pointers file pointers The basic File Manipulation operations MPI_File_open and MPI_File_close; File View operation MPI_File_set_view and blocking non-collective Data Access Operations MPI_File_write, MPI_File_write_at, MPI_File_read, MPI_File_read_at are already successfully implemented in the form of an isolated independent library. Currently we are work- ing on the collective counterparts of blocking operations and trying to make use of the extended two-phase method for accessing sections of out-of-core arrays, on which the ANL implementation is based. .

Testing Isolated MPI-I/O MPI-I/O Interface Contiguous memory and file • MPI_File_write: MPI-FCI Ok • MPI_File_read: MPI-FCI Ok MPI-I/O Implementation • MPI_File_write_at: MPI-FCI Ok • MPI_File_read_at: MPI-FCI Ok Memory Painting Fragmented memory, contiguous file • MPI_File_write: MPI-FCI Ok • MPI_File_read: MPI-FCI Ok MPI-FCI on Swiss-Tx • MPI_File_write_at: MPI-FCI Ok • MPI_File_read_at: MPI-FCI Ok Contiguous memory, fragmented file Fragmented memory and file • MPI_File_write: MPI-FCI Ok • MPI_File_write: MPI-FCI Ok • MPI_File_read: MPI-FCI Ok • MPI_File_read: MPI-FCI Ok • MPI_File_write_at: MPI-FCI Ok • MPI_File_write_at: MPI-FCI Ok • MPI_File_read_at: MPI-FCI Ok • MPI_File_read_at: MPI-FCI Ok The implemented operations of the isolated solution of MPI-I/O are successfully test- ed with the MPI-FCI implementation of MPI-1 on the Swiss-Tx supercomputer. .

Gateway to the Parallel I/O of the Swiss-T1 PR01 Compute Processor TNET connection ~86MB/s PR15 PR14 PR16 PR17 PR13 PR18 PR12 PR19 PR00 IO Processor PR11 PR20 Switch 0 PR10 PR21 PR09 PR22 Routing P P R R PR07 0 2 P 8 3 R 2 PR06 4 PR25 1 2 PR05 PR26 PR04 PR27 PR03 PR28 P R P 0 3 R 0 2 2 9 P P R 0 R 1 3 0 PR00 PR31 PR63 PR32 2 6 3 R 3 P R P PR61 4 3 R 4 P 7 PR60 PR35 PR59 PR36 PR58 PR37 PR57 8 3 R 6 5 P 6 9 5 3 R R 5 0 P P 5 4 R R 4 1 P 5 4 P PR53 R PR42 R P P PR52 PR43 PR51 PR44 5 0 6 9 5 PR48 PR47 4 4 4 R R R R P P P P At the bottom of the isolated MPI-I/O, we intended to provide as a high performance I/O solution a switching to the Striped File I/O system (SFIO). SFIO communication layer is implemented on top of MPI-1 and therefore SFIO is also portable. We measured a scalable performance of the SFIO on the architecture of the Swiss-Tx supercomputer. .

SFIO on the Swiss-Tx machine 400 350 300 Performance MB/s 250 200 150 100 50 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Number of Compute or I/O Nodes read average read maximum write average write maximum The performance of SFIO is measured for concurrent access from all compute nodes to all I/O nodes. In order to limit operating system caching ef- fects, the total size of the striped file linearly increases with the number of I/O nodes up to 32GB. The stripe unit size is 200 bytes. The application’s I/O performance is measured as a function of the number of Compute and I/O nodes. .

Conclusion Isolated solution automatically gives to every MPI-1 owner an MPI-I/O, without any requirements of changing, modifying, or specifically interfering to his current MPI-1 implementation. Future work • Implementation of blocking collective file access operations. • Implementation of non-blocking file access operations. • The remaining File Manipulation Operations. • Switching to SFIO. Thank You ! SFIO .

Isolated MPI-I/O Solution on top of MPI-1 Emin Gabrielyan, Roger D. - PDF document

May 23-24, 2001, Sheraton Hyannis, Cape Cod, Hyannis MA 5th Workshop on Distributed Supercomputing: Scalable Cluster Software SFIO Isolated MPI-I/O Solution on top of MPI-1 Emin Gabrielyan, Roger D. Hersch cole Polytechnique Fdrale de

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Brownian motion with variable drift can have drift can have isolated zeros isolated zeros

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

2/17/2017 Continued from yesterday >java RealQueen 5 SOLUTION: 1 3 5 2 4 SOLUTION: 1 4 2 5

To TOP or NOT to TOP www.SAS.com To TOP or NOT to TOP Using the TOP command in Linux By Len van

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

EMISSIONS OF GREENHOUSE GASES FROM THE ISOLATED WETLANDS OF OB-TOM INTERFLUVE AREA Egor

Moving beyond the lexicon Moving beyond the lexicon An isolated lexicon? An isolated lexicon?

Q4|2018 PERFORMANCE RESULTS PRESENTATION FOR INVESTOR & ANALYST DISCLAIMER The informat ion

PORT OF SAVANNAH The Southeast Gateway for the U.S. February 16, 2017 CREATING OPPORTUNITIES

RE-IMAGINING MINING TO IMPROVE PEOPLES LIVES Renaissance Capital Conference, 27 May 2020

FYP Presentation 28th August 2017 Presented by: Cormac Reidy, Interactive Media, University of

2 3 4 REGTECH ALTERNATIVEFINANCE BLOCKCHAIN INSURTECH MOBILE PAYMENTS SOFTWARE

PeeringDB 2.0 for IXPs Greg Hankins / Arnold Nipper {ghankins,arnold}@peeringdb.com 24 - 26

AUTHENTICITY our goal is the sureness to own the real product, certified and guaranteed by the

ORIENTATION 2018 WELCOME 62 universities WELCOME This presentation will cover the main things you

Isolated MPI-I/O Solution on top of MPI-1 Emin Gabrielyan, Roger D. - PDF document

May 23-24, 2001, Sheraton Hyannis, Cape Cod, Hyannis MA 5th Workshop on Distributed Supercomputing: Scalable Cluster Software SFIO Isolated MPI-I/O Solution on top of MPI-1 Emin Gabrielyan, Roger D. Hersch cole Polytechnique Fdrale de

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Brownian motion with variable drift can have drift can have isolated zeros isolated zeros

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

2/17/2017 Continued from yesterday &gt;java RealQueen 5 SOLUTION: 1 3 5 2 4 SOLUTION: 1 4 2 5

To TOP or NOT to TOP www.SAS.com To TOP or NOT to TOP Using the TOP command in Linux By Len van

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

EMISSIONS OF GREENHOUSE GASES FROM THE ISOLATED WETLANDS OF OB-TOM INTERFLUVE AREA Egor

Moving beyond the lexicon Moving beyond the lexicon An isolated lexicon? An isolated lexicon?

Q4|2018 PERFORMANCE RESULTS PRESENTATION FOR INVESTOR &amp; ANALYST DISCLAIMER The informat ion

PORT OF SAVANNAH The Southeast Gateway for the U.S. February 16, 2017 CREATING OPPORTUNITIES

RE-IMAGINING MINING TO IMPROVE PEOPLES LIVES Renaissance Capital Conference, 27 May 2020

FYP Presentation 28th August 2017 Presented by: Cormac Reidy, Interactive Media, University of

2 3 4 REGTECH ALTERNATIVEFINANCE BLOCKCHAIN INSURTECH MOBILE PAYMENTS SOFTWARE

PeeringDB 2.0 for IXPs Greg Hankins / Arnold Nipper {ghankins,arnold}@peeringdb.com 24 - 26

AUTHENTICITY our goal is the sureness to own the real product, certified and guaranteed by the

ORIENTATION 2018 WELCOME 62 universities WELCOME This presentation will cover the main things you

2/17/2017 Continued from yesterday >java RealQueen 5 SOLUTION: 1 3 5 2 4 SOLUTION: 1 4 2 5

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Q4|2018 PERFORMANCE RESULTS PRESENTATION FOR INVESTOR & ANALYST DISCLAIMER The informat ion