On Overlapping Communication and File I/O in Collective Write - PowerPoint PPT Presentation

On Overlapping Communication and File I/O in Collective Write Operation Raafat Feki and Edgar Gabriel Parallel Software Technologies Lab Department of Computer Science, University of Houston, Houston, USA Email: {rfeki, egabriel}@uh.edu

Motivation Many scientific applications operate on data sets that span hundreds of Gigabytes/Terabytes in size. A significant amount of time is spent in reading and writing data. The Message Passing Interface (MPI) is the most widely used Many Small non-contiguous I/O parallel programming paradigm for large scale parallel requests applications. MPI I/O: Parallel file I/O interface ■ Multiple processes can simultaneously access the same file Merge individual to perform read and write operations using a shared-file requests access pattern: Individually: Each process is responsible of ● writing/reading his local data independently of the others. Poor performance for complex data layout. ➢ One Large I/O request ● Collectively: Access information is shared between all processes. Significantly reduced I/O time ➢

Two-Phase I/O ● Collective I/O algorithm used at the client level ● Only a subset of MPI processes will actually write/read data called Aggregators ● It consist of two Phases: Example: 2D data decomposition ○ Shuffle/Aggregation phase: Process 0 0 1 1 0 0 1 Data shuffled and assigned to aggregators. ■ 1 1 Process Data sent to aggregators. ■ 2 2 3 3 2 2 3 3 2 Process ○ Access/ I/O phase: 0 0 1 1 0 0 1 1 3 Aggregators write/read data into/from disk. ■ Process 2 2 3 3 2 2 3 3 4 Compute node 1 Compute node 2 Compute node 3 Compute node 4 0 1 2 3 0 0 1 1 2 2 3 3 0 1 2 3 0 0 1 1 2 2 3 3 0 0 1 1 2 2 3 3 Shuffle phase Compute node 1 Compute node 3 0 1 0 1 2 3 2 3 0 1 0 1 2 3 2 3 Aggregator 1 Aggregator 2 I/O phase 0 1 0 1 2 3 2 3 0 1 0 1 2 3 2 3 File

Overlapped Two-Phase (I) Compute node 1 Compute node 2 Divide the buffer into two sub-buffers. ● 0 1 2 3 0 1 2 3 Overlap 2 phases: ● Shuffle While the aggregator is writing the data from ○ 1 sub-buff 1 into disk, the compute nodes are 0 1 Sub-Buf 1 0 1 Sub-Buf 2 2 3 2 3 shuffling the new data and send it into the Aggregator X second sub-buff. By using Asynchronous operations ■ We define: Write 0 Disk S n: n th Shuffle phase W n: n th Access phase Wait S 1 Cycle 1 Cycle 2 Cycle 3 Cycle n-1 Cycle n Buff 1 S 1 W 1 S 3 W 3 W n-1 Buff 2 S 2 W 2 S 4 S n W n Wait S 2 and W1

Overlapped Two-Phase (II) ● There are multiple ways to implement the overlapping technique depending on the choice of: The asynchronous functions (Communication or I/O). ○ The phases that will be overlapped (Aggregation or Access phase). ○ We proposed four different algorithms: Algorithm Overlapped Communication I/O function Phases function 1. Communication Overlap 2 Shuffle phases Asynchronous Synchronous 2. Writes Overlap 2 Access phases Synchronous Asynchronous 3. Write-Communication 1 Shuffle and 1 Asynchronous Asynchronous access phase Overlap 4.Write-Communication-2 Overlap: A revised version of the last algorithm that follows a data-flow model: The completion of any non-blocking operation is immediately followed by posting the follow-up ➢ operation first. Two shuffles and two writes operations are handled in each iteration (2 cycles). ✓

Evaluation (I) We tested the original non-overlapped two-phase I/O and the four overlapping algorithms using: ● Platforms: Crill cluster (University of Houston) & Ibex cluster (KAUST university) ● File system: Beegfs ● Benchmarks: IOR(1D data), TileIO (2D data) and FlashIO (HDF5 output). ● We cannot identify a “winner” out of different algorithms. ● There is no benefit of overlapping technique in 16% of test cases. ● The algorithms incorporating asynchronous I/O outperformed the other approaches in 71% test series. Better performance with Asynchronous file I/O

Evaluation (II) In order to identify the best algorithm, we ran the 4 overlapping algorithms for all the benchmarks : Crill Cluster Ibex Cluster The average performance improvement on the Crill clusters are very close for all versions ● with a slight advantage for the communication overlap version. The results on Ibex are more clear and shows a clear win of the communication overlap ● version

Data Transfer Primitives We investigated two communication models for the shuffle phase implementation: Two Sided communication: Currently implemented in the two-phase algorithm. ● Data sent from MPI processes using MPI_send/Isend to receiver (aggregators). ○ Aggregators receive data into local buffers using MPI_recv/Irecv. ○ ● Remote memory access (RMA): Each process exposes a part of its memory to other processes ○ Only one side (Sender/Receiver) is responsible of data transfer (using resp ○ Put()/Get() ): To alleviate the workload on the aggregator, we chose to make the senders “put” ➢ their data into the aggregators memory. We used 2 synchronization methods to guarantee data consistency. ○ Active target synchronization (MPI_Win_fence) ➢ Passive target synchronization(MPI_Win_lock/unlock()) ➢

Evaluation (III) Two-sided data communication ● outperformed the one-sided versions in 75% of the test-cases. When using Tile I/O with small element size ● of 256 Byte, the version using MPI_Win_fence achieved the best performance in 37% of the test cases. The performance gain over two-sided ○ communication was around 27% in these cases. The benefits of using one-sided ○ communication increased for larger Number of times each of the three different data transfer primitives resulted in the best performance. process counts.

Conclusion and Future Work Conclusion: ● ○ Proposed various design options for overlapping two-phase I/O. Overlap algorithms incorporating asynchronous I/O operations outperform ➢ other approaches and offer significant performance benefits of up to 22% compared to non-overlapped two-phase I/O algorithm. ○ Explored two communication paradigm for the shuffle phase: One-sided communication did not lead to performance improvements ➢ compared to two-sided communication. Future work: ● ○ Running the same tests on the Lustre file system showed a total different results since it only supports blocking I/O functions. Explore the lustre advanced reservations solution (Lockahead). ➢

On Overlapping Communication and File I/O in Collective Write - PowerPoint PPT Presentation

On Overlapping Communication and File I/O in Collective Write Operation Raafat Feki and Edgar Gabriel Parallel Software Technologies Lab Department of Computer Science, University of Houston, Houston, USA Email: {rfeki, egabriel}@uh.edu

http://cs224w.stanford.edu Non overlapping vs overlapping communities Non overlapping

File Management What is a file? Elements of file management File organization

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

Variational methods for overlapping and non-overlapping stochastic block models Pierre Latouche

Ego-Splitting Framework: from Non-Overlapping to Overlapping Clusters. Alessandro Epasto

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

What if... There is no file with the name given to the File constructor: new File

CPSC 410/611: File Management What is a file? Elements of file management

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

CIS 218 File Utilities and Filters Text / File Commands File Manipulation cat displays

Compilation/linking revisited Memory and C/C++ modules From Reading #6 source object file 1

Prio: Private, Robust, and Efficient Computation of Aggregate Statistics Henry Corrigan-Gibbs and

Collective Rationality in Graph Aggregation Ulle Endriss Institute for Logic, Language and

The dCacheBillingAggregator Gregory J. Sharp Daniel S. Riley Overview The dCache file system

Collective Prefetching for Parallel I/O Systems Yong Chen and Philip C. Roth Oak Ridge National

Full Virtualization for GPUs Reconsidered Revisit -- Suzuki, Yusuke, et al. GPUvm: Why not

Datadog: A Real-Time Metrics Database for Trillions of Points/Day Ian NOWLAND

Demand Management from an Aggregator's Perspective David Brewster, President May 21, 2009

A Convenient Framework for Efficient Parallel Multipass Algorithms Markus Weimer Joint Work with

On Overlapping Communication and File I/O in Collective Write - PowerPoint PPT Presentation

On Overlapping Communication and File I/O in Collective Write Operation Raafat Feki and Edgar Gabriel Parallel Software Technologies Lab Department of Computer Science, University of Houston, Houston, USA Email: {rfeki, egabriel}@uh.edu

http://cs224w.stanford.edu Non overlapping vs overlapping communities Non overlapping

File Management What is a file? Elements of file management File organization

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

Variational methods for overlapping and non-overlapping stochastic block models Pierre Latouche

Ego-Splitting Framework: from Non-Overlapping to Overlapping Clusters. Alessandro Epasto

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

What if... There is no file with the name given to the File constructor: new File

CPSC 410/611: File Management What is a file? Elements of file management

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

CIS 218 File Utilities and Filters Text / File Commands File Manipulation cat displays

Compilation/linking revisited Memory and C/C++ modules From Reading #6 source object file 1

Prio: Private, Robust, and Efficient Computation of Aggregate Statistics Henry Corrigan-Gibbs and

Collective Rationality in Graph Aggregation Ulle Endriss Institute for Logic, Language and

The dCacheBillingAggregator Gregory J. Sharp Daniel S. Riley Overview The dCache file system

Collective Prefetching for Parallel I/O Systems Yong Chen and Philip C. Roth Oak Ridge National

Full Virtualization for GPUs Reconsidered Revisit -- Suzuki, Yusuke, et al. GPUvm: Why not

Datadog: A Real-Time Metrics Database for Trillions of Points/Day Ian NOWLAND

Demand Management from an Aggregator's Perspective David Brewster, President May 21, 2009

A Convenient Framework for Efficient Parallel Multipass Algorithms Markus Weimer Joint Work with

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of