design of locality aware mpi io for scalable shared file
play

Design of Locality-aware MPI-IO for Scalable Shared File Write - PowerPoint PPT Presentation

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance Kohei Sugihara 1 , Osamu Tatebe 2 1 Department of Computer Science, University of Tsukuba 2 Center for Computational Sciences, University of Tsukuba P0 P1 P0 P1


  1. Design of Locality-aware MPI-IO for Scalable Shared File Write Performance Kohei Sugihara 1 , Osamu Tatebe 2 1 Department of Computer Science, University of Tsukuba 2 Center for Computational Sciences, University of Tsukuba

  2. P0 P1 P0 P1 Background P0 File P1 File SSF Fig. File Per Process Fig. Single Shared File ● Single Shared File (SSF) Multiple processes access a single shared file ( ⇔ File Per Process; FFP) ○ ○ It is a typical I/O access pattern in HPC applications [6] ○ It is used to reduce # of result files on a large-scale job ● Node-local Storage ○ It is installed in recent Supercomputer's computation nodes [7, 8, 9] ○ It is used as a temporary read file cache in File Per Process (FPP) ○ It can minimize communication cost The usage for Single Shared File is not obvious ○ 2 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  3. P0 P1 P2 P3 P0 P1 P2 P3 Problem ● Single Shared File access is slow Fig. lock-contention in file striping Reason: lock-contention ○ ○ The block or stripe size on a filesystem and access size of the application is not matched ● Node-local Storage cannot be used in the context of the locality-awareness ○ Reason: In the most of exist file systems employ file striping ○ The file striping consumes network bandwidth on computation node-side ● Our Goal: Achieve scalable write bandwidth for shared file access using node-local storage without file striping 3 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  4. Approach ● Avoid the lock-contention in Single Shared File access Most of the HPC applications does not access overlap region in SSF ○ ○ The locking among parallel I/O requests is not essential ○ Approach : We propose a lockless format for SSF representation ● Utilize locality ○ Lock in I/O requests within the computation node ○ Place files mostly in a node-local storage and minimize remote communication for the file access Approach : Utilize Gfarm filesystem for locality-oriented file placement ○ 4 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  5. Proposal: Sparse Segments ● Sparse Segments: An internal file format for a single shared file Each process creates the corresponding segment which is expected to ○ be stored in the node-local storage P0 P1 P0 P1 P0 P1 P0 File HOLE HOLE HOLE P1 File HOLE HOLE HOLE HOLE HOLE HOLE (a) N-1 Segmented w/o Resize (b) N-1 Strided w/ Resize (c) N-1 Strided w/ Resize 5 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  6. Implementation: Locality-aware MPI-IO ● Locality-aware MPI-IO: An MPI-IO optimization for Shared File Access Implicitly converts SSF to Sparse Segments ○ ○ Generates Sparse Segments in MPI_File_open() P0 P1 P0 P1 MPI_File_open(X) MPI_File_open(X) open(X) open(X) open(X.0) open(X.1) File X File X.0 File X.1 (a) Conventional MPI-IO (b) Ours. Locality-aware MPI-IO 6 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  7. Locality-oriented File Placement Node #0 Node #1 ● Store each Sparse Segments into Gfarm [25] Gfarm is a locality-oriented parallel filesystem ○ P1 P0 ○ Gfarm automatically stores full copy of the files MPI_File_open(X) on the nearest storage from the process ○ Each Sparse Segments is stored into open(X.0) open(X.1) corresponding node's local storage File X.0 File X.1 Gfarm Filesystem File X.0 File X.1 7 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  8. Experiment ● Method Issue write accesses against a single shared file (in weak scaling) ○ ● Application ○ Microbenchmark: IOR ○ Application Benchmark: S3D-IO, LES-IO, VPIC-IO Environment ● ○ System: TSUBAME 3.0 Supercomputer [8] at TokyoTech ■ Proposal: node-local storage on compute node ■ Lustre: file system node, 68 OSTs (peak 50 GB/s; not apple-to-apple) BeeOND: node-local storage on compute node ■ 8 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  9. Experiment ● The proposal method scales up in all benchmarks (aggregate bandwidth) Lustre bandwidth is saturated when # of processes exceeds BeeOND is not scalable even it # of OSTs uses same node-local storage Proposal is scalable IOR (non-collective) S3D-IO LES-IO VPIC-IO 9 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  10. Discussion ● Lustre and BeeOND induce slow bandwidth in application benchmarks Accesses to SSF in small pieces occur many lock contention among ○ processes ● Our method demonstrated linearly scalable bandwidth Conversion to Sparse Segments ○ ■ All application benchmarks are scaled ■ Result : Successfully avoids lock-contention ○ Locality-aware File Placement All benchmarks are scaled in linear even more than # of OST nodes ■ ■ Result : Successfully and effectively scales using node-local storages 10 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  11. Conclusion ● We proposed Sparse Segments and Locality-aware MPI-IO Our method demonstrates scalable parallel write ● ○ In both the microbenchmark and application benchmarks 11 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  12. References [6] P . Carns, K. Harms, W. Allcock, C. Bacon, S. Lang, R. Latham, and R. Ross, “Understanding and Improving Computational Science Storage Access Through Continuous Characterization,” ACM Trans. Storage, vol. 7, no. 3, pp. 8:1–8:26, Oct. 2011. [7] Summit. [Online]. Available: https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/ [8] TSUBAME 3.0. [Online]. Available: https://www.t3.gsic.titech.ac.jp/en/hardware [9] ABCI. [Online]. Available: https://abci.ai/en/about_abci/computing_resource.html [25] O. Tatebe, K. Hiraga, and N. Soda, “Gfarm grid file system,” New Generation Computing, vol. 28, no. 3, pp. 257–275, Jul 2010. 12 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  13. Contact Information ● sugihara@hpcs.cs.tsukuba.ac.jp ● tatebe@cs.tsukuba.ac.jp 13 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend