Design of Locality-aware MPI-IO for Scalable Shared File Write - - PowerPoint PPT Presentation

design of locality aware mpi io for scalable shared file
SMART_READER_LITE
LIVE PREVIEW

Design of Locality-aware MPI-IO for Scalable Shared File Write - - PowerPoint PPT Presentation

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance Kohei Sugihara 1 , Osamu Tatebe 2 1 Department of Computer Science, University of Tsukuba 2 Center for Computational Sciences, University of Tsukuba P0 P1 P0 P1


slide-1
SLIDE 1

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance

Kohei Sugihara1, Osamu Tatebe2

1 Department of Computer Science, University of Tsukuba 2 Center for Computational Sciences, University of Tsukuba

slide-2
SLIDE 2

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Background

  • Single Shared File (SSF)

○ Multiple processes access a single shared file (⇔ File Per Process; FFP) ○ It is a typical I/O access pattern in HPC applications [6] ○ It is used to reduce # of result files on a large-scale job

  • Node-local Storage

○ It is installed in recent Supercomputer's computation nodes [7, 8, 9] ○ It is used as a temporary read file cache in File Per Process (FPP) ○ It can minimize communication cost ○ The usage for Single Shared File is not obvious

2 P0 File P1 File SSF

P0 P1 P0 P1

  • Fig. File Per Process
  • Fig. Single Shared File
slide-3
SLIDE 3

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Problem

  • Single Shared File access is slow

○ Reason: lock-contention ○ The block or stripe size on a filesystem and access size of the application is not matched

  • Node-local Storage cannot be used in the context of the locality-awareness

○ Reason: In the most of exist file systems employ file striping ○ The file striping consumes network bandwidth on computation node-side

  • Our Goal: Achieve scalable write bandwidth for shared file access using

node-local storage without file striping

3

P0 P1 P2 P3 P0 P1 P2 P3

  • Fig. lock-contention in file striping
slide-4
SLIDE 4

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Approach

  • Avoid the lock-contention in Single Shared File access

○ Most of the HPC applications does not access overlap region in SSF ○ The locking among parallel I/O requests is not essential ○ Approach: We propose a lockless format for SSF representation

  • Utilize locality

○ Lock in I/O requests within the computation node ○ Place files mostly in a node-local storage and minimize remote communication for the file access ○ Approach: Utilize Gfarm filesystem for locality-oriented file placement

4

slide-5
SLIDE 5

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Proposal: Sparse Segments

  • Sparse Segments: An internal file format for a single shared file

○ Each process creates the corresponding segment which is expected to be stored in the node-local storage

HOLE

P0 File P1 File

HOLE HOLE HOLE HOLE

(a) N-1 Segmented w/o Resize (b) N-1 Strided w/ Resize (c) N-1 Strided w/ Resize P0 P1 P0 P1 P0 P1

HOLE HOLE HOLE HOLE

5

slide-6
SLIDE 6

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Implementation: Locality-aware MPI-IO

  • Locality-aware MPI-IO: An MPI-IO optimization for Shared File Access

○ Implicitly converts SSF to Sparse Segments ○ Generates Sparse Segments in MPI_File_open()

6 P0 P1 MPI_File_open(X)

  • pen(X)
  • pen(X)

File X P0 P1 MPI_File_open(X)

  • pen(X.0)
  • pen(X.1)

File X.0 File X.1

(a) Conventional MPI-IO (b) Ours. Locality-aware MPI-IO

slide-7
SLIDE 7

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Locality-oriented File Placement

  • Store each Sparse Segments into Gfarm [25]

○ Gfarm is a locality-oriented parallel filesystem ○ Gfarm automatically stores full copy of the files

  • n the nearest storage from the process

○ Each Sparse Segments is stored into corresponding node's local storage

7 P0 P1 MPI_File_open(X)

  • pen(X.0)
  • pen(X.1)

File X.0 File X.1 Gfarm Filesystem File X.0 File X.1

Node #0 Node #1

slide-8
SLIDE 8

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Experiment

  • Method

○ Issue write accesses against a single shared file (in weak scaling)

  • Application

○ Microbenchmark: IOR ○ Application Benchmark: S3D-IO, LES-IO, VPIC-IO

  • Environment

○ System: TSUBAME 3.0 Supercomputer [8] at TokyoTech ■ Proposal: node-local storage on compute node ■ Lustre: file system node, 68 OSTs (peak 50 GB/s; not apple-to-apple) ■ BeeOND: node-local storage on compute node

8

slide-9
SLIDE 9

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Experiment

  • The proposal method scales up in all benchmarks (aggregate bandwidth)

9

Lustre bandwidth is saturated when # of processes exceeds # of OSTs Proposal is scalable BeeOND is not scalable even it uses same node-local storage

IOR (non-collective) S3D-IO LES-IO VPIC-IO

slide-10
SLIDE 10

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Discussion

  • Lustre and BeeOND induce slow bandwidth in application benchmarks

○ Accesses to SSF in small pieces occur many lock contention among processes

  • Our method demonstrated linearly scalable bandwidth

○ Conversion to Sparse Segments ■ All application benchmarks are scaled ■ Result: Successfully avoids lock-contention ○ Locality-aware File Placement ■ All benchmarks are scaled in linear even more than # of OST nodes ■ Result: Successfully and effectively scales using node-local storages

10

slide-11
SLIDE 11

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Conclusion

  • We proposed Sparse Segments and Locality-aware MPI-IO
  • Our method demonstrates scalable parallel write

○ In both the microbenchmark and application benchmarks

11

slide-12
SLIDE 12

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

References

[6] P . Carns, K. Harms, W. Allcock, C. Bacon, S. Lang, R. Latham, and R. Ross, “Understanding and Improving Computational Science Storage Access Through Continuous Characterization,” ACM Trans. Storage, vol. 7, no. 3, pp. 8:1–8:26, Oct. 2011. [7] Summit. [Online]. Available: https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/ [8] TSUBAME 3.0. [Online]. Available: https://www.t3.gsic.titech.ac.jp/en/hardware [9] ABCI. [Online]. Available: https://abci.ai/en/about_abci/computing_resource.html [25] O. Tatebe, K. Hiraga, and N. Soda, “Gfarm grid file system,” New Generation Computing, vol. 28, no. 3, pp. 257–275, Jul 2010.

12

slide-13
SLIDE 13

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Contact Information

  • sugihara@hpcs.cs.tsukuba.ac.jp
  • tatebe@cs.tsukuba.ac.jp

13