Integration of Burst Buffer in High- level Parallel IO Library for - - PowerPoint PPT Presentation

integration of burst buffer in high level parallel io
SMART_READER_LITE
LIVE PREVIEW

Integration of Burst Buffer in High- level Parallel IO Library for - - PowerPoint PPT Presentation

Integration of Burst Buffer in High- level Parallel IO Library for Exa- scale Computing Era SC 2018 PDSW workshop Kaiyuan Hou, Reda Al-Bahrani, Esteban Rangel, Ankit Agrawal, Robert Latham, Robert Ross, Alok Choudhary, and Wei-keng Liao


slide-1
SLIDE 1

Integration of Burst Buffer in High- level Parallel IO Library for Exa- scale Computing Era

SC 2018 PDSW workshop

Kaiyuan Hou, Reda Al-Bahrani, Esteban Rangel, Ankit Agrawal, Robert Latham, Robert Ross, Alok Choudhary, and Wei-keng Liao

slide-2
SLIDE 2
  • Background & Motivation
  • Our idea – aggregation on burst buffer
  • Benefit
  • Challenges
  • Summary of results

Overview

2

PDSW-DISCS 2018

slide-3
SLIDE 3
  • Huge data size
  • >10PB system memory
  • Data generated by application are in similar magnitude
  • I/O speed cannot catch up the increase of data size
  • Parallel File System (PFS) architecture is not scalable
  • Burst buffer introduced into I/O hierarchy
  • Made of new hardware such as SSDs, Non-volatile RAM …etc.
  • Tries to bridge the performance gap between computing and I/O
  • The role and potential of burst buffer hasn’t been fully explored
  • How can burst buffer help on improvement I/O performance

I/O in The Exa-scale Era

3

PDSW-DISCS 2018

slide-4
SLIDE 4
  • PFSs are made of rotating hard disks
  • High capacity, low speed
  • Usually used as main storage on super computer
  • Sequential access is fast while random access is slow
  • Handling large data is more efficient than handling small data
  • Burst buffers are made of SSDs or NVMs
  • Higher speed, lower capacity
  • I/O aggregation on burst buffer
  • Gather write requests on the burst buffer
  • Reorder requests into sequential
  • Combine all requests into one large request

I/O Aggregation Using the Burst Buffer

4

PDSW-DISCS 2018

slide-5
SLIDE 5
  • LogFS [1]
  • I/O aggregation library using low-level offset and length data

representation

  • Simpler implementation
  • Does not preserve the structure of the data
  • Log-based data structure for recording write operations
  • Data Elevator [2]
  • A user level library to move buffered files on the burst buffer to PFS
  • File is written to the burst buffer as is and copied to the PFS later
  • Does not alter I/O pattern on the burst buffer
  • Work only on shared burst buffer
  • Faster than moving the file using system functions on large scale
  • When number of nodes larger than number of burst buffer servers

Related Work

5

PDSW-DISCS 2018

[1] D. Kimpe, R. Ross, S. Vandewalle and S. Poedts, "Transparent log-based data storage in MPI-IO applications," in Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface , Paris, 2007. [2] Dong, Bin, et al. "Data Elevator: Low-Contention Data Movement in Hierarchical Storage System." High Performance Computing (HiPC), 2016 IEEE 23rd International Conference on. IEEE, 2016.

slide-6
SLIDE 6

About PnetCDF

  • High-level I/O library
  • Built on top of MPI-IO
  • Abstract data description
  • Enable parallel access to

NetCDF formatted file

  • Consists of I/O modules

called drivers that deal with lower level libraries

  • https://github.com/Parallel-

NetCDF/PnetCDF

6

PDSW-DISCS 2018

Picture courtesy from: Li, Jianwei et al. “Parallel netCDF: A High-Performance Scientific I/O Interface.” ACM/IEEE SC 2003 Conference (SC'03) (2003): 39-39.

slide-7
SLIDE 7

I/O Aggregation in PnetCDF

7

PDSW-DISCS 2018

User Application PnetCDF Dispatcher IO Drivers MPI-IO Driver Burst Buffer Driver Burst Buffer MPI-IO POSIX IO Parallel File System

slide-8
SLIDE 8

Recording Write Requests

8

PDSW-DISCS 2018

slide-9
SLIDE 9
  • Retain the structure of original data
  • Most scientific data are sub-array of high-dimensional arrays
  • Performance optimization
  • Can be used to support other operations such as in-situ analysis
  • Lower memory footprint
  • One high-level request can translate to multiple offsets and lengths
  • More complex operations to record
  • Not as simple as offset and length
  • Must follow the constraint of lower-level library
  • Less freedom to manipulate raw data

Compared to Lower-level Approach

9

PDSW-DISCS 2018

slide-10
SLIDE 10

Generating Aggregated Request

  • Limitation of MPI-IO
  • Flattened offset of a MPI

write call must be monotonically non- decreasing

  • Can not simply stacking

high-level requests together

  • May violate the requirement
  • Offsets must be sorted in
  • rder
  • Performance issue on large

data

10

PDSW-DISCS 2018

slide-11
SLIDE 11

2-stage Reordering Strategy

  • Group the requests
  • Requests from different

group will never interleave each other

  • Requests within a group

interleaves each other

  • Perform sorting on groups
  • Without broken up request

to offsets

  • Perform sorting within group
  • Break up requests

11

PDSW-DISCS 2018

slide-12
SLIDE 12
  • Cori at NERSC
  • Cray DataWarp – shared burst buffer
  • Theta at ALCF
  • Local burst buffer made of SSD
  • Comparing with other approach
  • PnetCDF collective I/O without aggregation
  • Data elevator
  • Cray DataWarp staging out functions
  • LogFS
  • Comparing different log to process mapping

Experiment

12

PDSW-DISCS 2018

slide-13
SLIDE 13

Benchmarks

13

PDSW-DISCS 2018

Round 1 Round 2 Round 3 Block 1 2 3 4 5 6 7 8 P0 P1 P2

IOR - Contiguous

Round 1 Round 2 Round 3 Block 1 2 3 4 5 6 7 8 P0 P1 P2

IOR - Strided

Picture courtesy from: Liao, Wei-keng, et al. "Using MPI file caching to improve parallel write performance for large-scale scientific applications." Supercomputing, 2007. SC'07. Proceedings of the 2007 ACM/IEEE Conference on. IEEE, 2007.

Round 1 Round 2 … Round 3 Block 1 2 3 4 5 6 7 8 9 10 11

FLASH

P0 P1 P2

slide-14
SLIDE 14

Cori – Shared Burst Buffer

14

PDSW-DISCS 2018

2 4 6 8 1/4 1/2 1 2 4 I/O Bandwidth (GiB/s) Transfer Size (MiB) IOR - Contiguous - 512 Processes 10 20 30 40 256 512 1 K 2 K 4 K I/O Bandwidth (GiB/s) Number of Processes IOR - Strided - 8 MiB 2 4 6 8 10 12 256 512 1 K 2 K 4 K I/O Bandwidth (GiB/s) Number of Processes FLASH - I/O - Checkpoint File 1 2 3 4 5 256 1 K 4 K I/O Bandwidth (GiB/s) Number of Processes BTIO - Strong Scaling Burst Buffer Driver LogFS PnetCDF Raw DataWarp Stage Out LogFS Approximate

slide-15
SLIDE 15

Cori – Shared Burst Buffer

15

PDSW-DISCS 2018

5 10 15 20 1/4 1/2 1 2 4 Execution Time (sec.) Transfer Size (MiB) IOR - Contiguous - 512 Processes 20 40 60 80 100 256 512 1 K 2 K 4 K Execution Time (sec.) Number of Processes IOR - Strided - 8 MiB 10 20 30 40 50 256 512 1 K 2 K 4 K Execution Time (sec.) Number of Processes FLASH - I/O - Checkpoint File 10 20 30 40 50 60 70 256 1 K 4 K Execution Time (sec.) Number of Processes BTIO - Strong Scaling Burst Buffer Driver Data Elevator

slide-16
SLIDE 16

Theta – Local Burst Buffer

16

PDSW-DISCS 2018

1 2 3 4 5 6 7 1/4 1/2 1 2 4 I/O Bandwidth (GiB/s) Transfer Size (MiB)

IOR - Contiguous - 1 K Processes

5 10 15 20 256 512 1 K 2 K 4 K I/O Bandwidth (GiB/s) Number of Processes

IOR - Strided - 8 MiB

1 2 3 4 5 6 256 512 1 K 2 K 4 K I/O Bandwidth (GiB/s) Number of Processes

FLASH - I/O - Checkpoint File

0.5 1 1.5 2 256 1 K 4 K I/O Bandwidth (GiB/s) Number of Processes

BTIO - Strong Scaling

Burst Buffer Driver LogFS PnetCDF Raw LogFS Approx

slide-17
SLIDE 17

Impact of log to process mapping

  • Use log per node on shared

burst buffer

  • Metadata Server bottleneck

when creating large number

  • f files
  • Use log per process on local

burst buffer

  • Reduce file sharing
  • verhead
  • Use local burst buffer if

available

  • Configure DataWarp to

private mode

17

PDSW-DISCS 2018

0.5 1 1.5 2 2.5 3 3.5 4 A B C A B C A B C A B C A B C 256 512 1 K 2 K 4 K Time (sec.) Number of Processes FLASH - I/O - Cori Log File Read Log File Write Log File Init 1 2 3 4 A B A B A B A B A B 256 512 1 K 2 K 4 K Time (sec.) Number of Processes FLASH - I/O – Theta Log File Read Log File Write Log File Init A: Log Per Node B: Log Per Process A: Log Per Node B: Log Per Process – Private C: Log Per Process

slide-18
SLIDE 18
  • Burst buffer opens up new opportunities for I/O aggregation
  • Aggregation in a high-level I/O library is effective to improve

performance

  • The concept can be applied to other high-level I/O libraries
  • HDF5, NetCDF-4 … etc.
  • Performance improvement
  • Overlap burst buffer and PFS I/O
  • Reading from burst buffer and writing to PFS can be pipelined
  • Support reading from the log without flushing
  • Reduce number of flush operation

Conclusion and Future work

18

PDSW-DISCS 2018

slide-19
SLIDE 19

Thank You

This research was supported by the Exascale Computing Project (17-SC-20- SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE- AC02-06CH11357.