integration of burst buffer in high level parallel io
play

Integration of Burst Buffer in High- level Parallel IO Library for - PowerPoint PPT Presentation

Integration of Burst Buffer in High- level Parallel IO Library for Exa- scale Computing Era SC 2018 PDSW workshop Kaiyuan Hou, Reda Al-Bahrani, Esteban Rangel, Ankit Agrawal, Robert Latham, Robert Ross, Alok Choudhary, and Wei-keng Liao


  1. Integration of Burst Buffer in High- level Parallel IO Library for Exa- scale Computing Era SC 2018 PDSW workshop Kaiyuan Hou, Reda Al-Bahrani, Esteban Rangel, Ankit Agrawal, Robert Latham, Robert Ross, Alok Choudhary, and Wei-keng Liao

  2. Overview • Background & Motivation • Our idea – aggregation on burst buffer  Benefit  Challenges • Summary of results 2 PDSW-DISCS 2018

  3. I/O in The Exa-scale Era • Huge data size  >10PB system memory  Data generated by application are in similar magnitude • I/O speed cannot catch up the increase of data size  Parallel File System (PFS) architecture is not scalable • Burst buffer introduced into I/O hierarchy  Made of new hardware such as SSDs, Non-volatile RAM …etc.  Tries to bridge the performance gap between computing and I/O • The role and potential of burst buffer hasn’t been fully explored  How can burst buffer help on improvement I/O performance 3 PDSW-DISCS 2018

  4. I/O Aggregation Using the Burst Buffer • PFSs are made of rotating hard disks  High capacity, low speed  Usually used as main storage on super computer  Sequential access is fast while random access is slow  Handling large data is more efficient than handling small data • Burst buffers are made of SSDs or NVMs  Higher speed, lower capacity • I/O aggregation on burst buffer  Gather write requests on the burst buffer  Reorder requests into sequential  Combine all requests into one large request 4 PDSW-DISCS 2018

  5. Related Work • LogFS [1]  I/O aggregation library using low-level offset and length data representation • Simpler implementation • Does not preserve the structure of the data  Log-based data structure for recording write operations • Data Elevator [2]  A user level library to move buffered files on the burst buffer to PFS  File is written to the burst buffer as is and copied to the PFS later • Does not alter I/O pattern on the burst buffer  Work only on shared burst buffer  Faster than moving the file using system functions on large scale • When number of nodes larger than number of burst buffer servers [1] D. Kimpe, R. Ross, S. Vandewalle and S. Poedts, "Transparent log-based data storage in MPI-IO applications," in Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface , Paris, 2007. [2] Dong, Bin, et al. "Data Elevator: Low-Contention Data Movement in Hierarchical Storage System." High Performance Computing (HiPC), 2016 IEEE 23rd International Conference on . IEEE, 2016. 5 PDSW-DISCS 2018

  6. About PnetCDF • High-level I/O library  Built on top of MPI-IO  Abstract data description • Enable parallel access to NetCDF formatted file • Consists of I/O modules called drivers that deal with lower level libraries • https://github.com/Parallel- NetCDF/PnetCDF Picture courtesy from: Li, Jianwei et al. “Parallel netCDF: A High-Performance Scientific I/O Interface.” ACM/IEEE SC 2003 Conference (SC'03) (2003): 39-39. 6 PDSW-DISCS 2018

  7. I/O Aggregation in PnetCDF User Application PnetCDF Dispatcher IO Drivers MPI-IO Driver Burst Buffer Driver MPI-IO POSIX IO Parallel File System Burst Buffer 7 PDSW-DISCS 2018

  8. Recording Write Requests 8 PDSW-DISCS 2018

  9. Compared to Lower-level Approach • Retain the structure of original data  Most scientific data are sub-array of high-dimensional arrays  Performance optimization  Can be used to support other operations such as in-situ analysis • Lower memory footprint  One high-level request can translate to multiple offsets and lengths • More complex operations to record  Not as simple as offset and length • Must follow the constraint of lower-level library  Less freedom to manipulate raw data 9 PDSW-DISCS 2018

  10. Generating Aggregated Request • Limitation of MPI-IO  Flattened offset of a MPI write call must be monotonically non- decreasing • Can not simply stacking high-level requests together  May violate the requirement • Offsets must be sorted in order  Performance issue on large data 10 PDSW-DISCS 2018

  11. 2-stage Reordering Strategy • Group the requests  Requests from different group will never interleave each other  Requests within a group interleaves each other • Perform sorting on groups  Without broken up request to offsets • Perform sorting within group  Break up requests 11 PDSW-DISCS 2018

  12. Experiment • Cori at NERSC  Cray DataWarp – shared burst buffer • Theta at ALCF  Local burst buffer made of SSD • Comparing with other approach  PnetCDF collective I/O without aggregation  Data elevator  Cray DataWarp staging out functions  LogFS • Comparing different log to process mapping 12 PDSW-DISCS 2018

  13. Benchmarks IOR - Contiguous IOR - Strided Round 1 Round 1 Round 2 Round 2 Round 3 Round 3 Block 0 1 2 3 4 5 6 7 8 Block 0 1 2 3 4 5 6 7 8 P0 P1 P2 P0 P1 P2 FLASH Round 1 Round 2 … Round 3 Block 0 1 2 3 4 5 6 7 8 9 10 11 P0 P1 P2 Picture courtesy from: Liao, Wei-keng, et al. "Using MPI file caching to improve parallel write performance for large-scale scientific applications." Supercomputing, 2007. SC'07. Proceedings of the 2007 ACM/IEEE Conference on . IEEE, 2007. 13 PDSW-DISCS 2018

  14. Cori – Shared Burst Buffer IOR - Contiguous - 512 Processes IOR - Strided - 8 MiB 40 8 I/O Bandwidth (GiB/s) I/O Bandwidth (GiB/s) 30 6 20 4 10 2 0 0 1/4 1/2 1 2 4 256 512 1 K 2 K 4 K Transfer Size (MiB) Number of Processes FLASH - I/O - Checkpoint File BTIO - Strong Scaling 12 I/O Bandwidth (GiB/s) I/O Bandwidth (GiB/s) 5 10 4 8 3 6 2 4 1 2 0 0 256 1 K 4 K 256 512 1 K 2 K 4 K Number of Processes Number of Processes Burst Buffer Driver LogFS PnetCDF Raw DataWarp Stage Out LogFS Approximate 14 PDSW-DISCS 2018

  15. Cori – Shared Burst Buffer IOR - Contiguous - 512 Processes IOR - Strided - 8 MiB 100 20 Execution Time (sec.) Execution Time (sec.) 80 15 60 10 40 5 20 0 0 1/4 1/2 1 2 4 256 512 1 K 2 K 4 K Transfer Size (MiB) Number of Processes FLASH - I/O - Checkpoint File BTIO - Strong Scaling 50 70 Execution Time (sec.) Execution Time (sec.) 60 40 50 30 40 30 20 20 10 10 0 0 256 1 K 4 K 256 512 1 K 2 K 4 K Number of Processes Number of Processes Burst Buffer Driver Data Elevator 15 PDSW-DISCS 2018

  16. Theta – Local Burst Buffer IOR - Strided - 8 MiB IOR - Contiguous - 1 K Processes 7 20 I/O Bandwidth (GiB/s) I/O Bandwidth (GiB/s) 6 15 5 4 10 3 5 2 1 0 0 256 512 1 K 2 K 4 K 1/4 1/2 1 2 4 Number of Processes Transfer Size (MiB) FLASH - I/O - Checkpoint File BTIO - Strong Scaling 6 2 I/O Bandwidth (GiB/s) I/O Bandwidth (GiB/s) 5 1.5 4 3 1 2 0.5 1 0 0 256 512 1 K 2 K 4 K 256 1 K 4 K Number of Processes Number of Processes Burst Buffer Driver LogFS PnetCDF Raw LogFS Approx 16 PDSW-DISCS 2018

  17. Impact of log to process mapping FLASH - I/O - Cori • Use log per node on shared 4 3.5 burst buffer 3 2.5 Time (sec.) 2  Metadata Server bottleneck 1.5 Log File Read 1 when creating large number 0.5 Log File Write 0 of files Log File Init A B C A B C A B C A B C A B C 256 512 1 K 2 K 4 K • Use log per process on local A: Log Per Node Number of Processes burst buffer B: Log Per Process – Private C: Log Per Process FLASH - I/O – Theta  Reduce file sharing 4 overhead 3 Time (sec.) 2 • Use local burst buffer if Log File Read 1 available Log File Write 0 Log File Init A B A B A B A B A B  Configure DataWarp to A: Log Per Node 256 512 1 K 2 K 4 K private mode B: Log Per Process Number of Processes 17 PDSW-DISCS 2018

  18. Conclusion and Future work • Burst buffer opens up new opportunities for I/O aggregation • Aggregation in a high-level I/O library is effective to improve performance  The concept can be applied to other high-level I/O libraries • HDF5, NetCDF-4 … etc. • Performance improvement  Overlap burst buffer and PFS I/O • Reading from burst buffer and writing to PFS can be pipelined  Support reading from the log without flushing • Reduce number of flush operation 18 PDSW-DISCS 2018

  19. Thank You This research was supported by the Exascale Computing Project (17-SC-20- SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE- AC02-06CH11357.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend