scr and preparing for burst buffers
play

SCR and Preparing for Burst Buffers DOE COE Performance Portability - PowerPoint PPT Presentation

SCR and Preparing for Burst Buffers DOE COE Performance Portability Meeting August 23, 2017 Elsa Gonsiorowski LLNL-PRES-737156 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National


  1. SCR and Preparing for Burst Buffers DOE COE Performance Portability Meeting August 23, 2017 Elsa Gonsiorowski LLNL-PRES-737156 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

  2. Outline Burst Buffer Technologies SCR Overview Burst Buffers and SCR Additional Software Projects 2 LLNL-PRES-737156

  3. Burst Buffer Technologies Type Technology Location Node Local IBM BBAPI LLNL (Sierra) Machine Global Cray Datawarp LANL (Trinity) 3 LLNL-PRES-737156

  4. Burst Buffer Technologies Type Technology Location Node Local IBM BBAPI LLNL (Sierra) Machine Global Cray Datawarp LANL (Trinity) How can an application utilize this layer for I/O workloads? 3 LLNL-PRES-737156

  5. Burst Buffers Use Case Relies on integration with resource scheduler Different for machine-global vs. node-local storage Does not address inter-job data movement 4 LLNL-PRES-737156

  6. Burst Buffers Use Case Perfect for Checkpoint/Restart 5 LLNL-PRES-737156

  7. Checkpoint Restart a.k.a. Defensive I/O 6 LLNL-PRES-737156

  8. Checkpoint Restart a.k.a. Defensive I/O Related to the size of system memory 6 LLNL-PRES-737156

  9. Checkpoint Restart a.k.a. Defensive I/O Related to the size of system memory Depends on resiliency of machine 6 LLNL-PRES-737156

  10. Checkpoint Restart a.k.a. Defensive I/O Related to the size of system memory Depends on resiliency of machine Which may change over time 6 LLNL-PRES-737156

  11. Checkpoint Restart a.k.a. Defensive I/O Related to the size of system memory Depends on resiliency of machine Which may change over time Creating a checkpoint may not be as efficient as recomputing 6 LLNL-PRES-737156

  12. SCR Goal Enable checkpointing applications to take advantage of system storage hierarchies 7 LLNL-PRES-737156

  13. SCR Goal Enable checkpointing applications to take advantage of system storage hierarchies Efficient file movement between storage layers Data redundancy operations 7 LLNL-PRES-737156

  14. SCR Components 8 LLNL-PRES-737156

  15. SCR Component: Backend Library Redirect application files Synchronous & asynchronous flush operations Hardware specific capabilities Data redundancy Support for both checkpoint & output data 9 LLNL-PRES-737156

  16. SCR Component: Backend Library int rc = MyApp_Checkpoint(path); 10 LLNL-PRES-737156

  17. SCR Component: Backend Library SCR_Route_file(path, newpath); int rc = MyApp_Checkpoint(newpath); 10 LLNL-PRES-737156

  18. SCR Component: Backend Library SCR_Start_output("dataset name", flags); SCR_Route_file(path, newpath); int rc = MyApp_Checkpoint(newpath); SCR_Complete_output(rc); 10 LLNL-PRES-737156

  19. SCR Component: Frontend Scripts On Startup Locate most recent checkpoint and fetch for restart 11 LLNL-PRES-737156

  20. SCR Component: Frontend Scripts On Startup Locate most recent checkpoint and fetch for restart Within Allocation Detect application crash or system failures and trigger restart 11 LLNL-PRES-737156

  21. SCR Component: Frontend Scripts On Startup Locate most recent checkpoint and fetch for restart Within Allocation Detect application crash or system failures and trigger restart During Execution Manage datasets 11 LLNL-PRES-737156

  22. SCR Component: Frontend Scripts On Startup Locate most recent checkpoint and fetch for restart Within Allocation Detect application crash or system failures and trigger restart During Execution Manage datasets Resource Scheduler Integration Pre- and post-stage data movement 11 LLNL-PRES-737156

  23. SCR Component: Configurations Define the levels of the hierarchy Define modes/groups of failure Define checkpointing and data residency needs 12 LLNL-PRES-737156

  24. SCR Component: Configurations Define the levels of the hierarchy Define modes/groups of failure Define checkpointing and data residency needs Machine Portability 12 LLNL-PRES-737156

  25. Burst Buffers Use Case Checkpoint Restart 13 LLNL-PRES-737156

  26. Burst Buffers & SCR: Prestage Machine Global Solved Global access from CNs to storage Node Local Requires new softwares Requires deep integration with resource scheduler Most useful for DATs or half+ system jobs 14 LLNL-PRES-737156

  27. Burst Buffers & SCR: Poststage Similar solution for both BB types Take advantage of vendor APIs asynchronous operations Decouples burst buffer usage from compute usage Requires integration with resource scheduler Allows for more fine-grain control of resources 15 LLNL-PRES-737156

  28. Unaddressed Concerns Applications without checkpointing Shared Files Arbitrary data movement Machine-learning use case 16 LLNL-PRES-737156

  29. VELOC Combining two codes: FTI and SCR FTI: variable-based checkpointing scheme Will support existing FTI and SCR applications 17 LLNL-PRES-737156

  30. UnifyCR User-level file system Shared namespace across distributed burst buffers I/O interception layer 18 LLNL-PRES-737156

  31. MPI File Utils Use parallel processes to perform file operations Executed within a job allocation dbcast : broadcast from PFS to node-local storage dcp : multiple file copy in parallel drm : delete files in parallel many more https://github.com/hpc/mpifileutils 19 LLNL-PRES-737156

  32. SCR Team https://github.com/llnl/scr Kathryn Mohror Greg Becker Adam Moody Elsa Gonsiorowski 20 LLNL-PRES-737156

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend