with the scheduler s collaboration
play

[ with the schedulers collaboration] Alberto Miranda, PhD - PowerPoint PPT Presentation

www.bsc.es echofs: Enabling Transparent Access to Node-local NVM Burst Buffers for Legacy Applications [ with the schedulers collaboration] Alberto Miranda, PhD Researcher on HPC I/O alberto.miranda@bsc.es Dagstuhl, May 2017 The


  1. www.bsc.es echofs: Enabling Transparent Access to Node-local NVM Burst Buffers for Legacy Applications [… with the scheduler’s collaboration] Alberto Miranda, PhD Researcher on HPC I/O alberto.miranda@bsc.es Dagstuhl, May 2017 The NEXTGenIO project has received funding from the European Union’s Horizon 2020 Research and Innovation programme under Grant Agreement no. 671951

  2. I/O -> a fundamental challenge; • Petascale already struggles with I/O… external filesystem – Extreme parallelism w/ millions of threads – Job’s input data read from external PFS – Checkpoints: periodic writes to external PFS – Job’s output data write to external PFS high performance network • HPC & Data Intensive systems merging – Modelling, simulation and analytic workloads increasing… compute nodes 2

  3. I/O -> a fundamental challenge; • Petascale already struggles with I/O… external filesystem – Extreme parallelism w/ millions of threads – Job’s input data read from external PFS – Checkpoints: periodic writes to external PFS – Job’s output data write to external PFS high performance network • HPC & Data Intensive systems merging – Modelling, simulation and analytic workloads increasing… • And it will only get worse at Exascale … compute nodes 3

  4. Burst Buffers -> remote; • Fast storage devices that temporarily store burst filesystem external filesystem application data before sending it to PFS – Goal: Absorb peak I/O to avoid overtaxing PFS – Cray Datawarp, DDN IME • Growing interest to add them into next-gen high performance network HPC architectures – NERSC’s Cori, LLNL’s Sierra , ANL’s Aurora, … – Typically a separate resource to PFS – Usage/allocation/data movements/etc become user responsibility compute nodes 4

  5. Burst Buffers -> on-node; • Non-volatile coming to the node external filesystem – Argonne’s Theta has 128GB SSDs in each compute node high performance network compute nodes 5

  6. NEXTGenIO EU Project [http://www.nextgenio.eu] • Node-local, high-density NVRAM becomes a fundamental component I/O Stack – Intel’s 3DXPoint TM cache – Capacity much larger than DRAM memory – Slightly slower than DRAM but significantly faster storage than SSDs – DIMM form factor  standard memory controller – No refresh  no/low energy leakage 6

  7. NEXTGenIO EU Project [http://www.nextgenio.eu] • Node-local, high-density NVRAM becomes a fundamental component I/O Stack – Intel’s 3DXPoint TM cache – Capacity much larger than DRAM memory – Slightly slower than DRAM but significantly faster nvram than SSDs – DIMM form factor  standard memory controller fast storage – No refresh  no/low energy leakage slow storage 7

  8. NEXTGenIO EU Project [http://www.nextgenio.eu] • Node-local, high-density NVRAM becomes a fundamental component 1. How do we manage I/O Stack – Intel’s 3DXPoint TM access to these layers? cache – Capacity much larger than DRAM memory – Slightly slower than DRAM but significantly faster nvram than SSDs – DIMM form factor  standard memory controller fast storage – No refresh  no/low energy leakage slow storage 2. How can we bring the benefits from these layers to legacy code? 8

  9. OUR SOLUTION: MANAGING ACCESS THROUGH A USER-LEVEL FILESYSTEM

  10. echofs -> objectives; • First goal: Allow legacy applications to transparently benefit from new storage layers I/O Stack – Accessible storage layers under unique mount point cache – Make new layers readily available to applications memory – I/O stack complexity hidden from applications nvram – Allows for automatic management of data location SSD – POSIX interface [sorry] Lustre PFS namespace is “echoed” /mnt/PFS/User/App  /mnt/echofs/ /mnt/ECHOFS/User/App 10

  11. echofs -> objectives; • Second goal: construct a collaborative burst buffer by joining NVM regions assigned to a batch job by scheduler [SLURM] – Filesystem’s lifetime linked to batch job’s lifetime – Input files staged into NVM before job starts – Allow HPC jobs to perform collaborative NVM I/O parallel processes – Output files staged out to PFS when job ends POSIX read/writes echofs collaborative burst buffer stage-in/ NVM NVM NVM NVM stage-out external filesystem compute nodes 11

  12. echofs -> intended workflow; • User provides job I/O requirements through SLURM – Nodes required, files accessed, type of access [in|out|inout], expected lifetime [temporary|persistent], expected “survivability”, required POSIX semantics [?], … • SLURM allocates nodes and mounts echofs across them – Also forwards I/O requirements through API • echofs builds the CBB and fills it with input files – When finished, SLURM starts the batch job 12

  13. echofs -> intended workflow; • User provides job I/O requirements through SLURM – Nodes required, files accessed, type of access [in|out|inout], We can’t expect expected lifetime [temporary|persistent], optimization details expected “survivability”, required POSIX semantics [?], … from users, but maybe for them to offer us • SLURM allocates nodes and mounts echofs across them enough hints… – Also forwards I/O requirements through API • echofs builds the CBB and fills it with input files – When finished, SLURM starts the batch job 13

  14. echofs -> intended workflow; • Job I/O absorbed by collaborative burst buffer – Non-CBB open()s forwarded to PFS (throttled to limit PFS congestion) – Temporary files do not need to make it to PFS (e.g. checkpoints) – Metadata attributes for temporary files cached  Distributed key-value store If some other job reuses these files, we • When job completes, future of files managed by echofs can leave them “as is” – Persistent files eventually sync'd to PFS – Decision orchestrated by SLURM & DataScheduler component depending on requirements of upcoming jobs 14

  15. echofs -> data distribution; • Distributed data servers – Job’s data space partitioned across compute nodes data space – Each node acts as data server for its partition – Each node acts as data client for other partitions NVM NVM NVM NVM compute nodes • Pseudo-random file segment distribution – No replication ⇒ avoid coherence mechanisms hash – Resiliency through erasure codes (eventually) – Each node acts as lock manager for its partition [0-8MB) [8-16MB) [16-32MB) shared file 15

  16. echofs -> data distribution; • Why pseudo-random? – Efficient & decentralized segment lookup data space [no metadata request to lookup segment] – Balance workload w.r.t. partition size – Allows for collaborative I/O NVM NVM NVM NVM compute nodes hash [0-8MB) [8-16MB) [16-32MB) shared file 16

  17. echofs -> data distribution; • Why pseudo-random? – Efficient & decentralized segment lookup job scheduler [no metadata request to lookup segment] – Balance workload w.r.t. partition size – Allows for collaborative I/O allocated nodes – Guarantees minimal movements of allocated NVM nodes data if node allocation changes allocated allocated NVM NVM nodes nodes +1 node +1 node -2 nodes [future research on elasticity] NVM NVM NVM NVM data data data transfer transfer transfer NVM NVM NVM NVM job phases over time 17

  18. echofs -> data distribution; • Why pseudo-random? – Efficient & decentralized segment lookup job scheduler [no metadata request to lookup segment] – Balance workload w.r.t. partition size – Allows for collaborative I/O allocated nodes – Guarantees minimal movements of allocated NVM nodes data if node allocation changes allocated allocated NVM NVM nodes nodes +1 node +1 node -2 nodes [future research on elasticity] NVM NVM NVM NVM data data data Other strategies would transfer transfer transfer NVM NVM NVM NVM be possible depending on job semantics job phases over time 18

  19. echofs -> integration with batch scheduler; • Data Scheduler daemon external to echofs – Interfaces SLURM & echofs  Allows SLURM to send requests to echofs  Allows echofs to ACK these requests applications static dynamic – Offers an API to [non-legacy] applications I/O requirements I/O requirements willing to send I/O hints to echofs data – In the future will coordinate w/ SLURM to SLURM scheduler decide when different echofs instances stage-in/stage-out asynchronous requests should access PFS echofs [data-aware job scheduling] 19

  20. Summary; • Main features: – Ephemeral filesystem linked to job lifetime – Allows legacy applications to benefit from newer storage technologies – Provides aggregate I/O for applications • Research goals: – Improve coordination w/ job scheduler and other HPC management infrastructure – Investigate ad-hoc data distributions tailored for each job I/O – Scheduler-triggered specific optimizations for jobs/files 20

  21. Food for thought; • POSIX compliance is hard… – But maybe we don’t need FULL COMPLIANCE for ALL jobs… • Adding I/O- awareness to the scheduler is important… – Allows wasting I/O work already done… – … but requires user/developer collaboration (tricky…) • User- level filesystems/libraries solve very specific I/O problems… – Can we reuse/integrate these efforts? Can we learn what works for a specific application, characterize it & automatically run similar ones in a “best fit” FS? 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend