Exploiting Latent I/O Asynchrony in Petascale Science Applications - PowerPoint PPT Presentation

Exploiting Latent I/O Asynchrony in Petascale Science Applications Patrick Widener, Mary Payne, Patrick Bridges University of New Mexico Matthew Wolf, Hasan Abbasi, Scott McManus, Karsten Schwan Georgia Institute of Technology The research described in this presentation was supported by the National Science Foundation’s HECURA program, the Department of Energy’s Office of Science, and the U.S. Defense Threat Reduction Agency

Data intensities increasing everywhere Large Hadron Collider 2 PB/sec NG power grids 45 TB/day Storage is challenging, let alone analysis: write-once, read-never Data extract -> store -> analyze/visualize will not scale Climate modeling 8 PB/run ORNL Chimera 35K cores, 550 KB/core/sec => ~18 GB/sec

ORNL GTC fusion simulation: 60 TB/run Gyrokinetic Toroidal Code • � > 10000 nodes ORNL Cray XT4 • � 1024:1 compute / I/O ratio • � Limited I/O node disk BW • � Scarce memory, CPU on compute nodes Checkpoint / Restart • � Periodic export of all particles (potentially >10 9 ) • � 10% of node memory (200MB/core) • � ~8TB/write on 40K core XT4 Analysis • � Reorganization, cleaning after run completed Lustre • � Filtering, extraction PFS • � Monitoring, playback

I/O Demands are limiting scientific applications on these systems Problem: In-band data filtering, transformation, and analysis slows core scientific computation with ancillary tasks � � Thin pipe to I/O subsystem (I/O network, disk spindles) r � � I/O generally synchronous because compute node memory storing the I/O data is scarse � � Metadata updates are frequently slow and often unnecessary � � Lack of systems to enable application scientists to move tasks out of band

Decoupled data annotation & processing Contribution : I/O techniques to decouple filtering, transformation, and analysis from compute nodes � � IOgraphs decouple data manipulations in space from applications � � Metabots decouple data manipulations in time and space Enabling Technologies: � � DataTaps export data and “just enough” metadata using a smart, context-aware RDMA transfer � � Lightweight File System (LWFS) provides minimum filesystem semantics Using these tools to decouple ancillary operations can improve application I/O throughput, while giving end-users better abstractions to work with

Software architecture for “in-transit” data annotation and processing Datatap Client IOgraph Stone IOgraph IOgraph Datatap Client Metabot Datatap Stone Stone Server IOgraph Datatap Client Stone Datatap Client Datatap Client IOgraph IOgraph Stone IOgraph Stone Datatap Client Metabot Datatap Stone Server IOgraph Datatap Client Stone Datatap Client Compute Nodes I/O Service Nodes Storage Nodes

IOgraphs decouple operations in space Streaming from GTC DataTap IOgraph IOgraph I/O router scheduler Adjust # of nodes, IOgraph processes/node for router load or bandwidth IOgraph Parallel file distribution output storage IOgraph data nodes transformer Other data sink IOgraph Act on data in transit bounding box • � Dynamic overlay mapped to cluster, filter non-cluster nodes • � Streaming model, structured data • � Dynamically generated code, shared Stream visualization objects implement operations

What should IOgraphs look like? � � For buffering and distribution of I/O: # of nodes, # of processes/node? storage0 storage1 transmitter scheduler ... storage2 round-robin GTC restart Simulates … message DataTap 188 MB storageN IOgraph � � Modeling construction of GTC restart file � � Transmitter sends 200 messages � � Scheduler round-robins messages to storage nodes, which write to disk

Adding nodes to IOgraph shortens I/O phase 1400 Transmitter 1200 Scheduler Storage Client Time to completion (sec) 1000 800 Second storage node reduces backpressure, speeding up transmitter 600 400 200 Constrained by disk bandwidth 0 1 2 4 8 Number of storage nodes

Metabots decouple operations in time � � Some operations can or must be delayed � � Data formatting in long-running MPP codes � � Some data products may not be needed � � Service node numbers may be limited or overcommitted � � Small, modular programs; specification-based � � Well-defined input, output, transformation � � Data consistency/availability, co-scheduling information � � Ideal for just-in-time, on-demand conversions or metadata fixups � � Use same metadata, transport infrastructure as IOgraphs

Deferring directory metadata creation

Lazy metadata construction reduces wall-clock time � � Create structure without directory information (LANL FDTREE) � � Fix up later (add to LWFS name service) with metabot Flat structure Tree with 5 levels, 2 dir/level 4000 2000 3500 1500 3000 Raw 2500 1000 sec sec Raw 2000 Metabot Metabot 1500 500 In-band 1000 In-band 0 500 0 1 2 3 4 5 Directory depth Number of files created � � In-band is 70% slower on flat structure � � In-band is > 9X slower on tree structure � � Metabot reconstruction time similar to in-band time, but decoupled

Combining IOgraphs and metabots reduces overall execution time � � Create a fully-sorted restart file from collection of messages? � � Single sorter vs. write-now, merge-later storage0 storage1 Metabot Re-orderer storage2 Separate Collects all messages thread … produces total in- storageN order restart file In-order File per message output In-band with IOgraph Metabot In-band Metabot T otal Processing Processing Single In-series writer/sorter 2113.16 -- 2113.16 2 storage nodes + metabot 250.91 526.71 777.62 4 storage nodes + metabot 216.52 526.71 743.23

Comparison to other work � � High-performance parallel file systems � � Many choices: NASD, Panasas, PVFS, Lustre, GPFS � � Separation of data from metadata supports our approach � � Manipulating data en route to/from storage � � Availability of metadata enables better scheduling, staging, buffering decisions � � DataCutter and related tools � � Similar goals (e.g. customize end-user visualizations) � � Richer descriptions for filter and transformation, asynchrony � � Out-of-band techniques are similar to workflow systems � � Kepler, Pegasus, Condor/G, IRODS, others � � Specifications like Data Grid Language � � We focus on fine-grain scheduling, tightly-coupled systems, in-band / out-of-band data manipulation � � Can metabots be workflow actors?

These techniques provide traction on data-intensive applications � � IOgraphs and metabots provide several benefits � � Shorten application I/O phases � � Make analysis easier by making customization easier � � Reduce net storage amounts � � Generate custom metadata � � Accommodate anonymous downstream consumers Using these tools to decouple ancillary operations can improve application I/O throughput, while giving end-users better abstractions to work with

Future Work: Dynamic decoupling � � Run-time scheduling decisions about whether to implement operations in IOgraph or metabots � � Longer-range goal is to incorporate feedback � � CPU / node availability � � Network bandwidth � � Data consistency / availability � � Anonymous / on-demand consumers Application I/O slider Completely in-band Mix of IOgraph & Completely out-of-band (IOgraph-based) Metabot actions with Metabots

Acknowledgements Greg Eisenhauer, Ada Gavrilovska (Georgia Tech) Barney Maccabe, Scott Klasky (Oak Ridge National Laboratory) Ron Oldfield (Sandia National Laboratories)

Exploiting Latent I/O Asynchrony in Petascale Science Applications - PowerPoint PPT Presentation

Exploiting Latent I/O Asynchrony in Petascale Science Applications Patrick Widener, Mary Payne, Patrick Bridges University of New Mexico Matthew Wolf, Hasan Abbasi, Scott McManus, Karsten Schwan Georgia Institute of Technology The research

Compositionality and Asynchrony Dr. Liam OConnor University of Edinburgh LFCS (and UNSW) Term

Compositionality and Asynchrony Dr. Liam OConnor University of Edinburgh LFCS (and UNSW) Term

Fault-Tolerance, Fast and Slow: Exploiting Failure Asynchrony in Distributed Systems Ramnatthan

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

1 Latent variable models In the next section we will discuss latent variable models for

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

OPEN PETASCALE LIBRARIES Advancing the development of numerical software for the new generation

Scalable Full-Text Search for Petascale File Systems Andrew W. Leung Ethan L. Miller

Petascale Data Storage Workshop, PDSW08 Rewarding the Public Release of Valuable Data and

EFFECTS, ASYNCHRONY, AND CHOICE IN ARROWIZED FUNCTIONAL REACTIVE PROGRAMMING Daniel

Empirical Analysis of Latent Space Embedding David Mount and Eunhui Park Department of Computer

Di ff erentially-Private Batch Query Answering Exploiting the Workload vs. Exploiting the Data

Regularity and asynchrony when tapping to tactile, auditory and combined pulses Joren Six , Laura

Integrating Applications to Transform Operations and Deliver Business Results 2017 Enterprise

Asymptotics, asynchrony, and asymmetry in distributed consensus Anand D. Sarwate Information

Synchrony Weakened by Message Adversaries vs Asynchrony Restricted by Failure Detectors Michel R

Web Annotations Building the Experience Annotation An annotation is something added. It is not

Usability and Security of Out-Of-Band Channels in Secure Device Pairing Protocols Ronald Kainda,

Out-of-Band Authentication in Group Messaging: Computational, Statistical, Optimal

Th eorie algorithmique des nombres et applications ` a la cryptanalyse de primitives

in Group Messaging: Computational, Statistical, Optimal Lior Rotem Gil Segev Hebrew University

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Instrumenting and Debugging FireSim-Simulated Designs https://fires.im @firesimproject MICRO

Online Analysis and Telemetry WG Moderators: Michael Chynoweth & Ahmad Yasin Scalable Tools

Exploiting Latent I/O Asynchrony in Petascale Science Applications - PowerPoint PPT Presentation

Exploiting Latent I/O Asynchrony in Petascale Science Applications Patrick Widener, Mary Payne, Patrick Bridges University of New Mexico Matthew Wolf, Hasan Abbasi, Scott McManus, Karsten Schwan Georgia Institute of Technology The research

Compositionality and Asynchrony Dr. Liam OConnor University of Edinburgh LFCS (and UNSW) Term

Compositionality and Asynchrony Dr. Liam OConnor University of Edinburgh LFCS (and UNSW) Term

Fault-Tolerance, Fast and Slow: Exploiting Failure Asynchrony in Distributed Systems Ramnatthan

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

1 Latent variable models In the next section we will discuss latent variable models for

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

OPEN PETASCALE LIBRARIES Advancing the development of numerical software for the new generation

Scalable Full-Text Search for Petascale File Systems Andrew W. Leung Ethan L. Miller

Petascale Data Storage Workshop, PDSW08 Rewarding the Public Release of Valuable Data and

EFFECTS, ASYNCHRONY, AND CHOICE IN ARROWIZED FUNCTIONAL REACTIVE PROGRAMMING Daniel

Empirical Analysis of Latent Space Embedding David Mount and Eunhui Park Department of Computer

Di ff erentially-Private Batch Query Answering Exploiting the Workload vs. Exploiting the Data

Regularity and asynchrony when tapping to tactile, auditory and combined pulses Joren Six , Laura

Integrating Applications to Transform Operations and Deliver Business Results 2017 Enterprise

Asymptotics, asynchrony, and asymmetry in distributed consensus Anand D. Sarwate Information

Synchrony Weakened by Message Adversaries vs Asynchrony Restricted by Failure Detectors Michel R

Web Annotations Building the Experience Annotation An annotation is something added. It is not

Usability and Security of Out-Of-Band Channels in Secure Device Pairing Protocols Ronald Kainda,

Out-of-Band Authentication in Group Messaging: Computational, Statistical, Optimal

Th eorie algorithmique des nombres et applications ` a la cryptanalyse de primitives

in Group Messaging: Computational, Statistical, Optimal Lior Rotem Gil Segev Hebrew University

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Instrumenting and Debugging FireSim-Simulated Designs https://fires.im @firesimproject MICRO

Online Analysis and Telemetry WG Moderators: Michael Chynoweth &amp; Ahmad Yasin Scalable Tools

Online Analysis and Telemetry WG Moderators: Michael Chynoweth & Ahmad Yasin Scalable Tools