Exploiting Latent I/O Asynchrony in Petascale Science Applications - - PowerPoint PPT Presentation

exploiting latent i o asynchrony in petascale science
SMART_READER_LITE
LIVE PREVIEW

Exploiting Latent I/O Asynchrony in Petascale Science Applications - - PowerPoint PPT Presentation

Exploiting Latent I/O Asynchrony in Petascale Science Applications Patrick Widener, Mary Payne, Patrick Bridges University of New Mexico Matthew Wolf, Hasan Abbasi, Scott McManus, Karsten Schwan Georgia Institute of Technology The research


slide-1
SLIDE 1

Exploiting Latent I/O Asynchrony in Petascale Science Applications

Patrick Widener, Mary Payne, Patrick Bridges

University of New Mexico

Matthew Wolf, Hasan Abbasi, Scott McManus, Karsten Schwan

Georgia Institute of Technology

The research described in this presentation was supported by the National Science Foundation’s HECURA program, the Department of Energy’s Office of Science, and the U.S. Defense Threat Reduction Agency

slide-2
SLIDE 2

Data intensities increasing everywhere

Storage is challenging, let alone analysis: write-once, read-never

Large Hadron Collider 2 PB/sec NG power grids 45 TB/day Climate modeling 8 PB/run ORNL Chimera 35K cores, 550 KB/core/sec => ~18 GB/sec

Data extract -> store -> analyze/visualize will not scale

slide-3
SLIDE 3

ORNL GTC fusion simulation: 60 TB/run

Analysis

  • Reorganization, cleaning
  • Filtering, extraction
  • Monitoring, playback

Gyrokinetic Toroidal Code

  • > 10000 nodes ORNL Cray XT4
  • 1024:1 compute / I/O ratio
  • Limited I/O node disk BW
  • Scarce memory, CPU on compute

nodes

Checkpoint / Restart

  • Periodic export of all particles

(potentially >109)

  • 10% of node memory (200MB/core)
  • ~8TB/write on 40K core XT4

Lustre PFS

after run completed

slide-4
SLIDE 4

I/O Demands are limiting scientific applications on these systems

Problem: In-band data filtering, transformation, and analysis slows core scientific computation with ancillary tasks

Thin pipe to I/O subsystem (I/O network, disk spindles) r I/O generally synchronous because compute node

memory storing the I/O data is scarse

Metadata updates are frequently slow and often

unnecessary

Lack of systems to enable application scientists to move

tasks out of band

slide-5
SLIDE 5

Decoupled data annotation & processing

Contribution: I/O techniques to decouple filtering, transformation, and analysis from compute nodes

IOgraphs decouple data manipulations in space from

applications

Metabots decouple data manipulations in time and space Enabling Technologies:

DataTaps export data and “just enough” metadata using a smart,

context-aware RDMA transfer

Lightweight File System (LWFS) provides minimum filesystem semantics

Using these tools to decouple ancillary operations can improve application I/O throughput, while giving end-users better abstractions to work with

slide-6
SLIDE 6

Software architecture for “in-transit” data annotation and processing

Datatap Client Datatap Client Datatap Client Datatap Server IOgraph Stone Datatap Client Datatap Client Datatap Client Datatap Client Datatap Client IOgraph Stone IOgraph Stone IOgraph Stone Datatap Server IOgraph Stone Metabot IOgraph Stone Metabot I/O Service Nodes Storage Nodes Compute Nodes IOgraph Stone IOgraph Stone

slide-7
SLIDE 7

IOgraphs decouple operations in space

IOgraph bounding box filter IOgraph I/O scheduler IOgraph router IOgraph data transformer IOgraph router Other data sink

Stream visualization

Parallel file storage

Streaming from GTC DataTap

Adjust # of nodes, processes/node for load or bandwidth distribution IOgraph

  • utput

nodes

Act on data in transit

  • Dynamic overlay mapped to cluster,

non-cluster nodes

  • Streaming model, structured data
  • Dynamically generated code, shared
  • bjects implement operations
slide-8
SLIDE 8

What should IOgraphs look like?

For buffering and distribution of I/O: # of nodes, # of

processes/node?

...

transmitter scheduler storage0 storage1 storage2 storageN

GTC restart message 188 MB round-robin

IOgraph

Simulates DataTap

Modeling construction of GTC restart file Transmitter sends 200 messages Scheduler round-robins messages to storage nodes, which write to

disk

slide-9
SLIDE 9

Adding nodes to IOgraph shortens I/O phase

200 400 600 800 1000 1200 1400 1 2 4 8 Transmitter Scheduler Storage Client

Second storage node reduces backpressure, speeding up transmitter Constrained by disk bandwidth

Number of storage nodes Time to completion (sec)

slide-10
SLIDE 10

Metabots decouple operations in time

Some operations can or must be delayed

Data formatting in long-running MPP codes Some data products may not be needed Service node numbers may be limited or overcommitted

Small, modular programs; specification-based

Well-defined input, output, transformation Data consistency/availability, co-scheduling information

Ideal for just-in-time, on-demand conversions or metadata

fixups

Use same metadata, transport infrastructure as IOgraphs

slide-11
SLIDE 11

Deferring directory metadata creation

slide-12
SLIDE 12

Lazy metadata construction reduces wall-clock time

500 1000 1500 2000 2500 3000 3500 4000 1 2 3 4 5 Raw Metabot In-band 500 1000 1500 2000

Raw Metabot In-band

In-band is 70% slower on flat structure In-band is > 9X slower on tree structure Metabot reconstruction time similar to in-band time, but decoupled

sec Directory depth sec Number of files created

Flat structure Tree with 5 levels, 2 dir/level

Create structure without directory information (LANL FDTREE) Fix up later (add to LWFS name service) with metabot

slide-13
SLIDE 13

Combining IOgraphs and metabots reduces overall execution time

In-band Processing Metabot Processing T

  • tal

Single In-series writer/sorter 2113.16

  • 2113.16

2 storage nodes + metabot 250.91 526.71 777.62 4 storage nodes + metabot 216.52 526.71 743.23

Re-orderer

In-band with IOgraph

Collects all messages Separate thread produces total in-

  • rder restart

file

Metabot

storage0 storage1 storage2 storageN

File per message

Metabot

In-order

  • utput

Create a fully-sorted restart file from collection of messages? Single sorter vs. write-now, merge-later

slide-14
SLIDE 14

Comparison to other work

High-performance parallel file systems

Many choices: NASD, Panasas, PVFS, Lustre, GPFS Separation of data from metadata supports our approach Manipulating data en route to/from storage Availability of metadata enables better scheduling, staging, buffering

decisions

DataCutter and related tools

Similar goals (e.g. customize end-user visualizations) Richer descriptions for filter and transformation, asynchrony

Out-of-band techniques are similar to workflow systems

Kepler, Pegasus, Condor/G, IRODS, others Specifications like Data Grid Language We focus on fine-grain scheduling, tightly-coupled systems, in-band /

  • ut-of-band data manipulation

Can metabots be workflow actors?

slide-15
SLIDE 15

These techniques provide traction on data-intensive applications

IOgraphs and metabots provide several benefits

Shorten application I/O phases Make analysis easier by making customization easier Reduce net storage amounts Generate custom metadata Accommodate anonymous downstream consumers

Using these tools to decouple ancillary operations can improve application I/O throughput, while giving end-users better abstractions to work with

slide-16
SLIDE 16

Future Work: Dynamic decoupling

Run-time scheduling decisions about whether to

implement operations in IOgraph or metabots

Longer-range goal is to incorporate feedback

CPU / node availability Network bandwidth Data consistency / availability Anonymous / on-demand consumers

Completely in-band (IOgraph-based) Mix of IOgraph & Metabot actions Completely out-of-band with Metabots Application I/O slider

slide-17
SLIDE 17

Acknowledgements

Greg Eisenhauer, Ada Gavrilovska (Georgia Tech) Barney Maccabe, Scott Klasky (Oak Ridge National Laboratory) Ron Oldfield (Sandia National Laboratories)