Parallel I/O Fang Zheng 1 Credits Some materials are taken from - - PowerPoint PPT Presentation

parallel i o
SMART_READER_LITE
LIVE PREVIEW

Parallel I/O Fang Zheng 1 Credits Some materials are taken from - - PowerPoint PPT Presentation

CSE6230 Fall 2012 Parallel I/O Fang Zheng 1 Credits Some materials are taken from Rob Lathams Parallel I/O in Practice talk http://www.spscicomp.org/ScicomP14/talks/L atham.pdf 2 Outline I/O Requirements of HPC


slide-1
SLIDE 1

Parallel I/O

Fang Zheng

CSE6230 Fall 2012

1

slide-2
SLIDE 2

Credits

  • Some materials are taken from Rob Latham’s

“Parallel I/O in Practice” talk

  • http://www.spscicomp.org/ScicomP14/talks/L

atham.pdf

2

slide-3
SLIDE 3

Outline

  • I/O Requirements of HPC Applications
  • Parallel I/O Stack

– From storage hardware to I/O libraries

  • High-level I/O Middleware

– Case study: ADIOS

  • In situ I/O Processing
  • Interesting Research Topics

3

slide-4
SLIDE 4

Parallel I/O

  • I/O Requirements for HPC Applications

– Checkpoint/restart: for defending against failures – Analysis data: for later analysis and visualization – Other data: diagnostics, logs, snapshots, etc. – Applications view data with domain-specific semantics:

  • Variables, meshes, particles, attributes, etc.
  • Challenges:

– High concurrency: e.g., 100,000 processes – Hugh data volume: e.g., 200MB per process – Ease to use: scientists are not I/O experts

4

slide-5
SLIDE 5

5

Supporting Application I/O

  • Provide mapping of app. domain data

abstractions – API that uses language meaningful to app. programmers

  • Coordinate access by many processes

– Collective I/O, consistency semantics

  • Organize I/O devices into a single space

– Convenient utilities and file model

  • And also

– Insulate applications from I/O system changes – Maintain performance!!!

slide-6
SLIDE 6

6

What about Parallel I/O?

  • Focus of parallel I/O is on using parallelism to increase bandwidth
  • Use multiple data sources/sinks in concert

– Both multiple storage devices and multiple/wide paths to them

  • But applications don't want to deal with block devices and network

protocols,

  • So we add software layers
slide-7
SLIDE 7

Parallel I/O Stack

  • I/O subsystem in supercomputers

– Oak Ridge National Lab’s “Jaguar” Cray XT4/5

7

cited

slide-8
SLIDE 8

Parallel I/O Stack

  • Another Example: IBM BlueGene/P

8

cited

slide-9
SLIDE 9

Parallel File Systems (PFSs)

  • Organize I/O devices into a single logical space

– Striping files across devices for performance

  • Export a well-defined API, usually POSIX

– Access data in contiguous regions of bytes – Very general

9

slide-10
SLIDE 10

Parallel I/O Stack

  • Idea: Add some additional software

components to address remaining issues

– Coordination of access – Mapping from application model to I/O model

  • These components will be increasingly

specialized as we add layers

  • Bridge this gap between existing I/O systems

and application needs

10

slide-11
SLIDE 11

Parallel I/O for HPC

  • Break up support into multiple layers:

– High level I/O library maps app. abstractions to a structured, portable file format (e.g. HDF5, Parallel netCDF, ADIOS) – Middleware layer deals with organizing access by many processes (e.g. MPI-IO, UPC-IO) – Parallel file system maintains logical space, provides efficient access to data (e.g. PVFS, GPFS, Lustre)

11

slide-12
SLIDE 12

High Level Libraries

  • Provide an appropriate

abstraction for domain

– Multidimensional datasets – Typed variables – Attributes

  • Self-describing, structured file

format

  • Map to middleware interface

– Encourage collective I/O

  • Provide optimizations that

middleware cannot

12

slide-13
SLIDE 13

Parallel I/O Stack

  • High Level I/O Libraries:

– Provide richer semantics than “file” abstraction

  • Match applications’ data models: variables, attributes,

data types, domain decomposition, etc.

– Optimize I/O performance on top of MPI-IO

  • Can leverage more application-level knowledge
  • File format and layout
  • Orchestrate/coordinate I/O requests

– Examples: HDF5, NetCDF, ADIOS, SILO, etc.

13

slide-14
SLIDE 14

I/O Middleware

  • Facilitate concurrent access by

groups of processes

– Collective I/O – Atomicity rules

  • Expose a generic interface

– Good building block for high-level libraries

  • Match the underlying

programming model (e.g. MPI)

  • Efficiently map middleware
  • perations into PFS ones

– Leverage any rich PFS access constructs

14

slide-15
SLIDE 15

Parallel I/O

  • Parallel I/O supported by MPI-IO

– Individual files – Shared file, individual file pointers – Shared file, collective I/O

Each MPI process writes/reads a separate file MPI process writes/reads a single shared file with individual file pointers MPI process writes/reads a single shared file with collective semantics

15

slide-16
SLIDE 16

Parallel File System

  • Manage storage hardware

– Present single view – Focus on concurrent, independent access – Knowledge of collective I/O usually very limited

  • Publish an interface that

middleware can use effectively

– Rich I/O language – Relaxed but sufficient semantics

16

slide-17
SLIDE 17

Parallel I/O Stack

  • Parallel File System:

– Example: Lustre file system

17

Reference: http://wiki.lustre.org/images/3/38/Shipman_Feb_lustre_scalability.pdf

slide-18
SLIDE 18

High Level I/O Library Case Study: ADIOS

  • ADIOS (Adaptable I/O System):

– Developed by Georgia Tech and Oak Ridge National Lab – Works on Lustre and IBM’s GPFS – In production use by several major DOE applications – Features:

  • Simple, high-level API for reading/writing data in parallel
  • Support several popular file formats
  • High I/O performance at large scales
  • Extensible framework

18

slide-19
SLIDE 19

High Level I/O Library Case Study: ADIOS

  • ADIOS Architecture:

– Layered design – Higher level AIDOS API:

  • Adios_open/read/write/close

– Support multiple underlying file formats and I/O methods – Built-in optimization: scheduling, buffering, etc.

External Metadata (XML file)

Scientific Codes

ADIOS API DART LIVE/DataTap MPI-IO POSIX IO HDF-5 pnetCDF Viz Engines Others (plug-in) buffering schedule feedback

19

slide-20
SLIDE 20

High Level I/O Library Case Study: ADIOS

  • Optimize Write Performance under Contention:

– Write performance is critical for checkpointing – Parallel file system is shared by:

  • Processes within a MPI program
  • Different MPI programs running concurrently

– How to attain high write performance on a busy, shared supercomputer?

Application 1 Application 2 I/O Servers

20

slide-21
SLIDE 21

In Situ I/O Processing: An alternative Approach to Parallel I/O

  • Negative interference due to contention on

shared file system

– Internal contention between process within the same MPI program

As ratio of application processes vs. I/O servers reaches certain point, write bandwidth starts to drop

21

slide-22
SLIDE 22

In Situ I/O Processing: An alternative Approach to Parallel I/O

  • Negative interference due to contention on

shared file system

– External contention between different MPI programs

Huge variations of I/O performance on supercomputers

22

Reference: Jay Lofstead, Fang Zheng, Qing Liu, Scott Klasky, Ron Oldfield, Todd Kordenbrock, Karsten Schwan, Matthew Wolf. "Managing Variability in the IO Performance

  • f Petascale Storage Systems". In Proceedings of SC 10. New Orleans, LA. November 2010
slide-23
SLIDE 23

High Level I/O Library Case Study: ADIOS

  • How to obtain high write performance on a busy,

shared supercomputer?

  • The basic trick is find slow (over-loaded) I/O

servers and avoid writing data to them

Application 1 Application 2 I/O Servers

23

slide-24
SLIDE 24

High Level I/O Library Case Study: ADIOS

  • ADIOS’ solution: Coordination

– Divide the writing processes into groups – Each group has a sub-coordinator to monitor writing progress – Near the end of collective I/O, coordinator has a global view of storage targets’ performance, and inform stragglers to write to fast targets

24

slide-25
SLIDE 25

High Level I/O Library Case Study: ADIOS

  • Results with a parallel application: Pixie3D

Higher I/O Bandwidth Less variation

25

slide-26
SLIDE 26

In Situ I/O Processing

  • An alternative approach to existing parallel I/O

techniques

  • Motivation:

– I/O is becoming the bottleneck for large scale simulation AND analysis

26

slide-27
SLIDE 27

I/O Is a Major Bottleneck Now!

  • Under-provisioned I/O and storage sub-system

in supercomputers

– Huge disparity between I/O and computation capacity – I/O resources are shared and contended by concurrent jobs

Machine (as of

  • Nov. 2011)

Peak Flops Peak I/O bandwidth Flop/byte Jaguar Cray XT5 2.3 Petaflops 120GB/sec 191666 Franklin Cray XT4 352 Teraflops 17GB/sec 20705 Hopper Cray XE6 1.28 Petaflops 35GB/sec 36571 Intrepid BG/P 557 Teraflops 78GB/sec 7141

27

slide-28
SLIDE 28

I/O Is a Major Bottleneck Now!

  • Huge output volume for scientific simulations

– Example: GTS fusion simulation: 200MB per MPI process x 100,000 procs  20TB per checkpoint – Increasing scale  increasing failure frequency  increasing I/O frequency  increasing I/O time

A prediction by Sandia National Lab shows that checkpoint I/O will weigh more than 50% of total simulation runtime under current machines’ failure frequency

28

Reference: Ron Oldfield, Sarala Arunagiri, Patricia J. Teller, Seetharami R. Seelam, Maria Ruiz Varela, Rolf Riesen, Philip C. Roth: Modeling the Impact of Checkpoints on Next-Generation Systems. MSST 2007: 30-46

slide-29
SLIDE 29

I/O Is a Major Bottleneck Now!

  • Analysis and visualization needs to read data

back to gain useful insights from the raw bits

– File read time can weigh 90% of total runtime of visualization tasks

29

Reference: Tom Peterka, Hongfeng Yu, Robert B. Ross, Kwan-Liu Ma, Robert Latham: End-to-End Study of Parallel Volume Rendering on the IBM Blue Gene/P. ICPP 2009:566-573

slide-30
SLIDE 30

In Situ I/O Processing

  • In Situ I/O Processing:

– Eliminate I/O bottleneck by tightly coupling simulation and analysis

Simulation Analysis PFS Simulation Analysis remove the bottleneck!

30

slide-31
SLIDE 31

In Situ I/O Processing

  • In Situ I/O Processing:

– Process simulation output data online, while the data is being generated – Many useful analysis can be done this way

  • Data reduction: filtering, feature extraction
  • Data preparation: layout re-organization
  • Data inspection: validation, monitoring

– Reduce disk I/O activities  reduce time and power consumption – Reduce the time from data to insight

31

slide-32
SLIDE 32

Placing In Situ Analytics

  • There are multiple options to place analytics

along with simulation

– Inline – Helper core – Staging nodes – I/O nodes – Offline

32

slide-33
SLIDE 33

In Situ I/O Processing

  • PreDatA (Preparatory Data Analytics):

– Allocate a set of compute nodes (Staging Area) to host analytics – Move simulation output data into Staging Area – Process data in Staging Area using MapReduce

33

slide-34
SLIDE 34

In Situ I/O Processing

  • PreDatA data flow:

Staging Node 1c 3 2b 4 Packed partial data chunk … Compute Node Application 1 1b 1a … Data request Global metadata Local metadata Stream Processing Metadata Calculation Output Data Stream 2a

34

slide-35
SLIDE 35

In Situ I/O Processing

  • MapReduce in Staging Area

initialize shuffle reduce reduce finalize initialize shuffle reduce finalize fetch fetch map map map map map map map map map map map map

35

slide-36
SLIDE 36

In Situ I/O Processing

  • FlexIO: In Situ I/O processing middleware

– Used by simulation and analytics to exchange data – Analytics can be arbitrary MPI codes – Enable flexible placement options of analytics

  • In simulation nodes, staging nodes, offline nodes
  • No need to change code when changing placement

Simulation / Analytics Codes FlexIO API File I/O Shared Memory (SysV, mmap, Xpmem) RDMA (InfiniBand, SeaStar/Portals, Gemini) EVPath Messaging Library Parallel Data Movement DC Plug-ins Performance Monitoring Buffer Management FlexIO Runtime

36

slide-37
SLIDE 37

In Situ I/O Processing

  • FlexIO:

– High performance data movement between simulation and analytics – Automatically re-distribute multi-dimensional arrays between two parallel programs

1 2 3 4 5 6 7 8 1 1 2 3 4 5 6 7 8 1

Step 1.s Step 1.a Step 2 Step 4 Simulation Analytics

  • Dir. Server

37

slide-38
SLIDE 38

In Situ I/O Processing

  • Placement Algorithms:

– Decide where to run analytics – Data Aware Mapping:

  • Take inter-program data movement volume as input
  • Use graph partitioning to group processes
  • Map process groups to nodes

– Holistic Placement:

  • Conservative resource allocation
  • Take intra- and inter-program data movement volume as input
  • Use graph mapping to map processes to cores

– Node Topology Aware Placement:

  • Model node based on its cache structure

38

slide-39
SLIDE 39

In Situ I/O Processing

simulation analytics Inter-program data movement Intra-simulation data movement Intra-analytics data movement Data aware mapping Holistic placement

  • Placement of Analytics

39

slide-40
SLIDE 40

In Situ I/O Processing

  • Improve Application Performance by Smart

Placement of Analytics

– GTS fusion simulation + statistical analysis – Placement leads to 20% improvement of total runtime

200 220 240 260 280 300 320 340 512 1024 2048 4096 Total Execution Time (Sec) GTS Cores Inline Helper Core (Data Aware Mapping) Helper Core (Holistic) Helper Core (Node T

  • po. Aware)

Staging Lower Bound 40

slide-41
SLIDE 41

Parallel FS vs. Distributed FS

  • Distributed file systems used in data center environment

(like Google FS or HDFS, etc.)

– Similarities:

  • Client/server (meta, data) architecture
  • Basic file semantics
  • Support parallel and distributed workloads

– Differences:

  • Deployment model:

– DFS co-locate computation and data, and aggregates local disks – PFS assumes diskless clients

  • Interface:

– PFS provides collective I/O semantics and (most) POSIX semantics – DFS like HDFS supports “key-value store” semantics – DFS assumes “write-once” semantics, disallows concurrent writes to one file – DFS exposes data locality information to job scheduler

  • Implementation: data distribution, failure/consistency handling, etc.

On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS. Wittawat Tantisiriroj, Swapnil Patil, Garth Gibson, Seung Woo Son, Samuel J. Lang, Robert B. Ross. SC11, November 12-18, 2011

41

slide-42
SLIDE 42

Interesting Research Topics

  • Integrate new hardware into parallel I/O stack

(e.g., SSD, NVRAM)

  • Move Computation close to data:

– Use MapReduce/NoSQL systems to process massive scientific data – Online Real-time stream processing – Move analytics into File systems – Analytics-driven simulation: re-computing data may be cheaper and faster than storing&loading the data!

42

slide-43
SLIDE 43

References

  • MPI-IO: http://www.mcs.anl.gov/research/projects/romio/
  • HDF5: http://www.hdfgroup.org/HDF5/
  • NetCDF: http://www.unidata.ucar.edu/software/netcdf/
  • ADIOS: http://www.olcf.ornl.gov/center-projects/adios/
  • Lustre: http://wiki.lustre.org/index.php/Main_Page
  • PVFS: http://www.pvfs.org/
  • GPFS: http://www-03.ibm.com/systems/software/gpfs/

43