Parallel I/O
Fang Zheng
CSE6230 Fall 2012
1
Parallel I/O Fang Zheng 1 Credits Some materials are taken from - - PowerPoint PPT Presentation
CSE6230 Fall 2012 Parallel I/O Fang Zheng 1 Credits Some materials are taken from Rob Lathams Parallel I/O in Practice talk http://www.spscicomp.org/ScicomP14/talks/L atham.pdf 2 Outline I/O Requirements of HPC
CSE6230 Fall 2012
1
2
3
4
5
abstractions – API that uses language meaningful to app. programmers
– Collective I/O, consistency semantics
– Convenient utilities and file model
– Insulate applications from I/O system changes – Maintain performance!!!
6
– Both multiple storage devices and multiple/wide paths to them
protocols,
7
cited
8
cited
– Striping files across devices for performance
– Access data in contiguous regions of bytes – Very general
9
10
– High level I/O library maps app. abstractions to a structured, portable file format (e.g. HDF5, Parallel netCDF, ADIOS) – Middleware layer deals with organizing access by many processes (e.g. MPI-IO, UPC-IO) – Parallel file system maintains logical space, provides efficient access to data (e.g. PVFS, GPFS, Lustre)
11
12
13
– Collective I/O – Atomicity rules
– Good building block for high-level libraries
– Leverage any rich PFS access constructs
14
Each MPI process writes/reads a separate file MPI process writes/reads a single shared file with individual file pointers MPI process writes/reads a single shared file with collective semantics
15
– Present single view – Focus on concurrent, independent access – Knowledge of collective I/O usually very limited
– Rich I/O language – Relaxed but sufficient semantics
16
17
Reference: http://wiki.lustre.org/images/3/38/Shipman_Feb_lustre_scalability.pdf
18
External Metadata (XML file)
ADIOS API DART LIVE/DataTap MPI-IO POSIX IO HDF-5 pnetCDF Viz Engines Others (plug-in) buffering schedule feedback
19
Application 1 Application 2 I/O Servers
20
As ratio of application processes vs. I/O servers reaches certain point, write bandwidth starts to drop
21
Huge variations of I/O performance on supercomputers
22
Reference: Jay Lofstead, Fang Zheng, Qing Liu, Scott Klasky, Ron Oldfield, Todd Kordenbrock, Karsten Schwan, Matthew Wolf. "Managing Variability in the IO Performance
Application 1 Application 2 I/O Servers
23
24
Higher I/O Bandwidth Less variation
25
26
Machine (as of
Peak Flops Peak I/O bandwidth Flop/byte Jaguar Cray XT5 2.3 Petaflops 120GB/sec 191666 Franklin Cray XT4 352 Teraflops 17GB/sec 20705 Hopper Cray XE6 1.28 Petaflops 35GB/sec 36571 Intrepid BG/P 557 Teraflops 78GB/sec 7141
27
A prediction by Sandia National Lab shows that checkpoint I/O will weigh more than 50% of total simulation runtime under current machines’ failure frequency
28
Reference: Ron Oldfield, Sarala Arunagiri, Patricia J. Teller, Seetharami R. Seelam, Maria Ruiz Varela, Rolf Riesen, Philip C. Roth: Modeling the Impact of Checkpoints on Next-Generation Systems. MSST 2007: 30-46
29
Reference: Tom Peterka, Hongfeng Yu, Robert B. Ross, Kwan-Liu Ma, Robert Latham: End-to-End Study of Parallel Volume Rendering on the IBM Blue Gene/P. ICPP 2009:566-573
Simulation Analysis PFS Simulation Analysis remove the bottleneck!
30
31
32
33
Staging Node 1c 3 2b 4 Packed partial data chunk … Compute Node Application 1 1b 1a … Data request Global metadata Local metadata Stream Processing Metadata Calculation Output Data Stream 2a
34
initialize shuffle reduce reduce finalize initialize shuffle reduce finalize fetch fetch map map map map map map map map map map map map
35
Simulation / Analytics Codes FlexIO API File I/O Shared Memory (SysV, mmap, Xpmem) RDMA (InfiniBand, SeaStar/Portals, Gemini) EVPath Messaging Library Parallel Data Movement DC Plug-ins Performance Monitoring Buffer Management FlexIO Runtime
36
1 2 3 4 5 6 7 8 1 1 2 3 4 5 6 7 8 1
Step 1.s Step 1.a Step 2 Step 4 Simulation Analytics
37
38
simulation analytics Inter-program data movement Intra-simulation data movement Intra-analytics data movement Data aware mapping Holistic placement
39
200 220 240 260 280 300 320 340 512 1024 2048 4096 Total Execution Time (Sec) GTS Cores Inline Helper Core (Data Aware Mapping) Helper Core (Holistic) Helper Core (Node T
Staging Lower Bound 40
– DFS co-locate computation and data, and aggregates local disks – PFS assumes diskless clients
– PFS provides collective I/O semantics and (most) POSIX semantics – DFS like HDFS supports “key-value store” semantics – DFS assumes “write-once” semantics, disallows concurrent writes to one file – DFS exposes data locality information to job scheduler
On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS. Wittawat Tantisiriroj, Swapnil Patil, Garth Gibson, Seung Woo Son, Samuel J. Lang, Robert B. Ross. SC11, November 12-18, 2011
41
42
43