Scalable Parallel I/O Alternatives for Massively Parallel - - PowerPoint PPT Presentation

scalable parallel i o alternatives for massively parallel
SMART_READER_LITE
LIVE PREVIEW

Scalable Parallel I/O Alternatives for Massively Parallel - - PowerPoint PPT Presentation

Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu, Ning Liu, Onkar Sahni, Ken Jansen, Mark Shephard, Chris Carothers Computer Science Department Scientific Computation Research Center (SCOREC)


slide-1
SLIDE 1

Scalable Parallel I/O Alternatives 1

Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems

Jing Fu, Ning Liu, Onkar Sahni, Ken Jansen, Mark Shephard, Chris Carothers

Computer Science Department Scientific Computation Research Center (SCOREC) Rensselaer Polytechnic Institute chrisc@cs.rpi.edu

Acknowledgments: Partners: Simmetrix, Acusim, Kitware, IBM NSF PetaApps, DOE INCITE, ITR, CTS; DOE: SciDAC-ITAPS, NERI; AFOSR Industry:IBM, Northrup Grumman, Boeing, Lockheed Martin, Motorola Computer Resources: TeraGrid, ANL, NERSC, RPI-CCNI

slide-2
SLIDE 2

Scalable Parallel I/O Alternatives 2

Outline

  • Motivating application: CFD
  • Blue Gene Platforms
  • I/O Alternatives

– POSIX – PMPIO – syncIO – “reduce blocking” rbIO

  • Blue Gene Results
  • Summary
slide-3
SLIDE 3

Scalable Parallel I/O Alternatives 3

PHASTA Flow Solver Parallel Paradigm

lTime-accurate, stabilized FEM flow solver l Input partitioned on a per-processor basis l Unstructured mesh “parts” mapped to cores lTwo types of work: l Equation formation l O(40) peer-to-peer non-blocking comms l Overlapping comms with comp l Scales well on many machines l Implicit, iterative equation solution l Matrix assembled on processor ONLY l Each Krylov vector is: l q=Ap (matrix-vector product) l Same peer-to-peer comm of q PLUS l Orthogonalize against prior vectors l REQUIRES NORMS=>MPI_Allreduce l This sets up a cycle of global comms. separated by modest amount of work

P1 P2 P3

slide-4
SLIDE 4

Scalable Parallel I/O Alternatives 4

Parallel Implicit Flow Solver – Incompressible Abdominal Aorta Aneurysm (AAA)

Cores (avg. elems./core) IBM BG/L RPI-CCNI t (secs.) scale factor 512 (204800) 2119.7 1 (base) 1024 (102400) 1052.4 1.01 2048 (51200) 529.1 1.00 4096 (25600) 267.0 0.99 8192 (12800) 130.5 1.02 16384 (6400) 64.5 1.03 32768 (3200) 35.6 0.93

32K parts shows modest degradation due to 15% node imbalance

slide-5
SLIDE 5

Scalable Parallel I/O Alternatives 5

AAA Adapted to 109 Elements: Scaling on Blue Gene /P

#of cores Rgn imb Vtx imb Time (s) Scaling 32k 1.72% 8.11% 112.43 0.987

128k 5.49% 17.85% 31.35 0.885

New: @ 294,912 cores è 82% scaling But getting I/O done is a challenge…

slide-6
SLIDE 6

Scalable Parallel I/O Alternatives 6

Blue Gene /L Layout

CCNI “fen”

  • 32K cores/ 16 racks
  • 12 TB / 8 TB usable RAM
  • ~1 PB of disk over GPFS
  • Custom OS kernel
slide-7
SLIDE 7

Scalable Parallel I/O Alternatives 7

Blue Gene /P Layout

ALCF/ANL “Intrepid”

  • 163K cores/ 40 racks
  • ~80TB RAM
  • ~8 PB of disk over GPFS
  • Custom OS kernel
slide-8
SLIDE 8

Scalable Parallel I/O Alternatives 8

Blue Gene/ P (vs. BG/L)

slide-9
SLIDE 9

Scalable Parallel I/O Alternatives 9

Blue Gene I/O Archiectures

  • Blue Gene/L @ CCNI

– 1 2-core I/O node per 32 compute nodes – 32K system has 512, 1 Gbit/sec network interfaces – I/O nodes connected 48 GPFS file servers – Servers 0, 2, 4, and 6 are metadata servers – Server 0 does RAS and other duties – 800 TB of storage from 26 IBM DS4200 storage arrays – Split into 240 LUNs, each server has 10 LUNs (7 @ 1MB and 3 @ 128KB) – Peak bandwidth is ~8GB/sec read and 4 GB/sec write

  • Blue Gene/P @ ALCF

– Similar I/O node to compute node ratio – 128 dual core fileservers over Myrinet w/ 4MB GPFS block size – Metadata can be done by any server – 16x DDN 9900 è 7.5 PB (raw) storage w/ peak bandwidth of 60 GB/sec.

slide-10
SLIDE 10

Scalable Parallel I/O Alternatives 10

Non-Parallel I/O: A Bad Approach…

  • Sequential I/O:

– All processes send data to rank 0, and 0 writes it to the file

Data 0 Data N-1 Data 2 Data 1 1 2 N-1 Block 0 Block N-1 Block 2 Block 1 Lacks scaling and results in excessive memory use on rank 0 Must think parallel from the start, but that implies data/file partitioning…

slide-11
SLIDE 11

Scalable Parallel I/O Alternatives 11

1 POSIX File Per Processor (1PFPP)

  • Pros:

– parallelism, high performance at small core counts

  • Cons:

– lots of small files to manage – LOTS OF METADATA – stress parallel filesystem – difficult to read back data from different number of processes – @ 300K cores yields 600K files

  • @ JSC è kernel panic!!

– PHASTA currently uses this approach…

slide-12
SLIDE 12

Scalable Parallel I/O Alternatives 12

New Partitioned Solver Parallel I/O Format

  • Assumes data accessed in

a coordinated manner

  • File: master header +

series of data blocks

  • Each data block has

header and data

  • Ex: 4 parts w/ 2 fields per

part

  • Allows for different

processor config:

  • (1 core @ 4 parts),
  • (2 core @ 2 parts)
  • (4 cores @ 1 part)
  • Allows for 1 to many files

to control metadata

  • verheads
slide-13
SLIDE 13

Scalable Parallel I/O Alternatives 13

MPI_File alternatives: PMPIO

  • PMPIO à “poor man’s parallel

I/O” from “silo” mesh and field library

  • Divides app into groups of

writers

  • w/i a group only 1 writer at a

time to a file

  • Passing of a “token” ensures

synchronization w/i a group

  • Support for HDF5 file format
  • Uses MPI_File_read/write_at

routines

slide-14
SLIDE 14

Scalable Parallel I/O Alternatives 14

  • Flexible design allows variable

number files and procs/writers per file

  • Within a file, can be

configured to write on “block size boundries” which are typically 1 to 4MB.

  • Implemented using collective

I/O routines : e.g., MPI_File_write_at_all_begin

MPI_File alternatives: syncIO

slide-15
SLIDE 15

Scalable Parallel I/O Alternatives 15

  • Rb è “reduced blocking”
  • Targets “checkpointing”
  • Divides application into workers and

writers with 1 writer MPI task per group of workers.

  • Workers send I/O to writers over

MPI_Isend and are free to continue –

– e.g., hides the latency of blocking parallel I/O

  • Writers then perform blocking

MPI_File_write_at operation using MPI_SELF communicator

MPI_File alternatives: rbIO

slide-16
SLIDE 16

Scalable Parallel I/O Alternatives 16

BG/L: 1PFPP w/ 7.7 GB data

slide-17
SLIDE 17

Scalable Parallel I/O Alternatives 17

BG/L: PMPIO w/ 7.7 GB data

HDF5 Peak: 600MB/sec RAW MPI-IO Peak: 900 MB/sec

slide-18
SLIDE 18

Scalable Parallel I/O Alternatives 18

BG/L: syncIO w/ 7.7 GB data

Read Performance Peak: 6.6 GB/sec Write Performance Peak: 1.3 GB/sec

slide-19
SLIDE 19

Scalable Parallel I/O Alternatives 19

BG/P: syncIO w/ ~60 GB data

slide-20
SLIDE 20

Scalable Parallel I/O Alternatives 20

BG/L: rbIO actual BW w/ 7.7 GB data

slide-21
SLIDE 21

Scalable Parallel I/O Alternatives 21

BG/L: rbIO perceived BW w/ 7.7 GB data

~22 TB/sec ~11 TB/sec

slide-22
SLIDE 22

Scalable Parallel I/O Alternatives 22

1024 files 512 files 256 files 128 files 64 files 32 files 16 files 1 file 1024 writers 17.91 14.71 13.45 12.85 13.02 12.79 12.79 3.16 512 writers 17.49 12.05 10.53 11.39 10.53 10.53 3.20 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00 20.00

Bandwidth(GB/s)

Actual write performance of rbIO with 32, 768 procs on Intrepid

BG/P: rbIO actual BW w/ ~60 GB data

~17.9 GB/sec

slide-23
SLIDE 23

Scalable Parallel I/O Alternatives 23

1024 files 512 files 256 files 128 files 64 files 32 files 16 files 1 file 1024 writers 20.74 19.69 19.69 19.69 19.69 19.69 19.70 19.69 512 writers 20.88 19.79 19.78 19.78 19.79 19.79 19.78 0.00 5.00 10.00 15.00 20.00 25.00 30.00

Bandwidth(TB/s)

Perceived write performance of rbIO with 32, 768 procs on Intrepid

BG/P: rbIO perceived BW w/ ~60 GB data

~21 TB/sec

slide-24
SLIDE 24

Scalable Parallel I/O Alternatives 24

Related Work

  • A. Nisar, W. Liao, and A. Choudhary, “Scaling Parallel

I/O Performance through I/O Delegate and Caching System,” in Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008.

– Performance “rbIO” inside MPI via threads and using upto 10% compute cores as I/O workers

  • Benchmark studies (hightlight just a few…)

– H Yu et al [18] – BG/L: 2 GB/sec @ 1K – Saini et al [19] – 512 NEC SX-8 cores – I/O was not scalable when all processors access a shared file. – Larkin et al [17] – large performance drop at 2K core count for CrayXT3/XT4 – Lang et al [30] – large I/O study across many benchmarks on Intrepid/BG-P. Found 60 GB/s read and 45 GB/s write. In practice, Intrepid has a peak I/O rate of around 35 GB/sec

slide-25
SLIDE 25

Scalable Parallel I/O Alternatives 25

Summary and Future Work

  • We examine several parallel I/O approaches..

– 1 POSIX File per Proc: < 1 GB/sec on BG/L – PMPIO: < 1 GB/sec on BG/L – syncIO – all processors write as groups to different files

  • BG/L: 6.6 GB/sec read, 1.3 GB/sec write
  • BG/P: 11.6 GB/sec read, 25 GB/sec write

– rbIO – gives up 3 to 6% of compute nodes to hide latency of blocking parallel I/O.

  • BG/L: 2.3 GB/sec actual write, 22 TB/sec perceived write
  • BG/P: ~18 GB/sec actual write, ~22 TB/sec perceived write
  • Good trade-off on Blue Gene
  • All procs to 1 file does not yield good performance even if aligned.
  • Performance “sweet spot” for syncIO depends significantly on I/O

architecture and so file format must be tuned accordingly

– BG/L @ CCNI has a metadata bottleneck and must adjust # of files according – e.g., 32 to 128 writers – BG/P @ ALCF can sustain much higher performance, but requires more files – e.g., 1024 writers – Suggest collective I/O is sensitive to underlying file system performance.

  • For rbIO, we observed that 1024 writers was the best performance so far

for both BG/L and BG/P platforms..

  • Future Work – impact on performance of different filesystems

– Leverage Darshan logs @ ALCF to better understand Intrepid performance – More experiments on Blue Gene/P under PVFS, CrayXT5 under Lustre