I/O Threads to Reduce Checkpoint Blocking for an EM Solver on Blue - - PowerPoint PPT Presentation

i o threads to reduce checkpoint blocking for an em
SMART_READER_LITE
LIVE PREVIEW

I/O Threads to Reduce Checkpoint Blocking for an EM Solver on Blue - - PowerPoint PPT Presentation

Introduction Approaches Performance Summary I/O Threads to Reduce Checkpoint Blocking for an EM Solver on Blue Gene/P and Cray XK6 Jing Fu , Misun Min , Robert Latham , Christopher Carothers Department of Computer Science,


slide-1
SLIDE 1

Introduction Approaches Performance Summary

I/O Threads to Reduce Checkpoint Blocking for an EM Solver on Blue Gene/P and Cray XK6

Jing Fu∗†, Misun Min¶, Robert Latham¶, Christopher Carothers†

†Department of Computer Science, Rensselaer Polytechnic Institute ¶ Mathematics and Computer Science Division, Argonne National Laboratory

June 29, 2012

1/ 27

slide-2
SLIDE 2

Introduction Approaches Performance Summary

Presentation Outline

Introduction Approaches Performance and Analysis Summary

2/ 27

slide-3
SLIDE 3

Introduction Approaches Performance Summary

Solver systems and checkpointing

Parallel Partitioned Solver Systems are being applied to tackle hard problems in science & engineering, e.g. PHASTA (CFD), Nek5000 (CFD), NekCEM (CEM) These applications scale well on massively parallel platforms (strong scaling on 100,000s of cores) Traditional I/O doesn’t scale as well, may suffer at large scale In this talk, we focus on the use of I/O threads for an EM solver (NekCEM) checkpoint on BG/P and Cray XK6

3/ 27

slide-4
SLIDE 4

Introduction Approaches Performance Summary

I/O software stack of a typical HPC system

4/ 27

slide-5
SLIDE 5

Introduction Approaches Performance Summary

Bursty I/O

Figure: I/O workload in ANL, image courtesy of Rob Ross

Pattern: X steps comp. → checkpoint → X steps comp. ... Core assumption: synchronized writes among all processors (lack of well-supported asynchronous I/O on supercomputers)

5/ 27

slide-6
SLIDE 6

Introduction Approaches Performance Summary

Checkpoint File Structure

(a) Typical File Structure (b) NekCEM File Structure

6/ 27

slide-7
SLIDE 7

Introduction Approaches Performance Summary

Related Work and Our Objective

Related Work

Scalable Checkpoint/Restart, Lawrence Livermore National Lab ADaptable IO System, Oak Ridge National Lab I/O Delegate Cache System, Northwestern University

Design Factors

design space; platform dependency; application transparency

Our Objective

Goal: performance at scale user space, portable, reasonably generic

7/ 27

slide-8
SLIDE 8

Introduction Approaches Performance Summary

Previous work: from coIO to naive rbIO

8/ 27

slide-9
SLIDE 9

Introduction Approaches Performance Summary

from coIO to naive rbIO

9/ 27

slide-10
SLIDE 10

Introduction Approaches Performance Summary

Method 1: Completely split rbIO

dedicated I/O writers

  • verlap computation

and I/O lose a small portion of computing resources

  • ther problems?

10/ 27

slide-11
SLIDE 11

Introduction Approaches Performance Summary

Potential limitations with completely split rbIO

break collective operation optimizations on Blue Gene systems collective operations on subcomm go through torus not tree 10× slower on torus

Table: The time (in µs) MPI Allreduce spends on BG/P

#nodes Time on Tree Time on Torus Ratio 4096 7.68 55.65 7.24 8192 7.72 61.88 8.01 16384 8.19 67.66 8.26 performance impact on applications: 1 - 2% time spent on collective now means 10 -20% can be verified by running application with tree network off on BG/P

11/ 27

slide-12
SLIDE 12

Introduction Approaches Performance Summary

Method 2: rbIO with I/O daemon threads

global communicator simple control flow threading supercomputers?

12/ 27

slide-13
SLIDE 13

Introduction Approaches Performance Summary

Potential limitations of threading rbIO

BG/P has limited threading capability default to one, up to three threads per core does not support automatic thread switching have to use hardware thread in SMP mode experiment for demo purpose load balancing issue for those that fully support threads, e.g. Cray XK6?

13/ 27

slide-14
SLIDE 14

Introduction Approaches Performance Summary

NekCEM I/O on Blue Gene/P

Blue Gene/P Spec

163,840 cores, 80 TB RAM, 557 teraflops (“Intrepid”@ANL) GPFS/PVFS, 128 file servers connected to 16 DDN 9900, 10 PB pset (1 ION to 64 4-core CN), 640 ION to 128 file servers by 10GB/s Myricom switch 4MB block size, read peak 60 GB/s, write peak 47 GB/s

Experiment Setup

3D cylindrical waveguide simulation for different meshes (grid points, total size) = {(143M, 13GB), (275M, 26 GB), (550M, 52 GB)} Weak scaling tests

14/ 27

slide-15
SLIDE 15

Introduction Approaches Performance Summary

Overview of the Blue Gene system

15/ 27

slide-16
SLIDE 16

Introduction Approaches Performance Summary

NekCEM I/O on BG/P: bandwidth

20 40 60 80 100 8192 16384 32768

Bandwidth(GB/s) Number of processors Write performance with NekCEM on Intrepid GPFS

coIO, nf=1 coIO, nf=np/64 rbIO, nf=nw=np/64 threaded rbIO, raw threaded rbIO, perceived

16/ 27

slide-17
SLIDE 17

Introduction Approaches Performance Summary

NekCEM I/O on BG/P: overall time

5 10 15 20 25 8192 16384 32768

Time (seconds) Number of processors Compute and I/O time with NekCEM on Intrepid (w/ GPFS)

coIO, nf=1 coIO, nf=np/64 rbIO, nf=nw=np/64 threaded rbIO compute time

17/ 27

slide-18
SLIDE 18

Introduction Approaches Performance Summary

NekCEM I/O on Cray XK6

Cray XK6 Spec

299,008 cores (AMD Opteron Interlagos, on Cray Linux microkernel), 598 TB RAM, 2.63 petaflops (“Jaguar”@ORNL) Lustre, 192 OSS servers to 96 DDN 9900s (7 RAID-6 (8+2)/OSS), 10 PB 4MB block size, peak 120 GB/s

18/ 27

slide-19
SLIDE 19

Introduction Approaches Performance Summary

Overview of the Cray system

Figure: Architecture diagram of Jaguar@ORNL, image courtesy of Rob Ross

19/ 27

slide-20
SLIDE 20

Introduction Approaches Performance Summary

NekCEM I/O on Cray: bandwidth

20 40 60 80 100 8192 16384 32768

Bandwidth(GB/s) Number of processors Write performance with NekCEM on Jaguar Lustre

coIO, nf=1 coIO, nf=np/64 rbIO, nf=nw=np/64 threaded rbIO, raw threaded rbIO, perceived

20/ 27

slide-21
SLIDE 21

Introduction Approaches Performance Summary

NekCEM I/O on Cray: overall time

5 10 15 20 25 8192 16384 32768

Time (seconds) Number of processors Compute and I/O time with NekCEM on Jaguar (w/ Lustre)

coIO, nf=1 coIO, nf=np/64 rbIO, nf=nw=np/64 threaded rbIO compute time

21/ 27

slide-22
SLIDE 22

Introduction Approaches Performance Summary

NekCEM I/O on Cray: profiling compute time

9 9.5 10 10.5 11 2000 4000 6000 8000 10000 12000 14000 16000

Time(seconds) Processor Ranks Compute time distribution for NekCEM with 16,384 processors on Jaguar

rbIO threaded rbIO

22/ 27

slide-23
SLIDE 23

Introduction Approaches Performance Summary

NekCEM I/O on Cray: Threaded rbIO Timing Analysis

23/ 27

slide-24
SLIDE 24

Introduction Approaches Performance Summary

NekCEM I/O on Cray: Speedup Analysis

Speedupprod = TcoIO + TcoIO

comp

TtrbIO + TtrbIO

comp

= XcoIO ∗ tcoIO

comp + fcp ∗ tcoIO comp

XtrbIO ∗ ttrbIO

comp + fcp ∗ ttrbIO comp

= XcoIO + fcp XtrbIO + fcp ∗ 1 1 + δ, where X is the number of computation steps that a checkpoint time equals to, fcp denotes number of computation steps between two checkpoints, and δ is the overhead of a single step computation with threaded rbIO compared with nonthreaded I/O (i.e., ttrbIO

comp =

(1 + δ) ∗ tcoIO

comp).

Roughly 50% speedup on 32K procs Jaguar.

24/ 27

slide-25
SLIDE 25

Introduction Approaches Performance Summary

Summary

Application-transparent optimizations (MPI-IO collective) with good tuning practice can provide decent performance on some platforms Application-level optimizations exploit application-specific information and provide tuning options (nf, ng, I/O thread) and good performance on most platforms Data staging (on RAM, RAM disk, SSD) helps ease out pressure

  • f bursty I/O for file system, trending technique in design of

storage system for Exascale era What happens on Mira and Blue Waters?

25/ 27

slide-26
SLIDE 26

Introduction Approaches Performance Summary

Collaborators

Ning Liu, Christopher D. Carothers Department of Computer Science Rensselaer Polytechnic Institute Onkar Sahni, Min Zhou, Mark Shephard Scientific Computation Research Center (SCOREC) Rensselaer Polytechnic Institute Michel Rasquin, Kenneth Jansen Aerospace Engineering Sciences University of Colorado Boulder Misun Min, Paul Fischer, Rob Latham, Rob Ross Mathematics and Computer Science Division Argonne National Laboratory

26/ 27

slide-27
SLIDE 27

Introduction Approaches Performance Summary

Questions?

27/ 27