I/O Threads to Reduce Checkpoint Blocking for an EM Solver on Blue - PowerPoint PPT Presentation

Introduction Approaches Performance Summary I/O Threads to Reduce Checkpoint Blocking for an EM Solver on Blue Gene/P and Cray XK6 Jing Fu ∗† , Misun Min ¶ , Robert Latham ¶ , Christopher Carothers † † Department of Computer Science, Rensselaer Polytechnic Institute ¶ Mathematics and Computer Science Division, Argonne National Laboratory June 29, 2012 1/ 27

Introduction Approaches Performance Summary Presentation Outline Introduction Approaches Performance and Analysis Summary 2/ 27

Introduction Approaches Performance Summary Solver systems and checkpointing Parallel Partitioned Solver Systems are being applied to tackle hard problems in science & engineering, e.g. PHASTA (CFD), Nek5000 (CFD), NekCEM (CEM) These applications scale well on massively parallel platforms (strong scaling on 100,000s of cores) Traditional I/O doesn’t scale as well, may suffer at large scale In this talk, we focus on the use of I/O threads for an EM solver (NekCEM) checkpoint on BG/P and Cray XK6 3/ 27

Introduction Approaches Performance Summary I/O software stack of a typical HPC system 4/ 27

Introduction Approaches Performance Summary Bursty I/O Figure: I/O workload in ANL, image courtesy of Rob Ross Pattern: X steps comp. → checkpoint → X steps comp. ... Core assumption: synchronized writes among all processors (lack of well-supported asynchronous I/O on supercomputers) 5/ 27

Introduction Approaches Performance Summary Checkpoint File Structure (a) Typical File Structure (b) NekCEM File Structure 6/ 27

Introduction Approaches Performance Summary Related Work and Our Objective Related Work Scalable Checkpoint/Restart, Lawrence Livermore National Lab ADaptable IO System, Oak Ridge National Lab I/O Delegate Cache System, Northwestern University Design Factors design space; platform dependency; application transparency Our Objective Goal: performance at scale user space, portable, reasonably generic 7/ 27

Introduction Approaches Performance Summary Previous work: from coIO to naive rbIO 8/ 27

Introduction Approaches Performance Summary from coIO to naive rbIO 9/ 27

Introduction Approaches Performance Summary Method 1: Completely split rbIO dedicated I/O writers overlap computation and I/O lose a small portion of computing resources other problems? 10/ 27

Introduction Approaches Performance Summary Potential limitations with completely split rbIO break collective operation optimizations on Blue Gene systems collective operations on subcomm go through torus not tree 10 × slower on torus Table: The time (in µ s) MPI Allreduce spends on BG/P #nodes Time on Tree Time on Torus Ratio 4096 7.68 55.65 7.24 8192 7.72 61.88 8.01 16384 8.19 67.66 8.26 performance impact on applications: 1 - 2% time spent on collective now means 10 -20% can be verified by running application with tree network off on BG/P 11/ 27

Introduction Approaches Performance Summary Method 2: rbIO with I/O daemon threads global communicator simple control flow threading supercomputers? 12/ 27

Introduction Approaches Performance Summary Potential limitations of threading rbIO BG/P has limited threading capability default to one, up to three threads per core does not support automatic thread switching have to use hardware thread in SMP mode experiment for demo purpose load balancing issue for those that fully support threads, e.g. Cray XK6? 13/ 27

Introduction Approaches Performance Summary NekCEM I/O on Blue Gene/P Blue Gene/P Spec 163,840 cores, 80 TB RAM, 557 teraflops (“ Intrepid ”@ANL) GPFS/PVFS, 128 file servers connected to 16 DDN 9900, 10 PB pset (1 ION to 64 4-core CN), 640 ION to 128 file servers by 10GB/s Myricom switch 4MB block size, read peak 60 GB/s, write peak 47 GB/s Experiment Setup 3D cylindrical waveguide simulation for different meshes (grid points, total size) = { (143M, 13GB), (275M, 26 GB), (550M, 52 GB) } Weak scaling tests 14/ 27

Introduction Approaches Performance Summary Overview of the Blue Gene system 15/ 27

Introduction Approaches Performance Summary NekCEM I/O on BG/P: bandwidth Write performance with NekCEM on Intrepid GPFS 100 coIO, nf=1 coIO, nf=np/64 rbIO, nf=nw=np/64 threaded rbIO, raw threaded rbIO, perceived 80 Bandwidth(GB/s) 60 40 20 0 8192 16384 32768 Number of processors 16/ 27

Introduction Approaches Performance Summary NekCEM I/O on BG/P: overall time Compute and I/O time with NekCEM on Intrepid (w/ GPFS) 25 coIO, nf=1 coIO, nf=np/64 rbIO, nf=nw=np/64 threaded rbIO compute time 20 Time (seconds) 15 10 5 0 8192 16384 32768 Number of processors 17/ 27

Introduction Approaches Performance Summary NekCEM I/O on Cray XK6 Cray XK6 Spec 299,008 cores (AMD Opteron Interlagos, on Cray Linux microkernel), 598 TB RAM, 2.63 petaflops (“ Jaguar ”@ORNL) Lustre, 192 OSS servers to 96 DDN 9900s (7 RAID-6 (8+2)/OSS), 10 PB 4MB block size, peak 120 GB/s 18/ 27

Introduction Approaches Performance Summary Overview of the Cray system Figure: Architecture diagram of Jaguar@ORNL, image courtesy of Rob Ross 19/ 27

Introduction Approaches Performance Summary NekCEM I/O on Cray: bandwidth Write performance with NekCEM on Jaguar Lustre 100 coIO, nf=1 coIO, nf=np/64 rbIO, nf=nw=np/64 threaded rbIO, raw threaded rbIO, perceived 80 Bandwidth(GB/s) 60 40 20 0 8192 16384 32768 Number of processors 20/ 27

Introduction Approaches Performance Summary NekCEM I/O on Cray: overall time Compute and I/O time with NekCEM on Jaguar (w/ Lustre) 25 coIO, nf=1 coIO, nf=np/64 rbIO, nf=nw=np/64 threaded rbIO compute time 20 Time (seconds) 15 10 5 0 8192 16384 32768 Number of processors 21/ 27

Introduction Approaches Performance Summary NekCEM I/O on Cray: profiling compute time Compute time distribution for NekCEM with 16,384 processors on Jaguar 11 rbIO threaded rbIO 10.5 Time(seconds) 10 9.5 9 0 2000 4000 6000 8000 10000 12000 14000 16000 Processor Ranks 22/ 27

Introduction Approaches Performance Summary NekCEM I/O on Cray: Threaded rbIO Timing Analysis 23/ 27

Introduction Approaches Performance Summary NekCEM I/O on Cray: Speedup Analysis T coIO + T coIO comp = Speedup prod T trbIO + T trbIO comp X coIO ∗ t coIO comp + f cp ∗ t coIO comp = X trbIO ∗ t trbIO comp + f cp ∗ t trbIO comp X coIO + f cp 1 = ∗ 1 + δ, X trbIO + f cp where X is the number of computation steps that a checkpoint time equals to, f cp denotes number of computation steps between two checkpoints, and δ is the overhead of a single step computation with threaded rbIO compared with nonthreaded I/O (i.e., t trbIO comp = ( 1 + δ ) ∗ t coIO comp ). Roughly 50% speedup on 32K procs Jaguar. 24/ 27

Introduction Approaches Performance Summary Summary Application-transparent optimizations (MPI-IO collective) with good tuning practice can provide decent performance on some platforms Application-level optimizations exploit application-specific information and provide tuning options ( nf, ng, I/O thread ) and good performance on most platforms Data staging (on RAM, RAM disk, SSD) helps ease out pressure of bursty I/O for file system, trending technique in design of storage system for Exascale era What happens on Mira and Blue Waters? 25/ 27

Introduction Approaches Performance Summary Collaborators Ning Liu, Christopher D. Carothers Department of Computer Science Rensselaer Polytechnic Institute Onkar Sahni, Min Zhou, Mark Shephard Scientific Computation Research Center (SCOREC) Rensselaer Polytechnic Institute Michel Rasquin, Kenneth Jansen Aerospace Engineering Sciences University of Colorado Boulder Misun Min, Paul Fischer, Rob Latham, Rob Ross Mathematics and Computer Science Division Argonne National Laboratory 26/ 27

Introduction Approaches Performance Summary Questions? 27/ 27

I/O Threads to Reduce Checkpoint Blocking for an EM Solver on Blue - PowerPoint PPT Presentation

Introduction Approaches Performance Summary I/O Threads to Reduce Checkpoint Blocking for an EM Solver on Blue Gene/P and Cray XK6 Jing Fu , Misun Min , Robert Latham , Christopher Carothers Department of Computer Science,

Threads and Concurrency Threads and Concurrency Threads Threads A thread is a schedulable stream

Data Blocking Jon K. Nilsen Department of Physics and Scientific Computing Group University of

Unit 14: The Mach Operating System 14.2. Threads and Scheduling in Mach AP 9/01 Threads

1 User Threads Benefits Responsiveness Thread management done by a user-level threads

Data Blocking Jon K. Nilsen Department of Physics and Scientific Computing Group University of

Chapter 2: Processes & Threads Chapter 2 Processes and threads n Processes n Threads n

Chapter 2: Processes & Threads Chapter 2 Processes and threads Processes Threads

Operating Systems Threads Maria Hybinette, UGA Maria Hybinette, UGA Chapter: Threads:

Threads Threads Threads vs Processes Multi-threading Models Threading Issues

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Chapter 5: Threads I Overview I Multithreading Models I Threading Issues I Pthreads I Solaris 2

Threads: Questions CSCI 1730 Systems Programming How is a thread different from a process?

Chapter: Threads: Ques/ons How is a thread different from a process? Why are threads

iOmx Therapeutics Announces Discovery of Novel, Druggable Immune-Checkpoint Targets iOTarg

ICD-10 Checkpoint: Update for NJ-HFMA Jim Hennessy June 2015 e4 Services LLC Discussion Topics

Logistics Assignments Crossover and Mutation Checkpoint 1 -- Problem Graded --

Linear-(me Approxima(ons for Domina(ng Sets and Independent

How to Build a Petabyte Sized Storage System Invited Talk for LISA09 Ray Paden Version 2.0

MIT 6.875 & Berkeley CS276 Foundations of Cryptography Lecture 17 HOW TO CONSTRUCT NIZK IN

Compar omparativ ive e I/O O Wor orkload kload Char haract acter eriz ization ion of

HUF 2017 KEK site report Share Our Experience Koichi Murakami (KEK/CRC) HUF 2017 KEK, Tsukuba 1

Out line Robot ics Percept ion Robot ics Planning Reading: R&N Sect .

Unimodality of q -Eulerian polynomials and q , p -Eulerian polynomials Michelle Wachs University

UL HPC School 2017 Overview & Challenges of the UL HPC Facility at the Belval & EuroHPC

I/O Threads to Reduce Checkpoint Blocking for an EM Solver on Blue - PowerPoint PPT Presentation

Introduction Approaches Performance Summary I/O Threads to Reduce Checkpoint Blocking for an EM Solver on Blue Gene/P and Cray XK6 Jing Fu , Misun Min , Robert Latham , Christopher Carothers Department of Computer Science,

Threads and Concurrency Threads and Concurrency Threads Threads A thread is a schedulable stream

Data Blocking Jon K. Nilsen Department of Physics and Scientific Computing Group University of

Unit 14: The Mach Operating System 14.2. Threads and Scheduling in Mach AP 9/01 Threads

1 User Threads Benefits Responsiveness Thread management done by a user-level threads

Data Blocking Jon K. Nilsen Department of Physics and Scientific Computing Group University of

Chapter 2: Processes &amp; Threads Chapter 2 Processes and threads n Processes n Threads n

Chapter 2: Processes &amp; Threads Chapter 2 Processes and threads Processes Threads

Operating Systems Threads Maria Hybinette, UGA Maria Hybinette, UGA Chapter: Threads:

Threads Threads Threads vs Processes Multi-threading Models Threading Issues

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Chapter 5: Threads I Overview I Multithreading Models I Threading Issues I Pthreads I Solaris 2

Threads: Questions CSCI 1730 Systems Programming How is a thread different from a process?

Chapter: Threads: Ques/ons How is a thread different from a process? Why are threads

iOmx Therapeutics Announces Discovery of Novel, Druggable Immune-Checkpoint Targets iOTarg

ICD-10 Checkpoint: Update for NJ-HFMA Jim Hennessy June 2015 e4 Services LLC Discussion Topics

Logistics Assignments Crossover and Mutation Checkpoint 1 -- Problem Graded --

Linear-(me Approxima(ons for Domina(ng Sets and Independent

How to Build a Petabyte Sized Storage System Invited Talk for LISA09 Ray Paden Version 2.0

MIT 6.875 &amp; Berkeley CS276 Foundations of Cryptography Lecture 17 HOW TO CONSTRUCT NIZK IN

Compar omparativ ive e I/O O Wor orkload kload Char haract acter eriz ization ion of

HUF 2017 KEK site report Share Our Experience Koichi Murakami (KEK/CRC) HUF 2017 KEK, Tsukuba 1

Out line Robot ics Percept ion Robot ics Planning Reading: R&amp;N Sect .

Unimodality of q -Eulerian polynomials and q , p -Eulerian polynomials Michelle Wachs University

UL HPC School 2017 Overview &amp; Challenges of the UL HPC Facility at the Belval &amp; EuroHPC

Chapter 2: Processes & Threads Chapter 2 Processes and threads n Processes n Threads n

Chapter 2: Processes & Threads Chapter 2 Processes and threads Processes Threads

MIT 6.875 & Berkeley CS276 Foundations of Cryptography Lecture 17 HOW TO CONSTRUCT NIZK IN

Out line Robot ics Percept ion Robot ics Planning Reading: R&N Sect .

UL HPC School 2017 Overview & Challenges of the UL HPC Facility at the Belval & EuroHPC