Lecture 18 I/O Performance and Checkpoints EN 600.320/420/620 - PowerPoint PPT Presentation

Lecture 18 I/O Performance and Checkpoints EN 600.320/420/620 Instructor: Randal Burns 4 November 2020 Department of Computer Science, Johns Hopkins University

The I/O Crisis in HPC In a world where FLOPS is the commodity……. …..Disk I/O often limits performance l Any persistent data must make it off the supercomputer – To magnetic disk or solid state storage l Storage is not as connected to the high-speed network as compute – Because it needs to be shared with other computers – Because it doesn’t add to TOP500 benchmarks Lecture 18: Checkpoint I/O Performance

Where does the I/O Come From? l Checkpointing! – And, writing output from simulation (which is checkpointing) l Checkpoint workload – Every node node writes local state to a shared file system – Using POSIX calls (FS parallelized) or MPI I/O J. Bent et al. PLFS: A Checkpoint File Systems for Parallel Applications. SC, 2009. Lecture 18: Checkpoint I/O Performance

Why Checkpointing l At scale failures occur inevitably – MPI synchronous model means that a failure breaks the code – Lose all work since start (or restart) l Each checkpoint provides a restart point – Limits exposure, loss of work to last checkpoint l By policy, all codes that run at scale on supercomputers MUST checkpoint! – HPC centers want codes to do useful work Lecture 18: Checkpoint I/O Performance

Checkpoint Approaches l Automatic: store contents of memory and program counters – Brute force, large data, inefficient – But easy, no development effort – New interest in this approach with the emergence of VMs and containers in HPC. l Application specific: keep data structures and metadata representing current progress. Hand coded by developer. – Smaller, faster, preferred, but tedious. – Almost all “good” codes have application specific checkpoints Lecture 18: Checkpoint I/O Performance

A Checkpoint Workload l IOR benchmark – Each node transfers 512 MB l How much parallelism? l What effects? M. Uselton et al. Parallel I/O Performance: From Events to Ensembles. IPDPS, 2010. Lecture 18: Checkpoint I/O Performance

I/O Rates and PDF l What features do you observe? M. Uselton et al. Parallel I/O Performance: From Events to Ensembles. IPDPS, 2010. Lecture 18: Checkpoint I/O Performance

I/O Rates and PDF l What features do you observe? – Lagging processes = not realizing peak I/O performance – Harmonics in I/O distribution = unfair resource sharing M. Uselton et al. Parallel I/O Performance: From Events to Ensembles. IPDPS, 2010. Lecture 18: Checkpoint I/O Performance

Statistical Observations l Order statistics – Fancy way of saying, the longest operation dominates overall performance l Law of large numbers – I don’t think that they make this analysis cogent – It’s right, but Gaussian distribution is not what matters – A better, intuitive conclusion is l (RB interprets) smaller files are better – The worst case slow down on a smaller transfer takes less absolute time than on a large transfer – As long as transfers are “big enough” to amortize startups costs Lecture 18: Checkpoint I/O Performance

Smaller Files Improve Performance l Non-intuitive – Smaller operations seems like more overhead – But, a property of statistical analysis l Smaller better as long as fixed costs are amortized – Obviously, 1 byte is too small Lecture 18: Checkpoint I/O Performance

The Checkpoint Crisis As HPC codes get larger, I/O becomes more critical l Some observations – Checkpoint to protect against failure – More components increase failure probability – FLOPs grows faster than bandwidth l Conclusion – Must take slower checkpoints more often – Eventually you will get no constructive work done between checkpoints l Mitigation (just delaying the problem) – Burst buffers: fast (SSD) storage in high-speed network – Observe the checkpoint persistence is shorter than needed for output/analysis data Lecture 18: Checkpoint I/O Performance

Extra Slides Lecture 18: Checkpoint I/O Performance

Fixing I/O Performance l Compare same I/O benchmark on two platforms – 256 nodes of Franklin and Jaguar Lecture 18: Checkpoint I/O Performance

Problem = Long Read Delays Lecture 18: Checkpoint I/O Performance

Problem Analysis l Not all reads are slow – Just 4-8 l What special property do they have? – None: the reads are the same as earlier and later reads l So, maybe something about ordering Lecture 18: Checkpoint I/O Performance

Problem Analysis l After third read, system detects strided read pattern and performs read-ahead – Requires client side buffering of data l Other uses of memory (client writes) consumed buffer space, preventing the read-ahead from working l Lustre file system executed a fall-back code path – Perform small reads when no buffer space is available – Small reads are very inefficient Lecture 18: Checkpoint I/O Performance

Problem Resolution l Patch the file system – Turn off read-ahead in this case l Problem solved (4x improvement) Lecture 18: Checkpoint I/O Performance

Another Code (Resolution Process) l Reduce the number of tasks (10K -> 80) and have each task do many small I/Os – Variability reduction from more small I/Os – Reduce resource use and contention (fewer actors) l Align the request size to file system parameters – Increase transfer rate l Defer and aggregate metadata writes – Avoid lots of small updates Lecture 18: Checkpoint I/O Performance

Thought on MPI Performance l Visualization tools work and matter – Examples of 5x to 10x differences l I/O is a huge component of performance – This is only trending up – Memory capacity and processor speed makes more data – Scale requires more frequent checkpoints l HPC is a complex and fragile ecosystem – Many parameters and implementation subtleties l Order statistics rule – Only as fast as the slowest member – This gets more problematic as we use more nodes – HW errors and SW misconfiguations on one node can ruin a cluster. Must diagnose! Lecture 18: Checkpoint I/O Performance

Lecture 18 I/O Performance and Checkpoints EN 600.320/420/620 - PowerPoint PPT Presentation

Lecture 18 I/O Performance and Checkpoints EN 600.320/420/620 Instructor: Randal Burns 4 November 2020 Department of Computer Science, Johns Hopkins University The I/O Crisis in HPC In a world where FLOPS is the commodity. ..Disk I/O

Lecture 22 I/O Performance and Checkpoints EN 600.320/420/620 Instructor: Randal Burns 27 March

Checkpoints and Continuations instead of Nested Transactions Eric Koskinen Brown University

AMMI Introduction to Deep Learning 10.4. Model persistence and checkpoints Fran cois

When does macOS Catalina create APFS checkpoints and which data could be retrieved from them?

Fine-Grained Fault Tolerance using Device Checkpoints Asim Kadav with Matthew Renzelmann and

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin Ren and Xin Jin Checkpoint

TEAM CAVE See Schedules and Checkpoints http://www.failedlife.com/TeamCave.htm March 22

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Modeling the Impact of Checkpoints on Next-Generation Systems Cray User Group Technical

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Network Administration HW4 Checkpoints tzute Computer Center, CS, NCTU Overview (1/3) A. Check

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Lecture: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation,

CPU Performance Lecture 8 CAP 3103 06-11-2014 1.6 Performance Defining Performance Which

The Probabilistic Method Week 6: Expectation, Variance, and Beyond Joshua Brody CS49/Math59

COL866: Foundations of Data Science Ragesh Jaiswal, IITD Ragesh Jaiswal, IITD COL866:

The Source Coding Theorem Mathias Winther Madsen mathias.winther@gmail.com Institute for Logic,

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Thrusday

Review: Probability BM1: Advanced Natural Language Processing University of Potsdam Tatjana

Randomized Algorithms II High Probability Part I Lecture 10 Movie... September 26, 2013

Random Matrix Improved Covariance Estimation for a Large Class of Metrics Malik TIOMOKO , Florent

Kolmogorov-Loveland stochasticity and Kolmogorov complexity Laurent Bienvenu Laboratoire