Scalable in-memory checkpoint for hard and soft error protection - PowerPoint PPT Presentation

Charm++ Workshop Scalable in-memory checkpoint for hard and soft error protection with automatic restart on failure Xiang Ni Parallel Programming Laboratory University of Illinois at Urbana-Champaign May, 2013 1 / 25

Charm++ Workshop Fault Tolerance Philosophy in Charm++ Outline 1 Fault Tolerance Philosophy in Charm++ 2 Asynchronous Checkpoint/Restart 3 Replication enhanced Checkpoint Restart 2 / 25

Charm++ Workshop Fault Tolerance Philosophy in Charm++ Our Philosophy Keep progress rate despite failures No Fault Tolerance Support Fault Tolerance Support 100% Checkpoint Progress Slowdown Recovery Failure Time Optimize for the common case Minimize performance overhead 3 / 25

Charm++ Workshop Fault Tolerance Philosophy in Charm++ Optimize for the common case Failures rarely bring down more than one node at a time In Jaguar (now Titan, top 1 supercomputer), 92.27% of failures are individual node crashes So, our strategies are geared to handle all single-node failures and most multi-node failures 100 Frequency (%) 10 1 0.1 System 12 System 18 System 19 System 20 System 21 MPP2 Tsubame Mercury 1 node 2 nodes 3 nodes 4 nodes > 4 nodes 4 / 25

Charm++ Workshop Fault Tolerance Philosophy in Charm++ Minimize performance overhead Automatic restart: Failure detection in runtime system Immediate rollback-recovery Parallel recovery Faster checkpoint Double in-memory checkpoint/restart B is the buddy of A Objects $ % ! " # & ' Local ! " # & ' $ % Checkpoint Remote & ' $ % ! " # Checkpoint Node A Node B Node C 5 / 25

Charm++ Workshop Fault Tolerance Philosophy in Charm++ Minimize performance overhead Automatic restart: Failure detection in runtime system Immediate rollback-recovery Parallel recovery Faster checkpoint Double in-memory checkpoint/restart Semi-blocking checkpointing: asynchronously store the checkpoint remotely 5 / 25

Charm++ Workshop Asynchronous Checkpoint/Restart Outline 1 Fault Tolerance Philosophy in Charm++ 2 Asynchronous Checkpoint/Restart 3 Replication enhanced Checkpoint Restart 6 / 25

Charm++ Workshop Asynchronous Checkpoint/Restart Blocking Checkpoint checkpoint done barrier checkpoint τ blocking β interval NODE 1 α δ blocking checkpoint α NODE 2 β overhead 𝜐 blocking 𝜀 blocking Each node has a buddy node to store the checkpoint. Resume computation after all the nodes have successfully saved the checkpoints in their buddy nodes. 7 / 25

Charm++ Workshop Asynchronous Checkpoint/Restart Semi-blocking Checkpointing remote checkpoint local checkpoint barrier done done checkpoint interval τ NODE 1 α β local checkpoint overhead δ overlap period α θ β NODE 2 remote checkpoint interference φ ϕ 𝜀 𝛴 𝜐 Resume computation as soon as each node stores its own checkpoint (local checkpoint). Interleave the transmission of the checkpoint to buddy with application execution (remote checkpoint). 8 / 25

Charm++ Workshop Asynchronous Checkpoint/Restart Single Checkpoint Overhead 70 40 blocking checkpoint blocking checkpoint semi � blocking checkpoint semi � blocking checkpoint 35 60 30 Checkpoint Overhead(s) Checkpoint Overhead(s) 50 25 40 20 30 15 20 10 10 5 0 0 128 256 512 1024 128 256 512 1024 Number of Cores Number of Cores Wave2D Weak Scaling ChaNGa Strong Scaling Semi-Blocking checkpoint reduces checkpoint overhead significantly. 9 / 25

Charm++ Workshop Asynchronous Checkpoint/Restart Leveraging Solid State Drives Solid State Drive: becoming increasingly available on individual nodes Full SSD strategy Half SSD strategy Only store remote checkpoint in SSD Faster checkpoint and restart 10 / 25

Charm++ Workshop Asynchronous Checkpoint/Restart Asynchronous Checkpointing to SSD with IO thread worker worker worker worker IO thread thread thread thread thread IO threads Write checkpoint to/Read SSD request checkpoint from SSD write to When receive request SSD from worker thread. Notify worker thread When SSD is done with Checkpoint certain request. finishes 11 / 25

Charm++ Workshop Asynchronous Checkpoint/Restart Checkpoint/Restart on SSD 30 45 half − aio in − memory full − aio half − aio 40 half − sio full − aio 25 full − sio 35 Timing Penalty(s) 20 Restart Time(s) 30 15 25 20 10 15 5 10 0 5 0.45 1.34 2.23 0.45 1.34 2.23 Checkpoint Size/Node(GB) Checkpoint Size/Node(GB) Half SSD strategy with asynchronous IO reduces the timing penalty for aio asynchronous IO checkpointing to SSD sio synchronous IO Restart from SSD does not incur extra overhead 12 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart Outline 1 Fault Tolerance Philosophy in Charm++ 2 Asynchronous Checkpoint/Restart 3 Replication enhanced Checkpoint Restart 13 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart New challenge: soft error Not just from cosmic rays Computer electronic’s sensitivity to radiation increases as their dimensions and operating voltage decreases because of the requirements for high performance and low power. What may happen if soft failure rate keeps increasing? 1 10000 0.9 FIT rate (soft data corruption) 0.8 1000 0.7 Vulnerability 0.6 100 0.5 0.4 10 0.3 0.2 1 0.1 0 2K 4K 8K 16K 32K 64K 128K256K512K 1024K Number of Sockets 14 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart Partition framework in Charm++ 15 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart Partition framework in Charm++ Ranking Local rank Global rank 15 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart Partition framework in Charm++ Ranking Local rank Global rank Inter-partition communication CmiInterSyncSend(local rank, partition, size, message) CmiInterSyncSendAndFree(local rank, partition, size, message) 15 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart Replication enhanced Fault Tolerance Overview Periodic soft data corruption detection Automatically correct soft error from checkpoint Yes, there are benefits for hard failure! No need for remote checkpointing 16 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart Replication enhanced Fault Tolerance Overview Extension from the double in-memory checkpointing Replica 1 Replica 2 buddy Objects ! # ! # $ % " $ % " Local ! # ! # " " $ % $ % Checkpoint Remote Checkpoint Node A Node B Node A Node B 17 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart Replication enhanced Fault Tolerance Overview TIME Job Starts replica 1 replica 2 application execution transfer checkpoint for soft error detection T1 checkpoint recovery hard error detected by replica 2 hard T2 error replica 2 sends checkpoints to replica 1 for recovery T3 soft error detected, both replicas roll back 18 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart Initial Result: soft error detection overhead 3 0.12 checkpoint checkpoint 2.5 0.1 2 0.08 Time (s) Time (s) 1.5 0.06 1 0.04 0.5 0.02 0 0 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k Number of Cores per Replica Number of Cores per Replica Jacobi3D AMPI LeanMD 19 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart Optimization Topology aware mapping Replica 1 nodes Replica 2 nodes # inter-replica messages [0-4] 1 3 2 0 1 1 0 1 2 3 4 1 1 0 3 1 1 2 3 4 2 1 0 1 0 1 0 1 3 0 1 1 0 1 1 2 3 4 2 1 1 0 3 1 2 1 0 1 1 0 1 2 3 4 1 0 3 0 1 1 2 3 4 2 1 1 0 1 0 1 3 2 0 1 1 0 1 1 2 3 4 1 1 0 3 2 3 4 2 1 0 1 0 1 0 1 1 1 1 3 1 0 1 2 3 4 2 1 0 1 0 1 (a) Default-mapping (b) Optimal-mapping 20 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart Optimization Checksum 21 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart Optimization Checksum Issue with checksum How to handle floating point round off error? 21 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart Result: after optimization 3 0.12 checkpoint checksum checkpoint checksum optimal mapping optimal mapping 2.5 0.1 2 0.08 Time (s) Time (s) 1.5 0.06 1 0.04 0.5 0.02 0 0 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k Number of Cores per Replica Number of Cores per Replica Jacobi3D AMPI LeanMD 22 / 25

Charm++ Workshop Replication enhanced Checkpoint Restart Result: recovery from hard failures 0.3 0.2 default default optimal optimal 0.25 0.15 0.2 Time (s) Time (s) 0.15 0.1 0.1 0.05 0.05 0 0 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k Number of Cores per Replica Number of Cores per Replica Jacobi3D AMPI LeanMD 23 / 25

Charm++ Workshop Thanks Thanks! Questions? 24 / 25

Scalable in-memory checkpoint for hard and soft error protection - PowerPoint PPT Presentation

Charm++ Workshop Scalable in-memory checkpoint for hard and soft error protection with automatic restart on failure Xiang Ni Parallel Programming Laboratory University of Illinois at Urbana-Champaign May, 2013 1 / 25 Charm++ Workshop Fault

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Memory Errors Bits in memory can be flipped Hard error The chip is broken E.g.,

FreeSurfer: Troubleshooting surfer.nmr.mgh.harvard.edu 1 Hard and Soft Failures Categories of

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

ICD-10 Checkpoint: Update for NJ-HFMA Jim Hennessy June 2015 e4 Services LLC Discussion Topics

Logistics Assignments Crossover and Mutation Checkpoint 1 -- Problem Graded --

iOmx Therapeutics Announces Discovery of Novel, Druggable Immune-Checkpoint Targets iOTarg

Oasys PRIMER Did you know? Back to Contents Top Tips Demo Slide 2 Slide 2 Checkpoint

Paper Summaries Any takers? Procedural Shading Announcement Logistics Checkpoint 2

Logistics Checkpoint 2 Mostly graded. Note on grading -- Regaining points

CAPSTONE MATTERS Checkpointing for Success: Experience in Meteorology Michael Richman School of

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

Blocking and Non-blocking Checkpointing and Rollback Recovery for Networks-on-Chip Claudia Rusu 1

Uni.lu HPC School 2019 PS07: Scientific computing using MATLAB Uni.lu High Performance Computing

Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing Prasanna

Securing Proof-of-Work Ledgers via Checkpointing Dimitris Karakostas, Aggelos Kiayias

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin Ren and Xin Jin Checkpoint

MC714 - Sistemas Distribuidos slides by Maarten van Steen (adapted from Distributed System - 3rd

Scalable in-memory checkpoint for hard and soft error protection - PowerPoint PPT Presentation

Charm++ Workshop Scalable in-memory checkpoint for hard and soft error protection with automatic restart on failure Xiang Ni Parallel Programming Laboratory University of Illinois at Urbana-Champaign May, 2013 1 / 25 Charm++ Workshop Fault

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Memory Errors Bits in memory can be flipped Hard error The chip is broken E.g.,

FreeSurfer: Troubleshooting surfer.nmr.mgh.harvard.edu 1 Hard and Soft Failures Categories of

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

ICD-10 Checkpoint: Update for NJ-HFMA Jim Hennessy June 2015 e4 Services LLC Discussion Topics

Logistics Assignments Crossover and Mutation Checkpoint 1 -- Problem Graded --

iOmx Therapeutics Announces Discovery of Novel, Druggable Immune-Checkpoint Targets iOTarg

Oasys PRIMER Did you know? Back to Contents Top Tips Demo Slide 2 Slide 2 Checkpoint

Paper Summaries Any takers? Procedural Shading Announcement Logistics Checkpoint 2

Logistics Checkpoint 2 Mostly graded. Note on grading -- Regaining points

CAPSTONE MATTERS Checkpointing for Success: Experience in Meteorology Michael Richman School of

MATLAB on UL HPC Checkpointing &amp; parallel execution UL High Performance Computing (HPC) Team

Blocking and Non-blocking Checkpointing and Rollback Recovery for Networks-on-Chip Claudia Rusu 1

Uni.lu HPC School 2019 PS07: Scientific computing using MATLAB Uni.lu High Performance Computing

Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing Prasanna

Securing Proof-of-Work Ledgers via Checkpointing Dimitris Karakostas, Aggelos Kiayias

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin Ren and Xin Jin Checkpoint

MC714 - Sistemas Distribuidos slides by Maarten van Steen (adapted from Distributed System - 3rd

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team