Scalable in-memory checkpoint for hard and soft error protection - - PowerPoint PPT Presentation

scalable in memory checkpoint for hard and soft error
SMART_READER_LITE
LIVE PREVIEW

Scalable in-memory checkpoint for hard and soft error protection - - PowerPoint PPT Presentation

Charm++ Workshop Scalable in-memory checkpoint for hard and soft error protection with automatic restart on failure Xiang Ni Parallel Programming Laboratory University of Illinois at Urbana-Champaign May, 2013 1 / 25 Charm++ Workshop Fault


slide-1
SLIDE 1

Charm++ Workshop

Scalable in-memory checkpoint for hard and soft error protection with automatic restart on failure

Xiang Ni

Parallel Programming Laboratory University of Illinois at Urbana-Champaign

May, 2013

1 / 25

slide-2
SLIDE 2

Charm++ Workshop Fault Tolerance Philosophy in Charm++

Outline

1 Fault Tolerance Philosophy in Charm++ 2 Asynchronous Checkpoint/Restart 3 Replication enhanced Checkpoint Restart

2 / 25

slide-3
SLIDE 3

Charm++ Workshop Fault Tolerance Philosophy in Charm++

Our Philosophy

Keep progress rate despite failures

Time Progress

Fault Tolerance Support No Fault Tolerance Support 100% Slowdown Checkpoint Failure Recovery

Optimize for the common case Minimize performance overhead

3 / 25

slide-4
SLIDE 4

Charm++ Workshop Fault Tolerance Philosophy in Charm++

Optimize for the common case

Failures rarely bring down more than one node at a time In Jaguar (now Titan, top 1 supercomputer), 92.27% of failures are individual node crashes So, our strategies are geared to handle all single-node failures and most multi-node failures

0.1 1 10 100 System 12 System 18 System 19 System 20 System 21 MPP2 Tsubame Mercury Frequency (%) 1 node 2 nodes 3 nodes 4 nodes > 4 nodes 4 / 25

slide-5
SLIDE 5

Charm++ Workshop Fault Tolerance Philosophy in Charm++

Minimize performance overhead

Automatic restart:

Failure detection in runtime system Immediate rollback-recovery

Parallel recovery Faster checkpoint

Double in-memory checkpoint/restart

Node A Node B Node C

! " # ! " # $ % & ' & ' ! " # $ % $ % & '

Objects Remote Checkpoint Local Checkpoint B is the buddy of A

5 / 25

slide-6
SLIDE 6

Charm++ Workshop Fault Tolerance Philosophy in Charm++

Minimize performance overhead

Automatic restart:

Failure detection in runtime system Immediate rollback-recovery

Parallel recovery Faster checkpoint

Double in-memory checkpoint/restart Semi-blocking checkpointing: asynchronously store the checkpoint remotely

5 / 25

slide-7
SLIDE 7

Charm++ Workshop Asynchronous Checkpoint/Restart

Outline

1 Fault Tolerance Philosophy in Charm++ 2 Asynchronous Checkpoint/Restart 3 Replication enhanced Checkpoint Restart

6 / 25

slide-8
SLIDE 8

Charm++ Workshop Asynchronous Checkpoint/Restart

Blocking Checkpoint

NODE 1 NODE 2 barrier checkpoint done

𝜐blocking

β α β α

𝜀blocking

τblocking checkpoint interval δblocking checkpoint

  • verhead

Each node has a buddy node to store the checkpoint. Resume computation after all the nodes have successfully saved the checkpoints in their buddy nodes.

7 / 25

slide-9
SLIDE 9

Charm++ Workshop Asynchronous Checkpoint/Restart

Semi-blocking Checkpointing

NODE 1 NODE 2 barrier local checkpoint done remote checkpoint done

𝛴

β α β

𝜀 φ 𝜐

α

τ checkpoint interval δ local checkpoint overhead θ

  • verlap period

ϕ remote checkpoint interference

Resume computation as soon as each node stores its own checkpoint (local checkpoint). Interleave the transmission of the checkpoint to buddy with application execution (remote checkpoint).

8 / 25

slide-10
SLIDE 10

Charm++ Workshop Asynchronous Checkpoint/Restart

Single Checkpoint Overhead

10 20 30 40 50 60 70 128 256 512 1024 Checkpoint Overhead(s) Number of Cores blocking checkpoint semiblocking checkpoint

Wave2D Weak Scaling

5 10 15 20 25 30 35 40 128 256 512 1024 Checkpoint Overhead(s) Number of Cores blocking checkpoint semiblocking checkpoint

ChaNGa Strong Scaling Semi-Blocking checkpoint reduces checkpoint overhead significantly.

9 / 25

slide-11
SLIDE 11

Charm++ Workshop Asynchronous Checkpoint/Restart

Leveraging Solid State Drives

Solid State Drive: becoming increasingly available on individual nodes Full SSD strategy Half SSD strategy

Only store remote checkpoint in SSD Faster checkpoint and restart

10 / 25

slide-12
SLIDE 12

Charm++ Workshop Asynchronous Checkpoint/Restart

Asynchronous Checkpointing to SSD with IO thread

worker thread worker thread worker thread worker thread IO thread write to SSD SSD request Checkpoint finishes

IO threads

Write checkpoint to/Read checkpoint from SSD When receive request from worker thread. Notify worker thread When SSD is done with certain request.

11 / 25

slide-13
SLIDE 13

Charm++ Workshop Asynchronous Checkpoint/Restart

Checkpoint/Restart on SSD

5 10 15 20 25 30 0.45 1.34 2.23 Timing Penalty(s) Checkpoint Size/Node(GB) half−aio full−aio half−sio full−sio 5 10 15 20 25 30 35 40 45 0.45 1.34 2.23 Restart Time(s) Checkpoint Size/Node(GB) in−memory half−aio full−aio

Half SSD strategy with asynchronous IO reduces the timing penalty for checkpointing to SSD Restart from SSD does not incur extra

  • verhead

aio asynchronous IO sio synchronous IO

12 / 25

slide-14
SLIDE 14

Charm++ Workshop Replication enhanced Checkpoint Restart

Outline

1 Fault Tolerance Philosophy in Charm++ 2 Asynchronous Checkpoint/Restart 3 Replication enhanced Checkpoint Restart

13 / 25

slide-15
SLIDE 15

Charm++ Workshop Replication enhanced Checkpoint Restart

New challenge: soft error

Not just from cosmic rays Computer electronic’s sensitivity to radiation increases as their dimensions and operating voltage decreases because of the requirements for high performance and low power. What may happen if soft failure rate keeps increasing?

1 10 100 1000 10000 2K 4K 8K 16K 32K 64K 128K256K512K 1024K FIT rate (soft data corruption) Number of Sockets 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Vulnerability 14 / 25

slide-16
SLIDE 16

Charm++ Workshop Replication enhanced Checkpoint Restart

Partition framework in Charm++

15 / 25

slide-17
SLIDE 17

Charm++ Workshop Replication enhanced Checkpoint Restart

Partition framework in Charm++

Ranking

Local rank Global rank

15 / 25

slide-18
SLIDE 18

Charm++ Workshop Replication enhanced Checkpoint Restart

Partition framework in Charm++

Ranking

Local rank Global rank

Inter-partition communication

CmiInterSyncSend(local rank, partition, size, message) CmiInterSyncSendAndFree(local rank, partition, size, message)

15 / 25

slide-19
SLIDE 19

Charm++ Workshop Replication enhanced Checkpoint Restart

Replication enhanced Fault Tolerance Overview

Periodic soft data corruption detection Automatically correct soft error from checkpoint Yes, there are benefits for hard failure!

No need for remote checkpointing

16 / 25

slide-20
SLIDE 20

Charm++ Workshop Replication enhanced Checkpoint Restart

Replication enhanced Fault Tolerance Overview

Extension from the double in-memory checkpointing

Node A Node B

!

"

#

!

"

#

$ % $ %

Node A Node B

!

"

#

!

"

#

$ % $ %

Objects Remote Checkpoint Local Checkpoint

Replica 1 Replica 2

buddy

17 / 25

slide-21
SLIDE 21

Charm++ Workshop Replication enhanced Checkpoint Restart

Replication enhanced Fault Tolerance Overview

replica 1 replica 2 transfer checkpoint for soft error detection hard error hard error detected by replica 2 replica 2 sends checkpoints to replica 1 for recovery soft error detected, both replicas roll back

application execution checkpoint recovery

Job Starts

T1 T2 T3

TIME

18 / 25

slide-22
SLIDE 22

Charm++ Workshop Replication enhanced Checkpoint Restart

Initial Result: soft error detection overhead

0.5 1 1.5 2 2.5 3 1k 2k 4k 8k 16k Time (s) Number of Cores per Replica checkpoint

Jacobi3D AMPI

0.02 0.04 0.06 0.08 0.1 0.12 1k 2k 4k 8k 16k Time (s) Number of Cores per Replica checkpoint

LeanMD

19 / 25

slide-23
SLIDE 23

Charm++ Workshop Replication enhanced Checkpoint Restart

Optimization

Topology aware mapping

4 3 4 4 4 4 4 4 4 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Replica 2 nodes

2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3

(a) Default-mapping

1 1 1 1 1 1 1 1

(b) Optimal-mapping

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Replica 1 nodes

1

# inter-replica messages [0-4] 20 / 25

slide-24
SLIDE 24

Charm++ Workshop Replication enhanced Checkpoint Restart

Optimization

Checksum

21 / 25

slide-25
SLIDE 25

Charm++ Workshop Replication enhanced Checkpoint Restart

Optimization

Checksum Issue with checksum

How to handle floating point round off error?

21 / 25

slide-26
SLIDE 26

Charm++ Workshop Replication enhanced Checkpoint Restart

Result: after optimization

0.5 1 1.5 2 2.5 3 1k 2k 4k 8k 16k Time (s) Number of Cores per Replica checkpoint

  • ptimal mapping

checksum

Jacobi3D AMPI

0.02 0.04 0.06 0.08 0.1 0.12 1k 2k 4k 8k 16k Time (s) Number of Cores per Replica checkpoint

  • ptimal mapping

checksum

LeanMD

22 / 25

slide-27
SLIDE 27

Charm++ Workshop Replication enhanced Checkpoint Restart

Result: recovery from hard failures

0.05 0.1 0.15 0.2 0.25 0.3 1k 2k 4k 8k 16k Time (s) Number of Cores per Replica default

  • ptimal

Jacobi3D AMPI

0.05 0.1 0.15 0.2 1k 2k 4k 8k 16k Time (s) Number of Cores per Replica default

  • ptimal

LeanMD

23 / 25

slide-28
SLIDE 28

Charm++ Workshop

Thanks

Thanks! Questions?

24 / 25