Fault Tolerance in Charm++/AMPI Sayantan Chakravorty PPL, UIUC - - PowerPoint PPT Presentation

fault tolerance in charm ampi
SMART_READER_LITE
LIVE PREVIEW

Fault Tolerance in Charm++/AMPI Sayantan Chakravorty PPL, UIUC - - PowerPoint PPT Presentation

Fault Tolerance in Charm++/AMPI Sayantan Chakravorty PPL, UIUC April 19, 2007 Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 1 / 25 Motivation 1 Background 2 Checkpoint-based 3 Co-ordinated disk-based


slide-1
SLIDE 1

Fault Tolerance in Charm++/AMPI

Sayantan Chakravorty

PPL, UIUC

April 19, 2007

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 1 / 25

slide-2
SLIDE 2

1

Motivation

2

Background

3

Checkpoint-based Co-ordinated disk-based In-memory double checkpoint

4

Message Logging

5

Pro-active fault tolerance

6

Summary

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 2 / 25

slide-3
SLIDE 3

Motivation

Larger machines available, clusters as well as proprietary MTBF decreases as size of machines increases Long running applications have to tolerate faults

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 3 / 25

slide-4
SLIDE 4

Background

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 4 / 25

slide-5
SLIDE 5

Background

Checkpoint

◮ Coordinated: Cocheck, Starfish, Clip ◮ Uncoordinated: suffers from cascading rollbacks ◮ Communication: does not scale well Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 4 / 25

slide-6
SLIDE 6

Background

Checkpoint

◮ Coordinated: Cocheck, Starfish, Clip ◮ Uncoordinated: suffers from cascading rollbacks ◮ Communication: does not scale well

Message Logging

◮ Pesssimistic: MPICH-V1, MPICH-V2 etc. ◮ Optimistic: cascading rollback, complicated recovery ◮ Causal Logging: causalty tracking, Manetho, MPICH-V3 Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 4 / 25

slide-7
SLIDE 7

Background

Checkpoint

◮ Coordinated: Cocheck, Starfish, Clip ◮ Uncoordinated: suffers from cascading rollbacks ◮ Communication: does not scale well

Message Logging

◮ Pesssimistic: MPICH-V1, MPICH-V2 etc. ◮ Optimistic: cascading rollback, complicated recovery ◮ Causal Logging: causalty tracking, Manetho, MPICH-V3

Hybrid: Schultz et al, Bronevetsky et al

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 4 / 25

slide-8
SLIDE 8

Solutions in Charm++

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 5 / 25

slide-9
SLIDE 9

Solutions in Charm++

Reactive: react to a fault

◮ Disk based ◮ In-memory ◮ Message logging with fast recovery Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 5 / 25

slide-10
SLIDE 10

Solutions in Charm++

Reactive: react to a fault

◮ Disk based ◮ In-memory ◮ Message logging with fast recovery

Pro-active: act before a fault

◮ Fault prediction ◮ Evacuate processors after fault is predicted Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 5 / 25

slide-11
SLIDE 11

Disk-based Checkpoint

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 6 / 25

slide-12
SLIDE 12

Disk-based Checkpoint

Blocking Coordinated Checkpoint

◮ State of chares are checkpointed to parallel file system ◮ Collective MPI Checkpoint(DIRNAME) Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 6 / 25

slide-13
SLIDE 13

Disk-based Checkpoint

Blocking Coordinated Checkpoint

◮ State of chares are checkpointed to parallel file system ◮ Collective MPI Checkpoint(DIRNAME)

Restart

◮ Whole job is restarted ◮ Same job can be restarted on different # of processors ◮ Runtime flag: +restart DIRNAME Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 6 / 25

slide-14
SLIDE 14

Disk-based Checkpoint

Blocking Coordinated Checkpoint

◮ State of chares are checkpointed to parallel file system ◮ Collective MPI Checkpoint(DIRNAME)

Restart

◮ Whole job is restarted ◮ Same job can be restarted on different # of processors ◮ Runtime flag: +restart DIRNAME

Simple yet effective for common cases

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 6 / 25

slide-15
SLIDE 15

Drawbacks of disk-based checkpoint

Checkpoints to the parallel file system are slow High Recovery time:

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 7 / 25

slide-16
SLIDE 16

Drawbacks of disk-based checkpoint

Checkpoints to the parallel file system are slow High Recovery time:

◮ Time between the last checkpoint and the crash Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 7 / 25

slide-17
SLIDE 17

Drawbacks of disk-based checkpoint

Checkpoints to the parallel file system are slow High Recovery time:

◮ Time between the last checkpoint and the crash ◮ Time to resubmit the job and have it run Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 7 / 25

slide-18
SLIDE 18

In-memory Double Checkpoint: Checkpoint

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 8 / 25

slide-19
SLIDE 19

In-memory Double Checkpoint: Checkpoint

Coordinated checkpoint

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 8 / 25

slide-20
SLIDE 20

In-memory Double Checkpoint: Checkpoint

Coordinated checkpoint Each object maintains 2 checkpoints:

◮ On local processor ◮ On a remote buddy processor Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 8 / 25

slide-21
SLIDE 21

In-memory Double Checkpoint: Checkpoint

Coordinated checkpoint Each object maintains 2 checkpoints:

◮ On local processor ◮ On a remote buddy processor

Checkpoints are stored in memory

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 8 / 25

slide-22
SLIDE 22

In-memory Double Checkpoint: Restart

A dummy process is created to replace the crashed processor

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 9 / 25

slide-23
SLIDE 23

In-memory Double Checkpoint: Restart

A dummy process is created to replace the crashed processor New process starts recovery on other processors

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 9 / 25

slide-24
SLIDE 24

In-memory Double Checkpoint: Restart

A dummy process is created to replace the crashed processor New process starts recovery on other processors Other processors

◮ Remove all objects ◮ Use the buddy’s checkpoint to recreate objects from the crashed

processor

◮ Recreate your own objects from their local copy of the checkpoint Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 9 / 25

slide-25
SLIDE 25

In-memory Double Checkpoint: Pros and Cons

Advantages:

◮ Faster checkpoints than disk based ◮ Reading checkpoints during recovery is also faster ◮ Only one processor fetches checkpoint across the network Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 10 / 25

slide-26
SLIDE 26

In-memory Double Checkpoint: Pros and Cons

Advantages:

◮ Faster checkpoints than disk based ◮ Reading checkpoints during recovery is also faster ◮ Only one processor fetches checkpoint across the network

Drawbacks:

◮ High memory overhead ◮ All processors are rolled back even if one crashes ◮ All the work since the last checkpoint is redone on all processors ◮ Recovery time: Time between the crash and the previous checkpoint Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 10 / 25

slide-27
SLIDE 27

Message logging

Only processed messages affect the state of a processor After a crash, reprocess old messages to regain lost state

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 11 / 25

slide-28
SLIDE 28

Message logging

Only processed messages affect the state of a processor After a crash, reprocess old messages to regain lost state Messages are stored during execution After a crash, only crashed processors are rolled back Other processors resend their messages

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 11 / 25

slide-29
SLIDE 29

Message logging

Only processed messages affect the state of a processor After a crash, reprocess old messages to regain lost state Messages are stored during execution After a crash, only crashed processors are rolled back Other processors resend their messages Caveat: State of a processor is affected by the sequence of messages as well

◮ Message processing sequence needs to be stored ◮ Processors need to ignore messages they have already processed Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 11 / 25

slide-30
SLIDE 30

Message logging: Challenges

All the work of the crashed processor is redone by one processor Recovery time: Same as checkpoint/restart

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 12 / 25

slide-31
SLIDE 31

Message logging: Challenges

All the work of the crashed processor is redone by one processor Recovery time: Same as checkpoint/restart Most parallel applications are tightly coupled Other processors have to wait for the crashed processor to recover Fault free overhead is often high

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 12 / 25

slide-32
SLIDE 32

Message logging: Objectives

Fast recovery: Faster than time between the crash and the previous checkpoint Do not assume a stable storage Tolerate all single and most multiple processor faults Low performance penalty for the fault free case

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 13 / 25

slide-33
SLIDE 33

Message logging: Our idea

During restart distribute the work of the restarted processor among the waiting processors

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 14 / 25

slide-34
SLIDE 34

Message logging: Our idea

During restart distribute the work of the restarted processor among the waiting processors How can the work on one processor be divided ?

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 14 / 25

slide-35
SLIDE 35

Message logging: Our idea

During restart distribute the work of the restarted processor among the waiting processors How can the work on one processor be divided ? Object based Processor Virtualization

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 14 / 25

slide-36
SLIDE 36

Message logging: Our idea

During restart distribute the work of the restarted processor among the waiting processors How can the work on one processor be divided ? Object based Processor Virtualization Combine processor virtualization and message logging

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 14 / 25

slide-37
SLIDE 37

Message logging: Our idea

During restart distribute the work of the restarted processor among the waiting processors How can the work on one processor be divided ? Object based Processor Virtualization Combine processor virtualization and message logging Improves fault free performance as well

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 14 / 25

slide-38
SLIDE 38

Message Logging and Virtualization

Virtual processors are the communicating entities

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 15 / 25

slide-39
SLIDE 39

Modifying message logging to work with Virtualization

When sender and receiver are on the same processor The receiver and message log are on the same processor If processor crashes not only does the log dissapear but more importantly its TN disappears Solved by storing some meta-data about such a message on a buddy processor During restart redistribute the VPs on the restarted processor among all processors

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 16 / 25

slide-40
SLIDE 40

Fast Restart Performance

7 point stencil with 3D domain decomposition MPI program 16 processor run on Opterons with 1GB RAM and Gigabit Checkpoint every 30s Simulate fault after 27s 2-16 vps per processor

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 17 / 25

slide-41
SLIDE 41

Fault free performance

AMPI AMPI-FT AMPI-FT with multiple VP We got good performance for MG, SP and CG but bad for LU

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 18 / 25

slide-42
SLIDE 42

Closer look at MG and LU

MG on 8 processors LU on 8 processors AMPI AMPI-FT AMPI AMPI-FT Computation Time 68.18% 68.29% 86.56 % 87.81% Idle Time 25.56% 22.75% 12.41 % 48.28% Message Send 4.34% 5.01% 0.62 % 2.30 % Ticket Request Send 4.54% 0.63% Ticket Send 1.37% 1.01% Local Message 2.10% 0.00% Total 98.08 % 104.06% 99.59 % 140.03 % Lower granularity of LU increases Idle time

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 19 / 25

slide-43
SLIDE 43

Optimizations

Synthetic benchmark High overhead for low granularity Increasing vps helps 100 us case still pretty high Combine protocol messages Reduces cpu overhead Alleviates network congestion

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 20 / 25

slide-44
SLIDE 44

Optimizations: Evaluation

Real application: leanMD BUTANE molecular system is very small 16 processor test cluster Iteration time 13ms A message every 45µs on each proc WORST CASE

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 21 / 25

slide-45
SLIDE 45

Future Work

Load balancing with message logging Remove the need for extra processors

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 22 / 25

slide-46
SLIDE 46

Pro-active Fault Tolerance

Modern hardware can be used to predict failures Runtime system responds to warning

◮ Low response time ◮ No extra processors required ◮ Efficiency loss should be proportional to loss in computational power Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 23 / 25

slide-47
SLIDE 47

Processor Evacuation

Migrate Charm++ VPs off processor Point to Point messaging should continue to work correctly Collective operations should continue to work Rewire reduction tree around a warned processor Can deal with multiple simultaneous failures Load balance after an evacuation

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 24 / 25

slide-48
SLIDE 48

Summary

Charm++/AMPI provides multiple fault tolerance protocols Disk based Checkpoint/Restart In memory Checkpoint/Restart Proactive fault tolerance

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 25 / 25

slide-49
SLIDE 49

Summary

Charm++/AMPI provides multiple fault tolerance protocols Disk based Checkpoint/Restart In memory Checkpoint/Restart Proactive fault tolerance Message logging with fast recovery under development

Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 25 / 25