Fault Tolerance in Charm++/AMPI
Sayantan Chakravorty
PPL, UIUC
April 19, 2007
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 1 / 25
Fault Tolerance in Charm++/AMPI Sayantan Chakravorty PPL, UIUC - - PowerPoint PPT Presentation
Fault Tolerance in Charm++/AMPI Sayantan Chakravorty PPL, UIUC April 19, 2007 Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 1 / 25 Motivation 1 Background 2 Checkpoint-based 3 Co-ordinated disk-based
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 1 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 2 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 3 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 4 / 25
◮ Coordinated: Cocheck, Starfish, Clip ◮ Uncoordinated: suffers from cascading rollbacks ◮ Communication: does not scale well Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 4 / 25
◮ Coordinated: Cocheck, Starfish, Clip ◮ Uncoordinated: suffers from cascading rollbacks ◮ Communication: does not scale well
◮ Pesssimistic: MPICH-V1, MPICH-V2 etc. ◮ Optimistic: cascading rollback, complicated recovery ◮ Causal Logging: causalty tracking, Manetho, MPICH-V3 Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 4 / 25
◮ Coordinated: Cocheck, Starfish, Clip ◮ Uncoordinated: suffers from cascading rollbacks ◮ Communication: does not scale well
◮ Pesssimistic: MPICH-V1, MPICH-V2 etc. ◮ Optimistic: cascading rollback, complicated recovery ◮ Causal Logging: causalty tracking, Manetho, MPICH-V3
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 4 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 5 / 25
◮ Disk based ◮ In-memory ◮ Message logging with fast recovery Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 5 / 25
◮ Disk based ◮ In-memory ◮ Message logging with fast recovery
◮ Fault prediction ◮ Evacuate processors after fault is predicted Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 5 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 6 / 25
◮ State of chares are checkpointed to parallel file system ◮ Collective MPI Checkpoint(DIRNAME) Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 6 / 25
◮ State of chares are checkpointed to parallel file system ◮ Collective MPI Checkpoint(DIRNAME)
◮ Whole job is restarted ◮ Same job can be restarted on different # of processors ◮ Runtime flag: +restart DIRNAME Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 6 / 25
◮ State of chares are checkpointed to parallel file system ◮ Collective MPI Checkpoint(DIRNAME)
◮ Whole job is restarted ◮ Same job can be restarted on different # of processors ◮ Runtime flag: +restart DIRNAME
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 6 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 7 / 25
◮ Time between the last checkpoint and the crash Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 7 / 25
◮ Time between the last checkpoint and the crash ◮ Time to resubmit the job and have it run Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 7 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 8 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 8 / 25
◮ On local processor ◮ On a remote buddy processor Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 8 / 25
◮ On local processor ◮ On a remote buddy processor
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 8 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 9 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 9 / 25
◮ Remove all objects ◮ Use the buddy’s checkpoint to recreate objects from the crashed
◮ Recreate your own objects from their local copy of the checkpoint Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 9 / 25
◮ Faster checkpoints than disk based ◮ Reading checkpoints during recovery is also faster ◮ Only one processor fetches checkpoint across the network Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 10 / 25
◮ Faster checkpoints than disk based ◮ Reading checkpoints during recovery is also faster ◮ Only one processor fetches checkpoint across the network
◮ High memory overhead ◮ All processors are rolled back even if one crashes ◮ All the work since the last checkpoint is redone on all processors ◮ Recovery time: Time between the crash and the previous checkpoint Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 10 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 11 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 11 / 25
◮ Message processing sequence needs to be stored ◮ Processors need to ignore messages they have already processed Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 11 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 12 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 12 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 13 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 14 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 14 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 14 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 14 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 14 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 15 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 16 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 17 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 18 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 19 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 20 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 21 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 22 / 25
◮ Low response time ◮ No extra processors required ◮ Efficiency loss should be proportional to loss in computational power Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 23 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 24 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 25 / 25
Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 25 / 25