Fault Tolerance Support for Supercomputers with Multicore Nodes - - PowerPoint PPT Presentation

▶

Nov 20, 2023 23 likes •295 views

Fault Tolerance Support for Supercomputers with Multicore Nodes Esteban Meneses Xiang Ni Monday, April 18, 2011 Exascale Supercomputer: 100 M of cores an Exascale system could be expected to have a failure ... every 3539 minutes

SLIDE 1

Fault Tolerance Support for Supercomputers with Multicore Nodes

Esteban Meneses Xiang Ni

Monday, April 18, 2011

SLIDE 2

Exascale Supercomputer: 100 M of cores “an Exascale system could be expected to have a failure ... every 35–39 minutes”

Exascale Computing Study

“insufficient resilience of the software infrastructure would likely render extreme scale systems effectively unusable”

The International Exascale Software

Monday, April 18, 2011

SLIDE 3

Charm++ Fault Tolerance Infrastructure.
Fault Tolerance in SMP.
Preliminary Results.
Multiple Concurrent Failure Model.
Future Work.

Monday, April 18, 2011

SLIDE 4

Fault Tolerance in Charm++

Object Migration
Load Balancing
Runtime Support
SMP version

Node X Node Y Node Z

Monday, April 18, 2011

SLIDE 5

Strategies

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

Monday, April 18, 2011

SLIDE 6

Node W

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

Node X Node Y Node Z Predictor

Monday, April 18, 2011

SLIDE 7

Node X’

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

Node W Node X Node Y Node Z

1.0 2.0 3.0 4.0 1 151 301 451 601

Time/step (s) Timestep

1.0 2.0 3.0 4.0 1 151 301 451 601

Time/step (s) Timestep With LB Without LB

Restart with or without spare nodes Checkpoint in buddy’s memory

Monday, April 18, 2011

SLIDE 8

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

Node W Node X Node Y Node Z

m1 m2

Team-based Message Logging Parallel Restart

Team Q Team R

Monday, April 18, 2011

SLIDE 9

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

40 80 120 160 200 240 280 320 360 Execution Time Memory Overhead 12 24 36 48 60 72 84 96 108 Time (seconds) Memory (MB) NoLB(8) GreedyLB(8) TeamLB(1) TeamLB(8)

Team-based Load Balancer

Monday, April 18, 2011

SLIDE 10

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

Node X PE A PE B PE C PE D

Shared Memory (SM)

The minimum unit of failure is a node Single node failure support

Monday, April 18, 2011

SLIDE 11

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

Node X PE A PE B

Node Y PE C PE D

Causal Message Logging ➙ determinants in shared memory Load balancing ➙ increase communication inside a node Lock contention ➙ hybrid scheme

Monday, April 18, 2011

SLIDE 12

Experiments

Hardware:
Abe@NCSA: 1200 8-way SMP nodes.
Ranger@TACC: 3936 16-way SMP nodes.
Benchmarks:
Ring: Charm++ nearest neighbor exchange.
Jacobi: 7-point stencil.

Monday, April 18, 2011

SLIDE 13

Checkpoint Time

!" !# !$" !$# !%" !%# !&" !'( !$%) !%#' !#$% !$"%( *+,-!./-0123/4 56,7-8!19!:18-/ :;-0<=1+2>?@-/>A8>!!!@+2B!.C7-4 )!DE )"!DE (""!DE

Monday, April 18, 2011

SLIDE 14

Restart Time

!" !#" !$"" !$#" !%"" !" !# !$" !$# !%" &'()'*++!,-.*'/.-(01 2-3*!,+*4(05+1 67*489(-0.:;*+./'.!!!</4(=-!,>=*?!@%!4('*+1

Monday, April 18, 2011

SLIDE 15

Message Logging Overhead

0.05 0.1 0.15 0.2 0.25 128 256 512 1024 Time per Iteration (seconds) Number of Cores Jacobi (Ranger) Message Logging Checkpoint/Restart

Monday, April 18, 2011

SLIDE 16

Single Node Failure

All protocols presented tolerate a single node failure.
They may recover from a multiple failure.
Multiple concurrent failures are rare.
Cost to tolerate them is high:
Checkpoint/restart: more checkpoint buddies.
Causal Message logging: determinants must be

stored in more locations.

Monday, April 18, 2011

SLIDE 17

Distribution of Multiple Failures

0.1% 1% 10% 100% Tsubame MPP2 Mercury Frequency

1 node 2 nodes 3 nodes 4 nodes > 4 nodes

Monday, April 18, 2011

SLIDE 18

Multiple Concurrent Failures

Analytical Model:
Multiple Failure Distribution:

(heavy-tailed).

Checkpoint/Restart: probability of

losing both a node and its buddy.

Message Logging: probability of losing

a node and another node it contacts.

Monday, April 18, 2011

SLIDE 19

Buddy Assignment

A B C D E F G H

Ring Mapping

A B C D E F G H

Pair Mapping

Monday, April 18, 2011

SLIDE 20

Checkpoint/Restart

0.2 0.4 0.6 0.8 1 20 40 60 80 100 120 Probability Multiple Concurrent Failures Multiple Failure Survivability (n=1024) 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16

Monday, April 18, 2011

SLIDE 21

Message Logging

0.2 0.4 0.6 0.8 1 20 40 60 80 100 120 Probability Multiple Concurrent Failures Multiple Failure Survivability (n=1024) degree=2 degree=4 degree=8 degree=16 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16

Monday, April 18, 2011

SLIDE 22

Conclusions

Fault Tolerance for SMP better matches the

failure reality of supercomputers.

Single node failure support is robust enough for

failure pattern in supercomputers.

Load balancer is key to enhance fault tolerance

in SMP.

Monday, April 18, 2011

SLIDE 23

Future Work

Optimize message logging in SMP.
Add load balancer to reduce communication
verhead.
Early stages of supercomputer: correlated failures.

Monday, April 18, 2011

SLIDE 24

Aknowledgments

Ana Gainaru (NCSA).
Leonardo Bautista Gómez (Tokyo Tech).
This research was supported in part by the

US Department of Energy under grant DOE DE-SC0001845 and by a machine allocation on the Teragrid under award ASC050039N.

Monday, April 18, 2011

SLIDE 25

Thanks! Q&A

Monday, April 18, 2011

SLIDE 26

Multiple Failures Model

f(x)=(1-p)(x-1)p

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 12 14 16 Probability Multiple Concurrent Failures Multiple Failure Distribution (n=1024) p=0.7

Monday, April 18, 2011

SLIDE 27

Survivability

S Checkpoint/Restart Message Logging (2) Message Logging (4) Message Logging (8) Message Logging (16)

0.999402 0.997624 0.995285 0.990716 0.981973

Monday, April 18, 2011

Fault Tolerance Support for Supercomputers with Multicore Nodes

Contents

Fault Tolerance in Charm++

Strategies

m1 m2

Experiments

Checkpoint Time

Restart Time

Message Logging Overhead

Single Node Failure

stored in more locations.

Distribution of Multiple Failures

Multiple Concurrent Failures

(heavy-tailed).

losing both a node and its buddy.

a node and another node it contacts.

Buddy Assignment

Ring Mapping

Pair Mapping

Checkpoint/Restart

Message Logging

Conclusions

failure reality of supercomputers.

failure pattern in supercomputers.

in SMP.

Future Work

Aknowledgments

US Department of Energy under grant DOE DE-SC0001845 and by a machine allocation on the Teragrid under award ASC050039N.

Thanks! Q&A

Multiple Failures Model

Survivability