Fault Tolerance Support for Supercomputers with Multicore Nodes - - PowerPoint PPT Presentation

fault tolerance support for supercomputers with multicore
SMART_READER_LITE
LIVE PREVIEW

Fault Tolerance Support for Supercomputers with Multicore Nodes - - PowerPoint PPT Presentation

Fault Tolerance Support for Supercomputers with Multicore Nodes Esteban Meneses Xiang Ni Monday, April 18, 2011 Exascale Supercomputer: 100 M of cores an Exascale system could be expected to have a failure ... every 3539 minutes


slide-1
SLIDE 1

Fault Tolerance Support for Supercomputers with Multicore Nodes

Esteban Meneses Xiang Ni

Monday, April 18, 2011

slide-2
SLIDE 2

Exascale Supercomputer: 100 M of cores “an Exascale system could be expected to have a failure ... every 35–39 minutes”

Exascale Computing Study

“insufficient resilience of the software infrastructure would likely render extreme scale systems effectively unusable”

The International Exascale Software

Monday, April 18, 2011

slide-3
SLIDE 3

Contents

  • Charm++ Fault Tolerance Infrastructure.
  • Fault Tolerance in SMP.
  • Preliminary Results.
  • Multiple Concurrent Failure Model.
  • Future Work.

Monday, April 18, 2011

slide-4
SLIDE 4

Fault Tolerance in Charm++

  • Object Migration
  • Load Balancing
  • Runtime Support
  • SMP version

Node X Node Y Node Z

Monday, April 18, 2011

slide-5
SLIDE 5

Strategies

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

Monday, April 18, 2011

slide-6
SLIDE 6

Node W

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

Node X Node Y Node Z Predictor

Monday, April 18, 2011

slide-7
SLIDE 7

Node X’

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

Node W Node X Node Y Node Z

1.0 2.0 3.0 4.0 1 151 301 451 601

Time/step (s) Timestep

1.0 2.0 3.0 4.0 1 151 301 451 601

Time/step (s) Timestep With LB Without LB

Restart with or without spare nodes Checkpoint in buddy’s memory

Monday, April 18, 2011

slide-8
SLIDE 8

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

Node W Node X Node Y Node Z

m1 m2

Team-based Message Logging Parallel Restart

Team Q Team R

Monday, April 18, 2011

slide-9
SLIDE 9

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

40 80 120 160 200 240 280 320 360 Execution Time Memory Overhead 12 24 36 48 60 72 84 96 108 Time (seconds) Memory (MB) NoLB(8) GreedyLB(8) TeamLB(1) TeamLB(8)

Team-based Load Balancer

Monday, April 18, 2011

slide-10
SLIDE 10

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

Node X PE A PE B PE C PE D

Shared Memory (SM)

The minimum unit of failure is a node Single node failure support

Monday, April 18, 2011

slide-11
SLIDE 11

Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging

Node X PE A PE B

SM

Node Y PE C PE D

SM

Causal Message Logging ➙ determinants in shared memory Load balancing ➙ increase communication inside a node Lock contention ➙ hybrid scheme

Monday, April 18, 2011

slide-12
SLIDE 12

Experiments

  • Hardware:
  • Abe@NCSA: 1200 8-way SMP nodes.
  • Ranger@TACC: 3936 16-way SMP nodes.
  • Benchmarks:
  • Ring: Charm++ nearest neighbor exchange.
  • Jacobi: 7-point stencil.

Monday, April 18, 2011

slide-13
SLIDE 13

Checkpoint Time

!" !# !$" !$# !%" !%# !&" !'( !$%) !%#' !#$% !$"%( *+,-!./-0123/4 56,7-8!19!:18-/ :;-0<=1+2>?@-/>A8>!!!@+2B!.C7-4 )!DE )"!DE (""!DE

Monday, April 18, 2011

slide-14
SLIDE 14

Restart Time

!" !#" !$"" !$#" !%"" !" !# !$" !$# !%" &'()'*++!,-.*'/.-(01 2-3*!,+*4(05+1 67*489(-0.:;*+./'.!!!</4(=-!,>=*?!@%!4('*+1

Monday, April 18, 2011

slide-15
SLIDE 15

Message Logging Overhead

0.05 0.1 0.15 0.2 0.25 128 256 512 1024 Time per Iteration (seconds) Number of Cores Jacobi (Ranger) Message Logging Checkpoint/Restart

Monday, April 18, 2011

slide-16
SLIDE 16

Single Node Failure

  • All protocols presented tolerate a single node failure.
  • They may recover from a multiple failure.
  • Multiple concurrent failures are rare.
  • Cost to tolerate them is high:
  • Checkpoint/restart: more checkpoint buddies.
  • Causal Message logging: determinants must be

stored in more locations.

Monday, April 18, 2011

slide-17
SLIDE 17

Distribution of Multiple Failures

0.1% 1% 10% 100% Tsubame MPP2 Mercury Frequency

1 node 2 nodes 3 nodes 4 nodes > 4 nodes

Monday, April 18, 2011

slide-18
SLIDE 18

Multiple Concurrent Failures

  • Analytical Model:
  • Multiple Failure Distribution:

(heavy-tailed).

  • Checkpoint/Restart: probability of

losing both a node and its buddy.

  • Message Logging: probability of losing

a node and another node it contacts.

Monday, April 18, 2011

slide-19
SLIDE 19

Buddy Assignment

A B C D E F G H

Ring Mapping

A B C D E F G H

Pair Mapping

Monday, April 18, 2011

slide-20
SLIDE 20

Checkpoint/Restart

0.2 0.4 0.6 0.8 1 20 40 60 80 100 120 Probability Multiple Concurrent Failures Multiple Failure Survivability (n=1024) 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16

Monday, April 18, 2011

slide-21
SLIDE 21

Message Logging

0.2 0.4 0.6 0.8 1 20 40 60 80 100 120 Probability Multiple Concurrent Failures Multiple Failure Survivability (n=1024) degree=2 degree=4 degree=8 degree=16 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16

Monday, April 18, 2011

slide-22
SLIDE 22

Conclusions

  • Fault Tolerance for SMP better matches the

failure reality of supercomputers.

  • Single node failure support is robust enough for

failure pattern in supercomputers.

  • Load balancer is key to enhance fault tolerance

in SMP.

Monday, April 18, 2011

slide-23
SLIDE 23

Future Work

  • Optimize message logging in SMP.
  • Add load balancer to reduce communication
  • verhead.
  • Early stages of supercomputer: correlated failures.

Monday, April 18, 2011

slide-24
SLIDE 24

Aknowledgments

  • Ana Gainaru (NCSA).
  • Leonardo Bautista Gómez (Tokyo Tech).
  • This research was supported in part by the

US Department of Energy under grant DOE DE-SC0001845 and by a machine allocation on the Teragrid under award ASC050039N.

Monday, April 18, 2011

slide-25
SLIDE 25

Thanks! Q&A

Monday, April 18, 2011

slide-26
SLIDE 26

Multiple Failures Model

f(x)=(1-p)(x-1)p

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 12 14 16 Probability Multiple Concurrent Failures Multiple Failure Distribution (n=1024) p=0.7

Monday, April 18, 2011

slide-27
SLIDE 27

Survivability

S Checkpoint/Restart Message Logging (2) Message Logging (4) Message Logging (8) Message Logging (16)

0.999402 0.997624 0.995285 0.990716 0.981973

Monday, April 18, 2011