Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, - - PowerPoint PPT Presentation

towards an efficient fault tolerance scheme for glb
SMART_READER_LITE
LIVE PREVIEW

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, - - PowerPoint PPT Presentation

GLB Fault Tolerance Scheme Experimental Results Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner Programming Languages / Methodologies June 14, 2015 1 / 18 GLB Fault Tolerance Scheme


slide-1
SLIDE 1

GLB Fault Tolerance Scheme Experimental Results

Towards an Efficient Fault-Tolerance Scheme for GLB

Claudia Fohry, Marco Bungart and Jonas Posner Programming Languages / Methodologies June 14, 2015

1 / 18

slide-2
SLIDE 2

GLB Fault Tolerance Scheme Experimental Results

Global Load Balancing

1

Global Load Balancing

2

Fault Tolerance Scheme

3

Experimental Results

2 / 18

slide-3
SLIDE 3

GLB Fault Tolerance Scheme Experimental Results

Worker-local Pools

Examples: UTS: counting nodes in an unbalanced tree BC: calculate a property of each node in a graph

3 / 18

slide-4
SLIDE 4

GLB Fault Tolerance Scheme Experimental Results

GLB

Task pool framework for inter-place load balancing Utilizes cooperative work stealing Tasks are free of side effects and can spawn new task at execution time Final result computed by reduction Only one worker per place Worker-private pool

4 / 18

slide-5
SLIDE 5

GLB Fault Tolerance Scheme Experimental Results

GLB’s main processing loop

do { while (process(n)) { Runtime.probe(); distribute(); reject(); } } while (steal());

5 / 18

slide-6
SLIDE 6

GLB Fault Tolerance Scheme Experimental Results

Fault Tolerance Scheme

1

Global Load Balancing

2

Fault Tolerance Scheme

3

Experimental Results

6 / 18

slide-7
SLIDE 7

GLB Fault Tolerance Scheme Experimental Results

Conceptual Ideas

One backup-place per place (cyclic) Write backup periodically and when necessary (stealing) Exploit stealing-induces redundancy Write incremental backups whenever possible Each information at exactly two places

7 / 18

slide-8
SLIDE 8

GLB Fault Tolerance Scheme Experimental Results

Incremental Backup of stable Tasks

s A R R

...

t-1

...

R A s R A s

...

R A st-1

... ...

R s A

snap snap snapt-1 backup t mint-2 mint-1

st-2 send st-2 8 / 18

slide-9
SLIDE 9

GLB Fault Tolerance Scheme Experimental Results

Actor Scheme

No blocking constructs (except one outer finish) split and merge have to operate on the bottom of the Task Pool Actor Scheme

Worker is passive entity (only processing tasks) Worker becomes active when a message is received Two kinds of messages:

executed directly or stored and processed later

→ Worker stays responsive

9 / 18

slide-10
SLIDE 10

GLB Fault Tolerance Scheme Experimental Results

Stealing Protocol

F t r y S t e a l s t e a l

  • b

a c k u p V Back(F) Back(V)

  • continue

processing non-stolen tasks

  • record

stolen tasks in Open(F)

  • valid = false
  • update

backup l i n k t

  • V

save link insert + process F e n d B V e n d S T L a c k B F a c k g i v e valid = true At next backup of F: n

  • n
  • i

n c r e m e n t a l

  • delete

Open(F) d e l O p e n XYack

  • update

backup

  • delete link

to V

V1 V2 V3

10 / 18

slide-11
SLIDE 11

GLB Fault Tolerance Scheme Experimental Results

Asynchronism

11 / 18

slide-12
SLIDE 12

GLB Fault Tolerance Scheme Experimental Results

Asynchronism with Fault-Tolerance

12 / 18

slide-13
SLIDE 13

GLB Fault Tolerance Scheme Experimental Results

Detection of dead Places

Cannot use DeadPlaceExceptions Check relevant places regularly via isDead(), as well as the own backup place What if a place P is inactive?

Does not check its backup-place for lifeness But its predecessor Forth(P) does check P If P is active, it checks lifeness of Back(P) Recursive process

13 / 18

slide-14
SLIDE 14

GLB Fault Tolerance Scheme Experimental Results

Experimental Results

1

Global Load Balancing

2

Fault Tolerance Scheme

3

Experimental Results

14 / 18

slide-15
SLIDE 15

GLB Fault Tolerance Scheme Experimental Results

Setup

Experiments were conductet on an Infiniband-connected Cluster One place per node Up to 128 Nodes Configuration:

small UTS: -d=13 large UTS: -d=17

15 / 18

slide-16
SLIDE 16

GLB Fault Tolerance Scheme Experimental Results

UTS, small

10 20 30 40 50 60 10 20 30 40 50 60 Time (seconds) Places GLB FTGLB FTGLB-Incremental

16 / 18

slide-17
SLIDE 17

GLB Fault Tolerance Scheme Experimental Results

UTS, small

500 1000 1500 2000 10 20 30 40 50 60 Time (seconds) Places GLB FTGLB FTGLB-Incremental

17 / 18

slide-18
SLIDE 18

GLB Fault Tolerance Scheme Experimental Results

Thank you for your attention!

Please feel free to ask questions.

18 / 18