GLB Fault Tolerance Scheme Experimental Results
Towards an Efficient Fault-Tolerance Scheme for GLB
Claudia Fohry, Marco Bungart and Jonas Posner Programming Languages / Methodologies June 14, 2015
1 / 18
Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, - - PowerPoint PPT Presentation
GLB Fault Tolerance Scheme Experimental Results Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner Programming Languages / Methodologies June 14, 2015 1 / 18 GLB Fault Tolerance Scheme
GLB Fault Tolerance Scheme Experimental Results
Claudia Fohry, Marco Bungart and Jonas Posner Programming Languages / Methodologies June 14, 2015
1 / 18
GLB Fault Tolerance Scheme Experimental Results
1
Global Load Balancing
2
Fault Tolerance Scheme
3
Experimental Results
2 / 18
GLB Fault Tolerance Scheme Experimental Results
Examples: UTS: counting nodes in an unbalanced tree BC: calculate a property of each node in a graph
3 / 18
GLB Fault Tolerance Scheme Experimental Results
Task pool framework for inter-place load balancing Utilizes cooperative work stealing Tasks are free of side effects and can spawn new task at execution time Final result computed by reduction Only one worker per place Worker-private pool
4 / 18
GLB Fault Tolerance Scheme Experimental Results
do { while (process(n)) { Runtime.probe(); distribute(); reject(); } } while (steal());
5 / 18
GLB Fault Tolerance Scheme Experimental Results
1
Global Load Balancing
2
Fault Tolerance Scheme
3
Experimental Results
6 / 18
GLB Fault Tolerance Scheme Experimental Results
One backup-place per place (cyclic) Write backup periodically and when necessary (stealing) Exploit stealing-induces redundancy Write incremental backups whenever possible Each information at exactly two places
7 / 18
GLB Fault Tolerance Scheme Experimental Results
s A R R
t-1
R A s R A s
R A st-1
R s A
snap snap snapt-1 backup t mint-2 mint-1
st-2 send st-2 8 / 18
GLB Fault Tolerance Scheme Experimental Results
No blocking constructs (except one outer finish) split and merge have to operate on the bottom of the Task Pool Actor Scheme
Worker is passive entity (only processing tasks) Worker becomes active when a message is received Two kinds of messages:
executed directly or stored and processed later
→ Worker stays responsive
9 / 18
GLB Fault Tolerance Scheme Experimental Results
F t r y S t e a l s t e a l
a c k u p V Back(F) Back(V)
processing non-stolen tasks
stolen tasks in Open(F)
backup l i n k t
save link insert + process F e n d B V e n d S T L a c k B F a c k g i v e valid = true At next backup of F: n
n c r e m e n t a l
Open(F) d e l O p e n XYack
backup
to V
V1 V2 V3
10 / 18
GLB Fault Tolerance Scheme Experimental Results
11 / 18
GLB Fault Tolerance Scheme Experimental Results
12 / 18
GLB Fault Tolerance Scheme Experimental Results
Cannot use DeadPlaceExceptions Check relevant places regularly via isDead(), as well as the own backup place What if a place P is inactive?
Does not check its backup-place for lifeness But its predecessor Forth(P) does check P If P is active, it checks lifeness of Back(P) Recursive process
13 / 18
GLB Fault Tolerance Scheme Experimental Results
1
Global Load Balancing
2
Fault Tolerance Scheme
3
Experimental Results
14 / 18
GLB Fault Tolerance Scheme Experimental Results
Experiments were conductet on an Infiniband-connected Cluster One place per node Up to 128 Nodes Configuration:
small UTS: -d=13 large UTS: -d=17
15 / 18
GLB Fault Tolerance Scheme Experimental Results
10 20 30 40 50 60 10 20 30 40 50 60 Time (seconds) Places GLB FTGLB FTGLB-Incremental
16 / 18
GLB Fault Tolerance Scheme Experimental Results
500 1000 1500 2000 10 20 30 40 50 60 Time (seconds) Places GLB FTGLB FTGLB-Incremental
17 / 18
GLB Fault Tolerance Scheme Experimental Results
18 / 18