Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, - PowerPoint PPT Presentation

GLB Fault Tolerance Scheme Experimental Results Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner Programming Languages / Methodologies June 14, 2015 1 / 18

GLB Fault Tolerance Scheme Experimental Results Global Load Balancing Global Load 1 Balancing Fault Tolerance 2 Scheme Experimental 3 Results 2 / 18

GLB Fault Tolerance Scheme Experimental Results Worker-local Pools Examples: UTS: counting nodes in an unbalanced tree BC: calculate a property of each node in a graph 3 / 18

GLB Fault Tolerance Scheme Experimental Results GLB Task pool framework for inter-place load balancing Utilizes cooperative work stealing Tasks are free of side effects and can spawn new task at execution time Final result computed by reduction Only one worker per place Worker-private pool 4 / 18

GLB Fault Tolerance Scheme Experimental Results GLB’s main processing loop do { while (process(n)) { Runtime.probe(); distribute(); reject(); } } while (steal()); 5 / 18

GLB Fault Tolerance Scheme Experimental Results Fault Tolerance Scheme Global Load 1 Balancing Fault Tolerance 2 Scheme Experimental 3 Results 6 / 18

GLB Fault Tolerance Scheme Experimental Results Conceptual Ideas One backup-place per place (cyclic) Write backup periodically and when necessary (stealing) Exploit stealing-induces redundancy Write incremental backups whenever possible Each information at exactly two places 7 / 18

GLB Fault Tolerance Scheme Experimental Results Incremental Backup of stable Tasks t-1 t R A R R R A send A A R A s R s ... s ... ... ... ... s s t-1 s t-2 s t-2 snap snap snap t-1 backup min t-2 min t-1 8 / 18

GLB Fault Tolerance Scheme Experimental Results Actor Scheme No blocking constructs (except one outer finish) split and merge have to operate on the bottom of the Task Pool Actor Scheme Worker is passive entity (only processing tasks) Worker becomes active when a message is received Two kinds of messages: executed directly or stored and processed later → Worker stays responsive 9 / 18

GLB Fault Tolerance Scheme Experimental Results Stealing Protocol Back(F) F V Back(V) t r y S t e a l V1 s t e a l - b a c k u p ● continue processing non-stolen ● valid = false tasks ● update ● record backup stolen tasks V2 in Open(F) k S T L a c g i v e V l i n k t o insert save link + process B F a c k V3 B V e n d e n d F valid = true At next backup of F: n t a l c r e m e n o n - i n ● update backup ● delete link d e l O p e n to V XYack ● delete Open(F) 10 / 18

GLB Fault Tolerance Scheme Experimental Results Asynchronism 11 / 18

GLB Fault Tolerance Scheme Experimental Results Asynchronism with Fault-Tolerance 12 / 18

GLB Fault Tolerance Scheme Experimental Results Detection of dead Places Cannot use DeadPlaceException s Check relevant places regularly via isDead() , as well as the own backup place What if a place P is inactive? Does not check its backup-place for lifeness But its predecessor Forth(P) does check P If P is active, it checks lifeness of Back(P) Recursive process 13 / 18

GLB Fault Tolerance Scheme Experimental Results Experimental Results Global Load 1 Balancing Fault Tolerance 2 Scheme Experimental 3 Results 14 / 18

GLB Fault Tolerance Scheme Experimental Results Setup Experiments were conductet on an Infiniband-connected Cluster One place per node Up to 128 Nodes Configuration: small UTS: -d=13 large UTS: -d=17 15 / 18

GLB Fault Tolerance Scheme Experimental Results UTS, small 60 GLB FTGLB 50 FTGLB-Incremental Time (seconds) 40 30 20 10 0 0 10 20 30 40 50 60 Places 16 / 18

GLB Fault Tolerance Scheme Experimental Results UTS, small 2000 GLB FTGLB FTGLB-Incremental 1500 Time (seconds) 1000 500 0 0 10 20 30 40 50 60 Places 17 / 18

GLB Fault Tolerance Scheme Experimental Results Thank you for your attention! Please feel free to ask questions. 18 / 18

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, - PowerPoint PPT Presentation

GLB Fault Tolerance Scheme Experimental Results Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner Programming Languages / Methodologies June 14, 2015 1 / 18 GLB Fault Tolerance Scheme

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Class Overview Introduction

PERFORMANCE FAULT TOLERANCE AVAILABILITY FEATURE VELOCITY PERFORMANCE FAULT TOLERANCE

Improving Scalability and Fault Improving Scalability and Fault Tolerance in an Application

Why Data Auditing is Important Arizona State Public Health Laboratory April 2019 Objectives

Regulation, Div ersit y and Arbitrage Winslo w Strong Ph.D. Student at UC Santa Ba

Zerocash: addressing Bitcoin's privacy problem Alessandro Chiesa UC Berkeley 1 Bitcoin's

Distributed Analysis Scheme SHI Xin (IHEP) 30 August 2016 What is the problem? From hardware to

Zero-Knowledge Proofs II zk-SNARKs Oct. 21, 2019 Overview Recap Lelantus One e ffi cient

The Command Line Toolbox A Crash Course on building your own CLI tools Michael Bates !" #$

The bibliotek-o Framework: Principles, Patterns, and a Process for Community Engagement Folsom,

CS 683 - Security and Privacy Fall 2019 Instructor: Karim Eldefrawy University of San Francisco

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, - PowerPoint PPT Presentation

GLB Fault Tolerance Scheme Experimental Results Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner Programming Languages / Methodologies June 14, 2015 1 / 18 GLB Fault Tolerance Scheme

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Class Overview Introduction

PERFORMANCE FAULT TOLERANCE AVAILABILITY FEATURE VELOCITY PERFORMANCE FAULT TOLERANCE

Improving Scalability and Fault Improving Scalability and Fault Tolerance in an Application

Why Data Auditing is Important Arizona State Public Health Laboratory April 2019 Objectives

Regulation, Div ersit y and Arbitrage Winslo w Strong Ph.D. Student at UC Santa Ba

Zerocash: addressing Bitcoin's privacy problem Alessandro Chiesa UC Berkeley 1 Bitcoin's

Distributed Analysis Scheme SHI Xin (IHEP) 30 August 2016 What is the problem? From hardware to

Zero-Knowledge Proofs II zk-SNARKs Oct. 21, 2019 Overview Recap Lelantus One e ffi cient

The Command Line Toolbox A Crash Course on building your own CLI tools Michael Bates !&quot; #$

The bibliotek-o Framework: Principles, Patterns, and a Process for Community Engagement Folsom,

CS 683 - Security and Privacy Fall 2019 Instructor: Karim Eldefrawy University of San Francisco

The Command Line Toolbox A Crash Course on building your own CLI tools Michael Bates !" #$