Supporting Fault Tolerance in a Data-Intensive Computing Middleware - PowerPoint PPT Presentation

Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2010, Atlanta, Georgia IPDPS 2010

Motivation Data Intensive computing Distributed Large Datasets Distributed Computing Resources Cloud Environments Long execution time High Probability of Failures

A Data Intensive Computing API FREERIDE Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping.. overheads are eliminated with red. func/obj.

Simple Example 3 5 8 4 1 3 5 2 6 7 9 4 2 4 8 Our large Dataset ฀ Local Local Reduction (+) Local Reduction(+) Reduction(+) Robj[1]= 21 8 Robj[1]= 15 23 Robj[1]= 14 27 Our Compute Result= 71 Global Reduction(+) Nodes

Remote Data Analysis Co-locating resources gives best performance… But may not be always possible Cost, availability etc. Data hosts and compute hosts are separated Fits grid/cloud computing FREERIDE-G is a version of FREERIDE that supports remote data analysis

Fault Tolerance Systems Checkpoint based System or Application level snapshot Architecture dependent High overhead Replication based Service or Application Resource Allocation Low overhead

Outline Motivation and Introduction Fault Tolerance System Approach Implementation of the System Experimental Evaluation Related Work Conclusion

A Fault Tolerance System based on Reduction Object Reduction object… represents intermediate state of the computation is small in size is independent from machine architecture Reduction obj/func show associative and commutative properties Suitable for Checkpoint based Fault Tolerance System

An Illustration 3 5 8 4 1 1 3 5 2 6 7 9 4 2 Robj = 21 Robj = 8 Robj = 21 Local Local Reduction (+) Reduction (+) Robj= 0 21 8 Robj= 0 21 25

Modified Processing Structure for FTS { * Initialize FTS * } While { Foreach ( element e ) { (i, val) = Process(e); RObj(i) = Reduce(RObj(i), val); { * Store Red. Obj. * } } if ( CheckFailure() ) { * Redistribute Data * } { * Global Reduction * } }

Outline Motivation and Introduction Fault Tolerance System Design Implementation of the System Experimental Evaluation Related Work Conclusion

Simple Implementation of the Alg. Reduction Reduction Object Object Exchange Exchange .... CN n- CN 0 CN 1 CN n 1 Reduction object is stored another comp. node Pair-wise reduction object exchange Failure detection is done by alive peer

Demonstration Redistribute Failed Node’ s Remaining Data N0 N1 N2 N3 Local Red. Local Red. Local Red. Local Red. Robj N0 Robj N1 Robj N2 Robj N3 Robj N1 Robj N3 Robj N0 Robj N2 Global Red. Failure Detected Final Result

Goals for the Experiments Observing reduction object size Evaluate the overhead of the FTS Studying the slowdown in case of one node’s failure Comparison with Hadoop (Map-Reduce imp.)

Experimental Setup FREERIDE-G Data hosts and compute nodes are separated Applications K-means and PCA Hadoop (Map-Reduce Imp.) Data is replicated among all nodes

Experiments (K-means) Without Failure Configurations Without FTS With FTS With Failure Configuration Failure after processing %50 of data (on one node) Reduction obj. size: 2KB With FT overheads: 0 - 1.74% Max: 8 Comp. Nodes, 25.6 GB Relative: 5.38 – 21.98% Max: 4 Comp. Nodes, 25.6 GB Absolute: 0 – 4.78% Execution Times with K-means 25.6 GB Dataset Max: 8 Comp. Nodes, 25.6 GB

Experiments (PCA) Reduction obj. size: 128KB With FT overheads: 0 – 15.36% Max: 4 Comp. Nodes, 4 GB Relative: 7.77 – 32.48% Max: 4 Comp. Nodes, 4 GB Absolute: 0.86 – 14.08% Max: 4 Comp. Nodes, 4 GB Execution Times with PCA, 17 GB Dataset

Comparison with Hadoop w/f = with failure Failure happens after processing 50% of the data on one node Overheads Hadoop 23.06 | 71.78 | 78.11 FREERIDE-G 20.37 | 8.18 | 9.18 K-means Clustering, 6.4GB Dataset

Comparison with Hadoop One of the comp. nodes failed after processing 25, 50 and 75% of its data Overheads Hadoop 32.85 | 71.21 | 109.45 FREERIDE-G 9.52 | 8.18 | 8.14 K-means Clustering, 6.4GB Dataset, 8 Comp. Nodes

Related Work Application level checkpointing Bronevetsky et. al.: C^3 (SC06, ASPLOS04, PPoPP03) Zheng et. al. : Ftc-charm++ (Cluster04) Message logging Agrabia et. al. : Starfish (Cluster03) Bouteiller et. al. : Mpich-v (Int. Journal of High Perf. Comp. 06) Replication-based Fault Tolerance Abawajy et. al. (IPDPS04)

Conclusion Reduction object represents the state of the system Our FTS has very low overhead and effectively recovers from failures Different designs can be implemented using Robj. Our system outperforms Hadoop both in absence and presence of failures

Thanks

Supporting Fault Tolerance in a Data-Intensive Computing Middleware - PowerPoint PPT Presentation

Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2010, Atlanta, Georgia IPDPS 2010 Motivation Data

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Class Overview Introduction

PERFORMANCE FAULT TOLERANCE AVAILABILITY FEATURE VELOCITY PERFORMANCE FAULT TOLERANCE

Critical Leadership (23369) Self-leadership Week 16 workshop by Andrew Stewart and Chianu Dibia

Forward-Looking Statements From time to time, the Bank makes written and oral forward-looking

Every graph is easy or hard: dichotomy theorems for graph problems Dniel Marx 1 1 Institute for

Fault Tolerance in Charm++/AMPI Sayantan Chakravorty PPL, UIUC April 19, 2007 Sayantan

Geometry and Topology, Lecture 4 The fundamental group and covering spaces Text: Andrew Ranicki

Starling:** simplerconcurrencyproofs* Ma#$Windsor (1),$ Mike$Dodds (1) ,$$$$$$$Ma#hew$Parkinson

CS 839: Design the Next-Generation Database Lecture 23: Serverless Xiangyao Yu 4/14/2020 1

Conference 2019 Keynote Helen Bierton, Head of Banking, Starling Bank Helping you lead a healthy

Supporting Fault Tolerance in a Data-Intensive Computing Middleware - PowerPoint PPT Presentation

Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2010, Atlanta, Georgia IPDPS 2010 Motivation Data

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Class Overview Introduction

PERFORMANCE FAULT TOLERANCE AVAILABILITY FEATURE VELOCITY PERFORMANCE FAULT TOLERANCE

Critical Leadership (23369) Self-leadership Week 16 workshop by Andrew Stewart and Chianu Dibia

Forward-Looking Statements From time to time, the Bank makes written and oral forward-looking

Every graph is easy or hard: dichotomy theorems for graph problems Dniel Marx 1 1 Institute for

Fault Tolerance in Charm++/AMPI Sayantan Chakravorty PPL, UIUC April 19, 2007 Sayantan

Geometry and Topology, Lecture 4 The fundamental group and covering spaces Text: Andrew Ranicki

Starling:** simpler*concurrency*proofs* Ma#$Windsor (1),$ Mike$Dodds (1) ,$$$$$$$Ma#hew$Parkinson

CS 839: Design the Next-Generation Database Lecture 23: Serverless Xiangyao Yu 4/14/2020 1

Conference 2019 Keynote Helen Bierton, Head of Banking, Starling Bank Helping you lead a healthy

Starling:** simplerconcurrencyproofs* Ma#$Windsor (1),$ Mike$Dodds (1) ,$$$$$$$Ma#hew$Parkinson