Fault-Tolerant State Machine Replication Chinasa T. Okolo 1 - PowerPoint PPT Presentation

Fault-Tolerant State Machine Replication Chinasa T. Okolo 1 Slides borrowed from Hakim Weatherspoon and Drew Zagieboylo

Authors Fred Schneider • Samuel B. Eckert Professor of Computer Science • AAAS, ACM, and IEEE Fellow • Concurrent and distributed systems for high-integrity and mission-critical settings 2

Outline ● Motivation ● State Machine Replication Approach ● Implementation ● Fault Tolerance ● Chain Replication ● Conclusions 3

Motivation Server 10 X = 10 Client get(x) …No response get(x) Client 4

Motivation • Need replication for fault tolerance • What happens in scenarios without replication? • Storage - Disk Failure • Web service - Network failure • Be able to reason about failure tolerance • How badly can things go wrong and have our system continue to function? 5

Motivation Server X = 10 X = 10 Client X = 10 X = 10 6

Motivation Server put(x,10) X = 3 X = 3 X = 3 X = 3 7

Motivation Server get(x) X = 10 X = 10 10 get(x) X = 10 X = 3 3 Problem! 8

Problem How can we ensure that all replicas are in the same state all of the time? 9

Outline ● Motivation ● State Machine Replication Approach ● Implementation ● Fault Tolerance ● Chain Replication ● Conclusions 10

State Machines c X = Y f(c ) • c is a Command X = Z • f is a Transition Function 11

State Machine Coding ● State machines are procedures ● Client calls procedure ● Avoid loops ● Flexible structure 12

State Machine Replication ● Each starts in the same initial state ● Executes the same requests ● Requires consensus to execute in same order ● Deterministic, each will do the exact same thing ● Produce the same output 13

State Machine Replication All non faulty servers need: ● Agreement ○ Every replica needs to accept the same set of requests ● Order ○ All replicas process requests in the same relative order 14

Outline ● Motivation ● State Machines ● Implementation ● Fault Tolerance ● Chain Replication ● Conclusions 15

Implementation Agreement • Transmitter proposes a request; if it is non-faulty all servers will accept that request • Transmitter can be client or server • Client or Server can propose the request 16

Implementation Agreement • IC1: All non-faulty processors agree on the same value • IC2: If transmitter is non-faulty, agree on its value 17

Ordering “The Order requirement can be satisfied by assigning unique identifiers to requests and having state machine replicas process requests according to a total ordering relation on these unique identifiers.” 18

Implementation • Order • Assign unique ids to requests and process them in ascending order. • How do we assign unique ids in a distributed system? 19

Implementation Client Generated IDs Ordering via clocks • Logical Clocks • Synchronized Clocks • Ideas from last class! [Lamport 1978] 20

Can the replicas generate unique identifiers? Of course! 21

Implementation Replica Generated IDs • 2 Phase ID generation • Every replica proposes a candidate • One candidate is chosen and agreed upon by all replicas 22

Implementation Replica Generated IDs • When do we know a candidate is stable? • A candidate is accepted • No other pending requests with smaller candidate ids 23

Stability Testing • Stability tests for logical and synchronized clocks? • Disadvantages • Stability tests require all nodes to communicate Logical: stabilizing requests ■ Synchronized: clock synchronization ■ 24

When does behavior become faulty? When it’s no longer consistent with specification! 26

Fault Tolerance • Fail-Stop • A faulty server can be detected as faulty • Crash Failures • Server can stop responding without notification (subset of Byzantine) • Byzantine • Faulty servers can do arbitrary, perhaps malicious things 27

Fault Tolerance ● Fail-Stop Tolerance ○ To tolerate t failures, need t+1 servers. ○ As long as 1 server remains, we’re OK! ○ Only need to participate in protocols with other live servers 28

Fault Tolerance Byzantine Failures To tolerate t failures, need 2t + 1 servers ● Protocols now involve votes ○ Can only trust server response if the majority of servers say the same thing ● t + 1 servers need to participate in replication protocols 29

Takeaways • Can represent deterministic distributed system as Replicated State Machine • Each replica reaches the same conclusion about the system independently • Formalizes notions of fault-tolerance in SMR 30

Discussion • Why is State Machine Replication so important? • What is the best case scenario in terms of replications for fault tolerance? • Is the state machine approach still feasible? 31

Chain Replication Authors ● Robert Van Renesse ○ Senior Researcher at Cornell ○ ACM Fellow and Ukelele enthusiast ○ Systems and Networking ● Fred Schneider 33

Chain Replication • Fault Tolerant Storage Service • Requests: • Update(x, y) => set object x to value y • Query(x) => read value of object x 34

Chain Replication X = 3 X = 3 X = 3 X = 3 35

Chain Replication Head Tail X = 3 X = 3 X = 3 X = 3 get(x) 3 Client 36

Chain Replication Head Tail X = 3 X = 3 X = 3 X = 3 put(x,30) Client 37

Chain Replication Req. UID r0 1 Head Tail X = 30 X = 3 X = 3 X = 3 put(x,30) 1) Head assigns uid Client 38

Chain Replication Req. UID Req. UID r0 1 r0 1 Head Tail X = 30 X = 30 X = 3 X = 3 put(x,30) 2) Head sends message to next node Client 39

Chain Replication Req. UID Req. UID Req. UID r0 1 r0 1 r0 1 Head Tail X = 30 X = 30 X = 30 X = 3 put(x,30) 3) Repeat until tail is reached Client 40

Chain Replication Req. UID Req. UID Req. UID Req. UID r0 1 r0 1 r0 1 r0 1 Head Tail X = 30 X = 30 X = 30 X = 30 put(x,30) x= 30 4) respond to client with success Client 41

Chain Replication Assumptions ● No partition tolerance ● High throughput ● Fail-stop processors ● A universally accessible, failure resistant or replicated Master 42

Chain Replication How does Chain Replication implement State Machine Replication? • Agreement • Only Update modifies state, can ignore Query • Client always sends update to Head . Head propagates request down chain to Tail . • Everyone accepts the request! 43

Chain Replication How does Chain Replication implement State Machine Replication? • Order • Unique IDs generated implicitly by Head ’s ordering • FIFO order preserved down the chain • Tail interleaves Query requests 44

Chain Replication Fault Tolerance ● Trusted Master ○ Fault-tolerant state machine ○ Trusted by all replicas ○ Monitors all replicas & issues commands 45

Chain Replication Fault Tolerance ● Head Fails ○ Master assigns 2nd node as Head ● Intermediate Node Fails ○ Master coordinates chain link-up ● Tail Fails ○ Master assigns 2nd to last node as Tail 46

Conclusions • Implements the “exercise left to the reader” hinted at by Lamport’s paper • Provides some of the concrete details needed to actually implement this idea • But still a fair number of details in real implementations that would need to be considered • Chain replication illustrates a “simple” example with fully concrete details • A key contribution that bridges the gap between academia and practicality for SMR 48

Chain Replication Discussion • Comparison to other primary/backup protocols? • What are the tradeoffs of Chain Replication? • Latency • Consistency • Any thoughts on the Trusted Master system? 49

Fault-Tolerant State Machine Replication Chinasa T. Okolo 1 - PowerPoint PPT Presentation

Fault-Tolerant State Machine Replication Chinasa T. Okolo 1 Slides borrowed from Hakim Weatherspoon and Drew Zagieboylo Authors Fred Schneider Samuel B. Eckert Professor of Computer Science AAAS, ACM, and IEEE Fellow

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Fault Tolerance via the State Machine Replication Approach Favian Contreras Implementing

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

MasterT rTru rusts ts where to fr to from om he here? Dianne Day Independent Trustee

REVISED ED Board of Visitors Finance Committee Meeting March 2017 Finance Committee Agenda

OPEN CODING A n a l y z i n g Q u a l i t a t i ve D a t a OPEN CODING part of many

ALaDDIn @ Unimi Who we are Activities Popularization Bebras (schools) Carlo Bellettini

Proof of Novelty A distributed consensus mechanism for securing content novelty Daniel Severo

Stable Matching Problem Goal. Given a set of preferences among colleges and high school

A Simple Streaming Bit-parallel Algorithm for Swap Pattern Matching V aclav Bla zej (joint

Asynchronous Pattern Matching Amihood Amir Yonatan Aumann, Gary Benson,Tzvika Hartman, Oren

Sambuz

Useful Links

Newsletter

Mail Us