DISTRIBUTED SYSTEMS Department of Computing Science Umea University - - PowerPoint PPT Presentation

distributed systems
SMART_READER_LITE
LIVE PREVIEW

DISTRIBUTED SYSTEMS Department of Computing Science Umea University - - PowerPoint PPT Presentation

DISTRIBUTED SYSTEMS Department of Computing Science Umea University Distributed Systems - D N Ranasinghe 1 Fundamental Concepts Distributed Systems - D N Ranasinghe 2 About Distributed Computing devising algorithms for a set of


slide-1
SLIDE 1

Distributed Systems - D N Ranasinghe 1

DISTRIBUTED SYSTEMS

Department of Computing Science Umea University

slide-2
SLIDE 2

Distributed Systems - D N Ranasinghe 2

Fundamental Concepts

slide-3
SLIDE 3

Distributed Systems - D N Ranasinghe 3

  • About Distributed Computing

– devising algorithms for a set of processes that seek to achieve some form of a ‘cooperative goal’ – quoting Leslie Lamport: ‘ a distributed system is one in which the failure of a computer you did not even now existed can render your own computer unusable’

slide-4
SLIDE 4

Distributed Systems - D N Ranasinghe 4

Distributed Algorithm

  • has no shared global information: only decides on local

state and the messages they receive

  • has no shared global time frame: observes progress of

computation through at best a partial order of events

  • non deterministic behaviour: cannot predict the exact

sequence of global states from the study of the algorithm

slide-5
SLIDE 5

Distributed Systems - D N Ranasinghe 5

Design challenges from a systems perspective

  • heterogeneity: in hardware, OS, mode of interaction (c-s, p2p

etc), middleware provisioning for developers

  • security: involves eavesdropping, deliberate corruption, process

compromise, denial of service etc.,

  • scalability: robustness, performance bottlenecks
  • process failures: detecting/suspecting, masking, tolerating,

recovery, redundancy in the presence of partial processes failure

  • concurrency
slide-6
SLIDE 6

Distributed Systems - D N Ranasinghe 6

  • transparency:
  • access (local and remote resources accessed through identical
  • perations)
  • location (resources access independent of physical location)
  • concurrency (process concurrency on shared resources)
  • replication (maintaining replicas with consistency)
  • failure (concealment of failures)
  • mobility (movement of resources and clients)
slide-7
SLIDE 7

Distributed Systems - D N Ranasinghe 7

Role of middleware

  • software layer with services provided to the applications

designer

  • consisting of processes and objects
  • mechanisms:
  • Remote Method Invocation
  • object brokering
  • Service Oriented Architecture
  • event notification
  • distributed shared memory…
slide-8
SLIDE 8

Distributed Systems - D N Ranasinghe 8

Motivating application domains

  • information dissemination (publish-subscribe paradigm): by

event registration and notification with time-space decoupling property, based on reliable broadcast and agreement abstractions

  • process control in automation, in industrial systems etc.,

where consensus may have to be reached on multitude of sensorial inputs

  • cooperative work: multi-user cooperation in editing etc., based
  • n shared persistent space paradigm employing ordered

broadcast abstractions

  • distributed

databases: need for atomic commitment abstraction

  • n

acceptance

  • r

rejection

  • f

serialized transactions

slide-9
SLIDE 9

Distributed Systems - D N Ranasinghe 9

Motivating application domains

  • software based fault tolerance through replication: uses the so

called state machine replication paradigm

– when a centralized server is required to be made highly available by executing several copies of it whose consistency is guaranteed by total order broadcast abstraction

slide-10
SLIDE 10

Distributed Systems - D N Ranasinghe 10

Modeling of distributed systems

  • abstraction:

– to capture properties that are common to a large range of systems so that it enables to distinguish the fundamental from the accessory – to prevent reinvent the wheel for every minor variant of the problem

  • a model abstracts away the key components and the way

they interact

  • purpose:

– to make explicit all relevant assumptions about the system – to express behaviour through algorithms – make impossibility observations etc through logical analysis including proofs

slide-11
SLIDE 11

Distributed Systems - D N Ranasinghe 11

Modeling of distributed systems

  • abstracting the physical model: processes, links and failure

detectors (latter an indirect measurement of time)

2 5 4 3 1

slide-12
SLIDE 12

Distributed Systems - D N Ranasinghe 12

Modeling of distributed systems

  • component properties:

– channel (a communication resource) - message delays, message loss – process (a computational resource, has only local state) – can incur process failure, be infinitely slow or corrupt

  • low level models of interaction: synchronous message

passing, asynchronous message passing

Internal Computation (modules of the process) (receive) (send)

Incoming message Outgoing message

Process

slide-13
SLIDE 13

Distributed Systems - D N Ranasinghe 13

Modeling of distributed systems

  • failure detector abstraction: a possible way to capture the

notion of process and link failures based on their timing behaviour

  • incorporation of a failure detector, a specialized process in

each process which emits a heartbeat to others

  • a failure detector can be considered as an indirect abstraction
  • f time; simply a timeout is an indication of a failure, mostly

unreliable with an outcome either suspected or unsuspected

  • a synchronous system => a ‘perfect failure detector’
slide-14
SLIDE 14

Distributed Systems - D N Ranasinghe 14

Modeling of distributed systems

  • clock: physical and logical
  • abstracting a process: by the process failure model

Arbitrary Crashes & Recoveries Omissions Crashes

slide-15
SLIDE 15

Distributed Systems - D N Ranasinghe 15

Modeling of distributed systems

  • crashes: a faulty process as opposed to a correct process

(which executes an infinite number of steps) does no further local computation or message generation or respond to messages

– a crash does not preclude a recovery later but this is considered another category – also the correctness of any algorithm may depend on a maximally admissible number of faulty processes

slide-16
SLIDE 16

Distributed Systems - D N Ranasinghe 16

  • arbitrary faults: a process that deviates arbitrarily

from the algorithm assigned to it

– also known as malicious or Byzantine faulty or in fact may be due to a bug in the program – under such conditions some algorithmic abstractions may be ‘impossible’

slide-17
SLIDE 17

Distributed Systems - D N Ranasinghe 17

Modeling of distributed systems

  • mission failure: due to network congestion or buffer overflow,

resulting in process unable to send messages

  • crash-recovery: a process simply crashes fail-stop or, crashes

and recovers infinite times

– every process that recovers is assumed to have a stable storage (also called a log) accessible through some primitives, which stores the most recent local state with time stamps – alternatively those which do never crash could also act as virtual stable storage

slide-18
SLIDE 18

Distributed Systems - D N Ranasinghe 18

Modeling of distributed systems

  • abstracting communication: by loss or corruption of messages,

also known as communication omission

  • usually resolved through end-to-end network protocol support

unless of course there is a network partition

  • Desirable properties for ‘reliable’ delivery of messages
  • liveness: any message in the outgoing buffer of sender is

‘eventually’ delivered to the incoming message buffer of receiver

  • safety: the message received is identical to the one sent, and

no messages are delivered twice

slide-19
SLIDE 19

Distributed Systems - D N Ranasinghe 19

  • Abstracting other higher level interactions
  • e.g., capturing recurring patterns of interaction in the

form of

– distributed agreement (on an event, a sequence of events etc.,) – atomic commitment (whether to take an irrevocable step or not) – total order broadcast (i.e., agreeing on order of actions) leads to a wide range of algorithms

slide-20
SLIDE 20

Distributed Systems - D N Ranasinghe 20

Modeling of distributed systems

  • Predicting impossibility results in higher level interactions
  • due to in some cases indistinguishability of network failures from

process failures or, a slow process from a network delay

  • e.g., agreement in the presence of message loss, agreement in the

presence of process failures in asynchronous situations

  • Impossibility of agreement in the presence of message loss
  • leads to a widely used assumption in almost all models
  • typical two army problem
  • formal model described below
slide-21
SLIDE 21

Distributed Systems - D N Ranasinghe 21

Formal model of the two army problem

  • processes A and B communicate by sending and

receiving messages on a bidirectional channel; A sends a message to B, then B sends a message to A and so on

  • A and B can execute two actions α and β
  • neither process can fail but the channel can lose

messages

  • desired outcome is both processes take the same action

and neither take both actions

slide-22
SLIDE 22

Distributed Systems - D N Ranasinghe 22

  • proof- by contradiction: let there be a protocol P that

solves the problem using the fewest rounds, the last message sent by A being m

  • Observe that, action taken by A cannot depend on m

since its receipt could never be learned by A

  • Action taken by B cannot depend on m because B must

take the same choice of action as A even m is lost

  • Since actions of both A and B do not depend on m, m

can be discarded

  • m is not the last message
  • P is not using the fewest rounds
slide-23
SLIDE 23

Distributed Systems - D N Ranasinghe 23

Formal models for message passing algorithms

  • processes and channels: channels can be unidirectional or

bidirectional

  • topology represented by an undirected graph G(V, E)

P1 P4 P3 P2 P0

slide-24
SLIDE 24

Distributed Systems - D N Ranasinghe 24

Formal models for message passing algorithms

  • System has n processes, p0 to pn-1 where i is the index of the

process

  • The algorithm run by each pi is modeled as a process

automaton a formal description of a sequential algorithm and is associated with a node in the topology.

slide-25
SLIDE 25

Distributed Systems - D N Ranasinghe 25

Formal models for message passing algorithms

  • A process automaton is a description of the process state

machine

  • consists of a 5-tuple: {message alphabet, process states,

initial states, message generation function, state transition function}

– message_alphabet: content of messages exchanged – process_states: the finite set of states that a process can be in – initial_state: the start state of a process – message_gen_function: on the current process state how the next

message is to be generated

– state_trans_function: on the receipt of a messages, and based on

current state, the next state to which the process should transit

slide-26
SLIDE 26

Distributed Systems - D N Ranasinghe 26

Description of system state

  • A configuration is a vector C = (q0,…qn-1) where qi is a state of pi
  • In message passing systems two events can take place:

computation event of process pi (application of the so called state transition function), and delivery event, the delivery of message m from process pi to process pj consisting of a message sending event and a corresponding receiving event

  • Each message is uniquely identified by its sender process,

sequence number and may be local clock value

  • The behaviour of the system over time is modeled as an execution

which is a sequence of configurations alternating with events.

slide-27
SLIDE 27

Distributed Systems - D N Ranasinghe 27

Formal models for message passing algorithms

  • All possible executions of a distributed abstraction must

satisfy two conditions: safety and liveness.

Internal Computation (modules of the process) (receive) (send)

Incoming message Outgoing message

Process

slide-28
SLIDE 28

Distributed Systems - D N Ranasinghe 28

Formal models for message passing algorithms

  • Safety: ‘nothing bad has/can happen (yet)’
  • e.g., ‘every step by a process pi immediately follows a step by

process p0’, or, ‘no process should receive a message unless the message was indeed sent’

  • Safety is a property that can be violated at some time t and

never be satisfied thereafter; doing nothing will also ensure safety!

slide-29
SLIDE 29

Distributed Systems - D N Ranasinghe 29

Formal models for message passing algorithms

  • Liveness: ‘eventually something good happens’
  • a condition that must hold a number of times (possibly

infinite), e.g., ‘eventually p1 terminates’ => p1’s termination happens once, or, liveness for a perfect link will require that if a correct process (one which is alive and well behaved) sends a message to a correct destination process, then the destination process should eventually deliver the message

  • Liveness is a property that for any time t, there is some hope

that the property can be satisfied at some time t’≥ t

slide-30
SLIDE 30

Distributed Systems - D N Ranasinghe 30

Asynchronous systems

  • there is no fixed upper bound for message delivery time or,

the time elapse between consecutive steps of a process

  • notion of ordering of events, local computation, message send
  • r message receive are based on logical clocks
  • an execution α of an asynchronous message passing system

is a finite or infinite sequence of the form C0, ϕ1, C1, ϕ2, C2,…., where Ck is a configuration of process states, C0 is an initial configuration and ϕk is an event that captures all of messages send, computation and message receive events.

  • A schedule σ is a sequence of events in the execution, e.g.,

ϕ1, ϕ2, …., where if the local processes are deterministic then, the execution is uniquely defined by (C0, σ).

slide-31
SLIDE 31

Distributed Systems - D N Ranasinghe 31

Synchronous systems

  • There is a known upper bound on message transmission and

processing delays

  • processes execute in lock step; execution is partitioned into

‘rounds’: C0, ϕ1|,C1, ϕ2 |,C2,….,

  • very convenient for designing algorithms, but not very

practical

  • leads to some useful possibilities: e.g., timed failure detection

– every process crash can be detected by all correct processes, can implement a lease abstraction

  • in a synchronous system with no failures, only the C0 matters

for a given algorithm, but in an asynchronous system, there can be many executions for a given algorithm

slide-32
SLIDE 32

Distributed Systems - D N Ranasinghe 32

  • synchronous message passing

new state round 1 round 2 round 3 Time upper bound on time P Q R curren t State send( ) recv() state transition

slide-33
SLIDE 33

Distributed Systems - D N Ranasinghe 33

Properties of algorithms

  • validity and agreement: specific to the objective of the

algorithm

  • termination: an algorithm has terminated when all processes

are terminated and there are no messages in transit

  • an execution can still be infinite, but once terminated, the

process stays there taking ‘dummy’ steps

  • complexity: message (maximum number of messages sent over all

possible executions) and time (equal to maximum number of rounds if synchronous; and in asynchronous, this is less straightforward

slide-34
SLIDE 34

Distributed Systems - D N Ranasinghe 34

Properties of algorithms

  • Interaction algorithms are possible for each process failure

model

  • fail-stop – processes can fail by crashing but the crashes can

be reliably detected by all other processes

  • fail-silent – where process crashes can never be reliably

detected

  • fail-noisy – processes can fail by crashing, and the crashes

can be detected, but not always in a reliable manner

  • fail-recovery – where processes can crash and later recover

and still participate in the algorithm

  • Byzantine – processes deviate from the intended behaviour in

an unpredictable manner

  • no solutions exist for all models in all interaction abstractions
slide-35
SLIDE 35

Distributed Systems - D N Ranasinghe 35

Coordination and Agreement

slide-36
SLIDE 36

Distributed Systems - D N Ranasinghe 36

  • under this broad topic we will discuss

– Leader election – Consensus – Distributed mutual exclusion

  • common or uniform decisions by participating processes to

various internal and external stimuli is often required, in the presence of failures and synchrony considerations

slide-37
SLIDE 37

Distributed Systems - D N Ranasinghe 37

Leader election (LE)

  • a process that is correct and which acts as the coordinator in

some steps of a distributed algorithm, is a leader; e.g., commit manager in a distributed database, central server in distributed mutual exclusion

  • LE abstraction can be straightforwardly implemented using a

perfect failure detector (that is in a synchronous situation)

  • Hierarchical LE: assumes the existence of a ranking order

agreed among processes apriori, s.t. a function O associates, with every process, those that precede in ranking, i.e., O(p1) = ∅, p1 leader by default; O(p2) = {p1}, if p1 dies p2 becomes leader; O(p3) = {p1, p2} etc.,

slide-38
SLIDE 38

Distributed Systems - D N Ranasinghe 38

Leader election (LE)

LCR algorithm (LeLann-Chang-Roberts): a simple ring based algorithm

  • assumptions: n processes each with a hard coded uid in a

logical ring topology, unidirectional message passing-process pi to p(i+1) mod n, processes are not aware of ring size, asynchronous, no process failures, no message loss

  • leader is defined to be the process with the highest uid
slide-39
SLIDE 39

Distributed Systems - D N Ranasinghe 39

Leader election (LE)

algorithm in prose:

  • each process forwards its uid to neighbour
  • if received uid < own uid, then discard, else if received uid >
  • wn uid, forward received uid to neighbour, else if received

uid =own uid then declare self as leader

P 2 P 3 P 4 P n

uid1 uid2 uid3 uid4 uidn

slide-40
SLIDE 40

Distributed Systems - D N Ranasinghe 40

Leader election (LE)

  • process automaton:

message_alphabet: set U of uid’s for each pi statei: defined by three state variables u ε U, initially uidi send ε U + null, initially uidi status ε {leader, unknown}, initially unknown msgi: place value of send on output channel; transi: {send = null; receive v ε U on input channel; if v = null or else if v < u then exit; if v > u then send =v; if v = u then status = leader;}

slide-41
SLIDE 41

Distributed Systems - D N Ranasinghe 41

Leader election (LE)

  • expected properties: validity – if a process decides, then the decided

value is the largest uid of a process

  • termination – every correct process eventually decides
  • agreement – no two correct processes decide differently
  • message complexity: O (n2)
  • time complexity: if synchronous, then n rounds until leader is

discovered; 2n rounds until terminates

  • ther possible scenarios: synchronous and processes are aware of

ring size n (useful if processes fail), bidirectional ring (for a more efficient version of the algorithm)

slide-42
SLIDE 42

Distributed Systems - D N Ranasinghe 42

Leader election (LE)

  • an O(n log n) message complexity algorithm (Hirschberg-Sinclair)
  • assumptions: bidirectional ring, where for every i, 0≤ i < n, pi has a

channel to left to p i+1 mod n, and pi has a channel to right to p i-1, n processes each with a hard coded uid in a logical ring topology, processes are not aware of ring size, asynchronous, no process failures, no message loss

P 2 P 3 P 4 P k

uid1 uid2 uid3 uid4 uidk

slide-43
SLIDE 43

Distributed Systems - D N Ranasinghe 43

Leader election (LE)

algorithm in prose:

  • as before, a process sends its identifier around the ring

and the message of the process with the highest identifier traverses the whole ring and returns

  • define a k-neighbourhood of a process pi to be the set of

processes at distance at most k from pi in either direction, left and right

  • algorithm operates in phases starting from 0
  • in the kth phase a process tries to become a winner for that

phase, where it must have the largest uid in its 2k neighbourhood

  • only processes that are winners in the kth phase can go to

(k+1)th phase

slide-44
SLIDE 44

Distributed Systems - D N Ranasinghe 44

  • to start with, in phase 0 each process attempts to become

a phase 0 winner and sends probe messages to its left and right neighbours

  • if the identifier of the neighbour receiving the probe is

higher, then it swallows the probe, else its sends back a reply message if it is at the edge of neighbourhood, else forwards probe to next in line

  • a process that receives replies from both its neighbours is

a winner in phase 0

  • similarly in a 2k neighbourhood the kth phase winner will

receive replies from the farthest two processes in either direction

  • a process which receives its own probe message declares

itself winner

slide-45
SLIDE 45

Distributed Systems - D N Ranasinghe 45

Leader election (LE)

pseudo code for pi:

send <probe, uidi, phase, hop_count> to left and to right; initially phase=0, and hop_count=1 upon receiving <probe, j, k, d> from left (or right) { if j= uidi then terminate as leader; if j > uidi and d< 2k then send <probe, j, k, d+1> to right (or left); // forward msg and increase hop count if j > uidi and d ≥ 2k then // if reached edge, do not forward but send <reply, j, k> to left (or right);} // if j < uid, msg is swallowed upon receiving <reply,j,k> from left (or right) { if j ≠ uidi then send <reply, j,k> to right (or left) // forward else // reply is for own probe if already received <reply, j,k> from right (or left) then send <probe, uidi, k+1, 1> ;} // phase k winner

slide-46
SLIDE 46

Distributed Systems - D N Ranasinghe 46

Leader election (LE)

  • ther possible scenarios:

– synchronous with alternative ‘swallowing’ rules – any thing higher than minimum uid seen so far etc., with tweaking of uid usage – leads to a synchronous leader election algorithm whose message complexity is at most 4n

slide-47
SLIDE 47

Distributed Systems - D N Ranasinghe 47

DME

  • shared memory mutual exclusion is a well known aspect in
  • perating systems when there is a need for concurrent

threads to access a shared variable or object for read/write purposes

  • the shared resource is made a critical section with access to it

controlled by atomic lock or semaphore operations

  • the lock or the semaphore variable is seen by all threads

consistently

  • asynchronous shared memory is an alternative possibility:

say, P1, P2 and P3 share M1 and, P2 and P3 share M2

slide-48
SLIDE 48

Distributed Systems - D N Ranasinghe 48

DME

  • in a distributed system there will be no shared lock

variable to look at

  • processes will have to agree on the process eligible to

access the shared resource at any given time, by message passing

  • assumptions: system of n processes, pi, i=1..n; a

process wishing to access an external shared resource must obtain permission to enter the critical section (CS); asynchronous, processes do not fail, messages are reliably delivered

slide-49
SLIDE 49

Distributed Systems - D N Ranasinghe 49

  • correctness properties
  • ME1 safety: at most one process my execute in the CS at

any given time

  • ME2 liveness: requests to enter and exit CS eventually

succeed

  • ME3 ordering: if one request to enter the CS ‘happened-

before’ another, then entry to the CS is granted in that

  • rder
  • ME2 ensures freedom from both starvation and deadlock
slide-50
SLIDE 50

Distributed Systems - D N Ranasinghe 50

DME

  • several algorithms exist: Central Server version, Ring, Ricart-

Agrawala Central Server version

4 2

P1 P2 P3 P4

Server

  • 3. Grant

Token

  • 1. Request

Token

  • 2. Release

Token

Queue of requests

slide-51
SLIDE 51

Distributed Systems - D N Ranasinghe 51

DME

  • In this scenario, there is a central server S that grants

permission to the processes to enter CS based on a token request

  • ME1, ME2 satisfied due to weak assumptions
  • ME3 not - since arbitrary message delay may cause mis-order

at S

slide-52
SLIDE 52

Distributed Systems - D N Ranasinghe 52

DME

Ring algorithm:

  • assumptions: processes are ordered in a logical ring with

unidirectional communication where each process pi communicates only with p(i+1) mod n.; system of n processes, pi, i=1..n; asynchronous, processes do not fail, messages are reliably delivered

  • mutual exclusion is obtained by sole possession of a token
  • ME1 and ME2 satisfied
  • correctness may not be guaranteed under violations of

assumptions

slide-53
SLIDE 53

Distributed Systems - D N Ranasinghe 53

DME

Ricart-Agrawala algorithm:

  • assumptions: each process pi has a unique identifier, uidi and

maintains a logical scalar clock LCi; system of n processes, pi, i=1..n; asynchronous, processes do not fail, messages are reliably delivered

slide-54
SLIDE 54

Distributed Systems - D N Ranasinghe 54

algorithm in prose:

  • a process pi desirous of accessing the CS multicasts a

request message containing its (uid, timestamp) pair to whole group

  • a process receiving such a request unless it is already in CS
  • r, is determined to enter CS and has a local clock less than

LCi, responds to pi.

  • if pi receives responses from all then it can enter CS
slide-55
SLIDE 55

Distributed Systems - D N Ranasinghe 55

DME

On initialization state := RELEASED; To enter the section state := WANTED; Multicast request to all processes; Ti := request’s timestamp wait until (number of replies received = (N-1)) state := HELD; On receipt of a request <Ti , Pi> at pj (i != j) if (state = HELD or (state = WANTED and (Tj, pj) < (Ti , pi))) then queue request from pi without replying; else reply immediately to pi ; end if To exit the critical section state := RELEASED; reply to any queued requests;

slide-56
SLIDE 56

Distributed Systems - D N Ranasinghe 56

DME

  • ME1,ME2, ME3 satisfied

P3 P1 P2

41 41 34 34 41 34

Reply Reply Reply

slide-57
SLIDE 57

Distributed Systems - D N Ranasinghe 57

DME

  • Message complexity is easily derivable
  • In all three DME algorithms above, i.e., server based, ring

based and R-A, process failures might violate termination requirements

  • message losses are not acceptable
  • even a perfect failure detector is not applicable since two

amongst three algorithms are asynchronous

slide-58
SLIDE 58

Distributed Systems - D N Ranasinghe 58

Fault tolerant consensus

  • generally speaking, agreement or consensus by participating

processes may be on a common value, on a message delivery order, on abort or commit, on a leader etc.,

  • consensus is specified in terms of two primitives: propose and

decide

  • properties to be satisfied:
  • termination – every correct process eventually decides some

value

  • validity – if a process decides v, then v was proposed by

some process

  • integrity - no process decides twice
  • agreement – no two correct processes decide differently
slide-59
SLIDE 59

Distributed Systems - D N Ranasinghe 59

Fault tolerant consensus

  • integrity + agreement = safety
  • validity + termination = liveness
  • key features: best effort broadcast with no message loss as a

mechanism to convey to community

  • f

processes, synchronous, process failures – fail stop and Byzantine with key parameter f, the maximum number of processes that can fail, where the system is known as f-resilient

  • uncertainty in consensus in this failure model arises as a

result of the possibility of a partial set of a process’s messages being only delivered at any round

slide-60
SLIDE 60

Distributed Systems - D N Ranasinghe 60

Flooding consensus – version 1

  • assumptions: n processes in a strongly connected undirected

graph, processes aware of group size, synchronous, maximally f fail stop processes (hard coded), no message loss, the set of possible decision values {V} is made of all proposed values, each process has exactly one proposed value, objective is ‘uniform’ decision

slide-61
SLIDE 61

Distributed Systems - D N Ranasinghe 61

Flooding consensus – version 1

algorithm in prose:

  • processes execute in rounds
  • each process maintains the set of proposals it has seen by

the merger, and this set is augmented when moving from one round to next

  • in each round every process disseminates its augmented set

to all others using best effort broadcast

  • a process decides a specific value in its set when the number
  • f rounds equals (f+1)
slide-62
SLIDE 62

Distributed Systems - D N Ranasinghe 62

t

p1 p2 p3 p4

Consensus round (f+1)

round 2 round 1 round 3

slide-63
SLIDE 63

Distributed Systems - D N Ranasinghe 63

Flooding consensus – version 1

process automaton:

message_alphabet: subsets of {V} for each pi statei: defined by three state variables rounds ε N, initially 0 decision ε {V} ∪ unknown, initially unknown W ⊆ V, initially the singleton set consisting of vi, pi’s proposal msgi: if rounds ≤ f then broadcast W to all other processes; transi: {rounds = rounds +1; receive value xj on input channel j; W = W ∪ ∪

∪ ∪ ∪j xj;

if rounds = f +1 then if |W| = 1 then decision= v, where W = {v} else decision = default;}

slide-64
SLIDE 64

Distributed Systems - D N Ranasinghe 64

Flooding consensus – version 1

proof sketch:

  • termination- all correct processes decide at the end of round

f+1, whatever that decision may be

  • validity – suppose all initial proposals are identical to v, and

hence W has only one element v, and v is the only possible decision

  • agreement – suppose if no process fails, then algorithm runs

for 1 round only, and by the basic broadcast property, W seen by all are identical

  • in the worst case f failures can be distributed amongst each

round but there is one final round to uniformatise the decision

slide-65
SLIDE 65

Distributed Systems - D N Ranasinghe 65

  • performance: time complexity: (f+1) rounds
  • a particular feature of all consensus algorithms
  • message complexity: (f+1)n2
  • other possible decision functions apart from uniform are

majority, minimum, maximum etc.,

slide-66
SLIDE 66

Distributed Systems - D N Ranasinghe 66

Flooding consensus – version 2

  • assumptions: n processes in a fully connected undirected graph,

processes aware of group size, synchronous, fail stop crashes with perfect failure detector, no message loss, the set of possible decision values {V} is made of all proposed values, each process has exactly

  • ne proposed value, any deterministic decision function can be applied
slide-67
SLIDE 67

Distributed Systems - D N Ranasinghe 67

Flooding consensus – version 2

algorithm in prose:

  • processes execute in rounds
  • each process maintains the set of proposals it has seen by

the merger, and this set is augmented when moving from one round to next

  • in each round every process disseminates its set to all others

using best effort broadcast

  • a process decides a specific value in its set when it knows it

has gathered all proposals that will ever be seen by any correct process or, it has detected no new failures in two successive rounds

  • a process so decides broadcasts its decision to the rest in

next round; all correct processes so far have not decided will decide on the receipt of a decide message

slide-68
SLIDE 68

Distributed Systems - D N Ranasinghe 68

Flooding consensus – version 2

  • agreement is strictly not violated: but correct processes must

decide a value that must be consistent with values decided by processes that might have decided before crashing

  • suppose a process that receives messages from all others

decide but crashes immediately afterwards before broadcasting to others

  • the rest move to next round detecting a failure and to the next

where there may be no further failures and then may decide

  • n a different outcome
  • problem can be mitigated by employing a reliable broadcast

mechanism: a process must decide even if it is able to now, but only after a reliable form of broadcast

slide-69
SLIDE 69

Distributed Systems - D N Ranasinghe 69

Flooding consensus – version 2

  • performance: worse case n rounds if (n-1) processes crash in

sequence

  • impossibility of consensus under asynchronous fail-stop

conditions

  • important result by Fischer, Lynch, Peterson: ‘no algorithm

can guarantee to reach consensus in an asynchronous system even with one process crash failure’

  • utcome is mainly due to the indistinguishability of a crashed

process from a slow process in an asynchronous system

slide-70
SLIDE 70

Distributed Systems - D N Ranasinghe 70

Flooding consensus – version 2

  • proof is complicated, but follows the argument that among

many possible executions α there may be at least one that avoids consensus being reached

  • any alternative?
  • with ‘unreliable failure detectors’ – consensus can be solved

in an asynchronous system with an unreliable failure detector if fewer than n/2 processes crash (Chandra and Toueg)

slide-71
SLIDE 71

Distributed Systems - D N Ranasinghe 71

Byzantine fault tolerance

  • Consensus in a synchronous system in the presence of

malicious and/or adhoc process failures, known by the metaphor Byzantine failure

  • Generals commanding divisions of the Byzantine army

communicate using reliable messengers

  • generals should decide on a common plan of action
  • some generals many be traitors and may prevent loyal

generals from agreeing by sending conflicting messages to different generals

slide-72
SLIDE 72

Distributed Systems - D N Ranasinghe 72

Byzantine fault tolerance

City Army Army Army Army General 2 General 3 General 4 General 1 Four Generals scenario

slide-73
SLIDE 73

Distributed Systems - D N Ranasinghe 73

Byzantine fault tolerance

  • assumptions: n processes in a fully connected undirected

graph, processes aware of group size, synchronous, maximally f Byzantine fail processes (hard coded): a faulty process may send any message with any value at any time or keep silent, no message loss, a correct process detecting the absence of a message associates it with a ‘null’ value, one designated process initiates messages to others processes, messages are unsigned (oral), the set of possible decision values {V} is made of proposed value by designated process,

  • bjective is ‘majority’ decision
slide-74
SLIDE 74

Distributed Systems - D N Ranasinghe 74

Byzantine fault tolerance

  • properties to be satisfied:
  • termination – every correct process eventually decides
  • validity – if the sending process is correct then the message

received is identical to the message sent (or, if the commanding general is loyal, then every loyal general obeys the order sent)

  • agreement – correct processes receive the same message

(or, all loyal generals receive the same order)

  • impossibility with three processes
slide-75
SLIDE 75

Distributed Systems - D N Ranasinghe 75

Byzantine fault tolerance

  • G2 is a traitor
  • CG is a traitor

CG G1 G2 attack attack retreat CG G1 G2 attack retreat attack retreat

slide-76
SLIDE 76

Distributed Systems - D N Ranasinghe 76

Byzantine fault tolerance

  • algorithm in prose: processes execute in rounds; the

designated process initiates by best effort broadcast of message to others; each correct process maintains the set of proposals it has seen by the merger, and this set is augmented when moving from one round to next; in each round every correct process disseminates its set to all others except the designated process using best effort broadcast; a correct process decides a majority value in its set (or fall back to a default) when the number of rounds equals (f+1)

  • case (a) – three processes with participating general p3 as

traitor, case (b) – three processes with commanding general p1 as traitor

slide-77
SLIDE 77

Distributed Systems - D N Ranasinghe 77

Byzantine fault tolerance

  • utcome: termination – satisfied by definition, whatever that decision

is; validity – not satisfied for case (a) (p2 does not follow p1) and not applicable for case (b); agreement – satisfied for case (b) (p2 and p3 fall back on default) and not applicable for case (a)

P1 P2 P3 P1 P2 P3 1:X 2:1:V 3:1:u 1:V 1:W 2:1:W 3:1:X 1:V

slide-78
SLIDE 78

Distributed Systems - D N Ranasinghe 78

Byzantine fault tolerance

  • consensus with four processes: case (a) – four processes

with participating general p3 as traitor, case (b) four processes with commanding general p1 as traitor

slide-79
SLIDE 79

Distributed Systems - D N Ranasinghe 79

Byzantine fault tolerance

  • utcome: case (a) – validity and agreement satisfied; case (b)

– validity not applicable, agreement – satisfied (p2, p3 and p4 fall back on default)

  • scenario with signed messages: digitally signing a message

uniquely identifies a message and its originator

  • revisit the three process consensus: case (a) – traitor cannot

alter commanding general’s message but can stay silent: validity satisfied (p2 discards bogus message from p3); case (b) – agreement satisfied (p2 and p3 fall back on default)

  • Byzantine agreement is solvable with three processes with
  • ne failure if processes digitally sign the messages
slide-80
SLIDE 80

Distributed Systems - D N Ranasinghe 80

Byzantine fault tolerance

  • complexity: time – (f+1)
  • message – O(nf+1), an exponential message complexity
  • generic result: Byzantine agreement is solvable with at least

(3f+1) processes in (f+1) rounds where f is the maximum number of Byzantine failures

  • a constant message size BFT consensus alternative exists:

provided n> 4f and runs for 2(f+1) rounds

slide-81
SLIDE 81

Distributed Systems - D N Ranasinghe 81

Time and Global states

slide-82
SLIDE 82

Distributed Systems - D N Ranasinghe 82

  • a distributed system by nature has no single clock and it is

practically difficult to synchronise physical clocks across a system

  • notion of a mechanism to globally order events in an

asynchronous system is an important requirement for replica management, consensus etc.,

slide-83
SLIDE 83

Distributed Systems - D N Ranasinghe 83

Logical clocks

  • Leslie Lamport introduced the concept of causal relationship
  • bservable in a message passing distributed system
slide-84
SLIDE 84

Distributed Systems - D N Ranasinghe 84

Logical clocks

  • A potential causal ordering can be established by looking at

‘happened-before’ relationships (indicated by an arrow → ) between local events within a process as well as sending and receiving events across processes: e.g., p1: a→ b, p2: c→ d, p3: e→ f, p1 and p2: b→c, p2 and p3: d → f etc.,

  • transitivity property: if x → y and y→ z then x→ z
  • concurrency definition: if ¬ (x→ y) and ¬(y→ x) then we say

(x || y)

  • it can be easily established that for p1 and p3: a → f and, a ||

e.

slide-85
SLIDE 85

Distributed Systems - D N Ranasinghe 85

Logical clocks

  • it is possible to time stamp the events of a distributed system

such that

  • rule 1 – if e1 and e2 are local events in pi and e1 → e2 then

Ci(e1) < Ci(e2)

  • rule 2 – if e1 is the sending event of a message by pi and e2 is

the corresponding receiving of the message by pj the Ci(e1) < Cj(e2)

  • generalised notation: ei

j as event #j of process pi

  • local history (possibly an infinite sequence of events) of pi as

hi = ei

1 ei 2 ei 3.., and the global history of the system as H =

h1∪h2∪..hn

slide-86
SLIDE 86

Distributed Systems - D N Ranasinghe 86

Logical clocks

  • Lamport clock timestamp rules:

– given that LC(ei) = logical time stamp of event ei and LCi = value

  • f logical clock of pi then

LC(ei) = LCi + 1 if ei is an internal event or a send event = max (LCi, TS(m)) + 1 if ei is a receive event

  • where TS(m) is the time stamp of the received message
  • after occurrence of event ei on pi, the logical clock of pi is

updated as LCi ← LC(ei)

slide-87
SLIDE 87

Distributed Systems - D N Ranasinghe 87

Logical clocks

  • properties: e→e’ ⇒ LC(e) < LC(e’) ; but note that

LC(e) < LC(e’) ¬⇒ e → e’

Pi

T (local clock (i))

ei

1

m

rec v

send

ei

2

ei

3

Pj

T (local clock (j))

ej

1

ej

2

ej

3

Lamport’s logical clocks enforce only a partial ordering of events How can a causal order of events be enforced? Vector Clocks by Mattern and Fidge

slide-88
SLIDE 88

Distributed Systems - D N Ranasinghe 88

Logical clocks

  • specification: VC(ei) = vector time stamp of event ei on pi is a vector
  • f size n: each element is VC(ei)[j]; j=1..n, where n is the group size
  • for i=j, corresponds to the number of events on pi up to and including

ei

  • for i≠j, corresponds to the number of events on pj that happened

before ei

a b c d e f m 1 m 2 (2,0,0) (1,0,0) (2,1,0) (2,2,0) (2,2,2) (0,0,1) p1 p2 p3 Physical time

slide-89
SLIDE 89

Distributed Systems - D N Ranasinghe 89

Logical clocks

  • Vector clock timestamp rules:

– VCi = vector clock of pi – if ei is an internal event or send(m) on pi then, ∀ j≠i, VC(ei)[j] ← VCi[j] and VC(ei)[i] = VC(ei)[i] + 1 – else {if ei is a receive event on pi of message m with vector timestamp VT(m)} then, VC(ei) ← max (VCi, VT(m)) and VC(ei)[i] ← VC(ei)[i] +1 – after occurrence of event ek on pi, its vector clock is updated as VCi ← VC(ei) – comparing two vector clocks: – VC(e) < VC(e’) iff ((VC(e) ≤ VC(e’)) and (VC(e) ≠ VC(e’))) where

  • VC(e) ≠ VC(e’) iff ∃j s.t. VC(e)[j] ≠ VC(e’)[j] and
  • VC(e) ≤ VC(e’) iff ∀j s.t. VC(e)[j] ≤ VC(e’)[j]
slide-90
SLIDE 90

Distributed Systems - D N Ranasinghe 90

Logical clocks

  • Vector clock properties: e→e’⇔ VC(e) < VC(e’)
  • e || e’ ⇔ ¬(VC(e) < VC(e’)) and ¬(VC(e’) < VC(e))
  • Vector clocks impose a casual order of events
slide-91
SLIDE 91

Distributed Systems - D N Ranasinghe 91

Global property of a distributed computation

properties to look for in a distributed system

  • garbage collection – objects having no references to it

within a process can be discarded

  • deadlock detection – cyclic waiting for resources
  • termination detection – not only each process has halted

but also there are no messages in transit

  • debugging – ensuring for example variables across

processes remain within defined limits etc.,

slide-92
SLIDE 92

Distributed Systems - D N Ranasinghe 92

Global property of a distributed computation

  • among these are the class of stable properties
  • stable ⇒ if once true, then remains true forever
  • to observe the state there is no omniscient observer who

can record an instantaneous snapshot of the system state

  • useful concept if the system is asynchronous
slide-93
SLIDE 93

Distributed Systems - D N Ranasinghe 93

Global property of a distributed computation

  • first a few notations and definitions
  • let qi

k be the state of a process pi after the occurrence of event ei k,

and qi

0 the initial state of pi

e e2

21 1

e e1

13 3

P1 P2 P3

e e1

11 1

e e1

12 2

e e3

31 1

e e1

14 4

e e2

22 2

e e3

32 2

e e3

34 4

e e3

33 3

slide-94
SLIDE 94

Distributed Systems - D N Ranasinghe 94

Global property of a distributed computation

  • the global state of a distributed computation at any given

instant is defined by the tuple (q1

k1, q2 k2,…… qn kn): global state

does not include the state of the channels

  • cut of a distributed computation is defined as a subset of the

global history H given by C = h1

c1∪h2 c2∪..hn cn where, hi ci = ei 1

ei

2 …..ei ci the local event history of pi up to event ci

  • Cut C is defined by the tuple (c1, c2, ….cn)
  • the global state (q1

c1, q2 c2 …..qn cn) corresponds to cut C

slide-95
SLIDE 95

Distributed Systems - D N Ranasinghe 95

Global property of a distributed computation

  • Usefulness of a Cut C:

– it is possible to express a global property of a distributed computation such as deadlock, computation terminated etc as a global state predicate Φ which evaluates a observed state to true

  • r false

e e2

21 1

e e1

13 3

P1 P2 P3

e e1

11 1

e e1

12 2

e e3

31 1

e e1

14 4

e e2

22 2

e e3

32 2

e e3

34 4

Cut (C)

e e3

33 3

slide-96
SLIDE 96

Distributed Systems - D N Ranasinghe 96

Global property of a distributed computation

  • suppose a process p0, outside of the system ask each

process pi its local state qi; process p0 builds the global state Q = (q1

k1, q2 k2,…… qn kn) and Φ evaluates on Q to give {true,

false}

  • consider some predicate Φ, evaluated on a consistent cut C

expressed by state (q1

c1, q2 c2 …..qn cn) such that Φ(C) = value

  • f Φ on C = {T, F}
  • let cut C precedes a cut C’ iff C ⊂ C’
  • a predicate is said to be stable iff the following property holds:

Φ (C) ⇒ for all C ⊂ C’, Φ (C’)

slide-97
SLIDE 97

Distributed Systems - D N Ranasinghe 97

Global property of a distributed computation

p1 p2 p3 Cut (C) p4 300(CHF) 750 Transfer of 100 650 400 150 500 50 200 350 100 600 400 Problem…!

slide-98
SLIDE 98

Distributed Systems - D N Ranasinghe 98

  • invariant for the bank transfer example: there should not be

more money in the accounts than there was originally

  • global state defined by cut C’: (400, 650, 400, 600); total

amount = 2050 > 1550; cut C’ is not consistent

  • definition: a cut C is consistent iff for all events e, e’ it is such

that,

e’ ε C and (e → e’) ⇒ e ε C

  • definition: a consistent global state is a global state defined by

a consistent cut

  • vector clocks can be used to determine if a cut is consistent or

not

slide-99
SLIDE 99

Distributed Systems - D N Ranasinghe 99

Consider

  • VC(ei)[j] – number of events on pj that happened before

ei (on pi)

  • VC(ej)[j] – number of events on pj before and including ej
  • therefore if VC(ei)[j] > VC(ej)[j] then ei is aware of more

events on pj than ej it self

  • that is there was a subsequent event after ej on pj which

caused ei

  • exactly an inconsistent cut
slide-100
SLIDE 100

Distributed Systems - D N Ranasinghe 100

  • a cut C is consistent if and only if ∀i,j: VC(ej

cj)[j] ≥

VC(ei

ci)[j] that is, cut event ei ci can not be aware of more

events on pj than ej

cj it self

pj ej ei pi

slide-101
SLIDE 101

Distributed Systems - D N Ranasinghe 101

  • consistency test for cut (c1, c2, c3):
  • for a three process situation, proceed as

– VC(e1

c1)[1] ≥ VC(e2 c2)[1] and VC(e1 c1)[1] ≥ VC(e3 c3)[1] for p1

– VC(e2

c2)[2] ≥ VC(e1 c1)[2] and VC(e2 c2)[1] ≥ VC(e3 c3)[1] for p2

– VC(e3

c3)[3] ≥ VC(e1 c1)[3] and VC(e3 c3)[3] ≥ VC(e2 c2)[3] for p3

  • a monitor process may establish whether an observed

global state is consistent using collected vector time stamps of the processes

slide-102
SLIDE 102

Distributed Systems - D N Ranasinghe 102

Computing a consistent global state (snapshot)

Chandy-Lamport algorithm

  • a snapshot is a record of process states and channel

states which is consistent

  • assumptions: FIFO channels (can be imposed using

sequence numbers); recorded states may be collected by a designated process; no process fails; no message loss; graph is strongly connected; any one of the processes can initiate a global snapshot

slide-103
SLIDE 103

Distributed Systems - D N Ranasinghe 103

algorithm in prose:

  • process p1 (initiator of the snapshot) saves its state q1

c1

and broadcasts the message snapshot to P (set of all processes)

  • let pi receive the snapshot message the first time from

some process pj (pj can be different from p1) at which time pi saves its state pi

ci and broadcasts the snapshot

message to P (no application event can take place between the reception of a snapshot message and rebroadcast)

  • when pi has received snapshot from all in P the

computation of the snapshot is terminated

slide-104
SLIDE 104

Distributed Systems - D N Ranasinghe 104

Chandy-Lamport snapshot algorithm

p1 p2 p3 p4 c1 c2 c3 c4 е е'

slide-105
SLIDE 105

Distributed Systems - D N Ranasinghe 105

  • as depicted p4 receives first snapshot message from p3

and not from p1 proof sketch:

  • global state Q = (q1

c1,q2 c2…….qn cn) is consistent

  • consider the cut C(c1, c2…cn)
  • due to send (snapshot) → receive (snapshot) and FIFO

channels, for all i,j, if cj ε C and ci → cj, then ci ε C