Distributed Systems in practice Recitation Class 2 3PC/Quorum - - PowerPoint PPT Presentation

distributed systems in practice recitation class 2 3pc
SMART_READER_LITE
LIVE PREVIEW

Distributed Systems in practice Recitation Class 2 3PC/Quorum - - PowerPoint PPT Presentation

Distributed Systems in practice Recitation Class 2 3PC/Quorum Systems Ren Mller, Systems Group, ETH Zurich muellren@inf.ethz.ch, IFW B49.1 HS 2008 Important Note: Download of the Book Apparently, Microsoft Research updated their


slide-1
SLIDE 1

Distributed Systems in practice Recitation Class 2 – 3PC/Quorum Systems

René Müller, Systems Group, ETH Zurich muellren@inf.ethz.ch, IFW B49.1 HS 2008

slide-2
SLIDE 2

Freitag, 12. Dezember 2008 2 René Müller Systems Group, Department of Computer Science, ETH Zurich

Important Note: Download of the Book

  • Apparently, Microsoft Research updated their website so

the link to Phil Bernstein’s Book “Concurrency Control and Recovery in Distributed Databases” is no longer valid.

  • However, the FTP link (still) works.
  • Alternatively, you can find the book on the VS_Wiki used

earlier in the lecture.

slide-3
SLIDE 3

Freitag, 12. Dezember 2008 3 René Müller Systems Group, Department of Computer Science, ETH Zurich

Problems with 2PC

  • In 2PC any process can block during its uncertainty period.
  • However, if all processes are uncertain they all remain

blocked.

  • Coordinator failed after deciding (coordinator is no longer uncertain)
  • Issue is addressed in 3PC
slide-4
SLIDE 4

Freitag, 12. Dezember 2008 4 René Müller Systems Group, Department of Computer Science, ETH Zurich

Non-blocking Rule

  • NB: If any operational process is uncertain then no process

can have decided to commit.

  • Solution to previous problem:

 If all operational processes and find out that they are uncertain, they can safely abort, knowing that none of the failed processes could have decided commit.

slide-5
SLIDE 5

Freitag, 12. Dezember 2008 5 René Müller Systems Group, Department of Computer Science, ETH Zurich

Non-Blocking Rule in 3PC

  • Idea: Use additional round of messages (PRE-COMMIT, ACK) to get

everybody out of the uncertainty window.

  • 3PC Coordinator sends PRE-COMMIT before COMMIT
  • Semantics of PRE-COMMIT: Decision is going to be commit if there

are no failures.

  • A node receiving a PRE-COMMIT replies with an ACK.
  • What’s the purpose of the message? Coordinator has to expect an

ACK from each participant.

  • To signal an event! Signals that participant is participating in second

phase

slide-6
SLIDE 6

Freitag, 12. Dezember 2008 6 René Müller Systems Group, Department of Computer Science, ETH Zurich

Three-Phase Commitment Protocol (3PC)

Roles

  • Coordinator (C): initiates 3PC
  • Participants (P)

Messages

  • VOTE-REQ: (C)(P)
  • YES, NO: (P)(C)
  • PRE-COMMIT (C)(P)
  • ACK (C)(P)
  • COMMIT, ABORT (C)(P)

Timeouts on

  • (P) VOTE-REQ  abort
  • (C) YES, NO  abort
  • (P) PRE-COMMIT  term. prot.

(C) ACK  ignore failed Ps

  • (P) COMMIT  term. protocol
  • 1. Coordinator sends VOTE-REQ

to all participants.

  • 2. When receiving VOTE-REQ

participant votes and sends YES/NO vote to coordinator.

  • 3. Coordinator collects votes and

decides commit/abort.

  • All vote yes  PRE-COMMIT
  • Otherwise  ABORT
  • 4. Participants receive
  • 1. PRE-COMMIT reply ACK
  • 2. ABORT  abort
  • 5. Coordinator receives ACKs

then sends COMMIT to those it received an ACK from.

slide-7
SLIDE 7

Freitag, 12. Dezember 2008 7 René Müller Systems Group, Department of Computer Science, ETH Zurich

Coordinator

start wait for votes wait for ACKs aborted send VOTE-REQ All vote yes  send PRE-COMMIT Some vote no  send ABORT Timeout  decide abort and send ABORT committed all ACKs received  send COMMIT to everybody Timeout on all ACKs  send COMMIT to ACK nodes

slide-8
SLIDE 8

Freitag, 12. Dezember 2008 8 René Müller Systems Group, Department of Computer Science, ETH Zurich

Participant

wait for VOTE-REQ committable aborted vote no  send NO and abort uncertain PRE-COMMIT received  send ACK ABORT received  abort Timeout  decide abort

Participant is uncertain. It cannot unilaterally decide.  start Termination Protocol (same as in 2PC)

vote yes  send YES committed COMMIT received  commit

Even tough decision is commit. Participant cannot commit yet.  Violation of NB rule (others may still be uncertain)  start Termination Protocol

Timeout Timeout

slide-9
SLIDE 9

Freitag, 12. Dezember 2008 9 René Müller Systems Group, Department of Computer Science, ETH Zurich

Termination Protocol

1. Elect new coordinator 2. Coordinator sends STATE-REQ to all processes in the election. 3. All operating processes report their state 4. Coordinator applies Termination Rules based on state reports: TR1: If some process is aborted  send ABORT TR2: If some process is committed  send COMMIT TR3: If some process is uncertain  decide abort and send ABORT. TR4: If some processes is committable but none is committed  resume 3PC as new coordinator by (re-)sending PRE-COMMIT.

slide-10
SLIDE 10

Freitag, 12. Dezember 2008 10 René Müller Systems Group, Department of Computer Science, ETH Zurich

Coexistence of States TR2   

Committed

TR2 TR4  

Committable

 TR3 TR3 

Uncertain

  TR3 TR1

Aborted Committed Committable Uncertain Aborted  For each feasible combination there is exactly one termination rule

slide-11
SLIDE 11

Freitag, 12. Dezember 2008 11 René Müller Systems Group, Department of Computer Science, ETH Zurich

Failures in 3PC

  • Fact: Logging PRE-COMMIT

and ACKs does not help in recovery.

  •  Logging identical to 2PC.
  • Recovery from total site failures
  • wait for last process that failed

(unless independent recovery possible)  termination protocol must include last failing process.

  • Communication failures
  • Partitioning can occur
  • Partition may decide differently 

inconsistency

  • Protocol does NOT tolerate

communication failures.

  • Solution: Use Quorums, i.e.

decide only when majority of processes are participating.  introduces blocking again, of no quorum can be obtained.

slide-12
SLIDE 12

Freitag, 12. Dezember 2008 12 René Müller Systems Group, Department of Computer Science, ETH Zurich

Assignment 7.14 (10)

Committed

(9) (8)

Committable

(7) (6) (5)

Uncertain

(4) (3) (2) (1)

Aborted Committed Committable Uncertain Aborted

Prove correctness of co-existence table. (symmetry  only 10 cases)

slide-13
SLIDE 13

Freitag, 12. Dezember 2008 13 René Müller Systems Group, Department of Computer Science, ETH Zurich

Coexistence Table: simple cases

(1) Aborted—Aborted: no failures, a NO vote  abort. (2) Aborted—Uncertain: p1 votes NO and unilaterally aborts, p2 votes yes and is uncertain. (5) Uncertain—Uncertain: p1 and p2 vote YES, however, do not yet know the decision made by the coordinator. (6) Uncertain—Committable: after situation (5) the coordinator sends PRE-COMMIT. p1 received it before p2  p1 committable while p2 still uncertain. (7) Uncertain—Committed: prevented by NB rule. When committed there are no operational uncertain processes. (8) Committable—Committable: step (6) after p2 got PRE-COMMIT (9) Committable—Committed: p2 has received COMMIT p1 not yet. (10) Committed—Committed: step (6) after p1 also received COMMIT.

slide-14
SLIDE 14

Freitag, 12. Dezember 2008 14 René Müller Systems Group, Department of Computer Science, ETH Zurich

Coexistence Table: remaining cases

(3) Aborted—Committable (no communication failures) Abort possible if

  • In termination protocol when

Committable  everybody voted yes

  • Hence, processes are either

uncertain or committable.

  • Abort then only in termination

protocol.

  • Consider first round that would

decide abort

  • Abort if some are uncertain

processes are operational  impossible (no communication failures) (4) Aborted—Committed Commit is only reached if committable before. However, (3) says impossible

slide-15
SLIDE 15

Freitag, 12. Dezember 2008 15 René Müller Systems Group, Department of Computer Science, ETH Zurich

Assignment 7.17

  • Describe scenario with site-failures only where a

committable process still would lead to an abort.

P0 P1 P2 VOTE-REQ VOTE-REQ YES YES uncertain uncertain PRE-COMMIT committable uncertain termination protocol STATE-REQ “I am the only one alive and uncertain so I abort”

slide-16
SLIDE 16

Freitag, 12. Dezember 2008 16 René Müller Systems Group, Department of Computer Science, ETH Zurich

Assignment 7.17

  • 1. P0 sends VOTE-REQ to P1 and P2
  • 2. P1 and P2 both reply with YES
  • 3. P0 sends PRE-COMMIT to P1 but fails before sending it to
  • P2. Thus, P1 is committable whereas P2 is still uncertain.
  • 4. P1 fails.
  • 5. P2 times out for the PRE-COMMIT and starts termination

protocol.

  • 6. P2 sends out STATE-REQ.
  • 7. P2 times out for replies and since it is the only one alive,

determines abort since it is uncertain.

slide-17
SLIDE 17

Freitag, 12. Dezember 2008 17 René Müller Systems Group, Department of Computer Science, ETH Zurich

Assignment 3 (a)

  • Read One-Write All (ROWA) Systems
  • Advantage cheap reads: one local read
  • Disadvantage expensive writes: N writes
  • ROWA suitable for read-dominated loads
  • Apparent trade-off: read costs  write costs
  • Synchronous Update Everywhere ROWA: cheap reads expensive writes
  • Asynchronous Update Primary Copy: cheap writes expensive reads

(local read may be out-of-date)

  • Is there something in-between, i.e., not write-all and read “a few”?
slide-18
SLIDE 18

Freitag, 12. Dezember 2008 18 René Müller Systems Group, Department of Computer Science, ETH Zurich

Quorum Systems

  • Improve performance with availability in replication.
  • Balance costs between read and write operations.
  • Reduce number of copies involved in updates
  • Beispiel aus der Politik: “Für Verhandlungs- und Beschlussfähigkeit der

vereinigten Bundesversammlung ist die Anwesenheit von mehr als der Hälfte (>50%) der Räte erforderlich. “  Dann “absolutes Mehr”. Types

  • Voting Quorums
  • Majority Quorum (Quorum Consensus, “Gewichtetes Votieren”)
  • Hierarchical Quorum Consensus
  • Grid Quorums
  • Tree Quorums
slide-19
SLIDE 19

Freitag, 12. Dezember 2008 19 René Müller Systems Group, Department of Computer Science, ETH Zurich

Quorums

Formal Definition:

  • A quorum system S = {S1, S2, …, SN} is a collection of

quorum sets Si  U of a finite universe.

  •  i,j  {1, …, N} : Si  Sj  .
  • For replication we consider two quorum sets: read quorum

RQ and write quorum WQ.

  • Rules
  • Any read quorum must overlap with any write quorum
  • Any two write quorum must overlap
slide-20
SLIDE 20

Freitag, 12. Dezember 2008 20 René Müller Systems Group, Department of Computer Science, ETH Zurich

Majority Quorum

  • Use vote to define quorum
  • Each site has a non-negative voting weight.
  • Majority = number of votes exceed half of the total votes
  • For Assignment 3
  • For simplicity, we assume each site has vote weight 1.
  • N is the number of sites
  • Let |S| denote the voting weight of a quorum set S.
  • Rules for read quorum (RQ) and write quorum (WQ)
  • |RQ| + |WQ| > N

 read and write quorums overlap

  • 2 |WR| > N

 two write quorums overlap

slide-21
SLIDE 21

Freitag, 12. Dezember 2008 21 René Müller Systems Group, Department of Computer Science, ETH Zurich

Quorum Sizes

  • Rules for read quorum (RQ) and write quorum (WQ)
  • |RQ| + |WQ| > N

 read and write quorums overlap

  • 2 |WR| > N

 two write quorums overlap

  • The quorum sizes |RQ| and |WQ| determines the cost for

read and write operations.  minimize!

  • Minimum quorum sizes for the inequalities are:
  • Write quorum requires majority
  • Read quorum requires at least half of the system sites

1 2 N WQ min                2 N RQ min

slide-22
SLIDE 22

Freitag, 12. Dezember 2008 22 René Müller Systems Group, Department of Computer Science, ETH Zurich

Example

  • Consider 4 sites
  • min |WQ|=3 sites (majority)
  • min |RQ|=2 sites (half)

P1 P2 P3 P4

write quorums overlap

P1 P2 P3 P4

read quorums do not overlap

P1 P2 P3 P4

read and write quorums overlap

slide-23
SLIDE 23

Freitag, 12. Dezember 2008 23 René Müller Systems Group, Department of Computer Science, ETH Zurich

Comparison with ROWA

  • For ROWA we can think of:
  • |RQ| = 1 and |WQ|=N.
  • Any read overlaps with any write
  • Any two writes overlap
  • Reads do not overlap
  • For Quorums:

1 2 N WQ                2 N RQ

slide-24
SLIDE 24

Freitag, 12. Dezember 2008 24 René Müller Systems Group, Department of Computer Science, ETH Zurich

Assignment 3 (b)

  • Load consists of R reads and W writes
  • Normalized: R+W=1
  • Cost ROWA = R + N W
  • Cost Quorum = R  |RQ| + W  |WQ|
  • For Minimum-sized quorums

                         1 2 N W 2 N R Cost

slide-25
SLIDE 25

Freitag, 12. Dezember 2008 25 René Müller Systems Group, Department of Computer Science, ETH Zurich

ROWA better Quorum System better

ROWA – Quorum System

Write Load W=1 R=0 W=1/2 R=1/2 cost N 1 ROWA N/2 + 1 N/2 Quorum System W=0 R=1

slide-26
SLIDE 26

Freitag, 12. Dezember 2008 26 René Müller Systems Group, Department of Computer Science, ETH Zurich

Assignment 3 (c)

  • Why has asynchronous replication lower cost than

synchronous replication?

  • Cost for synchronous ROWA is

Cost ROWA = R + N W

  • In terms of read/write operations asynchronous (primary

copy) has cost 1  one direct write (master)  one local read (possibly outdated copy)  load independent

slide-27
SLIDE 27

Freitag, 12. Dezember 2008 27 René Müller Systems Group, Department of Computer Science, ETH Zurich

Updates

  • However, this is not the full cost.
  • Cost for propagating update sets (and reconciliation) also

need to be considered.

  • Assume, updates are load-independent with update

frequency (rate r)

  • Cost = 1 + r  (N-1)
  • Thus, asynchronous, update primary copy is cheaper for

1 N 1 W N R r W N R 1) (N r 1           

slide-28
SLIDE 28

Freitag, 12. Dezember 2008 28 René Müller Systems Group, Department of Computer Science, ETH Zurich

References

  • R. Jiménez-Peris, M. Patiño-Martínez, G. Alonso, B.

Kemme: Are Quorums an Alternative for Data Replication? ACM Transactions on Database Systems, 2003. http://doi.acm.org/10.1145/937598.937601