two-phase commit / network FSes 1 last time remote procedure calls - - PowerPoint PPT Presentation

two phase commit network fses
SMART_READER_LITE
LIVE PREVIEW

two-phase commit / network FSes 1 last time remote procedure calls - - PowerPoint PPT Presentation

two-phase commit / network FSes 1 last time remote procedure calls imitate function/method call interface extra setup: where is server interface description language to specify interface extra concerns: portability (language + machine),


slide-1
SLIDE 1

two-phase commit / network FSes

1

slide-2
SLIDE 2

last time

remote procedure calls

imitate function/method call interface extra setup: where is server interface description language to specify interface extra concerns: portability (language + machine), forward/backward-compatability

network failures — losing messages + reordering Byzantine failures fail-stop model distributed transcations

something across multiple machines happens all at once/not at all

2

slide-3
SLIDE 3

naive distributed transaction? (1)

machine A and B: student records; machine C: course records

any machine can be queried directly for info (e.g. by SIS web interface)

proposed add student to course procedure: execute code on A or B where student is stored tell C: add student to course wait for response from C (if course full, return error) locally: add student to course

3

slide-4
SLIDE 4

exericse (1)

seperate student (local) + course (remote) records tell remote: add student to course then locally: add student to course if no failures, which are possible to observe from third machine (that asks student/course machines for current records)?

A student record: in course; course record: not in course; but if double checking: both agree B same as A, but if double-checking both do not agree C student record: not in course; course record: in course; but if double checking: both agree D same as C, but if double-checking both do not agree

4

slide-5
SLIDE 5

exericse (2)

seperate student (local) + course (remote) records tell remote: add student to course then locally: add student to course if machine power loss + restart, which are possible to observe from third machine (that asks student/course machines for current records)?

A student record: in course; course record: not in course; but if double checking: both agree B same as A, but if double-checking both do not agree C student record: not in course; course record: in course; but if double checking: both agree D same as C, but if double-checking both do not agree

5

slide-6
SLIDE 6

decentralized solution properties

each machine handles only its own data

no sending everything through one machine (easy solution)

machines involved in transaction if and only if have relevant data

change only to courses? don’t tell student machines change to course + student A? don’t tell machine with student B

make progress as long as relevant machines don’t fail

losing one of K student machines? still runs for 1 of K students

hope: scales to tens/hundreds of machines

typical transaction: 1 to 3 machines?

6

slide-7
SLIDE 7

decentralized solution properties

each machine handles only its own data

no sending everything through one machine (easy solution)

machines involved in transaction if and only if have relevant data

change only to courses? don’t tell student machines change to course + student A? don’t tell machine with student B

make progress as long as relevant machines don’t fail

losing one of K student machines? still runs for 1 of K students

hope: scales to tens/hundreds of machines

typical transaction: 1 to 3 machines?

6

slide-8
SLIDE 8

two-phase commit

will look at solution that satisfjes these propties known as two-phase commit name from two steps: fjgure out what to do, then do it hint: similar idea to redo logging

record intended actions, then do them

7

slide-9
SLIDE 9

persisting past failures

will still use presistent log on each machine idea: machine remembers what it was doing on failure doesn’t store data of other machines …just some identifjer/contact info for the transaction

8

slide-10
SLIDE 10

two-phase commit: roles

  • ne machine = coordinator
  • ther machines are workers

common implementation: one physical machine runs coordinator+one worker

key rule: abort (don’t change anything) if anyone decides to abort coordinator collects workers’ vote: will they abort? coordinator makes fjnal decision using votes

9

slide-11
SLIDE 11

two-phase commit: roles

  • ne machine = coordinator
  • ther machines are workers

common implementation: one physical machine runs coordinator+one worker

key rule: abort (don’t change anything) if anyone decides to abort coordinator collects workers’ vote: will they abort? coordinator makes fjnal decision using votes

9

slide-12
SLIDE 12

two-phase commit: roles

  • ne machine = coordinator
  • ther machines are workers

common implementation: one physical machine runs coordinator+one worker

key rule: abort (don’t change anything) if anyone decides to abort coordinator collects workers’ vote: will they abort? coordinator makes fjnal decision using votes

9

slide-13
SLIDE 13

aside: why abort? (1)

why might worker want to abort? simpliest example: operation not possible course full course doesn’t exist on worker worker out of disk space …

10

slide-14
SLIDE 14

aside: why abort? (2)

why might worker want to abort? sublte issue: confmict with other tranaction; example: transaction 1: worker agreed to add student X to course A

…but still waiting to confjrm that this will happen

tranasction 2: worker asked to add student Y to course A

if course would be full after transaction 1, worker can’t say ‘yes’

  • ption one: worker aborts, says “not now”
  • ption two: worker delays response for transaction 2 until ready

11

slide-15
SLIDE 15

aside: why abort? (2)

why might worker want to abort? sublte issue: confmict with other tranaction; example: transaction 1: worker agreed to add student X to course A

…but still waiting to confjrm that this will happen

tranasction 2: worker asked to add student Y to course A

if course would be full after transaction 1, worker can’t say ‘yes’

  • ption one: worker aborts, says “not now”
  • ption two: worker delays response for transaction 2 until ready

11

slide-16
SLIDE 16

aside: why abort? (2)

why might worker want to abort? sublte issue: confmict with other tranaction; example: transaction 1: worker agreed to add student X to course A

…but still waiting to confjrm that this will happen

tranasction 2: worker asked to add student Y to course A

if course would be full after transaction 1, worker can’t say ‘yes’

  • ption one: worker aborts, says “not now”
  • ption two: worker delays response for transaction 2 until ready

11

slide-17
SLIDE 17

aside: consistency and reads

don’t want to allow reads of values that “in fmux” typical solution: reads need transaction, too

even though they don’t change anything

assignment: workers have “unavailable” fmag

12

slide-18
SLIDE 18

two-phase commit: no take-backs

  • nce worker agrees not to abort, it cannot change its mind
  • nce coordinator makes decision, it cannot change its mind

both cases: need to remember decision after power loss, crash, etc. solution: write decision down in log before acting on it

13

slide-19
SLIDE 19

two-phase commit: no take-backs

  • nce worker agrees not to abort, it cannot change its mind
  • nce coordinator makes decision, it cannot change its mind

both cases: need to remember decision after power loss, crash, etc. solution: write decision down in log before acting on it

13

slide-20
SLIDE 20

two-phase commit: voting

worker worker worker … coordinator chooses: commit commit commit … → commit commit abort commit … → abort commit commit unknown

?

… → abort

  • r

wait for missing vote

if nothing wrong, make progress no inconsistency if aborting instead must abort if any node can’t do it safe to abort if in doubt

14

slide-21
SLIDE 21

two-phase commit: voting

worker worker worker … coordinator chooses: commit commit commit … → commit commit abort commit … → abort commit commit unknown

?

… → abort

  • r

wait for missing vote

if nothing wrong, make progress no inconsistency if aborting instead must abort if any node can’t do it safe to abort if in doubt

14

slide-22
SLIDE 22

two-phase commit: voting

worker worker worker … coordinator chooses: commit commit commit … → commit commit abort commit … → abort commit commit unknown

?

… → abort

  • r

wait for missing vote

if nothing wrong, make progress no inconsistency if aborting instead must abort if any node can’t do it safe to abort if in doubt

14

slide-23
SLIDE 23

two-phase commit: voting

worker worker worker … coordinator chooses: commit commit commit … → commit commit abort commit … → abort commit commit unknown

?

… → abort

  • r

wait for missing vote

if nothing wrong, make progress no inconsistency if aborting instead must abort if any node can’t do it safe to abort if in doubt

14

slide-24
SLIDE 24

two-phase commit: voting

worker worker worker … coordinator chooses: commit commit commit … → commit commit abort commit … → abort commit commit unknown

?

… → abort

  • r

wait for missing vote

if nothing wrong, make progress no inconsistency if aborting instead must abort if any node can’t do it safe to abort if in doubt

14

slide-25
SLIDE 25

two-phase commit: phases

phase 1: preparing workers tell coordinator their votes: agree to commit/abort phase 2: fjnishing coordinator gathers votes, decides and tells everyone the outcome

15

slide-26
SLIDE 26

preparing

agree to commit

promise: “I will accept this transaction” promise recorded in the machine log in case it crashes

agree to abort

promise: “I will not accept this transaction” promise recorded in the machine log in case it crashes

never ever take back agreement!

to keep promise: can’t allow interfering operations e.g. agree to add student to class reserve seat in class (even though student might not be added b/c of other machines)

16

slide-27
SLIDE 27

preparing

agree to commit

promise: “I will accept this transaction” promise recorded in the machine log in case it crashes

agree to abort

promise: “I will not accept this transaction” promise recorded in the machine log in case it crashes

never ever take back agreement!

to keep promise: can’t allow interfering operations e.g. agree to add student to class → reserve seat in class (even though student might not be added b/c of other machines)

16

slide-28
SLIDE 28

coordinator decision

coordinator can’t take back global decision must record in presistent log to ensure not forgotten coordinator fails without logged decision? collect votes again

17

slide-29
SLIDE 29

coordinator decision

coordinator can’t take back global decision must record in presistent log to ensure not forgotten coordinator fails without logged decision? collect votes again

17

slide-30
SLIDE 30

fjnishing

coordinator says commit → commit transaction

worker applies transcation (e.g. record student is in class)

coordinator (or anyone) says abort → abort transaction

worker never ever applies transaction still want to do operation? make a new transaction

unsure which? option 1: ask coordinator

e.g. worker policy: keep asking if no outcome

unsure which? option 2: make sure coordinator resends outcome

e.g. coordinator keeps sending outcome until it gets “yes, I got it” reply

18

slide-31
SLIDE 31

fjnishing

coordinator says commit → commit transaction

worker applies transcation (e.g. record student is in class)

coordinator (or anyone) says abort → abort transaction

worker never ever applies transaction still want to do operation? make a new transaction

unsure which? option 1: ask coordinator

e.g. worker policy: keep asking if no outcome

unsure which? option 2: make sure coordinator resends outcome

e.g. coordinator keeps sending outcome until it gets “yes, I got it” reply

18

slide-32
SLIDE 32

two-phase commit: roles

typical two-phase commit implementation several workers

  • ne coordinator

might be same machine as a worker

19

slide-33
SLIDE 33

two-phase-commit messages

coordiantor → worker: PREPARE

“will you agree to do this action?”

  • n failure: can ask multiple times!

worker → coordinator: AGREE-TO-COMMIT or AGREE-TO-ABORT

worker records decision in log (before sending)

coordinator → worker: COMMIT or ABORT

I counted the votes and the result is commit/abort

  • nly commit if all votes were commit

20

slide-34
SLIDE 34

TPC: normal operation

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT COMMIT

log: state=WAIT log: state=AGREED-TO-COMMIT log: state=COMMIT

21

slide-35
SLIDE 35

TPC: normal operation

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT COMMIT

log: state=WAIT log: state=AGREED-TO-COMMIT log: state=COMMIT

21

slide-36
SLIDE 36

TPC: normal operation — confmict

coordinator worker 1 worker 2

PREPARE AGREE-TO- ABORT AGREE-TO- COMMIT ABORT

class is full! log: state=ABORT log: state=WAIT log: state=AGREED-TO-COMMIT log: state=ABORT

22

slide-37
SLIDE 37

TPC: normal operation — confmict

coordinator worker 1 worker 2

PREPARE AGREE-TO- ABORT AGREE-TO- COMMIT ABORT

class is full! log: state=ABORT log: state=WAIT log: state=AGREED-TO-COMMIT log: state=ABORT

22

slide-38
SLIDE 38

exercise (1)

under what circumstances may a worker send vote to abort?

[A] in repsonse to a duplicate PREPARE message after replying to the fjrst with a vote to commit [B] after rebooting after a crash, if its log indicates it previously decided to vote to abort, but did not receive any decisions from the coordinator [C] after rebooting after a crash, if its log indicates it previously decided to vote to commit, but did not receive any decisions from the coordinator [D] after sending a vote to commit, but detecting that the coordinator crashed and has been down for a very long time

23

slide-39
SLIDE 39

exercise (2)

under what circumstances may a coordinator send a decision to abort?

[A] when rebooting after a crash, after having last sent a request to vote to all but one worker and receiving votes to commit from all workers contacted [B] when rebooting after a crash, when the log indicates that the last thing the coordinator did was deciding to commit but the log doesn’t indicate that any workers were contacted [C] after successfully sending a request for a vote to a worker, but not receiving the reply due to a network problem

24

slide-40
SLIDE 40

two-phase commit: blocking

agree to commit “add student to class”? can’t allow confmicting actions…

adding student to confmicting class? removing student from the class? not leaving seat in class?

…until know transaction globally committed/aborted

25

slide-41
SLIDE 41

two-phase commit: blocking

agree to commit “add student to class”? can’t allow confmicting actions…

adding student to confmicting class? removing student from the class? not leaving seat in class?

…until know transaction globally committed/aborted

25

slide-42
SLIDE 42

waiting forever?

if machine goes away at wrong time, might never decide what happens solution in practice: manual intervention

26

slide-43
SLIDE 43

reasoning about protocols: state machines

very hard to reason about dist. protocol correctness typical tool: state machine each machine is in some state know what every message does in this state avoids common problem: don’t know what message does

27

slide-44
SLIDE 44

reasoning about protocols: state machines

very hard to reason about dist. protocol correctness typical tool: state machine each machine is in some state know what every message does in this state avoids common problem: don’t know what message does

27

slide-45
SLIDE 45

coordinator state machine (simplifjed?)

INIT WAITING ABORTED COMMITTED

send PREPARE to all receive any AGREE-TO-ABORT

  • r no reply from worker

send ABORT receive AGREE-TO-COMMIT from all send COMMIT accumulate votes resend PREPARE after timeout/failure resend ABORT if needed resend COMMIT if needed

28

slide-46
SLIDE 46

coordinator state machine (simplifjed?)

INIT WAITING ABORTED COMMITTED

send PREPARE to all receive any AGREE-TO-ABORT

  • r no reply from worker

send ABORT receive AGREE-TO-COMMIT from all send COMMIT accumulate votes resend PREPARE after timeout/failure resend ABORT if needed resend COMMIT if needed

28

slide-47
SLIDE 47

coordinator state machine (simplifjed?)

INIT WAITING ABORTED COMMITTED

send PREPARE to all receive any AGREE-TO-ABORT

  • r no reply from worker

send ABORT receive AGREE-TO-COMMIT from all send COMMIT accumulate votes resend PREPARE after timeout/failure resend ABORT if needed resend COMMIT if needed

28

slide-48
SLIDE 48

coordinator state machine (simplifjed?)

INIT WAITING ABORTED COMMITTED

send PREPARE to all receive any AGREE-TO-ABORT

  • r no reply from worker

send ABORT receive AGREE-TO-COMMIT from all send COMMIT accumulate votes resend PREPARE after timeout/failure resend ABORT if needed resend COMMIT if needed

28

slide-49
SLIDE 49

coordinator failure recovery

duplicate messages okay — unique transaction ID! coordinator crashes? log indicating last state worst case: log written, but message not sent → resend last message

  • r, if allowed, maybe send ABORT

worker doesn’t get COMMIT/ABORT?

in assignment: worker sends acknowledgment; arrange retry if no ack

  • ther option: worker asks again after timeout

workers need to handle duplicate messages! coordinators need to handle duplicate replies! haven’t sent commit? can abort instead (simpler?) in assignment, errors detected only at coordinator using gRPC — so have return value from “COMMIT” RPC normal strategy: wait for timeout, then resend assignment: you throw exception; we’ll restart (easier testing)

29

slide-50
SLIDE 50

coordinator failure recovery

duplicate messages okay — unique transaction ID! coordinator crashes? log indicating last state worst case: log written, but message not sent → resend last message

  • r, if allowed, maybe send ABORT

worker doesn’t get COMMIT/ABORT?

in assignment: worker sends acknowledgment; arrange retry if no ack

  • ther option: worker asks again after timeout

workers need to handle duplicate messages! coordinators need to handle duplicate replies! haven’t sent commit? can abort instead (simpler?) in assignment, errors detected only at coordinator using gRPC — so have return value from “COMMIT” RPC normal strategy: wait for timeout, then resend assignment: you throw exception; we’ll restart (easier testing)

29

slide-51
SLIDE 51

coordinator failure recovery

duplicate messages okay — unique transaction ID! coordinator crashes? log indicating last state worst case: log written, but message not sent → resend last message

  • r, if allowed, maybe send ABORT

worker doesn’t get COMMIT/ABORT?

in assignment: worker sends acknowledgment; arrange retry if no ack

  • ther option: worker asks again after timeout

workers need to handle duplicate messages! coordinators need to handle duplicate replies! haven’t sent commit? can abort instead (simpler?) in assignment, errors detected only at coordinator using gRPC — so have return value from “COMMIT” RPC normal strategy: wait for timeout, then resend assignment: you throw exception; we’ll restart (easier testing)

29

slide-52
SLIDE 52

coordinator failure recovery

duplicate messages okay — unique transaction ID! coordinator crashes? log indicating last state worst case: log written, but message not sent → resend last message

  • r, if allowed, maybe send ABORT

worker doesn’t get COMMIT/ABORT?

in assignment: worker sends acknowledgment; arrange retry if no ack

  • ther option: worker asks again after timeout

workers need to handle duplicate messages! coordinators need to handle duplicate replies! haven’t sent commit? can abort instead (simpler?) in assignment, errors detected only at coordinator using gRPC — so have return value from “COMMIT” RPC normal strategy: wait for timeout, then resend assignment: you throw exception; we’ll restart (easier testing)

29

slide-53
SLIDE 53

coordinator failure recovery

duplicate messages okay — unique transaction ID! coordinator crashes? log indicating last state worst case: log written, but message not sent → resend last message

  • r, if allowed, maybe send ABORT

worker doesn’t get COMMIT/ABORT?

in assignment: worker sends acknowledgment; arrange retry if no ack

  • ther option: worker asks again after timeout

workers need to handle duplicate messages! coordinators need to handle duplicate replies! haven’t sent commit? can abort instead (simpler?) in assignment, errors detected only at coordinator using gRPC — so have return value from “COMMIT” RPC normal strategy: wait for timeout, then resend assignment: you throw exception; we’ll restart (easier testing)

29

slide-54
SLIDE 54

coordinator failure recovery

duplicate messages okay — unique transaction ID! coordinator crashes? log indicating last state worst case: log written, but message not sent → resend last message

  • r, if allowed, maybe send ABORT

worker doesn’t get COMMIT/ABORT?

in assignment: worker sends acknowledgment; arrange retry if no ack

  • ther option: worker asks again after timeout

workers need to handle duplicate messages! coordinators need to handle duplicate replies! haven’t sent commit? can abort instead (simpler?) in assignment, errors detected only at coordinator using gRPC — so have return value from “COMMIT” RPC normal strategy: wait for timeout, then resend assignment: you throw exception; we’ll restart (easier testing)

29

slide-55
SLIDE 55

coordinator state machine (less simplifjed?)

INIT WAITING ABORTED COMMITTED

send PREPARE to all receive any AGREE-TO-ABORT send ABORT receive AGREE-TO-COMMIT from all send COMMIT failure/timeout: resend PREPARE (or send ABORT) vote: store + tally vote/failure/timeout: resend ABORT vote/failure/timeout: resend COMMIT

30

slide-56
SLIDE 56

coordinator state machine (less simplifjed?)

INIT WAITING ABORTED COMMITTED

send PREPARE to all receive any AGREE-TO-ABORT send ABORT receive AGREE-TO-COMMIT from all send COMMIT failure/timeout: resend PREPARE (or send ABORT) vote: store + tally vote/failure/timeout: resend ABORT vote/failure/timeout: resend COMMIT

30

slide-57
SLIDE 57

worker state machine (simplifjed)

INIT AGREED-TO-COMMIT COMMITTED ABORTED

recv PREPARE send AGREE-TO-COMMIT recv PREPARE send AGREE-TO-ABORT recv ABORT recv COMMIT

31

slide-58
SLIDE 58

worker state machine (less simplifjed?)

INIT AGREED-TO-COMMIT COMMITTED ABORTED

recv PREPARE send AGREE-TO-COMMIT recv PREPARE send AGREE-TO-ABORT recv ABORT recv COMMIT recv PREPARE (re)send AGREE-TO-ABORT recv PREPARE resend AGREE-TO-COMMIT

32

slide-59
SLIDE 59

worker state machine (less simplifjed?)

INIT AGREED-TO-COMMIT COMMITTED ABORTED

recv PREPARE send AGREE-TO-COMMIT recv PREPARE send AGREE-TO-ABORT recv ABORT recv COMMIT recv PREPARE (re)send AGREE-TO-ABORT recv PREPARE resend AGREE-TO-COMMIT

32

slide-60
SLIDE 60

worker failure recovery

worker crashes? log indicating last state log written before acting on that state if INIT: wait for PREPARE (resent)? if AGREE-TO-COMMIT or ABORTED: resend AGREE-TO-COMMIT/ABORT if COMMITTED: redo operation (just like redo logging)

33

slide-61
SLIDE 61

state machine missing details

really want to specify result of/action for every message!

worker recv ABORT in ABORTED: do nothing worker recv ABORT in INIT: go to ABORTED worker recv PREPARE in COMMITTED: ignore? …

everything specifjed: machine checkable? want to discard fjnished transactions eventually

34

slide-62
SLIDE 62

worker failure during prepare

worker failure after prepare without sending vote?

  • ption 1: coordinator retries prepare
  • ption 2: coordinator gives up, sends abort
  • ption 3: worker resends vote proactively

35

slide-63
SLIDE 63

worker failure during prepare

worker failure after prepare without sending vote?

  • ption 1: coordinator retries prepare
  • ption 2: coordinator gives up, sends abort
  • ption 3: worker resends vote proactively

36

slide-64
SLIDE 64

TPC: worker fails after prepare (1a)

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT PREPARE AGREE-TO- COMMIT COMMIT

coordinator timeout assignment: coord crash+reboot

  • n reboot: didn’t record transaction

as if never received after timeout – coordinator resends (assignment: coordinator crashes, testing code reboots) guess: message lost or worker broke

37

slide-65
SLIDE 65

TPC: worker fails after prepare (1a)

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT PREPARE AGREE-TO- COMMIT COMMIT

coordinator timeout assignment: coord crash+reboot

  • n reboot: didn’t record transaction

as if never received after timeout – coordinator resends (assignment: coordinator crashes, testing code reboots) guess: message lost or worker broke

37

slide-66
SLIDE 66

TPC: worker fails after prepare (1a)

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT PREPARE AGREE-TO- COMMIT COMMIT

coordinator timeout assignment: coord crash+reboot

  • n reboot: didn’t record transaction

as if never received after timeout – coordinator resends (assignment: coordinator crashes, testing code reboots) guess: message lost or worker broke

37

slide-67
SLIDE 67

TPC: worker fails after prepare (1a)

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT PREPARE AGREE-TO- COMMIT COMMIT

coordinator timeout assignment: coord crash+reboot

  • n reboot: didn’t record transaction

as if never received after timeout – coordinator resends (assignment: coordinator crashes, testing code reboots) guess: message lost or worker broke

37

slide-68
SLIDE 68

TPC: worker fails after prepare (1b)

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT PREPARE AGREE-TO- COMMIT COMMIT

coordinator timeout assignment: coord crash+reboot

recorded in log: agree-to-commit

  • n reboot: read log

not sure whether decision received after timeout – coordinator resends (assignment: coordinator crashes, testing code reboots) guess: message lost or worker broke

38

slide-69
SLIDE 69

TPC: worker fails after prepare (1b)

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT PREPARE AGREE-TO- COMMIT COMMIT

coordinator timeout assignment: coord crash+reboot

recorded in log: agree-to-commit

  • n reboot: read log

not sure whether decision received after timeout – coordinator resends (assignment: coordinator crashes, testing code reboots) guess: message lost or worker broke

38

slide-70
SLIDE 70

TPC: worker fails after prepare (1b)

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT PREPARE AGREE-TO- COMMIT COMMIT

coordinator timeout assignment: coord crash+reboot

recorded in log: agree-to-commit

  • n reboot: read log

not sure whether decision received after timeout – coordinator resends (assignment: coordinator crashes, testing code reboots) guess: message lost or worker broke

38

slide-71
SLIDE 71

TPC: worker fails after prepare (1b)

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT PREPARE AGREE-TO- COMMIT COMMIT

coordinator timeout assignment: coord crash+reboot

recorded in log: agree-to-commit

  • n reboot: read log

not sure whether decision received after timeout – coordinator resends (assignment: coordinator crashes, testing code reboots) guess: message lost or worker broke

38

slide-72
SLIDE 72

worker failure during prepare

worker failure after prepare without sending vote?

  • ption 1: coordinator retries prepare
  • ption 2: coordinator gives up, sends abort
  • ption 3: worker resends vote proactively

39

slide-73
SLIDE 73

TPC: worker fails after prepare (2)

coordinator worker 1 worker 2 PREPARE AGREE-TO- COMMIT ABORT

didn’t have time to log response? coordinator gives up, votes to abort doesn’t care about worker 2’s vote anymore

40

slide-74
SLIDE 74

TPC: worker fails after prepare (2)

coordinator worker 1 worker 2 PREPARE AGREE-TO- COMMIT ABORT

didn’t have time to log response? coordinator gives up, votes to abort doesn’t care about worker 2’s vote anymore

40

slide-75
SLIDE 75

TPC: worker fails after prepare (2)

coordinator worker 1 worker 2 PREPARE AGREE-TO- COMMIT ABORT

didn’t have time to log response? coordinator gives up, votes to abort doesn’t care about worker 2’s vote anymore

40

slide-76
SLIDE 76

worker failure during prepare

worker failure after prepare without sending vote?

  • ption 1: coordinator retries prepare
  • ption 2: coordinator gives up, sends abort
  • ption 3: worker resends vote proactively

41

slide-77
SLIDE 77

TPC: worker fails after prepare (3)

coordinator worker 1 worker 2 PREPARE AGREE-TO- COMMIT AGREE-TO- COMMIT COMMIT

record agree-to-commit

  • n reboot —

can proactively resend vote

42

slide-78
SLIDE 78

TPC: worker fails after prepare (3)

coordinator worker 1 worker 2 PREPARE AGREE-TO- COMMIT AGREE-TO- COMMIT COMMIT

record agree-to-commit

  • n reboot —

can proactively resend vote

42

slide-79
SLIDE 79

network failure after during voting?

network failure during voting ≈ node failure same options:

coordinator resends PREPARE coordinator gives up worker resends vote

43

slide-80
SLIDE 80

TPC: network failure (1)

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT ABORT

44

slide-81
SLIDE 81

worker failure during commit

worker failure during commit?

  • ption 1: coordinator resends outcome somehow?

requires acknowledgements from worker required for assignment

  • ption 2: worker resends vote (coordinator resends outcome)

NB: coordinator cannot give up

45

slide-82
SLIDE 82

aside: worker ACKs

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT COMMIT ack-commit assignment: worker sends response from COMMIT (no extra work: Commit is RPC call with return value) if not received, coordinator knows something wrong

46

slide-83
SLIDE 83

aside: worker ACKs

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT COMMIT ack-commit assignment: worker sends response from COMMIT (no extra work: Commit is RPC call with return value) if not received, coordinator knows something wrong

46

slide-84
SLIDE 84

worker failure during commit

worker failure during commit?

  • ption 1: coordinator resends outcome somehow?

requires acknowledgements from worker required for assignment

  • ption 2: worker resends vote (coordinator resends outcome)

NB: coordinator cannot give up

47

slide-85
SLIDE 85

coordinator resend automatically

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT COMMIT could detect missing ACK and resend but how many times to retry? how long to wait? would complicate testing COMMIT

48

slide-86
SLIDE 86

coordinator resend automatically

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT COMMIT could detect missing ACK and resend but how many times to retry? how long to wait? would complicate testing COMMIT

48

slide-87
SLIDE 87

TPC: worker revoting

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT AGREE-TO- COMMIT COMMIT COMMIT

record agree-to-commit

  • n reboot —

resend vote coordinator resends decision

49

slide-88
SLIDE 88

TPC: worker revoting

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT AGREE-TO- COMMIT COMMIT COMMIT

record agree-to-commit

  • n reboot —

resend vote coordinator resends decision

49

slide-89
SLIDE 89

two-phase commit assignment

two phase commit assignment store single value across workers single coordinator sends messages to/from workers to change values

workers current value can be queried directly

goal: several replicas all have same value or unavailable …even if failures

50

slide-90
SLIDE 90

assignment: RPC

coordinator talks to worker by making RPC calls workers only talk to coordinator by replying to RPC

example: make ”prepare” call, worker’s ”agree-to-X” is return value

RPC system detects worker being down, network errors, etc.

become Python exception in coordinator

coordinator verifjes Commit/Abort received instead of worker asking again

automatic: Commit/Abort message is RPC call; RPC call fails if problem

51

slide-91
SLIDE 91

assignment: failure recovery

to simplify assignment: always return error if you detect failure assume testing code/user will restart the coordinator+workers coordinator sends messages to workers on reboot to recover

resend prepare or commit, abort, etc.

52

slide-92
SLIDE 92

assignment: failure types

send RPC and

it gets lost it gets sent, but acknowledgment/reply is lost it gets sent, but delayed until after another RPC

53

slide-93
SLIDE 93

assignment: failure types

send RPC and

it gets lost it gets sent, but acknowledgment/reply is lost it gets sent, but delayed until after another RPC

54

slide-94
SLIDE 94

TPC: reordering

coordinator worker 1 worker 2 PREPARE id=0 PREPARE id=0 (resent) AGREE-TO- COMMIT id=0 COMMIT id=0 AGREE-TO- COMMIT id=0 PREPARE id=1

but maybe prepare wasn’t really lost… problem: need to know this is an old message

  • ne solution: unique/increasing ID numbers

fjrst prepare message didn’t get to worker 2 solution: resent later (timeout or coordinator recovery)

55

slide-95
SLIDE 95

TPC: reordering

coordinator worker 1 worker 2 PREPARE id=0 PREPARE id=0 (resent) AGREE-TO- COMMIT id=0 COMMIT id=0 AGREE-TO- COMMIT id=0 PREPARE id=1

but maybe prepare wasn’t really lost… problem: need to know this is an old message

  • ne solution: unique/increasing ID numbers

fjrst prepare message didn’t get to worker 2 solution: resent later (timeout or coordinator recovery)

55

slide-96
SLIDE 96

TPC: reordering

coordinator worker 1 worker 2 PREPARE id=0 PREPARE id=0 (resent) AGREE-TO- COMMIT id=0 COMMIT id=0 AGREE-TO- COMMIT id=0 PREPARE id=1

but maybe prepare wasn’t really lost… problem: need to know this is an old message

  • ne solution: unique/increasing ID numbers

fjrst prepare message didn’t get to worker 2 solution: resent later (timeout or coordinator recovery)

55

slide-97
SLIDE 97

message reordering and assignment

assignment: you need to worry about reordering

connections prevent reordering, but… RPC system doesn’t prevent it: can use multiple connections

problem: old request seems to fail, but is actually slow you repeat old request again later on slow old request reaches machine → must be ignored! solution: sequence numbers or transactions ID and/or timestamps

some way to tell “this is old”

56

slide-98
SLIDE 98

extending voting

two-phase commit: unanimous vote to commit assumption: data split across nodes, every must cooperate

  • ther model: every node has a copy of data

goal: work (including updates!) despite a few failing nodes just require “enough” nodes to be working for now — assume fail-stop

nodes don’t respond or tell you if broken

57

slide-99
SLIDE 99

extending voting

two-phase commit: unanimous vote to commit assumption: data split across nodes, every must cooperate

  • ther model: every node has a copy of data

goal: work (including updates!) despite a few failing nodes just require “enough” nodes to be working for now — assume fail-stop

nodes don’t respond or tell you if broken

57

slide-100
SLIDE 100

backup slides

58

slide-101
SLIDE 101

coordinator state machine (simplifjed?)

INIT WAITING ABORTED COMMITTED

send PREPARE to all receive any AGREE-TO-ABORT

  • r no reply from worker

send ABORT receive AGREE-TO-COMMIT from all send COMMIT accumulate votes resend PREPARE after timeout/failure resend ABORT if needed resend COMMIT if needed

59

slide-102
SLIDE 102

coordinator state machine (simplifjed?)

INIT WAITING ABORTED COMMITTED

send PREPARE to all receive any AGREE-TO-ABORT

  • r no reply from worker

send ABORT receive AGREE-TO-COMMIT from all send COMMIT accumulate votes resend PREPARE after timeout/failure resend ABORT if needed resend COMMIT if needed

59

slide-103
SLIDE 103

coordinator state machine (simplifjed?)

INIT WAITING ABORTED COMMITTED

send PREPARE to all receive any AGREE-TO-ABORT

  • r no reply from worker

send ABORT receive AGREE-TO-COMMIT from all send COMMIT accumulate votes resend PREPARE after timeout/failure resend ABORT if needed resend COMMIT if needed

59

slide-104
SLIDE 104

coordinator state machine (simplifjed?)

INIT WAITING ABORTED COMMITTED

send PREPARE to all receive any AGREE-TO-ABORT

  • r no reply from worker

send ABORT receive AGREE-TO-COMMIT from all send COMMIT accumulate votes resend PREPARE after timeout/failure resend ABORT if needed resend COMMIT if needed

59

slide-105
SLIDE 105

coordinator failure recovery

duplicate messages okay — unique transaction ID! coordinator crashes? log indicating last state worst case: log written, but message not sent → resend last message

  • r, if allowed, maybe send ABORT

worker doesn’t get COMMIT/ABORT?

in assignment: worker sends acknowledgment; arrange retry if no ack

  • ther option: worker asks again after timeout

workers need to handle duplicate messages! coordinators need to handle duplicate replies! haven’t sent commit? can abort instead (simpler?) in assignment, errors detected only at coordinator using gRPC — so have return value from “COMMIT” RPC normal strategy: wait for timeout, then resend assignment: you throw exception; we’ll restart (easier testing)

60

slide-106
SLIDE 106

coordinator failure recovery

duplicate messages okay — unique transaction ID! coordinator crashes? log indicating last state worst case: log written, but message not sent → resend last message

  • r, if allowed, maybe send ABORT

worker doesn’t get COMMIT/ABORT?

in assignment: worker sends acknowledgment; arrange retry if no ack

  • ther option: worker asks again after timeout

workers need to handle duplicate messages! coordinators need to handle duplicate replies! haven’t sent commit? can abort instead (simpler?) in assignment, errors detected only at coordinator using gRPC — so have return value from “COMMIT” RPC normal strategy: wait for timeout, then resend assignment: you throw exception; we’ll restart (easier testing)

60

slide-107
SLIDE 107

coordinator failure recovery

duplicate messages okay — unique transaction ID! coordinator crashes? log indicating last state worst case: log written, but message not sent → resend last message

  • r, if allowed, maybe send ABORT

worker doesn’t get COMMIT/ABORT?

in assignment: worker sends acknowledgment; arrange retry if no ack

  • ther option: worker asks again after timeout

workers need to handle duplicate messages! coordinators need to handle duplicate replies! haven’t sent commit? can abort instead (simpler?) in assignment, errors detected only at coordinator using gRPC — so have return value from “COMMIT” RPC normal strategy: wait for timeout, then resend assignment: you throw exception; we’ll restart (easier testing)

60

slide-108
SLIDE 108

coordinator failure recovery

duplicate messages okay — unique transaction ID! coordinator crashes? log indicating last state worst case: log written, but message not sent → resend last message

  • r, if allowed, maybe send ABORT

worker doesn’t get COMMIT/ABORT?

in assignment: worker sends acknowledgment; arrange retry if no ack

  • ther option: worker asks again after timeout

workers need to handle duplicate messages! coordinators need to handle duplicate replies! haven’t sent commit? can abort instead (simpler?) in assignment, errors detected only at coordinator using gRPC — so have return value from “COMMIT” RPC normal strategy: wait for timeout, then resend assignment: you throw exception; we’ll restart (easier testing)

60

slide-109
SLIDE 109

coordinator failure recovery

duplicate messages okay — unique transaction ID! coordinator crashes? log indicating last state worst case: log written, but message not sent → resend last message

  • r, if allowed, maybe send ABORT

worker doesn’t get COMMIT/ABORT?

in assignment: worker sends acknowledgment; arrange retry if no ack

  • ther option: worker asks again after timeout

workers need to handle duplicate messages! coordinators need to handle duplicate replies! haven’t sent commit? can abort instead (simpler?) in assignment, errors detected only at coordinator using gRPC — so have return value from “COMMIT” RPC normal strategy: wait for timeout, then resend assignment: you throw exception; we’ll restart (easier testing)

60

slide-110
SLIDE 110

coordinator failure recovery

duplicate messages okay — unique transaction ID! coordinator crashes? log indicating last state worst case: log written, but message not sent → resend last message

  • r, if allowed, maybe send ABORT

worker doesn’t get COMMIT/ABORT?

in assignment: worker sends acknowledgment; arrange retry if no ack

  • ther option: worker asks again after timeout

workers need to handle duplicate messages! coordinators need to handle duplicate replies! haven’t sent commit? can abort instead (simpler?) in assignment, errors detected only at coordinator using gRPC — so have return value from “COMMIT” RPC normal strategy: wait for timeout, then resend assignment: you throw exception; we’ll restart (easier testing)

60

slide-111
SLIDE 111

coordinator state machine (less simplifjed?)

INIT WAITING ABORTED COMMITTED

send PREPARE to all receive any AGREE-TO-ABORT send ABORT receive AGREE-TO-COMMIT from all send COMMIT failure/timeout: resend PREPARE (or send ABORT) vote: store + tally vote/failure/timeout: resend ABORT vote/failure/timeout: resend COMMIT

61

slide-112
SLIDE 112

coordinator state machine (less simplifjed?)

INIT WAITING ABORTED COMMITTED

send PREPARE to all receive any AGREE-TO-ABORT send ABORT receive AGREE-TO-COMMIT from all send COMMIT failure/timeout: resend PREPARE (or send ABORT) vote: store + tally vote/failure/timeout: resend ABORT vote/failure/timeout: resend COMMIT

61

slide-113
SLIDE 113

worker state machine (simplifjed)

INIT AGREED-TO-COMMIT COMMITTED ABORTED

recv PREPARE send AGREE-TO-COMMIT recv PREPARE send AGREE-TO-ABORT recv ABORT recv COMMIT

62

slide-114
SLIDE 114

worker state machine (less simplifjed?)

INIT AGREED-TO-COMMIT COMMITTED ABORTED

recv PREPARE send AGREE-TO-COMMIT recv PREPARE send AGREE-TO-ABORT recv ABORT recv COMMIT recv PREPARE (re)send AGREE-TO-ABORT recv PREPARE resend AGREE-TO-COMMIT

63

slide-115
SLIDE 115

worker state machine (less simplifjed?)

INIT AGREED-TO-COMMIT COMMITTED ABORTED

recv PREPARE send AGREE-TO-COMMIT recv PREPARE send AGREE-TO-ABORT recv ABORT recv COMMIT recv PREPARE (re)send AGREE-TO-ABORT recv PREPARE resend AGREE-TO-COMMIT

63

slide-116
SLIDE 116

worker failure recovery

worker crashes? log indicating last state log written before acting on that state if INIT: wait for PREPARE (resent)? if AGREE-TO-COMMIT or ABORTED: resend AGREE-TO-COMMIT/ABORT if COMMITTED: redo operation (just like redo logging)

64

slide-117
SLIDE 117

state machine missing details

really want to specify result of/action for every message!

worker recv ABORT in ABORTED: do nothing worker recv ABORT in INIT: go to ABORTED worker recv PREPARE in COMMITTED: ignore? …

everything specifjed: machine checkable? want to discard fjnished transactions eventually

65

slide-118
SLIDE 118

assignment: failure types

send RPC and

it gets lost it gets sent, but acknowledgment/reply is lost it gets sent, but delayed until after another RPC

66

slide-119
SLIDE 119

assignment: fails during prepare

coordinator worker 1 worker 2 PREPARE AGREE-TO- COMMIT ABORT

coordinator crashes from failing to get repsonse crash happens because RPC call to worker fails recovers after crash

67

slide-120
SLIDE 120

assignment: fails during prepare

coordinator worker 1 worker 2 PREPARE AGREE-TO- COMMIT ABORT

coordinator crashes from failing to get repsonse crash happens because RPC call to worker fails recovers after crash

67

slide-121
SLIDE 121

assignment: failuring during commit

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT COMMIT COMMIT COMMIT not sent successfully crash RPC call to get ack of commit fails, coordinator crashes fjx the problem when coordinator restarted

68

slide-122
SLIDE 122

assignment: failuring during commit

coordinator worker 1 worker 2

PREPARE AGREE-TO- COMMIT COMMIT COMMIT COMMIT not sent successfully → crash RPC call to get ack of commit fails, coordinator crashes fjx the problem when coordinator restarted

68

slide-123
SLIDE 123

quorums (1)

A B C D E

perform read/write with vote of any quorum of nodes any quorum enough — okay if some nodes fail if A, C, D agree: that’s enough B, E will fjgure out what happened when they come back up

69

slide-124
SLIDE 124

quorums (1)

A B C D E

perform read/write with vote of any quorum of nodes any quorum enough — okay if some nodes fail if A, C, D agree: that’s enough B, E will fjgure out what happened when they come back up

69

slide-125
SLIDE 125

quorums (2)

A B C D E

requirement: quorums overlap

  • verlap = someone in quorum knows about every update

e.g. every operation requires majority of nodes

part of voting — provide other voting nodes with ‘missing’ updates

make sure updates survive later on

cannot get a quorum to agree on anything confmicting with past updates

70

slide-126
SLIDE 126

quorums (2)

A B C D E

requirement: quorums overlap

  • verlap = someone in quorum knows about every update

e.g. every operation requires majority of nodes

part of voting — provide other voting nodes with ‘missing’ updates

make sure updates survive later on

cannot get a quorum to agree on anything confmicting with past updates

70

slide-127
SLIDE 127

quorums (2)

A B C D E

requirement: quorums overlap

  • verlap = someone in quorum knows about every update

e.g. every operation requires majority of nodes

part of voting — provide other voting nodes with ‘missing’ updates

make sure updates survive later on

cannot get a quorum to agree on anything confmicting with past updates

70