Last Time We t alked about t he pot ent ial benef it s of 19: - - PDF document

last time
SMART_READER_LITE
LIVE PREVIEW

Last Time We t alked about t he pot ent ial benef it s of 19: - - PDF document

Last Time We t alked about t he pot ent ial benef it s of 19: Dist ribut ed Coordinat ion dist ribut ed syst ems We also t alked about some of t he reasons t hey can be so dif f icult t o build Last Modif ied: Today we are


slide-1
SLIDE 1

1

  • 1

19: Dist ribut ed Coordinat ion

Last Modif ied: 7/ 3/ 2004 1:50:34 PM

  • 2

Last Time

We t alked about t he pot ent ial benef it s of

dist ribut ed syst ems

We also t alked about some of t he reasons

t hey can be so dif f icult t o build

Today we are going t o t ackle some of t hese

problems!

  • 3

Recall

Dist ribut ed syst ems

Component s can f ail (not f ail- st op) Net wor k par t it ions can occur in which each

por t ion of t he dist r ibut ed syst em t hinks t hey ar e t he only ones alive

Don’t have a shar ed clock Can’t r ely on har dwar e pr imit ives like t est-and-

set f or mut ual exclusion

  • 4

Dist ribut ed Coordinat ion

To t ackle t his complexit y we are going

t o build dist ribut ed algorit hms f or:

Event Or der ing Mut ual Exclusion At omicit y Deadlock Handling Elect ion Algor it hms Reaching Agr eement

  • 5

Event Ordering

P

roblem: dist ribut ed syst ems do not share a clock

Many coor dinat ion pr oblems would be simplif ied

if t hey did (“f ir st one wins”) Dist ribut ed syst ems do have some sense of

t ime

Event s in a single pr ocess happen in or der Messages bet ween pr ocesses must be sent

bef or e t hey can be r eceived

How helpf ul is t his?

  • 6

Happens-bef ore

Def ine a Happens-bef ore relat ion (denot ed

by →).

1) I f A and B ar e event s in t he same pr ocess,

and A was execut ed bef or e B, t hen A → B.

2) I f A is t he event of sending a message by

  • ne pr ocess and B is t he event of r eceiving t hat

message by anot her pr ocess, t hen A → B.

3) I f A → B and B → C

t hen A → C .

slide-2
SLIDE 2

2

  • 7

Tot al ordering?

Happens-bef ore gives a part ial ordering of

event s

We st ill do not have a t ot al or der ing of

event s

  • 8

P art ial Ordering

Pi -> Pi+1; Qi -> Qi+1; Ri -> Ri+1 R0-> Q4; Q3-> R4; Q1-> P4; P1

  • >

Q2

  • 9

Tot al Ordering?

P0, P 1, Q0, Q1, Q2, P 2, P 3, P 4, Q3, R0, Q4, R1, R2, R3, R4 P0, Q0, Q1, P 1, Q2, P 2, P 3, P 4, Q3, R0, Q4, R1, R2, R3, R4 P0, Q0, P 1, Q1, Q2, P 2, P 3, P 4, Q3, R0, Q4, R1, R2, R3, R4

  • 10

Timest amps

Assume each pr ocess has a local logical clock t hat

t icks once per event and t hat t he pr ocesses ar e number ed

Clocks t ick once per event (including message send) When send a message, send your clock value When receive a message, set your clock t o MAX( your

clock, t imest amp of message + 1)

  • Thus sending comes bef or e r eceiving
  • Only visibilit y int o act ions at ot her nodes happens dur ing

communicat ion, communicat e synchr onizes t he clocks

I f t he t imest amps of t wo event s A and B are t he same,

t hen use t he process ident it y numbers t o break t ies. This gives a t ot al or der ing!

  • 11

Dist ribut ed Mut ual Exclusion (DME)

Problem: We can no longer rely on j ust an

at omic t est and set oper at ion on a single machine t o build mut ual exclusion primit ives

Requirement

I f P

i is execut ing in it s cr it ical sect ion, t hen no

  • t her pr ocess Pj is execut ing in it s cr it ical

sect ion.

  • 12

Solut ion

We present t hree algorit hms t o ensure t he

mut ual exclusion execut ion of processes in t heir crit ical sect ions.

Cent r alized Dist r ibut ed Mut ual Exclusion

(CDME)

Fully Dist r ibut ed Mut ual Exclusion (DDME) Token passing

slide-3
SLIDE 3

3

  • 13

CDME: Cent ralized Approach

One of t he pr ocesses in t he syst em is chosen t o

coor dinat e t he ent r y t o t he cr it ical sect ion.

A process t hat want s t o ent er it s crit ical sect ion sends a

request message t o t he coordinat or.

The coordinat or decides which process can ent er t he crit ical

sect ion next , and it s sends t hat process a reply message.

When t he process receives a r eply message f rom t he

coordinat or, it ent ers it s crit ical sect ion.

Af t er exit ing it s crit ical sect ion, t he process sends a

release message t o t he coordinat or and proceeds wit h it s execut ion. 3 messages per cr it ical sect ion ent r y

  • 14

P roblems of CDME

Elect ing t he mast er process? Har dcoded? Single point of f ailur e? Elect ing a new

mast er process?

Dist ribut ed Elect ion algorit hms lat er…

  • 15

DDME: Fully Dist ribut ed Approach

When pr ocess P

i want s t o ent er it s cr it ical sect ion,

it gener at es a new t imest amp, TS, and sends t he message r equest (P

i , TS) t o all ot her pr ocesses in

t he syst em.

When pr ocess P

j r eceives a request message, it

may r eply immediat ely or it may def er sending a reply back.

When pr ocess P

i r eceives a reply message f rom all

  • t her pr ocesses in t he syst em, it can ent er it s

cr it ical sect ion.

Af t er exit ing it s cr it ical sect ion, t he pr ocess

sends reply messages t o all it s def er r ed r equest s.

  • 16

DDME: Fully Dist ribut ed Approach (Cont .)

The decision whet her pr ocess P

j r eplies

immediat ely t o a r equest(P

i, TS) message or

def er s it s r eply is based on t hr ee f act or s:

I f P

j is in it s crit ical sect ion, t hen it def ers it s reply t o

P

i.

I f P

j does not want t o ent er it s crit ical sect ion, t hen it

sends a reply immediat ely t o P

i.

I f P

j want s t o ent er it s crit ical sect ion but has not yet

ent ered it , t hen it compares it s own request t imest amp wit h t he t imest amp TS.

  • I f it s own request t imest amp is great er t han TS, t hen it

sends a r eply immediat ely t o Pi (Pi asked f ir st ).

  • Ot herwise, t he reply is def erred.
  • 17

P roblems of DDME

Requir es complet e t r ust t hat ot her pr ocesses will

play f air

Easy t o cheat j ust by delaying t he reply!

The pr ocesses needs t o know t he ident it y of all

  • t her pr ocesses in t he syst em

Makes t he dynamic addit ion and removal of processes

more complex. I f one of t he pr ocesses f ails, t hen t he ent ir e

scheme collapses.

Dealt wit h by cont inuously monit oring t he st at e of all t he

processes in t he syst em. Const ant ly bot her ing people who don’t car e

Can I ent er my crit ical sect ion? Can I ?

  • 18

Token P assing

Circulat e a t oken among processes in t he

syst em

P

  • ssession of t he t oken ent it les t he holder

t o ent er t he crit ical sect ion

Organize processes in syst em int o a logical

r ing

Pass t oken ar ound t he r ing When you get it , ent er cr it ical sect ion if need

t o t hen pass it on when you ar e done (or j ust pass it on if don’t need it )

slide-4
SLIDE 4

4

  • 19

P roblems of Token P assing

I f machines wit h t oken f ails, how t o

regenerat e a new t oken?

A lot like elect ing a new coordinat or I f process f ails, need t o repair t he break

in t he logical ring

  • 20

Compare: Number of Messages?

CDME: 3 messages per cr it ical sect ion

ent ry

DDME: The number of messages per

cr it ical-sect ion ent ry is 2 x (n – 1)

Request / r eply f or ever yone but myself

Token passing: Bet ween 0 and n messages

Might luck out and ask f or t oken while I have it

  • r when t he per son r ight bef or e me has it

Might need t o wait f or t oken t o visit ever yone

else f ir st

  • 21

Compare : St arvat ion

CDME : Fr eedom f r om st ar vat ion is ensur ed if

coor dinat or uses FI FO

DDME: Fr eedom f r om st ar vat ion is ensur ed, since

ent r y t o t he cr it ical sect ion is scheduled accor ding t o t he t imest amp or der ing. The t imest amp

  • r der ing ensur es t hat pr ocesses ar e ser ved in a

f ir st-come, f ir st ser ved or der .

Token Passing: Fr eedom f r om st ar vat ion if r ing is

unidir ect ional

Caveat s

net work reliable (I .e. machines not “st arved” by inabilit y

t o communicat e)

I f machines f ail t hey are rest art ed or t aken out of

considerat ion (I .e. machines not “st arved” by nonresponse of coordinat or or anot her part icipant )

P

rocesses play by t he rules

  • 22

Why DDME?

Harder More messages Bot hers more people Coordinat or j ust as bot hered

  • 23

At omicit y

Recall: At omicit y = eit her all t he

  • perat ions associat ed wit h a program unit

are execut ed t o complet ion, or none are perf ormed.

I n a dist ribut ed syst em may have mult iple

copies of t he dat a , replicas are good f or reliabilit y/ availabilit y

PROBLEM: How do we at omically updat e all

  • f t he copies?
  • 24

Replica Consist ency P roblem

I magine we have mult iple bank ser ver s and a client

desir ing t o updat e t heir back account

How can we do t his?

Allow a client t o updat e any ser ver t hen have

ser ver pr opagat e updat e t o ot her ser ver s

Simple and wrong! Simult aneous and conf lict ing updat es can occur at

dif f erent servers? Have client send updat e t o all ser ver s

Same problem - race condit ion – which of t he conf lict ing

updat e will reach each server f irst

slide-5
SLIDE 5

5

  • 25

Two-phase commit

Algorit hm f or providing at omic updat es in a

dist ribut ed syst em

Give t he servers (or replicas) a chance t o

say no and if any server says no, client abort s t he operat ion

  • 26

Framework

Goal: Updat e all r eplicas at omically

Eit her everyone commit s or everyone abort s No inconsist encies even if f ace of f ailures Caveat : Assume no byzant ine f ailures (servers st op when

t hey f ail – do not cont inue and generat e bad dat a) Def init ions

Coordinat or: Sof t ware ent it y t hat shepherds t he process

(client in our example could be one of t he servers)

Ready t o commit : side ef f ect s of updat e saf ely st ored

non- volat ilely (recall: writ e ahead logging)

  • Even if cr ash, once say I am r eady t o commit t hen when

r ecover will f ind evidence and cont inue wit h commit pr ot ocol

  • 27

Two P hase Commit : P hase 1

Coordinat or send a P

REP ARE message t o each r eplica

Coordinat or wait s f or all replicas t o reply

wit h a vot e

Each part icipant send vot e

Vot es PREPARED if r eady t o commit and locks

dat a it ems being updat ed

Vot es NO if unable t o get a lock or unable t o

ensur e r eady t o commit

  • 28

Two P hase Commit : P hase 2

I f coor dinat or r eceives PREPARED vot e f r om all

r eplicas t hen it may decide t o commit or abor t

Coor dinat or send it s decision t o all par t icipant s I f par t icipant r eceives COMMI T decision t hen

commit changes r esult ing f r om updat e

I f par t icipant r eceived ABORT decision t hen

discar d changes r esult ing f r om updat e

Par t icipant r eplies DONE When Coor dinat or r eceived DONE f r om all

par t icipant s t hen can delet e r ecor d of out come

  • 29

P erf ormance

I n absence of f ailure, 2PC makes a t ot al of

2 (1.5?) round t rips of messages bef ore decision is made

Pr epar e Vot e NO or PREPARE Commit / abor t Done (but done j ust f or bookkeeping, does not

af f ect r esponse t ime)

  • 30

Failure Handling in 2P C – Replica Failure

The log cont ains a <

commit T>

  • record. I n

t his case, t he sit e execut es redo(T).

The log cont ains an <

abort T> r ecor d. I n t his case, t he sit e execut es undo(T).

The cont ains a <

ready T> record; consult Ci. I f Ci is down, sit e sends query- status T message t o t he ot her sit es.

The log cont ains no cont rol records

concer ning T. I n t his case, t he sit e execut es undo(T).

slide-6
SLIDE 6

6

  • 31

Failure Handling in 2PC – Coordinat or C

i

Failur e

I f an act ive sit e cont ains a <

commit T> r ecor d in it s log, t he T must be commit t ed.

I f an act ive sit e cont ains an <

abor t T> r ecor d in it s log, t hen T must be abort ed.

I f some act ive sit e does not cont ain t he r ecor d

< r eady T> in it s log t hen t he f ailed coor dinat or C

i

cannot have decided t o commit T. Rat her t han wait f or C

i t o r ecover , it is

pr ef er able t o abor t T.

All act ive sit es have a <

r eady T> r ecor d in t heir logs, but no addit ional cont r ol r ecor ds. I n t his case we must wait f or t he coor dinat or t o r ecover .

Blocking problem – T is blocked pending t he recovery of

sit e Si.

  • 32

Failure Handling

Failur e det ect ed wit h t imeout s I f par t icipant t imes out bef or e get t ing a PREPARE

can abor t

I f coor dinat or t imes out wait ing f or a vot e can

abort

I f a par t icipant t imes out wait ing f or a decision it

is blocked!

Wait f or Coordinat or t o recover? P

unt t o some ot her resolut ion prot ocol I f a coor dinat or t imes out wait ing f or done, keep

r ecor d of out come

  • t her sit es may have a r eplica.
  • 33

Deadlock Handling

Recall our discussion of deadlock in t he

single node case

Same problem can occur in dist ribut ed

syst em

Worse? Because harder t o do manual

det ect ion and recovery

Can’t j ust not e single machine is slow/ hung and

and r eboot How can we deal wit h deadlock in a

dist ribut ed syst em?

  • 34

Global Ordering

Resour ce-or der ing deadlock-pr event ion – def ine a

global or der ing among t he syst em r esour ces.

Assign a unique number t o all syst em resources. A process may request a resource wit h unique number i

  • nly if it is not holding a resource wit h a unique number

grat er t han i. Simple t o implement ; r equir es lit t le over head but

how easy is it t o est ablish a global or der ing?

We had t his same issue in t he single node case. This is a

good approach when you can make it work.

  • 35

Ext end t he Banker’s Algorit hm

Recall t he Banker ’s algor it hm

Avoids deadlock by not commit t ing resources unless

t here is a guarant eed way t o complet e all Banker ’s algor it hm is a dist r ibut ed syst em?

Designat e one of t he processes in t he syst em as t he

process t hat maint ains t he inf ormat ion necessary t o carry out t he Banker’s algorit hm. St raight - f or war d ext ension of single node case

but

Banker is bot t leneck Messages on each resource acquire/ release Same as in single node case: sounds good but pret t y

expensive!

  • 36

Ot her choices?

Recall t he necessary condit ions f or

deadlock

Mut ual Exclusion Hold-and-W ait Cir cular Wait No pr eempt ion

I n t he single node case, we showed how t o

invalidat ing one of t hese condit ions

How about in t he dist ribut ed case? What about borrowing f rom how dat abases

deal wit h deadlock?

slide-7
SLIDE 7

7

  • 37

Recall: Timest amp-Based P rot ocols

Met hod f or select ing t he order among

conf lict ing t r ansact ions

Associat e wit h each t ransact ion a number

which is t he t imest amp or clock value when t he t ransact ion begins execut ing

Associat e wit h each dat a it em t he largest

t imest amp of any t ransact ion t hat wrot e t he it em and anot her t he largest t imest amp of a t ransact ion reading t he it em

  • 38

Timest amp-Ordering

I f t imest amp of t ransact ion want ing t o

read dat a < writ e t imest amp on t he dat a t hen it would have needed t o read a value already overwrit t en so abort t he reading t ransact ion

I f t imest amp if t ransact ion want ing t o

writ e dat a < read t imest amp on t he dat a t hen t he last read would be invalid but it is commit t ed so abort t he writ ing t ransact ion

Abilit y t o abort / rollback is crucial!

  • 39

Timest amped Deadlock-Prevent ion Scheme

Each process P

i is assigned a unique

t imest amp (or priorit y)

Timest amps ar e used t o decide whet her a

process P

i should wait f or a pr ocess P j;

  • t her wise P

i is rolled back.

The scheme pr event s deadlocks. For ever y

edge P

i → P j in t he wait-f or graph, P i has a

higher pr ior it y t han P

  • j. Thus a cycle cannot

exist .

Abilit y t o abor t / r ollback is cr ucial

  • 40

Unique Timest amps in Dist ribut ed Environment

Use sit e ident if ier as least signif icant t o ensur e t hat t he global t imest amps gener at ed at one sit e not always bigger

  • 41

Variat ions

Wait-Die

Non-pr eempt ive

Wound-wait

Preempt ive

Bot h pr event deadlock by avoiding cycles in

t he wait-f or gr aph

  • 42

Wait-Die Scheme

Nonpr eempt ive I f P

I request s a resource current ly held

by P

J, PI is allowed t o wait only if it has a

smaller t imest amp t han P

J (P I is older

t han P

J). Ot herwise, PI is rolled back

(dies).

Example: Suppose t hat pr ocesses P

1, P 2, and

P

3 have t imest amps 1, 2, and 3 r espect ively.

  • if P

1 request a resource held by P 2, t hen P 1 will wait .

  • I f P

3 request s a resource held by P 2, t hen P 3 will be

r olled back.

slide-8
SLIDE 8

8

  • 43

Wound-Wait Scheme

Preempt ive t echnique I f P

I request s a resource current ly held by

P

J, P I is allowed t o wait only if it has a

larger t imest amp t han does P

J (P I is

younger t han P

J). Ot herwise P J is rolled

back (P

J is wounded by P I ). Example: Suppose t hat pr ocesses P

1, P 2, and P 3

have t imest amps 1, 2, and 3 r espect ively.

  • I f P

1 request s a resource held by P 2, t hen t he

resource will be preempt ed f rom P

2 and P 2 will be

r olled back.

  • I f P

3 request s a resource held by P 2, t hen P 3 will wait .

  • 44

Summary

Request er wait s Holder dies (Request er wounds holder ) Wound-W ait Request er dies Request er wait s W ait-Die Holder has higher t imest amp Holder has lower t imest amp

  • 45

Avoiding St arvat ion

Bot h are a priorit y based scheme and so

subj ect t o st arvat ion

Avoid st arvat ion if when rollback a process

allow it t o keep it s t imest amp

Event ually it should be t he highest priorit y

process and will never be rolled back

  • 46

Deadlock det ect ion

I f inst ead of deadlock prevent ion, we could

allow deadlocks t o occur

Manual det ect ion and recovery is harder

Not ice whole dist r ibut ed syst em is slow/ hung

and reboot ? But aut omat ic det ect ion would global

knowledge t o f ind cycles in t he wait-f or graph

  • 47

Two Local Wait-For Graphs

Local gr aphs have no cycles

  • 48

Global Wait-For Graph

Global gr aph has a cycle!

slide-9
SLIDE 9

9

  • 49

Deadlock Det ect ion – Cent ralized Appr oach

Each sit e keeps a local wait-f or gr aph.

The nodes of t he graph correspond t o all t he processes t hat are current ly eit her holding or request ing any of t he resources local t o t hat sit e.

A global wait-f or gr aph is maint ained

in a single coordinat ion process; t his gr aph is t he union of all local wait-f or graphs.

  • 50

Cent ralized Approach (con’t)

There are t hree dif f erent opt ions (point s

in t ime) when t he wait-f or graph may be const ruct ed:

  • 1. Whenever a new edge is inser t ed or r emoved in
  • ne of t he local wait-f or gr aphs (implies

communicat ion wit h coor dinat or on ever y r esour ce acquir e/ r elease!)

  • 2. Per iodically, when a number of changes have
  • ccur r ed in a wait-f or graph (at least t his can

bat ch inf o sent t o coor dinat or )

  • 3. Whenever t he coor dinat or needs t o invoke t he

cycle-det ect ion algor it hm..

  • 51

False Cycles

Unnecessary rollbacks may occur as a

result of f alse cycles due t o communicat ion lat ency

Local graph snapshot s may be t aken at

dif f erent point s in t ime such t hat t he union suggest s a cycle t hat isn’t really t here

  • 52

Fully Dist ribut ed Approach

All cont r oller s shar e equally t he r esponsibilit y f or

det ect ing deadlock.

Every sit e const ruct s a wait- f or graph t hat represent s a

part of t he t ot al graph. We add one addit ional node P

ex t o each local wait-

f or gr aph.

I f a local wait - f or graph cont ains a cycle t hat does not

involve node P

ex, t hen t he syst em is in a deadlock st at e.

A cycle involving P

ex implies t he possibilit y of a deadlock.

To ascer t ain whet her a deadlock does exist , inf o

  • n cycle sent t o some ot her sit e wher e t hey will

eit her det ect a deadlock or augment t he gr aph wit h t heir inf or mat ion and pass on t o anot her sit e unt il all sit es have cont r ibut ed.

  • 53

Choosing a Coordinat or

I n many of t he dist ribut ed coordinat ion

algorit hms, we’ve seen some machine is playing t he role of a coordinat or

Examples: Coor dinat or s f or Cent r alized

Deadlock Det ect ion or 2 phase commit How do we choose such a coordinat or? Or elect a new one if t he current f ails?

  • 54

Elect ion Algorit hms

GOAL: Det er mine wher e a new copy of t he

coor dinat or should be st ar t ed/ r est ar t ed.

Assume t hat a unique pr ior it y number is associat ed

wit h each act ive pr ocess in t he syst em, and assume t hat t he pr ior it y number of pr ocess P

i is i.

The coor dinat or is always t he pr ocess wit h t he

lar gest pr ior it y number . When a coor dinat or f ails, t he algor it hm must elect t hat act ive pr ocess wit h t he lar gest pr ior it y number

Two var iant s: bully and r ing based on t opology

(r ing f or r ing net wor k t opology, bully f or ever yt hing else)

slide-10
SLIDE 10

10

  • 55

Ring Algorit hm

Applicable t o syst ems or ganized as a r ing (logically or

physically).

Assumes t hat t he links are unidirect ional, and t hat processes

send t heir messages t o t heir right neighbors. Each pr ocess maint ains an act ive list , consist ing of all

t he pr ior it y number s of all act ive pr ocesses in t he syst em when t he algor it hm ends.

I f pr ocess P

i det ect s a coor dinat or f ailur e (t imeout

wait ing f or r esponse), it cr eat es a new act ive list t hat is init ially empt y.

I t t hen sends a message elect (i) t o it s right neighbor, and adds

t he number i t o it s act ive list .

  • 56

Ring Algorit hm (Cont .)

I f P

i r eceives a message elect (j ) f r om t he pr ocess

  • n t he lef t , it must r espond in one of t hr ee ways:
  • 1. I f t his is t he f irst (in some t ime) elect message it has

seen or sent , Pi cr eat es a new act ive list wit h t he numbers i and j . I t t hen sends t he message elect (i), f ollowed by t he message elect (j ).

  • 2. I f t he message does not cont ain P

i’s number t hen P i adds j t o it s act ive list and f orwards it t o it s

  • 3. I f t he message does cont ain P

i’s number, t hen P i should have seen all previous messages and it s act ive list should be f ull

  • 57

Recovery in Ring Algorit hm

Recovering process can send a message

around t he ring request ing t o know who is t he coordinat or

Coordinat or will see message as it goes

around ring and reply wit h it s ident it y

  • 58

Bully Algorit hm

For net work t opologies ot her t han ring

Must know all ot her pr ocesses in t he syst em

Process P

i sends a request t hat is not

answered by t he coordinat or wit hin a specif ied t ime => assume t hat t he coordinat or has f ailed

P

i t ries t o elect it self as t he new

coordinat or

  • 59

Bully Algorit hm (Cont .)

P

i sends an elect ion message t o every

process wit h a higher priorit y number, P

i

t hen wait s f or any of t hese processes t o answer wit hin T.

I f no r esponse wit hin T , assume t hat all

pr ocesses wit h number s gr eat er t han i have f ailed; P

i elect s it self t he new coor dinat or .

I f answer is r eceived, P

i begins t ime int er val

T´ , wait ing t o r eceive a message t hat a pr ocess wit h a higher pr ior it y number has been elect ed.

  • I f no such message is received wit hin T´ , assume t he

process wit h a higher number has f ailed; P

i should

rest art t he algorit hm

  • 60

Bully Algorit hm (Cont .)

I f P

i is not t he coordinat or, t hen, at any

t ime during execut ion, P

i may r eceive one of

t he f ollowing t wo messages f rom process P

j. P

j is t he new coor dinat or (j >

i). P

i, in t urn,

r ecor ds t his inf or mat ion.

P

j st ar t ed an elect ion (j >

i). P

i, sends a

r esponse t o P

j and begins it s own elect ion

algor it hm, pr ovided t hat Pi has not alr eady init iat ed such an elect ion.

slide-11
SLIDE 11

11

  • 61

Recovery in Bully Algorit hm

Af t er a f ailed process recovers, it

immediat ely begins execut ion of t he same algorit hm.

I f t here are no act ive processes wit h

higher numbers, t he recovered process f orces all processes wit h lower number t o let it become t he coordinat or process, even if t here is a current ly act ive coordinat or wit h a lower number.

  • 62

Byzant ine Generals P roblem

Deals wit h r eaching agr eement in t he f ace of bot h

f ault y communicat ions and unt r ust wor t hy peer s

P

roblem:

Divisions of an army each commanded by a general

surrounding an enemy camp

Generals must reach agreement on whet her t o at t ack (a

cert ain number must at t ack or def eat is cert ain)

Divisions are geographically separat ed such t hat t hey

must communicat e via messengers

Messengers may be caught and never reach t he ot her

side (lost messages)

Generals may be t rait ors (f ault y/ compromised processes)

  • 63

P roblem 1: Lost s Messengers/ Messages

How can we deal wit h t he f act t hat messages may

be lost ? (We saw t his in TCP)

Det ect f ailur es using a t ime-out scheme.

When send a message, specif ies a t ime int erval t o wait

f or an acknowledgment

When receives a message, sends an acknowledgment Acknowledgment can be lost t oo!. I f receives t he acknowledgment message wit hin t he

specif ied t ime int erval can conclude t hat message was received it s message. I f a t ime- out occurs, ret ransmit message and wait f or anot her acknowledgment .

Cont inue unt il eit her receives an acknowledgment , or give

up af t er some t ime?

  • 64

The Last Word?

Suppose, t he r eceiver needs t o know t hat t he

sender has r eceived it s acknowledgment message, in or der t o decide on how t o pr oceed

Act ually, in t he pr esence of f ailur e, it is not

possible t o accomplish t his t ask

I t is not possible in a dist r ibut ed envir onment f or

pr ocesses P

I and P J t o agr ee complet ely on t heir

cur r ent r espect ive st at es

Always level of uncer t aint y about last message

  • 65

Tr ait or s?

Consider t hat generals can be t rait ors

(processes can be f ault y)

What could t rait ors do?

Ref use t o send any messages Delay sending messages Send incor r ect messages Send dif f er ent messages t o dif f er ent

generals

  • 66

Formalize Agreement

Consider a syst em of n processes, of which

no more t han m are f ault y.

Devise an algorit hm t hat allows each non-

f ault y PI t o const ruct a vect or XI = (AI 1, AI 2, … , AI n) such t hat ::

Each pr ocess P

I has some pr ivat e value of VI.

I f P

J is a nonf ault y process, t hen AI J = VJ .

I f P

I and P J ar e bot h nonf ault y pr ocesses, t hen

X I = X J.

slide-12
SLIDE 12

12

  • 67

Solut ions t o P roblem of Reaching Agreement

Solut ions share t he f ollowing propert ies.

Assume r eliable communicat ion Bound maximum number of t r ait or s t o m A cor r ect algor it hm can be devised only if n ≥ 3

x m + 1.

The worst -case delay f or r eaching agr eement is

pr opor t ionat e t o m + 1 message-passing delays.

  • 68

Simplest Example

An algorit hm f or t he case where m = 1

and n = 4 (> = 3*m+1) r equir es m+1 = 2 rounds of inf ormat ion exchange:

Each pr ocess sends it s pr ivat e value t o t he

  • t her 3 processes.

Each pr ocess sends t he inf or mat ion it has

  • bt ained in t he f ir st r ound t o all ot her

processes.

  • 69

Simplest Example (con’t)

I f a f ault y pr ocess r ef uses t o send messages, a

nonf ault y pr ocess can choose an ar bit r ar y value and pret end t hat t hat value was sent by t hat pr ocess.

Af t er t he t wo r ounds ar e complet ed, a nonf ault y

process P

i can const r uct it s vect or X i = (Ai,1, Ai,2,

Ai,3, Ai,4) as f ollows:

AI J = VJ. For j ≠ i, if at least t wo of t he t hree values report ed f or

process P

j agree, t hen t he maj orit y value is used t o set

t he value of AI J. Ot herwise, a def ault value (nil) is used.

  • 70

Consider

What if n <

4

I f n=3 and t her e was one t r ait or t hen it could

lie dif f er ent ly t o t he t wo non-t r ait or s and t hey could not r esolve t he discr epancy by a maj or it y vot e What if only one round?

  • 71

Out t akes

  • 72

Ensuring at omicit y in a dist ribut ed syst em

r equir es a t ransact ion coordinat or, which is responsible f or t he f ollowing:

St ar t ing t he execut ion of t he t r ansact ion. Br eaking t he t r ansact ion int o a number of

subt r ansact ions, and dist r ibut ion t hese subt r ansact ions t o t he appr opr iat e sit es f or execut ion.

Coor dinat ing t he t er minat ion of t he

t r ansact ion, which may r esult in t he t r ansact ion being commit t ed at all sit es or abor t ed at all sit es.

slide-13
SLIDE 13

13

  • 73

Two-P hase Commit P rot ocol (2P C)

Assumes f ail-st op model. Execut ion of t he pr ot ocol is init iat ed by t he

coor dinat or af t er t he last st ep of t he t r ansact ion has been r eached.

When t he pr ot ocol is init iat ed, t he t r ansact ion

may st ill be execut ing at some of t he local sit es.

The pr ot ocol involves all t he local sit es at which

t he t r ansact ion execut ed.

Example: Let T be a t r ansact ion init iat ed at sit e

Si and let t he t r ansact ion coor dinat or at Si be C

i.

  • 74

P hase 1: Obt aining a Decision

C

i adds <

pr epar e T> r ecor d t o t he log.

C

i sends <

pr epar e T> message t o all sit es.

When a sit e r eceives a <

pr epar e T> message, t he t r ansact ion manager det er mines if it can commit t he t r ansact ion.

I f no: add <

no T> record t o t he log and respond t o C

i wit h

< abor t T>.

I f yes:

  • add <

r eady T> record t o t he log.

  • f or ce all log records f or T ont o st able st or age.
  • t r ansact ion manager sends <

r eady T> message t o C

i.

  • 75

P hase 1 (Cont .)

Coordinat or collect s responses

All r espond “r eady”,

decision is commit.

At least one r esponse is “abor t ”,

decision is abort .

At least one par t icipant f ails t o r espond wit hin

t ime out per iod, decision is abort .

  • 76

Phase 2: Recording Decision in t he Dat abase

Coor dinat or adds a decision r ecor d

< abor t T>

  • r <

commit T> t o it s log and f or ces r ecor d ont o st able st or age.

Once t hat r ecor d r eaches st able st or age it is

ir r evocable (even if f ailur es occur ).

Coor dinat or sends a message t o each par t icipant

inf or ming it of t he decision (commit or abor t ).

Par t icipant s t ake appr opr iat e act ion locally.

  • 77

Concurrency Cont rol

I cut t his all t oget her – t oo similar t o mut ual Exclusion – does it deserve a separat e discussion

  • 78

Concurrency Cont rol

Modif y t he cent r alized concur r ency schemes t o

accommodat e t he dist r ibut ion of t r ansact ions.

Tr ansact ion manager coor dinat es execut ion of

t r ansact ions (or subt r ansact ions) t hat access dat a at local sit es.

Local t r ansact ion only execut es at t hat sit e. Global t r ansact ion execut es at sever al sit es.

slide-14
SLIDE 14

14

  • 79

Locking P rot ocols

Nonreplicat ed scheme – each sit e maint ains

a local lock manager which administ ers lock and unlock request s f or t hose dat a it ems t hat are st ored in t hat sit e.

Simple implement at ion involves t wo message

t r ansf er s f or handling lock r equest s, and one message t r ansf er f or handling unlock r equest s.

Deadlock handling is mor e complex.

  • 80

Single-Coordinat or Approach

A single lock manager r esides in a single

chosen sit e, all lock and unlock request s are made a t hat sit e.

Simple implement at ion Simple deadlock handling Possibilit y of bot t leneck Vulner able t o loss of concur r ency cont r oller if

single sit e f ails Mult iple-coor dinat or appr oach dist ribut es

lock-manager f unct ion over several sit es.

  • 81

Maj orit y P rot ocol

Avoids dr awbacks of cent r al cont r ol by dealing

wit h r eplicat ed dat a in a decent r alized manner .

Must get ok f r om at least n/ 2 +1 par t icipant s Deadlock-handling algor it hms must be modif ied;

possible f or deadlock t o occur in locking only one dat a it em.

Example: t wo processes t rying t o lock each get 2 out of 4

processes t o say ok – each need a t hird?

  • 82

Biased P rot ocol (OUTTAKE)

Similar t o maj orit y prot ocol, but request s

f or shared locks priorit ized over request s f or exclusive locks.

Less overhead on read operat ions t han in

maj orit y prot ocol; but has addit ional

  • verhead on writ es.

Like maj orit y prot ocol, deadlock handling is

complex.

  • 83

P rimary Copy

One of t he sit es at which a r eplica r esides is

designat ed as t he pr imar y sit e. Request t o lock a dat a it em is made at t he pr imar y sit e of t hat dat a it em.

Concur r ency cont r ol f or r eplicat ed dat a handled in

a manner similar t o t hat of unr eplicat ed dat a.

Simple implement at ion, but if pr imar y sit e f ails,

t he dat a it em is unavailable, even t hough ot her sit es may have a r eplica.

  • 84

Example Cent ralized Deadlock Det ect ion Algorit hm

slide-15
SLIDE 15

15

  • 85

Det ect ion Algorit hm Based on Opt ion 3

Append unique ident if ier s (t imest amps) t o

r equest s f or m dif f er ent sit es.

When pr ocess P

i, at sit e A, r equest s a r esour ce

f r om pr ocess P

j, at sit e B, a r equest message wit h

t imest amp T S is sent .

The edge P

i → P j wit h t he label T S is inser t ed in

t he local wait- f or of A. The edge is inser t ed in t he local wait-f or gr aph of B only if B has r eceived t he r equest message and cannot immediat ely gr ant t he r equest ed r esour ce.

  • 86

The Algorit hm

  • 1. The cont r oller sends an init iat ing message t o each

sit e in t he syst em.

  • 2. On r eceiving t his message, a sit e sends it s local

wait-f or gr aph t o t he coor dinat or .

  • 3. When t he cont r oller has r eceived a r eply f r om

each sit e, it const r uct s a gr aph as f ollows:

(a) The const ruct ed graph cont ains a vert ex f or every process in t he syst em. (b) The graph has an edge P

i → P j if and only if (1) t her e is

an edge P

i → P j in one of t he wait - f or graphs, or (2) an

edge P

i → P j wit h some label TS appears in more t han

  • ne wait - f or graph.

I f t he const ruct ed graph cont ains a cycle ⇒ deadlock.

  • 87

Local and Global Wait-For Graphs

  • 88

Dist ribut ed Deadlock Det ect ion

  • 89

Augment ed Local Wait-For Graphs

  • 90

Augment ed Local Wait -For Gr aph in Sit e S2

slide-16
SLIDE 16

16

  • 91