External Consistency and Spanner
CS425/ECE428 — SPRING 2020 NIKITA BORISOV, UIUC
External Consistency and Spanner CS425/ECE428 SPRING 2020 NIKITA - - PowerPoint PPT Presentation
External Consistency and Spanner CS425/ECE428 SPRING 2020 NIKITA BORISOV, UIUC Transactions so far Objects distributed / partitioned among different servers For load balancing (sharding) For separation of concerns /
CS425/ECE428 — SPRING 2020 NIKITA BORISOV, UIUC
Objects distributed / partitioned among different servers
Isolation enforced using two-phase locking (2PL)
Atomic commit using 2PC
Node failure
But! Node failure is common Drive failures => no recovery!
Objects distributed among 1000’s cluster nodes for load-balancing (sharding) Objects replicated among a handful of nodes for availability / durability
Two-level operation:
Note: can be expensive!
read A -> acquire read lock on A read B -> acquire read lock on B write A -> promote A’s lock to write lock commit -> perform 2PC
Read transactions often access many data items
Read transactions still need (read) locks (Why?) Acquiring locks requires consensus (Why?) Locks prevent write transactions from moving forward
Serial equivalence:
Linearizability
E.g., buying a movie
Wilson Hsieh representing a host of authors OSDI 2012
OSDI 2012 9
OSDI 2012 User posts Friend lists User posts Friend lists User posts Friend lists User posts Friend lists US Brazil Russia Spain San Francisco Seattle Arizona Sao Paulo Santiago Buenos Aires Moscow Berlin Krakow London Paris Berlin Madrid Lisbon User posts Friend lists 10 x1000 x1000 x1000 x1000
transactions
– First system at global scale
control, replication, and 2PC
– Correctness and performance
– Interval-based global time
OSDI 2012 11
– Consistent view of friend list and their posts
OSDI 2012
Why consistency matters
12
User posts Friend lists User posts Friend lists
Friend2 post Generate my page Friend1 post Friend1000 post Friend999 post Block writes OSDI 2012 … 13
User posts Friend lists User posts Friend lists
User posts Friend lists Generate my page Friend2 post Friend1 post Friend1000 post Friend999 post User posts Friend lists Block writes OSDI 2012 … 14
User posts Friend lists User posts Friend lists User posts Friend lists
User posts Friend lists Generate my page Friend2 post Friend1 post Friend1000 post Friend999 post OSDI 2012 … US Spain Russia Brazil 15 x1000 x1000 x1000 x1000
– Each transaction T is assigned a timestamp s – Data written by T is timestamped with s
OSDI 2012 16 Time 8 <8 [X] [me] 15 [P] My friends My posts X’s friends [] []
== External Consistency: Commit order respects global wall-time order
OSDI 2012 17
== Timestamp order respects global wall-time order given timestamp order == commit order
Global wall-clock time
T Pick s = now() Acquired locks Release locks OSDI 2012 18
OSDI 2012 19
T2 T3 T4 T1
uncertainty
time earliest latest TT.now() 2*ε OSDI 2012 20
T Pick s = TT.now().latest Acquired locks Release locks Wait until TT.now().earliest > s s OSDI 2012 average ε Commit wait average ε 21
OSDI 2012 T Acquired locks Release locks Start consensus Notify slaves Commit wait done Pick s 22 Achieve consensus
OSDI 2012 TC Acquired locks Release locks TP1 Acquired locks Release locks TP2 Acquired locks Release locks Notify participants of s Commit wait done Compute s for each 23 Start logging Done logging Prepared Compute overall s Committed Send s
OSDI 2012 24 TP Remove X from my friend list Remove myself from X’s friend list sC=6 sP=8 s=8 s=15 Risky post P s=8 Time <8 [X] [me] 15 TC T2 [P] My friends My posts X’s friends 8 [] []
– Uncertainty in time can be waited out
OSDI 2012 25
– Mostly non-blocking – Commit in the future
– At any sufficiently up-to-date replica
OSDI 2012 26
Datacenter 1 Datacenter n … Datacenter 2 GPS timemaster GPS timemaster GPS timemaster Atomic-clock timemaster GPS timemaster Client OSDI 2012 27 GPS timemaster
Compute reference [earliest, latest] = now ± ε
time ε 0sec 30sec 60sec 90sec +6ms
now = reference now + local-clock offset ε = reference ε + worst-case local-clock drift
reference uncertainty OSDI 2012 28 200 μs/sec
– Bad CPUs 6 times more likely than bad clocks
OSDI 2012 29
OSDI 2012 Mar 29 Mar 30 Mar 31 Apr 1
Date
2 4 6 8 10
Epsilon (ms)
99.9 99 90 6AM 8AM 10AM 12PM
Date (April 13)
1 2 3 4 5 6 30
– Known unknowns are better than unknown unknowns – Rethink algorithms to make use of uncertainty
– Greater scale != weaker semantics
OSDI 2012 33