External Consistency and Spanner CS425/ECE428 SPRING 2020 NIKITA - - PowerPoint PPT Presentation

external consistency and spanner
SMART_READER_LITE
LIVE PREVIEW

External Consistency and Spanner CS425/ECE428 SPRING 2020 NIKITA - - PowerPoint PPT Presentation

External Consistency and Spanner CS425/ECE428 SPRING 2020 NIKITA BORISOV, UIUC Transactions so far Objects distributed / partitioned among different servers For load balancing (sharding) For separation of concerns /


slide-1
SLIDE 1

External Consistency and Spanner

CS425/ECE428 — SPRING 2020 NIKITA BORISOV, UIUC

slide-2
SLIDE 2

Transactions so far

Objects distributed / partitioned among different servers

  • For load balancing (sharding)
  • For separation of concerns / administration

Isolation enforced using two-phase locking (2PL)

  • Each server maintains locks on own objects
  • Deadlocks detected using e.g., edge-chasing

Atomic commit using 2PC

  • Prepare to commit ensures durability
  • Recover from coordinator and participant crashes
slide-3
SLIDE 3

Dealing with Failures

Node failure

  • Objects unavailable until recovery
  • 2PC “stuck” after coordinator failure

But! Node failure is common Drive failures => no recovery!

slide-4
SLIDE 4

Replication

Objects distributed among 1000’s cluster nodes for load-balancing (sharding) Objects replicated among a handful of nodes for availability / durability

  • Replication across data centers, too

Two-level operation:

  • Use transactions, coordinators, 2PC per object
  • Use Paxos / Raft among object replicas

Note: can be expensive!

  • Coordinator sends Prepare message to leaders of each replica group
  • Each leader uses Paxos / Raft to commit the Prepare to the group logs
  • Once commit succeeds, reply to coordinator
  • Coordinator uses Paxos / Raft to commit decision to its group log
slide-5
SLIDE 5

Example transaction

read A -> acquire read lock on A read B -> acquire read lock on B write A -> promote A’s lock to write lock commit -> perform 2PC

  • Coordinator -> A, B: prepare
  • A, B -> OK
  • Coordinator -> A, B: commit
slide-6
SLIDE 6

Read transactions

Read transactions often access many data items

  • E.g., Facebook ”news feed”
  • E.g., Amazon front page
  • E.g., balances across all accounts

Read transactions still need (read) locks (Why?) Acquiring locks requires consensus (Why?) Locks prevent write transactions from moving forward

slide-7
SLIDE 7

Linearizability

Serial equivalence:

  • Total effect on system is equivalent to a run that is serial and consistent with each client’s order

Linearizability

  • Total effect on system is equivalent to a run that is serial and consistent with actual order of events

E.g., buying a movie

  • Client makes RPC to bank transfers $3.99 to Amazon account
  • Client requests video from Amazon
  • Amazon makes RPC to bank, does not see transfer, rejects request!
slide-8
SLIDE 8

Spanner: Google’s Globally-Distributed Database

Wilson Hsieh representing a host of authors OSDI 2012

slide-9
SLIDE 9

What is Spanner?

  • Distributed multiversion database
  • General-purpose transactions (ACID)
  • SQL query language
  • Schematized tables
  • Semi-relational data model
  • Running in production
  • Storage for Google’s ad data
  • Replaced a sharded MySQL database

OSDI 2012 9

slide-10
SLIDE 10

Example: Social Network

OSDI 2012 User posts Friend lists User posts Friend lists User posts Friend lists User posts Friend lists US Brazil Russia Spain San Francisco Seattle Arizona Sao Paulo Santiago Buenos Aires Moscow Berlin Krakow London Paris Berlin Madrid Lisbon User posts Friend lists 10 x1000 x1000 x1000 x1000

slide-11
SLIDE 11

Overview

  • Feature: Lock-free distributed read transactions
  • Property: External consistency of distributed

transactions

– First system at global scale

  • Implementation: Integration of concurrency

control, replication, and 2PC

– Correctness and performance

  • Enabling technology: TrueTime

– Interval-based global time

OSDI 2012 11

slide-12
SLIDE 12

Read Transactions

  • Generate a page of friends’ recent posts

– Consistent view of friend list and their posts

OSDI 2012

Why consistency matters

  • 1. Remove untrustworthy person X as friend
  • 2. Post P: “My government is repressive…”

12

slide-13
SLIDE 13

User posts Friend lists User posts Friend lists

Single Machine

Friend2 post Generate my page Friend1 post Friend1000 post Friend999 post Block writes OSDI 2012 … 13

slide-14
SLIDE 14

User posts Friend lists User posts Friend lists

Multiple Machines

User posts Friend lists Generate my page Friend2 post Friend1 post Friend1000 post Friend999 post User posts Friend lists Block writes OSDI 2012 … 14

slide-15
SLIDE 15

User posts Friend lists User posts Friend lists User posts Friend lists

Multiple Datacenters

User posts Friend lists Generate my page Friend2 post Friend1 post Friend1000 post Friend999 post OSDI 2012 … US Spain Russia Brazil 15 x1000 x1000 x1000 x1000

slide-16
SLIDE 16

Version Management

  • Transactions that write use strict 2PL

– Each transaction T is assigned a timestamp s – Data written by T is timestamped with s

OSDI 2012 16 Time 8 <8 [X] [me] 15 [P] My friends My posts X’s friends [] []

slide-17
SLIDE 17

Synchronizing Snapshots

== External Consistency: Commit order respects global wall-time order

OSDI 2012 17

== Timestamp order respects global wall-time order given timestamp order == commit order

Global wall-clock time

slide-18
SLIDE 18

Timestamps, Global Clock

  • Strict two-phase locking for write transactions
  • Assign timestamp while locks are held

T Pick s = now() Acquired locks Release locks OSDI 2012 18

slide-19
SLIDE 19

Timestamp Invariants

OSDI 2012 19

  • Timestamp order == commit order
  • Timestamp order respects global wall-time order

T2 T3 T4 T1

slide-20
SLIDE 20

TrueTime

  • “Global wall-clock time” with bounded

uncertainty

time earliest latest TT.now() 2*ε OSDI 2012 20

slide-21
SLIDE 21

Timestamps and TrueTime

T Pick s = TT.now().latest Acquired locks Release locks Wait until TT.now().earliest > s s OSDI 2012 average ε Commit wait average ε 21

slide-22
SLIDE 22

Commit Wait and Replication

OSDI 2012 T Acquired locks Release locks Start consensus Notify slaves Commit wait done Pick s 22 Achieve consensus

slide-23
SLIDE 23

Commit Wait and 2-Phase Commit

OSDI 2012 TC Acquired locks Release locks TP1 Acquired locks Release locks TP2 Acquired locks Release locks Notify participants of s Commit wait done Compute s for each 23 Start logging Done logging Prepared Compute overall s Committed Send s

slide-24
SLIDE 24

Example

OSDI 2012 24 TP Remove X from my friend list Remove myself from X’s friend list sC=6 sP=8 s=8 s=15 Risky post P s=8 Time <8 [X] [me] 15 TC T2 [P] My friends My posts X’s friends 8 [] []

slide-25
SLIDE 25

What Have We Covered?

  • Lock-free read transactions across datacenters
  • External consistency
  • Timestamp assignment
  • TrueTime

– Uncertainty in time can be waited out

OSDI 2012 25

slide-26
SLIDE 26

What Haven’t We Covered?

  • How to read at the present time
  • Atomic schema changes

– Mostly non-blocking – Commit in the future

  • Non-blocking reads in the past

– At any sufficiently up-to-date replica

OSDI 2012 26

slide-27
SLIDE 27

TrueTime Architecture

Datacenter 1 Datacenter n … Datacenter 2 GPS timemaster GPS timemaster GPS timemaster Atomic-clock timemaster GPS timemaster Client OSDI 2012 27 GPS timemaster

Compute reference [earliest, latest] = now ± ε

slide-28
SLIDE 28

TrueTime implementation

time ε 0sec 30sec 60sec 90sec +6ms

now = reference now + local-clock offset ε = reference ε + worst-case local-clock drift

reference uncertainty OSDI 2012 28 200 μs/sec

slide-29
SLIDE 29

What If a Clock Goes Rogue?

  • Timestamp assignment would violate external consistency
  • Empirically unlikely based on 1 year of data

– Bad CPUs 6 times more likely than bad clocks

OSDI 2012 29

slide-30
SLIDE 30

Network-Induced Uncertainty

OSDI 2012 Mar 29 Mar 30 Mar 31 Apr 1

Date

2 4 6 8 10

Epsilon (ms)

99.9 99 90 6AM 8AM 10AM 12PM

Date (April 13)

1 2 3 4 5 6 30

slide-31
SLIDE 31

Conclusions

  • Reify clock uncertainty in time APIs

– Known unknowns are better than unknown unknowns – Rethink algorithms to make use of uncertainty

  • Stronger semantics are achievable

– Greater scale != weaker semantics

OSDI 2012 33