I I Can ant Belie elieve e It Its No Not C Causal ! ! - - PowerPoint PPT Presentation

i i can an t belie elieve e it it s no not c causal
SMART_READER_LITE
LIVE PREVIEW

I I Can ant Belie elieve e It Its No Not C Causal ! ! - - PowerPoint PPT Presentation

I I Can ant Belie elieve e It Its No Not C Causal ! ! Scalable Causal Consistency cy wit with No o Slo lowdown wn Cas ascad ades Syed Akbar Mehdi 1 , Cody Littley 1 , Natacha Crooks 1 , Lorenzo Alvisi 1,4 , Nathan Bronson 2 ,


slide-1
SLIDE 1

I I Can an’t Belie elieve e It It’s No Not C Causal ! !

Scalable Causal Consistency cy wit with No

  • Slo

lowdown wn Cas ascad ades

Syed Akbar Mehdi1, Cody Littley1, Natacha Crooks1, Lorenzo Alvisi1,4, Nathan Bronson2, Wyatt Lloyd3

1UT Austin, 2Facebook, 3USC, 4Cornell University

slide-2
SLIDE 2

Causal Consistency: Great In Th Theory

  • Lots of exciting research building scalable causal data-stores, e.g.,

Causal Consistency Eventual Consistency Strong Consistency Higher Perf. Stronger Guarantees

Ø COPS [SOSP 11] Ø Bolt-On [SIGMOD 13] Ø Chain Reaction [EuroSys 13] Ø Eiger [NSDI 13] Ø Orbe [SOCC 13] Ø GentleRain [SOCC 14] Ø Cure [ICDCS 16] Ø TARDiS [SIGMOD 16]

slide-3
SLIDE 3

Ca Causa sal Co Consi sistency: y: Bu But In Practice …

The middle child of consistency models

Espresso TAO Manhattan

Reality: Largest web apps use eventual consistency, e.g.,

slide-4
SLIDE 4

Ke Key Hurdle: Slowdown Cascades

Enforce Consistency

Implicit Assumption of Current Causal Systems Reality at Scale Slowdown Cascade Wait Wait

slide-5
SLIDE 5

Datacenter A Datacenter B

Replicated and sharded storage for a social network

slide-6
SLIDE 6

W1

Datacenter A Datacenter B

Writes causally ordered as 𝑋

" → 𝑋 $ → 𝑋 %

slide-7
SLIDE 7

W1

Buffered Buffered

Datacenter A Datacenter B Applied ?

W2

Applied ?

W1

Current causal systems enforce consistency as a datastore invariant

W2 W3

slide-8
SLIDE 8

W1

Buffered Buffered

Datacenter A Datacenter B Applied ?

W2

Applied ?

W1

Alice’s advisor unnecessarily waits for Justin Bieber’s update despite not reading it

W2 W3

Delayed

Slowdown Cascade

W1

slide-9
SLIDE 9

W1

Buffered Buffered

Datacenter A Datacenter B Applied ?

W2

Applied ?

W1

Alice’s advisor unnecessarily waits for Justin Bieber’s update despite not reading it

W2 W3

Delayed

Slowdown Cascade

W1

Slowdown cascades affect all previous causal systems because they enforce consistency inside the data store

slide-10
SLIDE 10

Sl Slowdown Ca Cascades in Ei Eiger (NS

(NSDI DI ‘1 ‘13)

200 400 600 800 1000 1200 500 1000 1500 2000 2500 Buffered Replicated Writes Replicated writes received Normal Slowdown

Replicated write buffers grow arbitrarily because Eiger enforces consistency inside the datastore

slide-11
SLIDE 11

OCCULT

Observable Causal Consistency Using Lossy Timestamps

slide-12
SLIDE 12

Causal Consistency guarantees that each client observes a monotonically non-decreasing set of updates (including its own) in an order that respects potential causality between operations

Ob Obse servable Ca Causa sal Co Consi sistency

Key Idea: Don’t implement a causally consistent data store Let clients observe a causally consistent data store

slide-13
SLIDE 13

How do clients observe a causally consistent datastore ?

Datacenter A Datacenter B

slide-14
SLIDE 14

Master Slave Master Slave Slave Master

Writes accepted only by master shards and then replicated asynchronously in-order to slaves

Datacenter A Datacenter B

slide-15
SLIDE 15

Master Slave

7 7

Master Slave Slave Master

4 4 8 8

Each shard keeps track of a shardstamp which counts the writes it has applied

Datacenter A Datacenter B

slide-16
SLIDE 16

Master Slave

7 7

Master Slave Slave Master

4 4 8 8

Causal Timestamp: Vector of shardstamps which identifies a global state across all shards

4 3 2

Client 1 Client 2 Client 3

6 2 5

Datacenter A Datacenter B

slide-17
SLIDE 17

Master Slave

7

Master Slave Slave Master

4 4 8 8 4 3 2

Client 1 Client 2 Client 3

6 2 5 4 3 2 a

Datacenter A Datacenter B

Write Protocol: Causal timestamps stored with objects to propagate dependencies

7

slide-18
SLIDE 18

8

Master Slave

7

Master Slave Slave Master

4 4 8 8 3 2

Client 1 Client 2 Client 3

6 2 5 3 2 a

Write Protocol: Server shardstamp is incremented and merged into causal timestamps

Datacenter A Datacenter B

8 8

slide-19
SLIDE 19

8

Master Slave

7

Master Slave Slave Master

4 4 8 8 8 3 2

Client 1 Client 2 Client 3

6 2 5 8 3 2 a

Read Protocol: Always safe to read from master

Datacenter A Datacenter B

slide-20
SLIDE 20

8

Master Slave

7

Master Slave Slave Master

4 4 8 8 8 3 2

Client 1 Client 2 Client 3

8 3 5 8 3 2 a

Read Protocol: Object’s causal timestamp merged into client’s causal timestamp

Datacenter A Datacenter B

slide-21
SLIDE 21

8 5 5 8

Master Slave

7

Master Slave Slave Master

5 4 8 8 8 3 2

Client 1 Client 2 Client 3

8 3 2 a 8 5 5 b

Datacenter A Datacenter B

Read Protocol: Causal timestamp merging tracks causal ordering for writes following reads

slide-22
SLIDE 22

a 8 5 5 8

Master Slave

7

Master Slave Slave Master

5 4 8 8 8 3 2

Client 1 Client 2 Client 3

8 3 2 a 8 5 5 b 5 8 5 b

Delayed! Datacenter A Datacenter B

8 3 2

Replication: Like eventual consistency; asynchronous, unordered, writes applied immediately

slide-23
SLIDE 23

a 8 5 5 8

Master Slave

7

Master Slave Slave Master

5 5 8 8 8 3 2

Client 1 Client 2 Client 3

8 3 2 a 8 5 5 b

Delayed!

Replication: Slaves increment their shardstamps using causal timestamp of a replicated write

Datacenter A Datacenter B

8 3 2 5 8 5 b

slide-24
SLIDE 24

8 3 2 a 8 5 5 8

Master Slave

7

Master Slave Slave Master

5 5 8 8 8 3 2

Client 1 Client 2 Client 3

8 3 2 a 8 5 5 b 5

≥ ? Delayed!

Read Protocol: Clients do consistency check when reading from slaves

Datacenter A Datacenter B

5 8 5 b

slide-25
SLIDE 25

8 3 2 a 8 5 5 8

Master Slave

7

Master Slave Slave Master

5 5 8 8 8 3 2

Client 1 Client 2 Client 3

8 3 2 a 8 5 5 b 8 5 5 b 5

≥ ?

8 b

Delayed!

Read Protocol: Clients do consistency check when reading from slaves

Datacenter A Datacenter B

b’s dependencies are delayed, but we can read it anyway!

slide-26
SLIDE 26

8 3 2 a 8 5 5 8

Master Slave

7

Master Slave Slave Master

5 5 8 8 8 3 2

Client 1 Client 2 Client 3

8 5 5 8 3 2 a 8 5 5 b

≥ ?

7 8

Stale Shard ! Delayed!

Read Protocol: Clients do consistency check when reading from slaves

Datacenter A Datacenter B

5 8 5 b

slide-27
SLIDE 27

8 3 2 a 8 5 5 8

Master Slave

7

Master Slave Slave Master

5 5 8 8 8 3 2

Client 1 Client 2 Client 3

8 5 5 8 3 2 a 8 5 5 b

≥ ?

7 8

Stale Shard ! Delayed!

Options:

  • 1. Retry locally
  • 2. Read from master

Read Protocol: Resolving stale reads

Datacenter A Datacenter B

5 8 5 b

slide-28
SLIDE 28

Causal Ti Timestamp Compression

  • What happens at scale when number of shards is (say) 100,000 ?

400 234 23 87 9 102 78

Size(Causal Timestamp) == 100,000 ?

slide-29
SLIDE 29

Causal Ti Timestamp Compression: Strawman

  • To compress down to n, conflate shardstamps with same ids modulo n

1000 89 13 209 1000 209 Compress

  • Problem: False Dependencies
  • Solution:
  • Use system clock as the next value of shardstamp on a write
  • Decouples shardstamp value from number of writes on each shard
slide-30
SLIDE 30

Causal Ti Timestamp Compression: Strawman

  • To compress down to n, conflate shardstamps with same ids modulo n

1000 89 13 209 1000 209 Compress

  • Problem: Modulo arithmetic still conflates unrelated shardstamps
slide-31
SLIDE 31

Causal Ti Timestamp Compression

  • Insight: Recent shardstamps more likely to create false dependencies
  • Use high resolution for recent shardstamps and conflate the rest

4000 3989 3880 3873 3723 3678 45 89 34 402 123 *

Shardstamps Shard IDs

Catch-all shardstamp

  • 0.01 % false dependencies with just 4 shardstamps and 16K logical shards
slide-32
SLIDE 32

Transactions in OCCULT

Scalable causally consistent general purpose transactions

slide-33
SLIDE 33
  • A. Atomicity
  • B. Read from a causally consistent snapshot
  • C. No concurrent conflicting writes

Pr Properties of Transactions

slide-34
SLIDE 34
  • A. Observable Atomicity
  • B. Observably Read from a causally consistent snapshot
  • C. No concurrent conflicting writes

Pr Properties of Transactions

slide-35
SLIDE 35
  • A. Observable Atomicity
  • B. Observably Read from a causally consistent snapshot
  • C. No concurrent conflicting writes
  • 1. No centralized timestamp authority (e.g. per-datacenter)

§ Transactions ordered using causal timestamps

  • 2. Transaction commit latency is independent of number of replicas

Pr Properties of Transactions Pr Properties of Pr Proto tocol

slide-36
SLIDE 36
  • A. Observable Atomicity
  • B. Observably Read from a causally consistent snapshot
  • C. No concurrent conflicting writes
  • 1. Read Phase

§ Buffer writes at client

  • 2. Validation Phase

§ Client validates A, B and C using causal timestamps

  • 3. Commit Phase

§ Buffered writes committed in an observably atomic way

Pr Properties of Transactions Th Three Phase Protocol

slide-37
SLIDE 37

c = [Cal] a = []

Master Master

b = [Bob] a = [] b = [Bob] c = [Cal]

Master Slave Slave Slave

1 1 1 1

Alice and her advisor are managing lists of students for three courses

Datacenter A Datacenter B

1 1 1 1

slide-38
SLIDE 38

c = [Cal] a = []

Master Master

b = [Bob] a = [] b = [Bob] c = [Cal]

Master Slave Slave Slave

1 1 1 1

Observable atomicity and causally consistent snapshot reads enforced by single mechanism

Datacenter A Datacenter B

1 1 1 1

slide-39
SLIDE 39

c = [Cal] a = []

Master Master

b = [Bob] a = [] b = [Bob] c = [Cal]

Master Slave Slave Slave

1 1 1 1

Datacenter A Datacenter B

1 1 1 1

Start T1 r(a) = [] w(a = [Abe])

Transaction T1 : Alice adding Abe to course a

slide-40
SLIDE 40

c = [Cal] a = [Abe]

Master Master

1

b = [Bob] a = [] b = [Bob] c = [Cal]

Master Slave Slave Slave

1 1 1 1 1

Datacenter A Datacenter B

1 1 1 1

Start T1 r(a) = [] w(a = [Abe]) Commit T1

1

Transaction T1 : After Commit

slide-41
SLIDE 41

c = [Cal] a = [Abe]

Master Master

1

b = [Bob] a = [] b = [Bob] c = [Cal]

Master Slave Slave Slave

1 1 1 1

Transaction T2 : Alice moving Bob from course b to course c

1

Datacenter A Datacenter B

1 1 1 1

Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal]

1 1 1

slide-42
SLIDE 42

c = [Cal] a = [Abe]

Master Master

1

b = [Bob] a = [] b = [Bob] c = [Cal]

Master Slave Slave Slave

1 1 1 1

Observable Atomicity: Make writes causally dependent on each other

1 1 1

Datacenter A Datacenter B

1 1 1 1

Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal]

1 2 2

Atomicity through causality: Make writes dependent on each other

slide-43
SLIDE 43

c = [Bob, Cal] a = [Abe]

Master Master

1

b = [] a = [] b = [Bob] c = [Cal]

Master Slave Slave Slave

2 1 2 1

Observable Atomicity: Same commit timestamp makes writes causally dependent on each other

1 2 2

Datacenter A Datacenter B

1 2 2 1 1 2 2 1

Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2

1

slide-44
SLIDE 44

c = [Bob, Cal] a = [Abe]

Master Master

1

b = [] a = [] b = [Bob] c = [Bob, Cal]

Master Slave Slave Slave

2 1 2 2

Transaction writes replicate asynchronously

1 2 2

Datacenter A Datacenter B

1 2 2 1 1 2 2 1 2 2

Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2

1

Delayed! Delayed!

slide-45
SLIDE 45

c = [Bob, Cal] a = [Abe]

Master Master

1

b = [] a = [] b = [Bob] c = [Bob, Cal]

Master Slave Slave Slave

2 1 2 2 1 2 2

Datacenter A Datacenter B

1 2 2 1 1 2 2 1 2 2

Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2

1

Delayed! Delayed!

Alice’s advisor reads the lists in a transaction

Start T3

slide-46
SLIDE 46

c = [Bob, Cal]

2

c = [Bob, Cal] a = [Abe]

Master Master

1

b = [] a = [] b = [Bob]

Master Slave Slave Slave

2 1 2 1 2 2 1

Datacenter A Datacenter B

1 2 2 1 1 2 2 1 2 2

Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2

1

Start T3 r(b) = [Bob]

Delayed! Delayed! T3 Read Set

Transactions maintain a Read Set to validate atomicity and read from causal snapshot

b = [Bob]

1 1

slide-47
SLIDE 47

1 2

c = [Bob, Cal]

2

c = [Bob, Cal] a = [Abe]

Master Master

1

b = [] a = [] b = [Bob]

Master Slave Slave Slave

2 1 2 1 2 2 1 2 2

Datacenter A Datacenter B

1 2 2 1 1 2 2 1 2 2

Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2

1

Start T3 r(b) = [Bob] r(c) = [Bob,Cal]

Delayed! Delayed! T3 Read Set b = [Bob]

1

c = [Bob, Cal]

1 2 2

Transactions maintain a Read Set to validate atomicity and read from causal snapshot

slide-48
SLIDE 48

c = [Bob, Cal]

2

c = [Bob, Cal] a = [Abe]

Master Master

1

b = [] a = [] b = [Bob]

Master Slave Slave Slave

2 1 2 1 2 2 1 2 2

Datacenter A Datacenter B

1 2 2 1 1 2 2 1 2 2

Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2

1

Start T3 r(b) = [Bob] r(c) = [Bob,Cal]

Delayed! Delayed! T3 Read Set

Validation failure: c knows more writes from grey shard than applied at the time b was read

b = [Bob]

1

c = [Bob, Cal]

1 2 2

2 1

slide-49
SLIDE 49

c = [Bob, Cal]

2

c = [Bob, Cal] a = [Abe]

Master Master

1

b = [] a = [] b = [Bob]

Master Slave Slave Slave

2 1 2 1 2 2

1

2 2

Datacenter A Datacenter B

1 2 2 1 1 2 2 1 2 2

Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2

1

Start T3 r(b) = [Bob] r(c) = [Bob,Cal] r(a) = []

Delayed! Delayed! T3 Read Set

Ordering Violation: Detected in the usual way. Red Shard is stale !

b = [Bob]

1 1

c = [Bob, Cal]

1 2 2 2

slide-50
SLIDE 50
  • A. Observable Atomicity
  • B. Observably Read from a causally consistent snapshot
  • C. No concurrent conflicting writes
  • 1. Read Phase

§ Buffer writes at client

  • 2. Validation Phase

§ Client validates A, B and C using causal timestamps

  • 3. Commit Phase

§ Buffered writes committed in an observably atomic way

Pr Properties of Transactions Th Three Phase Protocol

  • 2. Validation Phase
  • a. Validate Read Set to verify A and B
  • b. Validate Overwrite Set to verify C
slide-51
SLIDE 51

Evaluation

slide-52
SLIDE 52

Evaluation Setup

  • Occult implemented by modifying Redis Cluster (baseline)
  • Evaluated on CloudLab
  • Two datacenters in WI and SC
  • 20 server machines (4 server processes per machine)
  • 16K logical shards
  • YCSB used as the benchmark
  • For graphs shown here read-heavy (95% reads) workload with zipfian

distribution

  • We show cost of providing consistency guarantees
slide-53
SLIDE 53

Goodput Comparison

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2 4 6 8 10 12 14 16 18 20 Goodput (million ops/s) Num Ops per Transaction (Tsize) Occult Transactions Occult Single-Key Redis Cluster 4 shardstamps per causal timestamp 8.7% 31% 39.6%

slide-54
SLIDE 54

Effect of slow nodes on Occult Latency

1 2 3 4 5 50th 75th 90th 95th 99th Log10 (Latency us) Percentiles 2 4 6 slow nodes

280us 390us 1.6ms 3.7ms 47.1ms 800us

slide-55
SLIDE 55

Conclusions

  • Enforcing causal consistency in the data store is vulnerable to slowdown

cascades

  • Sufficient to ensure that clients observe causal consistency:
  • Use lossy timestamps to provide the guarantee
  • Avoid slowdown cascades
  • Observable enforcement can be extended to causally consistent transactions
  • Make writes causally dependent on each other to observe atomicity
  • Also avoids slowdown cascades