A Brief History Of Time In Riak Time in Riak Logical Time Logical - - PowerPoint PPT Presentation

a brief history of time
SMART_READER_LITE
LIVE PREVIEW

A Brief History Of Time In Riak Time in Riak Logical Time Logical - - PowerPoint PPT Presentation

A Brief History Of Time In Riak Time in Riak Logical Time Logical Clocks Implementation details Mind the Gap How a venerable, established, simple data structure/algorithm was botched multiple times. Order of Events Dynamo And


slide-1
SLIDE 1

A Brief History Of Time

In Riak

slide-2
SLIDE 2

Time in Riak

✴Logical Time ✴Logical Clocks ✴Implementation details

slide-3
SLIDE 3

Mind the Gap

How a venerable, established, simple data structure/algorithm was botched multiple times.

slide-4
SLIDE 4

Order of Events

✴Dynamo And Riak ✴Temporal and Logical Time ✴Logical Clocks of Riak Past ✴Now

slide-5
SLIDE 5

Why Riak?

slide-6
SLIDE 6

Scale Up

$$$Big Iron (still fails)

slide-7
SLIDE 7

Scale Out

Commodity Servers CDNs, App servers DATABASES!!

slide-8
SLIDE 8

Fundamental Trade Off

  • Lipton/Sandberg ’88
  • Attiya/Welch ’94
  • Gilbert/Lynch ’02

Low Latency/Availability:

  • Increased Revenue
  • User Engagement

Strong Consistency:

  • Easier for Programmers
  • Less user “surprise”
slide-9
SLIDE 9

Consistency

There must exist a total order on all operations such that each operation looks as if it were completed

at a single instant. This is equivalent to requiring requests of

the distributed shared memory to act as if they were

executing on a single node, responding to

  • perations one at a time.
  • -Gilbert & Lynch
slide-10
SLIDE 10

Consistency

One important property of an atomic read/write shared memory is that

any read operation that begins after a write operation completes must return that value, or the result of a later write

  • peration. This is the consistency guarantee that generally provides

the easiest model for users to understand,

and is most convenient for those attempting to design a client application that uses the distributed service

  • -Gilbert & Lynch
slide-11
SLIDE 11

https://aphyr.com/posts/313-strong-consistency-models

slide-12
SLIDE 12

Replica A Replica B Replica C Client X Client Y

PUT “sue” PUT “bob” NO!!!! :(

Consistent

slide-13
SLIDE 13

Availability

Any non-failing node can respond to any request

  • -Gilbert & Lynch
slide-14
SLIDE 14

Replica A Replica B Replica C Client X Client Y

PUT “sue” PUT “bob” NO!!!! :(

Consistent

slide-15
SLIDE 15

Consensus for a total

  • rder of events
slide-16
SLIDE 16

Requires a quorum

slide-17
SLIDE 17

Coordination waits

slide-18
SLIDE 18

Replica A Replica B Replica C Client X Client Y

PUT “sue” PUT “bob”

Consistent

slide-19
SLIDE 19

Client X put “BOB” Client Y put “SUE”

Events put in a TOTAL ORDER

slide-20
SLIDE 20

https://aphyr.com/posts/313-strong-consistency-models

slide-21
SLIDE 21

Eventual Consistency

Eventual consistency is a consistency model used in distributed computing that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.

  • -Wikipedia
slide-22
SLIDE 22

Replica A Replica B Replica C Client X Client Y

PUT “sue”

C’

PUT “bob”

A’ B’

Available

slide-23
SLIDE 23

Availability

When serving reads and writes matters more than consistency of data. Deferred consistency.

slide-24
SLIDE 24

Fault Tolerance

slide-25
SLIDE 25

Low Latency

slide-26
SLIDE 26

Low Latency

Amazon found every 100ms of latency cost them 1% in sales.

slide-27
SLIDE 27

Low Latency

Google found an extra 0.5 seconds in search page generation time dropped traffic by 20%.

slide-28
SLIDE 28

Replica A Replica B Replica C Client X Client Y

PUT “sue”

C’

PUT “bob”

A’ B’

Available

slide-29
SLIDE 29

Optimistic replication

slide-30
SLIDE 30

No coordination - lower latency

slide-31
SLIDE 31

Replica A Replica B Replica C Client X Client Y

PUT “sue” PUT “bob”

Low Latency

[c1] “sue” [c1] “sue” [a1] “bob”

slide-32
SLIDE 32

How Do We Order Updates?

slide-33
SLIDE 33

–Google Book Search p.148 “The Giant Anthology of Science Fiction”, edited by Leo Margulies and Oscar Jerome Friend, 1954

"'Time,' he said, 'is what keeps everything from happening at once.'"

slide-34
SLIDE 34

Temporal Clocks

posix time number line

Thursday, 1 January 1970 0 129880800 1394382600 Now-ish My Birthday

slide-35
SLIDE 35

Light Cone!

By SVG version: K. Aainsqatsi at en.wikipediaOriginal PNG version: Stib at en.wikipedia - Transferred from en.wikipedia to Commons.(Original text: self-made), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2210907

slide-36
SLIDE 36

Physics Problem

4,148 km 14 ms Light 21 ms fibre SF NY PUT “bob” 1394382600000 PUT “sue” 1394382600020

slide-37
SLIDE 37

temporal clocks

✴CAN

  • A could NOT have caused B
  • A could have caused B

✴CAN’T

  • A caused B
slide-38
SLIDE 38

Dynamo

The Shopping Cart

slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41

1 CLIENT

2 REPLICAS

1 KEY

slide-42
SLIDE 42

Optimistic replication

slide-43
SLIDE 43

No coordination - lower latency

slide-44
SLIDE 44

GET PUT UPDATE REPLICATE

slide-45
SLIDE 45

PUT PUT

slide-46
SLIDE 46

Quorum

slide-47
SLIDE 47

GET

A

PUT

A

GET

B

PUT

B

TEMPORAL TIME

slide-48
SLIDE 48

>

155196119890 155196118001

Timestamp - total order

slide-49
SLIDE 49
slide-50
SLIDE 50

Logical clock - partial order

slide-51
SLIDE 51
slide-52
SLIDE 52

Clocks, Time, And the Ordering of Events

  • Logical Time
  • Causality
  • A influenced B
  • A and B happened at the same

time

  • Per-process clocks, only tick when

something happens

Leslie Lamport http://dl.acm.org/citation.cfm?id=359563

slide-53
SLIDE 53

Detection of Mutual Inconsistency in Distributed Systems

Version Vectors - updates to a data item

http://zoo.cs.yale.edu/classes/cs426/2013/bib/ parker83detection.pdf

slide-54
SLIDE 54

Version Vectors or Vector Clocks?

http://haslab.wordpress.com/2011/07/08/version-vectors-are- not-vector-clocks/

version vectors - updates to a data item

slide-55
SLIDE 55

Summary

  • Distributed systems exist (scale out)
  • There is a trade off (Consistency/Availability)
  • To decide on a value we need to “order” updates
  • Temporal time is inadequate
  • Logical time can help
slide-56
SLIDE 56

Version Vectors

A C B

slide-57
SLIDE 57

Version Vectors

A C B

{a, 1} {b, 1} {c, 1} {a, 2}

[ ] {a, 2}, {b, 1}, {c, 1}

slide-58
SLIDE 58

Version Vectors

A C B

{a, 1} {b, 1} {c, 1} {a, 2}

[ ] {a, 2}, {b, 1}, {c, 1}

slide-59
SLIDE 59

Version Vectors

A C B

{a, 1} {b, 1} {c, 1} {a, 2}

[ ] {a, 2}, {b, 1}, {c, 1}

slide-60
SLIDE 60

Version Vectors

A C B

{a, 1} {b, 1} {c, 1} {a, 2}

[ ] {a, 2}, {b, 1}, {c, 1}

slide-61
SLIDE 61

Version Vectors

A C B

{a, 1} {b, 1} {c, 1} {a, 2}

[ ] {a, 2}, {b, 1}, {c, 1}

slide-62
SLIDE 62

Version Vectors

[{a,2}, {b,1}, {c,1}]

slide-63
SLIDE 63

Version Vectors Update

[{a,2}, {b,1}, {c,1}]

slide-64
SLIDE 64

Version Vectors Update

[{a,2}, {b,2}, {c,1}]

slide-65
SLIDE 65

Version Vectors Update

[{a,2}, {b,3}, {c,1}]

slide-66
SLIDE 66

Version Vectors Update

[{a,2}, {b,3}, {c,2}]

slide-67
SLIDE 67

Version Vectors Descends

✴A descends B : A >= B ✴A has seen all that B has ✴A summarises at least the same history as B

slide-68
SLIDE 68

Version Vectors Descends

[{a,2}, {b,3}, {c,2}] [{a,2}, {b,3}, {c,2}] [{a,2}, {b,3}, {c,2}] []

>=

[{a,4}, {b,3}, {c,2}] [{a,2}, {b,3}, {c,2}]

slide-69
SLIDE 69

Version Vectors Dominates

✴A dominates B : A > B ✴A has seen all that B has, and at least one more

event

✴A summarises a greater history than B

slide-70
SLIDE 70

Version Vectors Dominates

[{a,1}] []

>

[{a,4}, {b,3}, {c,2}] [{a,2}, {b,3}, {c,2}] [{a,5}, {b,3}, {c,5}, {d, 1}] [{a,2}, {b,3}, {c,2}]

slide-71
SLIDE 71

Version Vectors Concurrent

✴A concurrent with B : A | B ✴A does not descend B AND B does not descend A ✴A and B summarise disjoint events ✴A contains events unseen by B AND B contains

events unseen by A

slide-72
SLIDE 72

Version Vectors Concurrent

[{a,1}] [{b,1}]

|

[{a,4}, {b,3}, {c,2}] [{a,2}, {b,4}, {c,2}] [{a,5}, {b,3}, {c,5}, {d, 1}] [{a,2}, {b,4}, {c,2}, {e,1}]

slide-73
SLIDE 73
slide-74
SLIDE 74

happens before concurrent
 ——— divergent convergent

Logical Clocks

slide-75
SLIDE 75

Version Vectors Merge

✴A merge with B : A ⊔ B ✴A ⊔ B = C ✴C >= A and C >= B ✴If A | B C > A and C > B ✴C summarises all events in A and B ✴Pairwise max of counters in A and B

slide-76
SLIDE 76

Version Vectors Merge

[{a,1}] [{b,1}]

[{a,4}, {b,3}, {c,2}] [{a,2}, {b,4}, {c,2}] [{a,5}, {b,3}, {c,5}, {d, 1}] [{a,2}, {b,4}, {c,2}, {e,1}]

slide-77
SLIDE 77

Version Vectors Merge

[{a,1}{b,2}] [{a,4}, {b,4}, {c,2}] [{a,5}, {b,3}, {c,5}, {d, 1},{e,1}]

slide-78
SLIDE 78

Syntactic Merging

✴Discarding “seen” information ✴Retaining concurrent values ✴Merging divergent clocks

slide-79
SLIDE 79

Temporal vs Logical

[{a,4}, {b,3}, {c,2}] [{a,2}, {b,3}, {c,2}] A B “Bob” “Sue”

?

slide-80
SLIDE 80

Temporal vs Logical

[{a,4}, {b,3}, {c,2}] [{a,2}, {b,3}, {c,2}] A B “Bob” “Sue” Bob

slide-81
SLIDE 81

Temporal vs Logical

1429533664000 A B “Bob” “Sue” ? 1429533662000

slide-82
SLIDE 82

Temporal vs Logical

1429533664000 A B “Bob” “Sue”Bob? 1429533662000

slide-83
SLIDE 83

Temporal vs Logical

[{a,4}, {b,3}, {c,2}] A B “Bob” “Sue”

?

[{a,2}, {b,4}, {c,2}]

slide-84
SLIDE 84

Temporal vs Logical

[{a,4}, {b,3}, {c,2}] A B “Bob” “Sue”

[Bob, Sue]

[{a,2}, {b,4}, {c,2}]

slide-85
SLIDE 85

Temporal vs Logical

1429533664000 A B “Bob” “Sue” ? 1429533664001

slide-86
SLIDE 86

Temporal vs Logical

1429533664000 A B “Bob” “Sue”Sue? 1429533664001

slide-87
SLIDE 87

Summary

  • Eventually Consistent Systems allow concurrent

updates

  • Temporal timestamps can’t capture concurrency
  • Logical clocks (Version vectors) can
  • Version Vectors are easy
slide-88
SLIDE 88

History Repeating

“Those who cannot remember the past are condemned to repeat it"

slide-89
SLIDE 89

Terms

  • Local value - stored on disk at some replica
  • Incoming value - sent as part of a PUT or

replication

  • Local clock - The Version Vector of the Local

Value

  • Incoming clock - The Version Vector of the

Incoming Value

slide-90
SLIDE 90

Riak Version Vectors

Who’s the actor?

slide-91
SLIDE 91

Riak 0.n Client Side IDs

  • Client Code Provides ID
  • Riak increments Clock at API boundary
  • Riak syntactic merge and stores object
  • Read, Resolve, Rinse, Repeat.
slide-92
SLIDE 92
slide-93
SLIDE 93

Client Riak API Riak Vnode

slide-94
SLIDE 94

Riak Vnode

slide-95
SLIDE 95

Conflict Resolution

  • Client reads merged clock + sibling values
  • sends new value + clock
  • new clock descends old (eventually!)
  • Store single value
slide-96
SLIDE 96

Client Version Vector

What Level of Consistency Do We Require?

slide-97
SLIDE 97

https://aphyr.com/posts/313-strong-consistency-models

slide-98
SLIDE 98

GET

A

PUT

A

GET

B

PUT

B

TEMPORAL TIME

slide-99
SLIDE 99

RYOW

  • Invariant: strictly increasing events per actor.
  • PW+PR > N
  • Availability cost
  • Bug made it impossible!
slide-100
SLIDE 100

Client VClock

  • Read not_found []
  • store “bob” [{c, 1}]
  • read “bob” [{c, 1}]
  • store [“bob”, “sue”] [{c, 2}]
slide-101
SLIDE 101

Client VClock

  • Read not_found []
  • store “bob” [{c, 1}]
  • read not_found []
  • store “sue” [{c, 1}]
slide-102
SLIDE 102

Riak Vnode

slide-103
SLIDE 103

Client VClock

  • If local clock: ([{c, 1}]) 


descends
 incoming clock: ([{c,1}])

  • discard incoming value
slide-104
SLIDE 104

Client Side ID RYOW

  • Read a Stale clock
  • Re-issue the same OR lower event again
  • No total order for a single actor
  • Each event is not unique
  • System discards as “seen” data that is new
slide-105
SLIDE 105

Client Side IDs Bad

  • Unique actor ID:: database invariant enforced by

client!

  • Actor Explosion (Charron-Bost)
  • No. Entries == No. Actors
  • Client Burden
  • RYOW required - Availability Cost
slide-106
SLIDE 106

Riak Version Vectors

Who’s the actor?

slide-107
SLIDE 107

Vnode Version Vectors Riak 1.n

  • No more Version Vector, just say Context
  • The Vnode is the Actor
  • Vnodes act serially
  • Store the clock with the Key
  • Coordinating Vnode, increments clock
  • Deliberate false concurrency
slide-108
SLIDE 108

Vnode VClocks False Concurrency

C1 C2

RIAK

GET Foo GET Foo

slide-109
SLIDE 109

Vnode VClocks False Concurrency

C1 C2

RIAK

[{a,1},{b4}]->”bob” [{a,1},{b4}]->”bob”

slide-110
SLIDE 110

Vnode VClocks False Concurrency

C1 C2

RIAK

PUT [{a,1},{b,4}]=“Rita” PUT [{a,1},{b,4}]=“Sue”

slide-111
SLIDE 111

Vnode VClocks False Concurrency

C1 C2 PUTFSM1 PUTFSM2

VNODE Q

RITA SUE

VNODE

slide-112
SLIDE 112

Vnode VClocks False Concurrency

VNODE Q

RITA

VNODE a

[{a,2},{b,4}]=“SUE” [{a,1},{b,4}]

slide-113
SLIDE 113

Vnode VClocks False Concurrency

VNODE Q

[{a,3},{b,4}]=[RITA,SUE]

VNODE a

[{a,2},{b,4}]=“SUE”

slide-114
SLIDE 114

Client Riak API Riak API Coordinator

slide-115
SLIDE 115

Vnode VV Coordinator

slide-116
SLIDE 116

Vnode VV - Coordinator

  • If incoming clock descends local
  • Increment clock
  • Write incoming as sole value
  • Replicate
slide-117
SLIDE 117

Vnode VV - Coordinator

  • If incoming clock does not descend local
  • Merge clocks
  • Increment Clock
  • Add incoming value as sibling
  • Replicate
slide-118
SLIDE 118

Vnode VV - Replica

slide-119
SLIDE 119

Vnode VClock GOOD

  • Far fewer actors
  • Way simpler
  • Empty context PUTs are siblings
slide-120
SLIDE 120

Vnode VClock BAD

  • Possible latency cost of forward
  • No more idempotent PUTs
  • Store a SET of siblings, not LIST
  • Sibling Explosion
  • As a result of too much false concurrency
slide-121
SLIDE 121

Sibling Explosion

  • False concurrency cost
  • Many many siblings
  • Large object
  • Death
slide-122
SLIDE 122

Sibling Explosion

  • Data structure
  • Clock + Set of Values
  • False Concurrency
slide-123
SLIDE 123

Sibling Explosion

slide-124
SLIDE 124

Sibling Explosion

C1 C2

RIAK

GET Foo GET Foo

slide-125
SLIDE 125

Sibling Explosion

C1 C2

RIAK

not_found not_found

slide-126
SLIDE 126

Sibling Explosion

C1

RIAK

PUT []=“Rita” [{a,1}]->”Rita”

slide-127
SLIDE 127

Sibling Explosion

C2

RIAK

PUT []=“Sue” [{a,2}]->[”Rita”, “Sue”]

slide-128
SLIDE 128

Sibling Explosion

C1

RIAK

PUT [{a, 1}]=“Bob” [{a,3}]->[”Rita”, “Sue”, “Bob”]

slide-129
SLIDE 129

Sibling Explosion

C2

RIAK

PUT [{a,2}]=“Babs” [{a,4}]->[”Rita”, “Sue”, “Bob”, “Babs”]

slide-130
SLIDE 130

Vnode VClock

  • Trick to “dodge’ the Charron-Bost result
  • Engineering, not academic
  • Tested (quickchecked in fact!)
  • “Action at a distance”
slide-131
SLIDE 131

Dotted Version Vectors

Dotted Version Vectors: Logical Clocks for Optimistic Replication http://arxiv.org/abs/1011.5808

slide-132
SLIDE 132

Vnode VClocks + Dots Riak 2.n

  • What even is a dot?
  • That “event” we saw back a the start

A

{a, 1} {a, 2}

slide-133
SLIDE 133

Oh Dot all the Clocks

✴Data structure

  • Clock + List of Dotted Values

[{{a, 1}, “bob”}, {{a, 2}, “Sue”}]

slide-134
SLIDE 134

Vnode VClock

✴If incoming clock descends local

  • Increment clock
  • Get Last Event as dot (eg {a, 3})
  • Write incoming as sole value + Dot
  • Replicate
slide-135
SLIDE 135

Vnode VClock

✴If incoming clock does not descend local

  • Merge clocks
  • Increment Clock
  • Get Last Event as dot (eg {a, 3})
  • Prune siblings!
  • Add incoming value as sibling
  • Replicate
slide-136
SLIDE 136

Oh drop all the dots

✴Prune Siblings

  • Remove any siblings who’s dot is seen by the

incoming clock

  • if Clock >= [Dot] drop Dotted value
slide-137
SLIDE 137

Vnode VClocks

[{a, 4}] Rita Sue Babs Bob [{a, 3}] Pete

slide-138
SLIDE 138

Vnode VClocks + Dots

[{a, 4}] Rita Sue Babs Bob [{a, 3}] Pete {a,1} {a,2} {a,3} {a,4}

slide-139
SLIDE 139

Vnode VClocks + Dots

[{a, 4}] Babs [{a, 3}] Pete {a,4}

slide-140
SLIDE 140

Vnode VClocks + Dots

[{a, 5}] Babs Pete {a,4} {a,5}

slide-141
SLIDE 141

Dotted Version Vectors

✴ Action at a distance ✴ Correctly capture concurrency ✴ No sibling explosion ✴ No Actor explosion

slide-142
SLIDE 142

KV679

slide-143
SLIDE 143

Riak Overview

Read Repair. Deletes.

slide-144
SLIDE 144

Replica A Replica B Replica C Client X

PUT “bob”

slide-145
SLIDE 145

Read Repair

Replica A Replica B Replica C Client

GET “Bob” “Bob” not_found

slide-146
SLIDE 146

Read Repair

Replica A Replica B Replica C Client

“Bob” “Bob”!!!

slide-147
SLIDE 147

Replica A Replica B Replica C Client X

DEL ‘k’ [{a, 4}, {b, 3}]

C’

slide-148
SLIDE 148

Replica A Replica B Replica C C’ Del FSM

GET

slide-149
SLIDE 149

Replica A Replica B Replica C C’ Del FSM

GET A=Tombstone, B=Tombstone, C=not_found

slide-150
SLIDE 150

Read Repair

Replica A Replica B Replica C

“Tombstone”!!!

slide-151
SLIDE 151

Replica A Replica B Replica C C’ Client

GET A=Tombstone, B=Tombstone, C=Tombstone

FSM

not_found

slide-152
SLIDE 152

Replica A Replica B Replica C C’

REAP

FSM

slide-153
SLIDE 153

Replica A Replica B Replica C Client X

PUT “sue” [] Sue [{a, 1}]

C’

slide-154
SLIDE 154

Replica A Replica B Replica C C’

Hinted Hand off tombstone

slide-155
SLIDE 155

Replica A Replica B Replica C Client

GET A=Sue[{a,1}], B=Sue[{a,1}], C=Tombstone [{a,4}, {b1}]

FSM

not_found Ooops!

slide-156
SLIDE 156

KV679 Lingering Tombstone

  • Write Tombstone
  • One goes to fallback
  • Read and reap primaries
  • Add Key again
  • Tombstone is handed off
  • Tombstone clock dominates, data lost
slide-157
SLIDE 157

KV679 Other flavours

  • Back up restore
  • Failed local read (disk error, operator “error” etc)
slide-158
SLIDE 158

KV679 RYOW?

  • Familiar
  • History repeating
slide-159
SLIDE 159

KV679 Per Key Actor Epochs

  • Every time a Vnode reads a local “not_found”
  • Increment a vnode durable counter
  • Make a new actor ID
  • <<VnodeId, Epoch_Counter>>
slide-160
SLIDE 160

KV679 Per Key Actor Epochs

  • Actor ID for the vnode remains long lived
  • No actor explosion
  • Each key gets a new actor per “epoch”
  • Vnode increments highest “Epoch” for it’s Id
  • <<VnodeId, Epoch>>
slide-161
SLIDE 161

Replica A Replica B Replica C Client

GET A=Sue[{a:2,1}], B=Sue[{a:2,1}], C=Tombstone [{a:1,4}, {b1}]

FSM

[Sue, tombstone]

slide-162
SLIDE 162

Per Key Actor Epochs BAD

  • More Actors (every time you delete and recreate

a key _it_ gets a new actor)

  • More computation (find highest epoch for actor in

Version Vector)

slide-163
SLIDE 163

Per Key Actor Epochs GOOD

  • No silent dataloss
  • No actor explosion
  • Fully backwards/forward compatible
slide-164
SLIDE 164

Are we there yet?

?

slide-165
SLIDE 165

Summary

  • Client side Version Vectors
  • Invariants, availability, Charron-Bost
  • Vnode Version Vectors
  • Sibling Explosion
slide-166
SLIDE 166

Summary

  • Dotted Version Vectors
  • “beat” Charron-Bost
  • Per-Key-Actor-Epochs
  • Vnodes can “forget” safely
slide-167
SLIDE 167

Summary

  • Temporal Clocks can’t track causality
  • Logical Clocks can
slide-168
SLIDE 168

Summary

  • Version Vectors are EASY!
  • (systems using) Version Vectors are HARD!
  • Mind the Gap!