M T V C Matthew Sackman - - PowerPoint PPT Presentation

m t v c
SMART_READER_LITE
LIVE PREVIEW

M T V C Matthew Sackman - - PowerPoint PPT Presentation

M T V C Matthew Sackman matthew@goshawkdb.io https://goshawkdb.io/ 1. Have you played with a NoSQL or NewSQL store? 1. Have you played with a NoSQL or NewSQL store? 2. Have


slide-1
SLIDE 1

M T  V C

Matthew Sackman

matthew@goshawkdb.io

https://goshawkdb.io/

slide-2
SLIDE 2
  • 1. Have you played with a NoSQL or NewSQL store?
slide-3
SLIDE 3
  • 1. Have you played with a NoSQL or NewSQL store?
  • 2. Have you deployed a NoSQL or NewSQL store?
slide-4
SLIDE 4
  • 1. Have you played with a NoSQL or NewSQL store?
  • 2. Have you deployed a NoSQL or NewSQL store?
  • 3. Have you studied and know their semantics?
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

T D S

ACID

  • Atomic: an operation (transaction) either succeeds or aborts

completely - no partial successes

  • Consistent: constraints like uniqueness, foreign keys, etc are

honoured

  • Durable: flushed to disk before the client can find out the result
slide-8
SLIDE 8

T D S

ACID

  • Atomic: an operation (transaction) either succeeds or aborts

completely - no partial successes

  • Consistent: constraints like uniqueness, foreign keys, etc are

honoured

  • Isolation: the degree to which operations in one transaction

can observe actions of concurrent transactions

  • Durable: flushed to disk before the client can find out the result
slide-9
SLIDE 9

T D S

Default isolation levels

  • PostgreSQL:
  • Oracle 11g:
  • MS SQL Server:
  • MySQL InnoDB:
slide-10
SLIDE 10

T D S

Default isolation levels

  • PostgreSQL: Read Committed
  • Oracle 11g: Read Committed
  • MS SQL Server: Read Committed
  • MySQL InnoDB:
slide-11
SLIDE 11

T D S

Default isolation levels

  • PostgreSQL: Read Committed
  • Oracle 11g: Read Committed
  • MS SQL Server: Read Committed
  • MySQL InnoDB: Repeatable Read
slide-12
SLIDE 12

I L

slide-13
SLIDE 13

S I

  W

“Snapshot isolation is a guarantee that all reads made in a transaction will see a consistent snapshot of the database and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot.”

slide-14
SLIDE 14

S I

  W

“Snapshot isolation is a guarantee that all reads made in a transaction will see a consistent snapshot of the database and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot.” Snapshot isolation is called “serializable” mode in Oracle.

slide-15
SLIDE 15

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

slide-16
SLIDE 16

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

  • Serialized:

t1 then t2

slide-17
SLIDE 17

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

  • Serialized:

t1 then t2

slide-18
SLIDE 18

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

  • Serialized:

t1 then t2

slide-19
SLIDE 19

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

  • Serialized:

t1 then t2

slide-20
SLIDE 20

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

  • Serialized:

t1 then t2: x:0, y:1

slide-21
SLIDE 21

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

  • Serialized:

t1 then t2: x:0, y:1 t2 then t1

slide-22
SLIDE 22

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

  • Serialized:

t1 then t2: x:0, y:1 t2 then t1: x:1, y:0

slide-23
SLIDE 23

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

  • Serialized:

t1 then t2: x:0, y:1 t2 then t1: x:1, y:0

  • Snapshot Isolation:
slide-24
SLIDE 24

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

  • Serialized:

t1 then t2: x:0, y:1 t2 then t1: x:1, y:0

  • Snapshot Isolation:

t1 || t2

slide-25
SLIDE 25

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

  • Serialized:

t1 then t2: x:0, y:1 t2 then t1: x:1, y:0

  • Snapshot Isolation:

t1 || t2

slide-26
SLIDE 26

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

  • Serialized:

t1 then t2: x:0, y:1 t2 then t1: x:1, y:0

  • Snapshot Isolation:

t1 || t2

slide-27
SLIDE 27

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

  • Serialized:

t1 then t2: x:0, y:1 t2 then t1: x:1, y:0

  • Snapshot Isolation:

t1 || t2: x:1, y:1

slide-28
SLIDE 28

S I

  

x, y := 0,0

func t1() {

if x == 0 {

y = 1

}

} func t2() { if y == 0 { x = 1 } }

  • Serialized:

t1 then t2: x:0, y:1 t2 then t1: x:1, y:0

  • Snapshot Isolation: Write Skew

t1 || t2: x:1, y:1

slide-29
SLIDE 29

D F

  • General purpose transactions
slide-30
SLIDE 30

D F

  • General purpose transactions
  • Strong serializability
slide-31
SLIDE 31

D F

  • General purpose transactions
  • Strong serializability
  • Distribution
  • Automatic sharding
  • Horizontal scalability
  • ...
slide-32
SLIDE 32
slide-33
SLIDE 33

I L

slide-34
SLIDE 34

I L

slide-35
SLIDE 35

CAP

Possibility of Partitions = ⇒ ¬(Consistency ∧ Availability)

slide-36
SLIDE 36

CAP

Possibility of Partitions = ⇒ ¬(Consistency ∧ Availability)

slide-37
SLIDE 37

CAP

Possibility of Partitions = ⇒ ¬(Consistency ∧ Availability)

slide-38
SLIDE 38

CAP

Possibility of Partitions = ⇒ ¬(Consistency ∧ Availability)

slide-39
SLIDE 39

CAP

Possibility of Partitions = ⇒ ¬(Consistency ∧ Availability)

slide-40
SLIDE 40

A C

C I C N

slide-41
SLIDE 41

A C

C I C N

slide-42
SLIDE 42

A C

C I C N

slide-43
SLIDE 43

A C

C I C N

slide-44
SLIDE 44

A C

C I C N

slide-45
SLIDE 45

A C

C I C N

slide-46
SLIDE 46

A C

C I C N

slide-47
SLIDE 47

A C

C I C N

slide-48
SLIDE 48

A C

C I C N

slide-49
SLIDE 49

A C

C I C N

slide-50
SLIDE 50

L 

  • Strong serializability requires Consistency, so must sacrifice

Availability

slide-51
SLIDE 51

L 

  • Strong serializability requires Consistency, so must sacrifice

Availability

  • To achieve Consistency, only accept operations if connected to

majority

slide-52
SLIDE 52

L 

  • Strong serializability requires Consistency, so must sacrifice

Availability

  • To achieve Consistency, only accept operations if connected to

majority

  • If cluster size is 2F + 1 then we can withstand no more than F

failures

slide-53
SLIDE 53

M V

C I E  T W

slide-54
SLIDE 54

M V

C I E  T W

slide-55
SLIDE 55

M V

C I E  T W

slide-56
SLIDE 56

M V

C I E  T W

slide-57
SLIDE 57

M V

C I E  T W

slide-58
SLIDE 58

M V

C I E  T W

slide-59
SLIDE 59

L 

  • Strong serializability requires Consistency, so must sacrifice

Availability

  • To achieve Consistency, only accept operations if connected to

majority

  • If cluster size is 2F + 1 then we can withstand no more than F

failures

  • Writes must go to F + 1 nodes
slide-60
SLIDE 60

R V

C I E  T W

slide-61
SLIDE 61

R V

C I E  T W

slide-62
SLIDE 62

R V

C I E  T W

slide-63
SLIDE 63

R V

C I E  T W

slide-64
SLIDE 64

R V

C I E  T W

slide-65
SLIDE 65

R V

C I E  T W

slide-66
SLIDE 66

L 

  • Strong serializability requires Consistency, so must sacrifice

Availability

  • To achieve Consistency, only accept operations if connected to

majority

  • If cluster size is 2F + 1 then we can withstand no more than F

failures

  • Writes must go to F + 1 nodes
  • Reads must read from F + 1 nodes and be able to order results
slide-67
SLIDE 67
slide-68
SLIDE 68

T P  D D

  • 1. Client submits txn
slide-69
SLIDE 69

T P  D D

  • 1. Client submits txn
  • 2. Node(s) vote on txn
slide-70
SLIDE 70

T P  D D

  • 1. Client submits txn
  • 2. Node(s) vote on txn
  • 3. Node(s) reach consensus on txn outcome
slide-71
SLIDE 71

T P  D D

  • 1. Client submits txn
  • 2. Node(s) vote on txn
  • 3. Node(s) reach consensus on txn outcome
  • 4. Client is informed of outcome
slide-72
SLIDE 72

T P  D D

  • 1. Client submits txn
  • 2. Node(s) vote on txn
  • 3. Node(s) reach consensus on txn outcome
  • 4. Client is informed of outcome

Most important thing is all nodes agree on the order of transactions

slide-73
SLIDE 73

T P  D D

  • 1. Client submits txn
  • 2. Node(s) vote on txn
  • 3. Node(s) reach consensus on txn outcome
  • 4. Client is informed of outcome

Most important thing is all nodes agree on the order of transactions (focus for the rest of this talk!)

slide-74
SLIDE 74

L B O

slide-75
SLIDE 75

L B O

slide-76
SLIDE 76

L B O

slide-77
SLIDE 77

L B O

  • Only leader votes on whether txn commits or aborts
slide-78
SLIDE 78

L B O

  • Only leader votes on whether txn commits or aborts
  • Therefore leader must know everything
slide-79
SLIDE 79

L B O

  • Only leader votes on whether txn commits or aborts
  • Therefore leader must know everything
  • If leader fails, a new leader will be elected from remaining

nodes

slide-80
SLIDE 80

L B O

  • Only leader votes on whether txn commits or aborts
  • Therefore leader must know everything
  • If leader fails, a new leader will be elected from remaining

nodes

  • Therefore all nodes must know everything
slide-81
SLIDE 81

L B O

  • Only leader votes on whether txn commits or aborts
  • Therefore leader must know everything
  • If leader fails, a new leader will be elected from remaining

nodes

  • Therefore all nodes must know everything
  • Fine for small clusters, but scaling issues when clusters get big
slide-82
SLIDE 82

C C B O

slide-83
SLIDE 83

C C B O

slide-84
SLIDE 84

C C B O

slide-85
SLIDE 85

C C B O

  • Nodes receive txns and must vote on txn outcome and then

consensus must be reached (not shown)

slide-86
SLIDE 86

C C B O

  • Nodes receive txns and must vote on txn outcome and then

consensus must be reached (not shown)

  • Clients are responsible for applying an increasing clock value

to txns

slide-87
SLIDE 87

C C B O

  • Nodes receive txns and must vote on txn outcome and then

consensus must be reached (not shown)

  • Clients are responsible for applying an increasing clock value

to txns

  • If a client’s clock races then it can prevent other clients from

getting txns submitted

slide-88
SLIDE 88

C C B O

  • Nodes receive txns and must vote on txn outcome and then

consensus must be reached (not shown)

  • Clients are responsible for applying an increasing clock value

to txns

  • If a client’s clock races then it can prevent other clients from

getting txns submitted

  • So must be very careful to try and keep clocks running at the

same rate

slide-89
SLIDE 89

C C B O

  • Nodes receive txns and must vote on txn outcome and then

consensus must be reached (not shown)

  • Clients are responsible for applying an increasing clock value

to txns

  • If a client’s clock races then it can prevent other clients from

getting txns submitted

  • So must be very careful to try and keep clocks running at the

same rate

  • No possibility to reorder transactions at all to maximise

commits

slide-90
SLIDE 90

S

V1 < V2 ∀x ∈ dom(V1 ∪ V2) : V1[x] ≤ V2[x] ∧∃y ∈ dom(V1 ∪ V2) : V1[y] < V2[y]

slide-91
SLIDE 91

S T

slide-92
SLIDE 92

S T

slide-93
SLIDE 93

S T

slide-94
SLIDE 94

S T

slide-95
SLIDE 95

S T

slide-96
SLIDE 96

S T

slide-97
SLIDE 97

S T

slide-98
SLIDE 98

S T

slide-99
SLIDE 99

T W

slide-100
SLIDE 100

T W

slide-101
SLIDE 101

T W

slide-102
SLIDE 102

T W

slide-103
SLIDE 103

T W

slide-104
SLIDE 104

T W

slide-105
SLIDE 105

T W

slide-106
SLIDE 106

T W

slide-107
SLIDE 107

T W

slide-108
SLIDE 108

T W

slide-109
SLIDE 109

T W

slide-110
SLIDE 110

T W

slide-111
SLIDE 111

T W

slide-112
SLIDE 112

T W

slide-113
SLIDE 113

T D A D’ W

  • Changing state when receiving a txn seems to be a very bad

idea

slide-114
SLIDE 114

T D A D’ W

  • Changing state when receiving a txn seems to be a very bad

idea

  • Maybe only change state when receiving the outcome of a

vote

slide-115
SLIDE 115

T D A D’ W

  • Changing state when receiving a txn seems to be a very bad

idea

  • Maybe only change state when receiving the outcome of a

vote

  • And don’t vote on txns until we know it’s safe to do so
slide-116
SLIDE 116

N I

  • Divide time into frames. First half of frame is reads, second half

writes.

slide-117
SLIDE 117

N I

  • Divide time into frames. First half of frame is reads, second half

writes.

  • Within a frame, we don’t care about order of reads,
  • but all reads must come after writes of previous frame,
  • all writes must come after reads of this frame,
  • all writes must be totally ordered within the frame - must know

which write comes last.

slide-118
SLIDE 118

F & D

slide-119
SLIDE 119

F & D

slide-120
SLIDE 120

F & D

slide-121
SLIDE 121

F & D

slide-122
SLIDE 122

F & D

slide-123
SLIDE 123

F & D

slide-124
SLIDE 124

F & D

slide-125
SLIDE 125

F & D

slide-126
SLIDE 126

C  W C  R

  • Merge all read clocks together
  • Add 1 to result for every object that was written by txns in our

frame’s reads

slide-127
SLIDE 127

C  F W

& N F’ R C

  • Partition write results by local clock elem, and within that by

txn id

  • Each clock inherits missing clock elems from above
  • Then sort each partition first by clock (now all same length),

then by txn id

  • Next frame starts with winner’s clock, +1 for all writes
slide-128
SLIDE 128

C  F W

& N F’ R C

  • Partition write results by local clock elem, and within that by

txn id

  • Each clock inherits missing clock elems from above
  • Then sort each partition first by clock (now all same length),

then by txn id

  • Next frame starts with winner’s clock, +1 for all writes
  • Guarantees no concurrent vector clocks (proof in progress!)
slide-129
SLIDE 129

C  F W

& N F’ R C

  • Partition write results by local clock elem, and within that by

txn id

  • Each clock inherits missing clock elems from above
  • Then sort each partition first by clock (now all same length),

then by txn id

  • Next frame starts with winner’s clock, +1 for all writes
  • Guarantees no concurrent vector clocks (proof in progress!)
  • Many details elided! (deadlock freedom, etc)
slide-130
SLIDE 130

T V C

slide-131
SLIDE 131

T V C

slide-132
SLIDE 132

T V C

slide-133
SLIDE 133

T V C

slide-134
SLIDE 134

T V C

slide-135
SLIDE 135

T V C

slide-136
SLIDE 136

T V C

slide-137
SLIDE 137

T V C

slide-138
SLIDE 138

T V C

slide-139
SLIDE 139

T V C

slide-140
SLIDE 140

S V C

  • Hardest part of Paxos is garbage collection
slide-141
SLIDE 141

S V C

  • Hardest part of Paxos is garbage collection
  • Need additional messages to determine when Paxos instances

can be deleted

slide-142
SLIDE 142

S V C

  • Hardest part of Paxos is garbage collection
  • Need additional messages to determine when Paxos instances

can be deleted

  • We can use these to also express:

You will never see any of these vector clock elems again

slide-143
SLIDE 143

S V C

  • Hardest part of Paxos is garbage collection
  • Need additional messages to determine when Paxos instances

can be deleted

  • We can use these to also express:

You will never see any of these vector clock elems again

  • Therefore we can remove matching elems from vector clocks!
  • Many more details elided!
slide-144
SLIDE 144

C

Vector clocks capture dependencies and causal relationship between transactions

slide-145
SLIDE 145

C

Vector clocks capture dependencies and causal relationship between transactions Plus we always add transactions into the youngest frame

slide-146
SLIDE 146

C

Vector clocks capture dependencies and causal relationship between transactions Plus we always add transactions into the youngest frame Which gets us Strong Serializability

slide-147
SLIDE 147

C

No leader, so no potential bottleneck

slide-148
SLIDE 148

C

No leader, so no potential bottleneck No wall clocks, so no issues with clock skews

slide-149
SLIDE 149

C

No leader, so no potential bottleneck No wall clocks, so no issues with clock skews Can separate F from cluster size,

slide-150
SLIDE 150

C

No leader, so no potential bottleneck No wall clocks, so no issues with clock skews Can separate F from cluster size, Which gets us horizontal scalability

slide-151
SLIDE 151

C

slide-152
SLIDE 152

R ()

  • Interval Tree Clocks - Paulo Sérgio Almeida, Carlos Baquero,

Victor Fonte

  • Highly available transactions: Virtues and limitations - Bailis et

al

  • Coordination avoidance in database systems - Bailis et al
  • k-dependency vectors: A scalable causality-tracking protocol -

Baldoni, Melideo

  • Multiversion concurrency control-theory and algorithms -

Bernstein, Goodman

  • Serializable isolation for snapshot databases - Cahill, Röhm,

Fekete

  • Paxos made live: an engineering perspective - Chandra,

Griesemer, Redstone

slide-153
SLIDE 153

R ()

  • Consensus on transaction commit - Gray, Lamport
  • Spanner: Google’s globally distributed database - Corbett et al
  • Faster generation of shorthand universal cycles for

permutations - Holroyd, Ruskey, Williams

  • s-Overlap Cycles for Permutations - Horan, Hurlbert
  • Universal cycles of k-subsets and k-permutations - Jackson
  • Zab: High-performance broadcast for primary-backup systems
  • Junqueira, Reed, Serafini
  • Time, clocks, and the ordering of events in a distributed system
  • Lamport
slide-154
SLIDE 154

R ()

  • The part-time parliament - Lamport
  • Paxos made simple - Lamport
  • Consistency, Availability, and Convergence - Mahajan, Alvisi,

Dahlin

  • Notes on Theory of Distributed Systems - Aspnes
  • In search of an understandable consensus algorithm - Ongaro,

Ousterhaut

  • Perfect Consistent Hashing - Sackman
  • The case for determinism in database systems - Thomson,

Abadi

slide-155
SLIDE 155

Distributed databases are FUN! https://goshawkdb.io/