SLIDE 1
M T V C
Matthew Sackman
matthew@goshawkdb.io
https://goshawkdb.io/
SLIDE 2
- 1. Have you played with a NoSQL or NewSQL store?
SLIDE 3
- 1. Have you played with a NoSQL or NewSQL store?
- 2. Have you deployed a NoSQL or NewSQL store?
SLIDE 4
- 1. Have you played with a NoSQL or NewSQL store?
- 2. Have you deployed a NoSQL or NewSQL store?
- 3. Have you studied and know their semantics?
SLIDE 5
SLIDE 6
SLIDE 7 T D S
ACID
- Atomic: an operation (transaction) either succeeds or aborts
completely - no partial successes
- Consistent: constraints like uniqueness, foreign keys, etc are
honoured
- Durable: flushed to disk before the client can find out the result
SLIDE 8 T D S
ACID
- Atomic: an operation (transaction) either succeeds or aborts
completely - no partial successes
- Consistent: constraints like uniqueness, foreign keys, etc are
honoured
- Isolation: the degree to which operations in one transaction
can observe actions of concurrent transactions
- Durable: flushed to disk before the client can find out the result
SLIDE 9 T D S
Default isolation levels
- PostgreSQL:
- Oracle 11g:
- MS SQL Server:
- MySQL InnoDB:
SLIDE 10 T D S
Default isolation levels
- PostgreSQL: Read Committed
- Oracle 11g: Read Committed
- MS SQL Server: Read Committed
- MySQL InnoDB:
SLIDE 11 T D S
Default isolation levels
- PostgreSQL: Read Committed
- Oracle 11g: Read Committed
- MS SQL Server: Read Committed
- MySQL InnoDB: Repeatable Read
SLIDE 12
I L
SLIDE 13
S I
W
“Snapshot isolation is a guarantee that all reads made in a transaction will see a consistent snapshot of the database and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot.”
SLIDE 14
S I
W
“Snapshot isolation is a guarantee that all reads made in a transaction will see a consistent snapshot of the database and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot.” Snapshot isolation is called “serializable” mode in Oracle.
SLIDE 15 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
SLIDE 16 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
t1 then t2
SLIDE 17 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
t1 then t2
SLIDE 18 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
t1 then t2
SLIDE 19 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
t1 then t2
SLIDE 20 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
t1 then t2: x:0, y:1
SLIDE 21 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
t1 then t2: x:0, y:1 t2 then t1
SLIDE 22 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
t1 then t2: x:0, y:1 t2 then t1: x:1, y:0
SLIDE 23 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
t1 then t2: x:0, y:1 t2 then t1: x:1, y:0
SLIDE 24 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
t1 then t2: x:0, y:1 t2 then t1: x:1, y:0
t1 || t2
SLIDE 25 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
t1 then t2: x:0, y:1 t2 then t1: x:1, y:0
t1 || t2
SLIDE 26 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
t1 then t2: x:0, y:1 t2 then t1: x:1, y:0
t1 || t2
SLIDE 27 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
t1 then t2: x:0, y:1 t2 then t1: x:1, y:0
t1 || t2: x:1, y:1
SLIDE 28 S I
x, y := 0,0
func t1() {
if x == 0 {
y = 1
}
} func t2() { if y == 0 { x = 1 } }
t1 then t2: x:0, y:1 t2 then t1: x:1, y:0
- Snapshot Isolation: Write Skew
t1 || t2: x:1, y:1
SLIDE 29 D F
- General purpose transactions
SLIDE 30 D F
- General purpose transactions
- Strong serializability
SLIDE 31 D F
- General purpose transactions
- Strong serializability
- Distribution
- Automatic sharding
- Horizontal scalability
- ...
SLIDE 32
SLIDE 33
I L
SLIDE 34
I L
SLIDE 35
CAP
Possibility of Partitions = ⇒ ¬(Consistency ∧ Availability)
SLIDE 36
CAP
Possibility of Partitions = ⇒ ¬(Consistency ∧ Availability)
SLIDE 37
CAP
Possibility of Partitions = ⇒ ¬(Consistency ∧ Availability)
SLIDE 38
CAP
Possibility of Partitions = ⇒ ¬(Consistency ∧ Availability)
SLIDE 39
CAP
Possibility of Partitions = ⇒ ¬(Consistency ∧ Availability)
SLIDE 40
A C
C I C N
SLIDE 41
A C
C I C N
SLIDE 42
A C
C I C N
SLIDE 43
A C
C I C N
SLIDE 44
A C
C I C N
SLIDE 45
A C
C I C N
SLIDE 46
A C
C I C N
SLIDE 47
A C
C I C N
SLIDE 48
A C
C I C N
SLIDE 49
A C
C I C N
SLIDE 50 L
- Strong serializability requires Consistency, so must sacrifice
Availability
SLIDE 51 L
- Strong serializability requires Consistency, so must sacrifice
Availability
- To achieve Consistency, only accept operations if connected to
majority
SLIDE 52 L
- Strong serializability requires Consistency, so must sacrifice
Availability
- To achieve Consistency, only accept operations if connected to
majority
- If cluster size is 2F + 1 then we can withstand no more than F
failures
SLIDE 53
M V
C I E T W
SLIDE 54
M V
C I E T W
SLIDE 55
M V
C I E T W
SLIDE 56
M V
C I E T W
SLIDE 57
M V
C I E T W
SLIDE 58
M V
C I E T W
SLIDE 59 L
- Strong serializability requires Consistency, so must sacrifice
Availability
- To achieve Consistency, only accept operations if connected to
majority
- If cluster size is 2F + 1 then we can withstand no more than F
failures
- Writes must go to F + 1 nodes
SLIDE 60
R V
C I E T W
SLIDE 61
R V
C I E T W
SLIDE 62
R V
C I E T W
SLIDE 63
R V
C I E T W
SLIDE 64
R V
C I E T W
SLIDE 65
R V
C I E T W
SLIDE 66 L
- Strong serializability requires Consistency, so must sacrifice
Availability
- To achieve Consistency, only accept operations if connected to
majority
- If cluster size is 2F + 1 then we can withstand no more than F
failures
- Writes must go to F + 1 nodes
- Reads must read from F + 1 nodes and be able to order results
SLIDE 67
SLIDE 68 T P D D
SLIDE 69 T P D D
- 1. Client submits txn
- 2. Node(s) vote on txn
SLIDE 70 T P D D
- 1. Client submits txn
- 2. Node(s) vote on txn
- 3. Node(s) reach consensus on txn outcome
SLIDE 71 T P D D
- 1. Client submits txn
- 2. Node(s) vote on txn
- 3. Node(s) reach consensus on txn outcome
- 4. Client is informed of outcome
SLIDE 72 T P D D
- 1. Client submits txn
- 2. Node(s) vote on txn
- 3. Node(s) reach consensus on txn outcome
- 4. Client is informed of outcome
Most important thing is all nodes agree on the order of transactions
SLIDE 73 T P D D
- 1. Client submits txn
- 2. Node(s) vote on txn
- 3. Node(s) reach consensus on txn outcome
- 4. Client is informed of outcome
Most important thing is all nodes agree on the order of transactions (focus for the rest of this talk!)
SLIDE 74
L B O
SLIDE 75
L B O
SLIDE 76
L B O
SLIDE 77 L B O
- Only leader votes on whether txn commits or aborts
SLIDE 78 L B O
- Only leader votes on whether txn commits or aborts
- Therefore leader must know everything
SLIDE 79 L B O
- Only leader votes on whether txn commits or aborts
- Therefore leader must know everything
- If leader fails, a new leader will be elected from remaining
nodes
SLIDE 80 L B O
- Only leader votes on whether txn commits or aborts
- Therefore leader must know everything
- If leader fails, a new leader will be elected from remaining
nodes
- Therefore all nodes must know everything
SLIDE 81 L B O
- Only leader votes on whether txn commits or aborts
- Therefore leader must know everything
- If leader fails, a new leader will be elected from remaining
nodes
- Therefore all nodes must know everything
- Fine for small clusters, but scaling issues when clusters get big
SLIDE 82
C C B O
SLIDE 83
C C B O
SLIDE 84
C C B O
SLIDE 85 C C B O
- Nodes receive txns and must vote on txn outcome and then
consensus must be reached (not shown)
SLIDE 86 C C B O
- Nodes receive txns and must vote on txn outcome and then
consensus must be reached (not shown)
- Clients are responsible for applying an increasing clock value
to txns
SLIDE 87 C C B O
- Nodes receive txns and must vote on txn outcome and then
consensus must be reached (not shown)
- Clients are responsible for applying an increasing clock value
to txns
- If a client’s clock races then it can prevent other clients from
getting txns submitted
SLIDE 88 C C B O
- Nodes receive txns and must vote on txn outcome and then
consensus must be reached (not shown)
- Clients are responsible for applying an increasing clock value
to txns
- If a client’s clock races then it can prevent other clients from
getting txns submitted
- So must be very careful to try and keep clocks running at the
same rate
SLIDE 89 C C B O
- Nodes receive txns and must vote on txn outcome and then
consensus must be reached (not shown)
- Clients are responsible for applying an increasing clock value
to txns
- If a client’s clock races then it can prevent other clients from
getting txns submitted
- So must be very careful to try and keep clocks running at the
same rate
- No possibility to reorder transactions at all to maximise
commits
SLIDE 90
S
V1 < V2 ∀x ∈ dom(V1 ∪ V2) : V1[x] ≤ V2[x] ∧∃y ∈ dom(V1 ∪ V2) : V1[y] < V2[y]
SLIDE 91
S T
SLIDE 92
S T
SLIDE 93
S T
SLIDE 94
S T
SLIDE 95
S T
SLIDE 96
S T
SLIDE 97
S T
SLIDE 98
S T
SLIDE 99
T W
SLIDE 100
T W
SLIDE 101
T W
SLIDE 102
T W
SLIDE 103
T W
SLIDE 104
T W
SLIDE 105
T W
SLIDE 106
T W
SLIDE 107
T W
SLIDE 108
T W
SLIDE 109
T W
SLIDE 110
T W
SLIDE 111
T W
SLIDE 112
T W
SLIDE 113 T D A D’ W
- Changing state when receiving a txn seems to be a very bad
idea
SLIDE 114 T D A D’ W
- Changing state when receiving a txn seems to be a very bad
idea
- Maybe only change state when receiving the outcome of a
vote
SLIDE 115 T D A D’ W
- Changing state when receiving a txn seems to be a very bad
idea
- Maybe only change state when receiving the outcome of a
vote
- And don’t vote on txns until we know it’s safe to do so
SLIDE 116 N I
- Divide time into frames. First half of frame is reads, second half
writes.
SLIDE 117 N I
- Divide time into frames. First half of frame is reads, second half
writes.
- Within a frame, we don’t care about order of reads,
- but all reads must come after writes of previous frame,
- all writes must come after reads of this frame,
- all writes must be totally ordered within the frame - must know
which write comes last.
SLIDE 118
F & D
SLIDE 119
F & D
SLIDE 120
F & D
SLIDE 121
F & D
SLIDE 122
F & D
SLIDE 123
F & D
SLIDE 124
F & D
SLIDE 125
F & D
SLIDE 126 C W C R
- Merge all read clocks together
- Add 1 to result for every object that was written by txns in our
frame’s reads
SLIDE 127 C F W
& N F’ R C
- Partition write results by local clock elem, and within that by
txn id
- Each clock inherits missing clock elems from above
- Then sort each partition first by clock (now all same length),
then by txn id
- Next frame starts with winner’s clock, +1 for all writes
SLIDE 128 C F W
& N F’ R C
- Partition write results by local clock elem, and within that by
txn id
- Each clock inherits missing clock elems from above
- Then sort each partition first by clock (now all same length),
then by txn id
- Next frame starts with winner’s clock, +1 for all writes
- Guarantees no concurrent vector clocks (proof in progress!)
SLIDE 129 C F W
& N F’ R C
- Partition write results by local clock elem, and within that by
txn id
- Each clock inherits missing clock elems from above
- Then sort each partition first by clock (now all same length),
then by txn id
- Next frame starts with winner’s clock, +1 for all writes
- Guarantees no concurrent vector clocks (proof in progress!)
- Many details elided! (deadlock freedom, etc)
SLIDE 130
T V C
SLIDE 131
T V C
SLIDE 132
T V C
SLIDE 133
T V C
SLIDE 134
T V C
SLIDE 135
T V C
SLIDE 136
T V C
SLIDE 137
T V C
SLIDE 138
T V C
SLIDE 139
T V C
SLIDE 140 S V C
- Hardest part of Paxos is garbage collection
SLIDE 141 S V C
- Hardest part of Paxos is garbage collection
- Need additional messages to determine when Paxos instances
can be deleted
SLIDE 142 S V C
- Hardest part of Paxos is garbage collection
- Need additional messages to determine when Paxos instances
can be deleted
- We can use these to also express:
You will never see any of these vector clock elems again
SLIDE 143 S V C
- Hardest part of Paxos is garbage collection
- Need additional messages to determine when Paxos instances
can be deleted
- We can use these to also express:
You will never see any of these vector clock elems again
- Therefore we can remove matching elems from vector clocks!
- Many more details elided!
SLIDE 144
C
Vector clocks capture dependencies and causal relationship between transactions
SLIDE 145
C
Vector clocks capture dependencies and causal relationship between transactions Plus we always add transactions into the youngest frame
SLIDE 146
C
Vector clocks capture dependencies and causal relationship between transactions Plus we always add transactions into the youngest frame Which gets us Strong Serializability
SLIDE 147
C
No leader, so no potential bottleneck
SLIDE 148
C
No leader, so no potential bottleneck No wall clocks, so no issues with clock skews
SLIDE 149
C
No leader, so no potential bottleneck No wall clocks, so no issues with clock skews Can separate F from cluster size,
SLIDE 150
C
No leader, so no potential bottleneck No wall clocks, so no issues with clock skews Can separate F from cluster size, Which gets us horizontal scalability
SLIDE 151
C
SLIDE 152 R ()
- Interval Tree Clocks - Paulo Sérgio Almeida, Carlos Baquero,
Victor Fonte
- Highly available transactions: Virtues and limitations - Bailis et
al
- Coordination avoidance in database systems - Bailis et al
- k-dependency vectors: A scalable causality-tracking protocol -
Baldoni, Melideo
- Multiversion concurrency control-theory and algorithms -
Bernstein, Goodman
- Serializable isolation for snapshot databases - Cahill, Röhm,
Fekete
- Paxos made live: an engineering perspective - Chandra,
Griesemer, Redstone
SLIDE 153 R ()
- Consensus on transaction commit - Gray, Lamport
- Spanner: Google’s globally distributed database - Corbett et al
- Faster generation of shorthand universal cycles for
permutations - Holroyd, Ruskey, Williams
- s-Overlap Cycles for Permutations - Horan, Hurlbert
- Universal cycles of k-subsets and k-permutations - Jackson
- Zab: High-performance broadcast for primary-backup systems
- Junqueira, Reed, Serafini
- Time, clocks, and the ordering of events in a distributed system
- Lamport
SLIDE 154 R ()
- The part-time parliament - Lamport
- Paxos made simple - Lamport
- Consistency, Availability, and Convergence - Mahajan, Alvisi,
Dahlin
- Notes on Theory of Distributed Systems - Aspnes
- In search of an understandable consensus algorithm - Ongaro,
Ousterhaut
- Perfect Consistent Hashing - Sackman
- The case for determinism in database systems - Thomson,
Abadi
SLIDE 155
Distributed databases are FUN! https://goshawkdb.io/