Architecture of a Geo-Distributed SQL Database
CockroachDB
Peter Mattis (@petermattis), Co-founder & CTO
CockroachDB Architecture of a Geo-Distributed SQL Database Peter - - PowerPoint PPT Presentation
CockroachDB Architecture of a Geo-Distributed SQL Database Peter Mattis (@petermattis), Co-founder & CTO CockroachDB: Geo-distributed SQL Database Make Data Easy Distributed Horizontally scalable to grow with your application
Architecture of a Geo-Distributed SQL Database
Peter Mattis (@petermattis), Co-founder & CTO
○ Horizontally scalable to grow with your application
○ Handle datacenter failures ○ Place data near usage ○ Push computation near data
○ Lingua-franca for rich data storage ○ Schemas, indexes, and transactions make app development easier
○ Lexicographically ordered by key
○ Values are never updated “in place”, newer versions shadow older versions ○ Tombstones are used to delete values ○ Provides snapshot to each transaction
* Not exposed for external usage
DOGS
carl dagne figment jack pinetop sooshi stella zee muddy peetey lula lady
DOGS
carl dagne figment jack pinetop sooshi stella zee muddy peetey lula lady
carl dagne figment jack muddy peetey lula lady pinetop sooshi stella zee
Ranges are small enough to be moved/split quickly Ranges are large enough to amortize indexing overhead
DOGS
carl dagne figment jack pinetop sooshi stella zee muddy peetey lula lady
(very much like a B-tree)
1 2 3 carl - jack lady - peetey pinetop - zee carl dagne figment jack muddy peetey lula lady pinetop sooshi stella zee
DOGS
carl dagne figment jack pinetop sooshi zee peetey lula lady
dogs >= “muddy” AND <= “stella”
1 2 3 carl - jack lady - peetey pinetop - zee carl dagne figment jack peetey lula lady pinetop sooshi zee muddy stella
stella muddy
DOGS
carl dagne figment jack pinetop sooshi zee peetey lula lady
1 2 3 carl - jack lady - peetey pinetop - zee
stella muddy INSERT[sunny]
INSERT[sunny]
Space available in range? - YES
carl dagne figment jack muddy peetey lula lady pinetop sooshi stella zee
pinetop sooshi stella zee muddy peetey lula lady
✓ ?
DOGS
carl dagne figment jack pinetop sooshi zee peetey lula lady
1 2 3 carl - jack lady - peetey pinetop - zee
stella muddy INSERT[sunny]
carl dagne figment jack muddy peetey lula lady pinetop sooshi stella sunny zee
pinetop sooshi stella zee muddy peetey lula lady
✓
INSERT[sunny]
DOGS
carl dagne figment jack pinetop sooshi zee peetey lula lady
1 2 3 carl - jack lady - peetey pinetop - zee
stella muddy INSERT[rudy]
carl dagne figment jack muddy peetey lula lady pinetop sooshi stella sunny zee
pinetop sooshi stella zee muddy peetey lula lady
✓ ? INSERT[rudy]
Space available in range? - NO
DOGS
carl dagne figment jack pinetop sooshi zee peetey lula lady stella muddy INSERT[rudy]
carl dagne figment jack muddy peetey lula lady pinetop rudy sooshi
pinetop sooshi stella zee muddy peetey lula lady
✓ INSERT[rudy]
split range and insert
stella sunny zee 1 2 3 carl - jack lady - peetey pinetop - sooshi 4 stella - zee
Raft group
* Leaseholder == Raft leader
Raft group
LEASEHOLDER
node 1 node 2 node 4 node 3
muddy peetey lula lady pinetop sooshi stella zee carl dagne figment jack
Reads must talk to a quorum of replicas
READ[carl]
node 1 node 2 node 4 node 3
muddy peetey lula lady pinetop sooshi stella zee carl dagne figment jack
One replica is chosen as the leaseholder
READ[carl] leaseholder
node 1 node 2 node 4 node 3
muddy peetey lula lady pinetop sooshi stella zee carl dagne figment jack
One replica is chosen as the leaseholder
READ[carl] leaseholder
node 1 node 2 node 4 node 3
muddy peetey lula lady pinetop sooshi stella zee
carl dagne figment jack
Each Range is a Raft state machine A Range has 1 or more Replicas
node 1 node 2 node 4 node 3
muddy peetey lula lady carl dagne figment jack
replicas across “failure domains”
pinetop sooshi stella zee
node 1 node 2 node 6 node 4 node 5
muddy peetey lula lady pinetop sooshi stella zee carl dagne figment jack
Balances placement using heuristics that considers real-time usage metrics of the data itself
This range is high load as it is accessed more than others
While we show this for ranges within a single table, this is also applicable across all ranges across ALL tables, which is the more typical situation
node 1 node 3
muddy peetey lula lady carl dagne figment jack pinetop sooshi stella zee USE/muddy USE/stella USE/figment USE/dagne USW/jack USW/lady USW/peetey USW/pinetop EU/carl EU/lula EU/sooshi EU/zee
We apply a constraint that indicates regional placement so we can ensure low latency access or jurisdictional control of data
node 1 node 5 node 4 node 2 node 3 NEW
If we add a node to the cluster, CockroachDB automatically redistributed replicas to even load across the cluster Uses the replica placement heuristics from previous slides to decide which node to add to and which to remove from
node 1 node 5 node 4 node 2 node 3 NEW
If we add a node to the cluster, CockroachDB automatically redistributed replicas to even load across the cluster Uses the replica placement heuristics from previous slides Movement is decomposed into adding a replica followed by removing a replica
node 1 node 5 node 4 node 2 node 3 NEW
If we add a node to the cluster, CockroachDB automatically redistributed replicas to even load across the cluster Uses the replica placement heuristics from previous slides Movement is decomposed into adding a replica followed by removing a replica
node 1 node 5 node 4 node 2 node 3
If a node goes down, the Raft group realizes a replica is missing and replaces it with a new replica
Uses the replica placement heuristics from previous slides
node 1 node 5 node 4 node 2 node 3
If a node goes down, the Raft group realizes a replica is missing and replaces it with a new replica
Uses the replica placement heuristics from previous slides The failed replica is removed from the Raft group and a new replica created. The leaseholder sends a snapshot of the Range’s state to bring the new replica up to date.
node 1 node 5 node 4 node 2
If a node goes down for a moment, the leaseholder can “catch up” any replica that is behind The leaseholder can send commands to be replayed OR it can send a snapshot of the current Range data. We apply heuristics to decide which is most efficient for a given failure.
node 3
node 1
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady
node 2 node 3
carl peetey dagne lady lula muddy peetey lady
node 4
carl dagne figment jack pinetop sooshi stella zee pinetop sooshi stella zee pinetop sooshi stella zee carl peetey dagne lady lula muddy peetey lady carl dagne figment jack
INSERT INTO dogs VALUES (sunny, ozzie)
node 1
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady
node 2 node 3
carl peetey dagne lady lula muddy peetey lady
node 4
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella zee pinetop sooshi stella zee
BEGIN TXN1 WRITE[sunny]
GATEWAY
INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING
pinetop sooshi stella zee
node 1
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady
node 2 node 3
carl peetey dagne lady lula muddy peetey lady
node 4
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee
BEGIN TXN1 WRITE[sunny]
GATEWAY
INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING
pinetop sooshi stella sunny zee
node 1
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady
node 2 node 3
carl peetey dagne lady lula muddy peetey lady
node 4
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee
BEGIN TXN1 WRITE[sunny]
GATEWAY
INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING
pinetop sooshi stella sunny zee
ACK
pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee
node 1
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady
node 2 node 3
carl peetey dagne lady lula muddy peetey lady
node 4
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee
BEGIN TXN1 WRITE[sunny]
GATEWAY
INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING
pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee
node 1
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady
node 2 node 3
carl peetey dagne lady lula muddy peetey lady
node 4
carl dagne figment jack carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee
BEGIN TXN1 WRITE[sunny] WRITE[ozzie]
GATEWAY
INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING
pinetop sooshi stella sunny zee carl peetey dagne lady lula muddy
peetey lady
ACK
pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee
node 1
carl dagne figment jack carl peetey dagne lady lula muddy
peetey lady
node 2 node 3
carl peetey dagne lady lula muddy
peetey lady
node 4
carl dagne figment jack carl dagne figment jack
BEGIN TXN1 WRITE[sunny] WRITE[ozzie]
GATEWAY
INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING
carl peetey dagne lady lula muddy
peetey lady pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee
node 1
carl dagne figment jack
node 2 node 3 node 4
carl dagne figment jack carl dagne figment jack
BEGIN TXN1 WRITE[sunny] WRITE[ozzie]
GATEWAY
INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING
ACK
pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee carl peetey dagne lady lula muddy
peetey lady carl peetey dagne lady lula muddy
peetey lady carl peetey dagne lady lula muddy
peetey lady
node 1
carl dagne figment jack
node 2 node 3 node 4
carl dagne figment jack carl dagne figment jack
BEGIN TXN1 WRITE[sunny] WRITE[ozzie]
GATEWAY
INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING
ACK
pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee carl peetey dagne lady lula muddy
peetey lady carl peetey dagne lady lula muddy
peetey lady carl peetey dagne lady lula muddy
peetey lady
node 1
carl dagne figment jack carl peetey dagne lady lula muddy
peetey lady
node 2 node 3
carl peetey dagne lady lula muddy
peetey lady
node 4
carl dagne figment jack carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee
BEGIN TXN1 WRITE[sunny] WRITE[ozzie] COMMIT
GATEWAY
INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: COMMIT
pinetop sooshi stella sunny zee carl peetey dagne lady lula muddy
peetey lady
node 1
carl dagne figment jack carl peetey dagne lady lula muddy
peetey lady
node 2 node 3
carl peetey dagne lady lula muddy
peetey lady
node 4
carl dagne figment jack carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee
BEGIN TXN1 WRITE[sunny] WRITE[ozzie] COMMIT
GATEWAY
INSERT INTO dogs VALUES (sunny, ozzie)
pinetop sooshi stella sunny zee carl peetey dagne lady lula muddy
peetey lady
ACK
Serial Pipelined
Serial Pipelined
s u n n y s u n n y
BEGIN WRITE[sunny]
txn:sunny (pending)
Serial Pipelined
txn:sunny (pending) s u n n y
z i e s u n n y
z i e
BEGIN WRITE[sunny] WRITE[ozzie]
Serial Pipelined
txn:sunny (pending) s u n n y
z i e t x n : s u n n y ( c
m i t )
[ k e y s : s u n n y ,
z i e ]
txn:sunny (staged)
[ k e y s : s u n n y ,
z i e ]
s u n n y
z i e
BEGIN WRITE[sunny] WRITE[ozzie] COMMIT
Serial Pipelined
txn:sunny (pending) s u n n y
z i e t x n : s u n n y ( c
m i t )
[ k e y s : s u n n y ,
z i e ]
BEGIN WRITE[sunny] WRITE[ozzie] COMMIT Committed once all
We replaced the centralized commit marker with a distributed one
t
s u n n y
z i e txn:sunny (staged)
[ k e y s : s u n n y ,
z i e ]
* “Proved” with TLA+
CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT )
ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 Key Value /1 “Bat”,1.11 /2 “Ball”,2.22 /3 “Glove”,3.33
CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT )
ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 Key Value /<Table>/<Index>/1 “Bat”,1.11 /<Table>/<Index>/2 “Ball”,2.22 /<Table>/<Index>/3 “Glove”,3.33
CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT )
ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 Key Value /inventory/primary/1 “Bat”,1.11 /inventory/primary/2 “Ball”,2.22 /inventory/primary/3 “Glove”,3.33
CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) )
ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 Key Value /inventory/name_idx/”Bat”/1 ∅ /inventory/name_idx/”Ball”/2 ∅ /inventory/name_idx/”Glove”/3 ∅
CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) )
ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 4 Bat 4.44 Key Value /inventory/name_idx/”Bat”/1 ∅ /inventory/name_idx/”Ball”/2 ∅ /inventory/name_idx/”Glove”/3 ∅
CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) )
ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 4 Bat 4.44 Key Value /inventory/name_idx/”Bat”/1 ∅ /inventory/name_idx/”Ball”/2 ∅ /inventory/name_idx/”Glove”/3 ∅ /inventory/name_idx/”Bat”/4 ∅
○ Technically, not a relational operator
○ For example, a “filter” expression has 1 input expression and a scalar expression that filters the rows from the child ○ The scan expression has zero inputs
inventory
inventory
name >= “b” AND name < “c”
inventory
name >= “b” AND name < “c”
name
inventory
name >= “b” AND name < “c”
name
Results
inventory@name [“b” - “c”)
The filter gets pushed into the scan
inventory@name [“b” - “c”)
name
Results
○ hash group by, stream group by ○ hash join, merge join, lookup join, zig-zag join
Name Country Bob United States Hans Germany Jacques France Marie France Susan United States
Name Country Bob United States Hans Germany Jacques France Marie France Susan United States
Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 1
Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 1 Germany 1
Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 1 Germany 1 France 1
Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 1 Germany 1 France 2
Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 2 Germany 1 France 2
Name Country Bob United States Hans Germany Jacques France Marie France Susan United States
Name Country Jacques France Marie France Hans Germany Bob United States Susan United States
Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 1
Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 2
Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 2 Germany 1
Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 2 Germany 1 United States 1
Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 2 Germany 1 United States 2
Scan customers Scan customers Scan customers scan scan scan
Scan customers Scan customers Scan customers Group-By “country” Group-By “country” Group-By “country” group-by group-by group-by
Scan customers Scan customers Scan customers Group-By “country” Group-By “country” Group-By “country” Group-By “country” group-by
Memo
AST Plan
Fold Constants Check Types Resolve Names Report Semantic Errors Compute properties Retrieve and attach stats Cost-independent transformations Cost-based transformations Parse SQL
○ Constant folding ○ Filter push-down ○ Decorrelating subqueries* ○ ...
○ If the transformation can be applied to the query, it is applied
○ Compiled down to code which efficiently matches query fragments in the memo ○ ~200 transformations currently defined
* Actually cost-based, but we’re treating it as cost-independent right now
a@primary
x > 10
Results
b@primary
a@primary
x > 10
Results
b@primary
x > 10
○ Index selection ○ Join reordering ○ ...
○ When should the transformation be applied? ○ Need to try both paths and maintain both the original and transformed query ○ State explosion: thousands of possible query plans ■ Memo data structure maintains a forest of query plans ○ Estimate cost of each query, select query with lowest cost
○ Based on table statistics and estimating cardinality of inputs to relational expressions
Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows
Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows
a@primary
x > 10
y
Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows
a@primary
a@x [10 - )
x > 10
y
y
Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows
a@primary
a@x [10 - )
x > 10
a@y
y
y
x > 10
Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows
a@primary
a@x [10 - )
x > 10
a@y
y
y
x > 10
10 100,000 10 10
Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows
a@primary
a@x [10 - )
x > 10
a@y
y
y
x > 10
50,000 100,000 50,000 50,000
CREATE TABLE postal_codes ( id INT PRIMARY KEY, code STRING, INDEX idx_eu (id) STORING (code), INDEX idx_usw (id) STORING (code) )
CREATE TABLE postal_codes ( id INT PRIMARY KEY, code STRING, INDEX idx_eu (id) STORING (code), INDEX idx_usw (id) STORING (code) )
SELECT * FROM postal_codes
www.cockroachlabs.com github.com/cockroachdb/cockroach
node 1
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady
node 1 node 1
carl peetey dagne lady lula muddy peetey lady
node 1
carl dagne figment jack pinetop sooshi stella zee pinetop sooshi stella zee pinetop sooshi stella zee carl peetey dagne lady lula muddy peetey lady carl dagne figment jack
INSERT INTO DOGS (sunny);
node 1
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady
node 1 node 1
carl peetey dagne lady lula muddy peetey lady
node 1
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella zee pinetop sooshi stella zee pinetop sooshi stella sunny zee
BEGIN WRITE[sunny] COMMIT
GATEWAY
INSERT INTO DOGS (sunny);
NOTE: a gateway can be ANY CockroachDB instance. It can find the leaseholder for any range and execute a transaction
node 1
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady
node 1 node 1
carl peetey dagne lady lula muddy peetey lady
node 1
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee
BEGIN WRITE[sunny] COMMIT
GATEWAY
INSERT INTO DOGS (sunny);
ACK
node 1
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady
node 1 node 1
carl peetey dagne lady lula muddy peetey lady
node 1
carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee
BEGIN WRITE[sunny] COMMIT
GATEWAY
INSERT INTO DOGS (sunny);
ACK
* This approach is also used by Bigtable (tablets), HBase (regions) & Spanner (ranges)