CockroachDB Architecture of a Geo-Distributed SQL Database Nathan - - PowerPoint PPT Presentation

cockroachdb
SMART_READER_LITE
LIVE PREVIEW

CockroachDB Architecture of a Geo-Distributed SQL Database Nathan - - PowerPoint PPT Presentation

CockroachDB Architecture of a Geo-Distributed SQL Database Nathan VanBenschoten (@natevanben), Staff Software Engineer CockroachDB: Geo-distributed SQL Database Make Data Easy Distributed Horizontally scalable to grow with your


slide-1
SLIDE 1

Architecture of a Geo-Distributed SQL Database

CockroachDB

Nathan VanBenschoten (@natevanben), Staff Software Engineer

slide-2
SLIDE 2

CockroachDB: Geo-distributed SQL Database

Make Data Easy

  • Distributed

○ Horizontally scalable to grow with your application

  • Geo-distributed

○ Handle datacenter failures ○ Place data near usage ○ Push computation near data

  • SQL

○ Lingua-franca for rich data storage ○ Schemas, indexes, and transactions make app development easier

slide-3
SLIDE 3

AGENDA

  • Introduction
  • Ranges and Replicas
  • Transactions
  • SQL Data in a KV World
  • SQL Execution
  • SQL Optimization
slide-4
SLIDE 4

Distributed, Replicated, Transactional KV*

  • Keys and values are strings

○ Lexicographically ordered by key

  • Multi-version concurrency control (MVCC)

○ Values are never updated “in place” ○ Tombstones are used to delete values ○ Provides snapshot to each transaction

  • Monolithic key-space

* Not exposed for external usage

slide-5
SLIDE 5

Monolithic Key Space

DOGS

carl dagne figment jack pinetop sooshi stella zee muddy peetey lula lady

Monolithic logical key space

  • Ordered lexicographically by key
slide-6
SLIDE 6

Ranges

DOGS

carl dagne figment jack pinetop sooshi stella zee muddy peetey lula lady

carl dagne figment jack muddy peetey lula lady pinetop sooshi stella zee

Key space divided into contiguous ~64MB ranges

Ranges are small enough to be moved/split quickly Ranges are large enough to amortize indexing overhead

slide-7
SLIDE 7

Range Indexing

DOGS

carl dagne figment jack pinetop sooshi stella zee muddy peetey lula lady

Index structure used to locate ranges

(very much like a B-tree)

1 2 3 carl - jack lady - peetey pinetop - zee carl dagne figment jack muddy peetey lula lady pinetop sooshi stella zee

slide-8
SLIDE 8

Ordered Range Scans

DOGS

carl dagne figment jack pinetop sooshi zee peetey lula lady

Ordered keys enable efficient range scans

dogs >= “muddy” AND <= “stella”

1 2 3 carl - jack lady - peetey pinetop - zee carl dagne figment jack peetey lula lady pinetop sooshi zee muddy stella

stella muddy

slide-9
SLIDE 9

Raft and Replication

Ranges (~64MB) are the unit of replication Each range is a Raft group (Raft is a distributed consensus protocol) Default to 3 replicas, though this is configurable

  • Important system ranges default to 5 replicas

Raft group

slide-10
SLIDE 10

Raft and Replication

Raft provides “atomic replication” of commands Each group elects a leader Commands require majority of replicas to vote

Raft group

LEADER

slide-11
SLIDE 11

node 1 node 2 node 4 node 3

Range Leases

muddy peetey lula lady pinetop sooshi stella zee carl dagne figment jack

Reads with consensus

Reads must talk to a quorum of replicas

READ[carl]

slide-12
SLIDE 12

node 1 node 2 node 4 node 3

Range Leases

muddy peetey lula lady pinetop sooshi stella zee carl dagne figment jack

Reads without consensus

One replica is chosen as the leaseholder

READ[carl] leaseholder

slide-13
SLIDE 13

node 1 node 2 node 4 node 3

Range Leases

muddy peetey lula lady pinetop sooshi stella zee carl dagne figment jack

Reads without consensus

One replica is chosen as the leaseholder

  • Coordinates writes (proposal, key locking)
  • Performs reads

READ[carl] leaseholder

slide-14
SLIDE 14

node 1 node 2 node 4 node 3

Replica Placement

muddy peetey lula lady pinetop sooshi stella zee

  • Space
  • Diversity
  • Load
  • Latency

carl dagne figment jack

Each Range is a Raft state machine A Range has 1 or more Replicas

slide-15
SLIDE 15

node 1 node 2 node 4 node 3

Replica Placement: Diversity

muddy peetey lula lady carl dagne figment jack

Diversity

  • ptimizes placement of

replicas across “failure domains”

  • Disk
  • Single machine
  • Rack
  • Datacenter
  • Region

pinetop sooshi stella zee

slide-16
SLIDE 16

node 1 node 2 node 6 node 4 node 5

Replica Placement: Load

muddy peetey lula lady pinetop sooshi stella zee carl dagne figment jack

Load

Balances placement using heuristics that considers real-time usage metrics of the data itself

This range is high load as it is accessed more than others

node 1 node 3

slide-17
SLIDE 17

Replica Placement: Latency & Geo-partitioning

muddy peetey lula lady carl dagne figment jack pinetop sooshi stella zee USE/muddy USE/stella USE/figment USE/dagne USW/jack USW/lady USW/peetey USW/pinetop EU/carl EU/lula EU/sooshi EU/zee

We apply a constraint that indicates regional placement so we can ensure low latency access or jurisdictional control of data

slide-18
SLIDE 18

AGENDA

  • Introduction
  • Ranges and Replicas
  • Transactions
  • SQL Data in a KV World
  • SQL Execution
  • SQL Optimization
slide-19
SLIDE 19

Transactions

Atomicity, Consistency, Isolation, Durability Serializable Isolation

  • As if the transactions are run in a serial order
  • Gold standard isolation level
  • Make Data Easy - weaker isolation levels are too great a burden

Transactions can span arbitrary ranges Conversational

  • The full set of operations is not required up front
slide-20
SLIDE 20

Transactions

Raft provides atomic writes to individual ranges Bootstrap transaction atomicity using Raft atomic writes Transaction record atomically flipped from PENDING to COMMIT

slide-21
SLIDE 21

Transaction Atomicity

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 2 node 3

carl peetey dagne lady lula muddy peetey lady

node 4

carl dagne figment jack pinetop sooshi stella zee pinetop sooshi stella zee pinetop sooshi stella zee carl peetey dagne lady lula muddy peetey lady carl dagne figment jack

INSERT INTO dogs VALUES (sunny, ozzie)

slide-22
SLIDE 22

Transaction Atomicity

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 2 node 3

carl peetey dagne lady lula muddy peetey lady

node 4

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella zee pinetop sooshi stella zee

BEGIN TXN1 WRITE[sunny]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie)

pinetop sooshi stella zee

slide-23
SLIDE 23

Transaction Atomicity

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 2 node 3

carl peetey dagne lady lula muddy peetey lady

node 4

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella zee pinetop sooshi stella zee

BEGIN TXN1 WRITE[sunny]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING

pinetop sooshi stella sunny* zee

slide-24
SLIDE 24

Transaction Atomicity

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 2 node 3

carl peetey dagne lady lula muddy peetey lady

node 4

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack

transactions TXN1: PENDING

pinetop sooshi stella sunny* zee

transactions TXN1: PENDING

pinetop sooshi stella sunny* zee

BEGIN TXN1 WRITE[sunny]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING

pinetop sooshi stella sunny* zee

slide-25
SLIDE 25

Transaction Atomicity

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 2 node 3

carl peetey dagne lady lula muddy peetey lady

node 4

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

BEGIN TXN1 WRITE[sunny]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING transactions TXN1: PENDING transactions TXN1: PENDING

pinetop sooshi stella sunny zee

ACK

pinetop sooshi stella sunny* zee pinetop sooshi stella sunny* zee pinetop sooshi stella sunny* zee

slide-26
SLIDE 26

Transaction Atomicity

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 2 node 3

carl peetey dagne lady lula muddy peetey lady

node 4

carl dagne figment jack carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

BEGIN TXN1 WRITE[sunny] WRITE[ozzie]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING transactions TXN1: PENDING transactions TXN1: PENDING

pinetop sooshi stella sunny zee carl peetey dagne lady lula muddy

  • zzie*

peetey lady pinetop sooshi stella sunny* zee pinetop sooshi stella sunny* zee pinetop sooshi stella sunny* zee

slide-27
SLIDE 27

Transaction Atomicity

node 1

carl dagne figment jack carl peetey dagne lady lula muddy

  • zzie*

peetey lady

node 2 node 3

carl peetey dagne lady lula muddy

  • zzie*

peetey lady

node 4

carl dagne figment jack carl dagne figment jack

BEGIN TXN1 WRITE[sunny] WRITE[ozzie]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING transactions TXN1: PENDING transactions TXN1: PENDING

carl peetey dagne lady lula muddy

  • zzie*

peetey lady pinetop sooshi stella sunny* zee pinetop sooshi stella sunny* zee pinetop sooshi stella sunny* zee

slide-28
SLIDE 28

Transaction Atomicity

node 1

carl dagne figment jack

node 2 node 3 node 4

carl dagne figment jack carl dagne figment jack

BEGIN TXN1 WRITE[sunny] WRITE[ozzie]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING transactions TXN1: PENDING transactions TXN1: PENDING

A C K

pinetop sooshi stella sunny* zee pinetop sooshi stella sunny* zee pinetop sooshi stella sunny* zee carl peetey dagne lady lula muddy

  • zzie*

peetey lady carl peetey dagne lady lula muddy

  • zzie*

peetey lady carl peetey dagne lady lula muddy

  • zzie*

peetey lady

slide-29
SLIDE 29

Transaction Atomicity

node 1

carl dagne figment jack carl peetey dagne lady lula muddy

  • zzie*

peetey lady

node 2 node 3

carl peetey dagne lady lula muddy

  • zzie*

peetey lady

node 4

carl dagne figment jack carl dagne figment jack

transactions TXN1: PENDING transactions TXN1: PENDING transactions TXN1: COMMITTED

pinetop sooshi stella sunny* zee pinetop sooshi stella sunny* zee

BEGIN TXN1 WRITE[sunny] WRITE[ozzie] COMMIT

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie)

pinetop sooshi stella sunny* zee carl peetey dagne lady lula muddy

  • zzie*

peetey lady

slide-30
SLIDE 30

Transaction Atomicity

node 1

carl dagne figment jack carl peetey dagne lady lula muddy

  • zzie*

peetey lady

node 2 node 3

carl peetey dagne lady lula muddy

  • zzie*

peetey lady

node 4

carl dagne figment jack carl dagne figment jack

transactions TXN1: COMMITTED transactions TXN1: COMMITTED transactions TXN1: COMMITTED

pinetop sooshi stella sunny* zee pinetop sooshi stella sunny* zee

BEGIN TXN1 WRITE[sunny] WRITE[ozzie] COMMIT

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie)

pinetop sooshi stella sunny* zee carl peetey dagne lady lula muddy

  • zzie*

peetey lady

slide-31
SLIDE 31

Transaction Atomicity

node 1

carl dagne figment jack carl peetey dagne lady lula muddy

  • zzie*

peetey lady

node 2 node 3

carl peetey dagne lady lula muddy

  • zzie*

peetey lady

node 4

carl dagne figment jack carl dagne figment jack

transactions TXN1: COMMITTED transactions TXN1: COMMITTED transactions TXN1: COMMITTED

pinetop sooshi stella sunny* zee pinetop sooshi stella sunny* zee

BEGIN TXN1 WRITE[sunny] WRITE[ozzie] COMMIT

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie)

pinetop sooshi stella sunny* zee carl peetey dagne lady lula muddy

  • zzie*

peetey lady

A C K

slide-32
SLIDE 32

Transaction Atomicity

node 1

carl dagne figment jack carl peetey dagne lady lula muddy

  • zzie*

peetey lady

node 2 node 3

carl peetey dagne lady lula muddy

  • zzie*

peetey lady

node 4

carl dagne figment jack carl dagne figment jack

transactions TXN1: COMMITTED transactions TXN1: COMMITTED transactions TXN1: COMMITTED

pinetop sooshi stella sunny* zee pinetop sooshi stella sunny* zee

BEGIN TXN1 WRITE[sunny] WRITE[ozzie] COMMIT

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie)

pinetop sooshi stella sunny* zee carl peetey dagne lady lula muddy

  • zzie*

peetey lady

ACK

slide-33
SLIDE 33

Transaction Atomicity

node 1

carl dagne figment jack carl peetey dagne lady lula muddy

  • zzie

peetey lady

node 2 node 3

carl peetey dagne lady lula muddy

  • zzie

peetey lady

node 4

carl dagne figment jack carl dagne figment jack

transactions TXN1: COMMITTED transactions TXN1: COMMITTED transactions TXN1: COMMITTED

pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

BEGIN TXN1 WRITE[sunny] WRITE[ozzie] COMMIT

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie)

pinetop sooshi stella sunny zee carl peetey dagne lady lula muddy

  • zzie

peetey lady

ACK

slide-34
SLIDE 34

AGENDA

  • Introduction
  • Ranges and Replicas
  • Transactions
  • SQL Data in a KV World
  • SQL Execution
  • SQL Optimization
slide-35
SLIDE 35

SQL

Structured Query Language Declarative, not imperative

  • These are the results I want vs perform these operations in this sequence

Relational data model

  • Typed: INT, FLOAT, STRING, ...
  • Schemas: tables, rows, columns, foreign keys
slide-36
SLIDE 36

SQL: Tabular Data in a KV World

SQL data has columns and types?!? How do we store typed and columnar data in a key-value store?

  • The SQL data model needs to be mapped to KV data
slide-37
SLIDE 37

CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT )

SQL Data Mapping: Inventory Table

ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 Key Value /1 “Bat”,1.11 /2 “Ball”,2.22 /3 “Glove”,3.33

slide-38
SLIDE 38

CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT )

SQL Data Mapping: Inventory Table

ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 Key Value /<Table>/<Index>/1 “Bat”,1.11 /<Table>/<Index>/2 “Ball”,2.22 /<Table>/<Index>/3 “Glove”,3.33

slide-39
SLIDE 39

CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT )

SQL Data Mapping: Inventory Table

ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 Key Value /inventory/primary/1 “Bat”,1.11 /inventory/primary/2 “Ball”,2.22 /inventory/primary/3 “Glove”,3.33

slide-40
SLIDE 40

AGENDA

  • Introduction
  • Ranges and Replicas
  • Transactions
  • SQL Data in a KV World
  • SQL Execution
  • SQL Optimization
slide-41
SLIDE 41

SQL Execution

Relational operators

  • Projection (SELECT <columns>)
  • Selection (WHERE <filter>)
  • Aggregation (GROUP BY <columns>)
  • Join (JOIN), Union (UNION), Intersect (INTERSECT)
  • Scan (FROM <table>)
  • Sort (ORDER BY)

○ Technically, not a relational operator

slide-42
SLIDE 42

SQL Execution

  • Relational expressions have input expressions and scalar expressions

○ The filter expression has 1 input expression and a predicate scalar expression ○ The scan expression has zero inputs

  • Query plan is a tree of relational expressions
  • SQL execution takes a query plan and runs the operations to completion
slide-43
SLIDE 43

SQL Execution: Example

SELECT name FROM inventory WHERE name >= “b” AND name < “c”

slide-44
SLIDE 44

SQL Execution: Scan

SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan

inventory

slide-45
SLIDE 45

SQL Execution: Filter

SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan

inventory

Filter

name >= “b” AND name < “c”

slide-46
SLIDE 46

SQL Execution: Project

SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan

inventory

Filter

name >= “b” AND name < “c”

Project

name

slide-47
SLIDE 47

SQL Execution: Project

SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan

inventory

Filter

name >= “b” AND name < “c”

Project

name

Results

slide-48
SLIDE 48

SQL Execution: Index Scans

SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan

inventory@name [“b” - “c”)

The filter gets pushed into the scan

slide-49
SLIDE 49

SQL Execution: Index Scans

SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan

inventory@name [“b” - “c”)

Project

name

Results

slide-50
SLIDE 50

SQL Execution: Correctness

Correct SQL execution involves lots of bookkeeping

  • User defined tables, and indexes
  • Queries refer to table and column names
  • Execution uses table and column IDs
  • NULL handling
slide-51
SLIDE 51

SQL Execution: Performance

Performant SQL execution

  • Tight, well written code
  • Operator specialization

○ hash group by, stream group by ○ hash join, merge join, lookup join, zig-zag join

  • Distributed execution
slide-52
SLIDE 52

Distributed SQL Execution

Network latencies and throughput are important Push fragments of computation as close to the data as possible Leverage aggregate compute resources

slide-53
SLIDE 53

Distributed SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Scan customers Scan customers Scan customers scan scan scan

slide-54
SLIDE 54

Distributed SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Scan customers Scan customers Scan customers Group-By “country” Group-By “country” Group-By “country” group-by group-by group-by

slide-55
SLIDE 55

Distributed SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Scan customers Scan customers Scan customers Group-By “country” Group-By “country” Group-By “country” Group-By “country” group-by

slide-56
SLIDE 56

AGENDA

  • Introduction
  • Ranges and Replicas
  • Transactions
  • SQL Data in a KV World
  • SQL Execution
  • SQL Optimization
slide-57
SLIDE 57

SQL Optimization

An optimizer explores many plans that are logically equivalent to a given query and chooses the best one Parse Execute Search

Memo

Prep

AST Plan

Fold Constants Check Types Resolve Names Report Semantic Errors Compute properties Retrieve and attach stats Cost-independent transformations Cost-based transformations Parse SQL

slide-58
SLIDE 58

SQL Optimization: Cost-Independent Transformations

  • Some transformations always make sense

○ Constant folding ○ Filter push-down ○ ...

  • These transformations are cost-independent

○ If the transformation can be applied to the query, it is applied

  • Domain Specific Language for transformations

○ Compiled down to code which efficiently matches query fragments in the memo ○ ~200 transformations currently defined

slide-59
SLIDE 59

SQL Optimization: Cost-Based Transformations

  • Some transformations are not universally good

○ Index selection ○ Join reordering ○ ...

  • These transformations are cost-based

○ When should the transformation be applied? ○ Need to try both paths and maintain both the original and transformed query ○ State explosion: thousands of possible query plans ■ Memo data structure maintains a forest of query plans ○ Estimate cost of each query, select query with lowest cost

  • Costing

○ Based on table statistics and estimating cardinality of inputs to relational expressions

slide-60
SLIDE 60

SQL Optimization: Cost-based Index Selection

The index to use for a query is affected by multiple factors

  • Filters and join conditions
  • Required ordering (ORDER BY)
  • Implicit ordering (GROUP BY)
  • Covering vs non-covering (i.e. is an index-join required)
  • Locality
slide-61
SLIDE 61

SQL Optimization: Cost-based Index Selection

SELECT * FROM a WHERE x > 10 ORDER BY y Required orderings affect index selection

Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows

slide-62
SLIDE 62

SQL Optimization: Cost-based Index Selection

Required orderings affect index selection

Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows

Scan

a@primary

Filter

x > 10

Sort

y

SELECT * FROM a WHERE x > 10 ORDER BY y

slide-63
SLIDE 63

SQL Optimization: Cost-based Index Selection

Required orderings affect index selection

Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows

Scan

a@primary

Scan

a@x [10 - )

Filter

x > 10

Sort

y

Sort

y

SELECT * FROM a WHERE x > 10 ORDER BY y

slide-64
SLIDE 64

SQL Optimization: Cost-based Index Selection

Required orderings affect index selection

Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows

Scan

a@primary

Scan

a@x [10 - )

Filter

x > 10

Scan

a@y

Sort

y

Sort

y

Filter

x > 10

SELECT * FROM a WHERE x > 10 ORDER BY y

slide-65
SLIDE 65

SQL Optimization: Cost-based Index Selection

Required orderings affect index selection

Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows

Scan

a@primary

Scan

a@x [10 - )

Filter

x > 10

Scan

a@y

Sort

y

Sort

y

Filter

x > 10

SELECT * FROM a WHERE x > 10 ORDER BY y

10 100,000 10 10

Lowest Cost

slide-66
SLIDE 66

SQL Optimization: Cost-based Index Selection

Required orderings affect index selection

Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows

Scan

a@primary

Scan

a@x [10 - )

Filter

x > 10

Scan

a@y

Sort

y

Sort

y

Filter

x > 10

SELECT * FROM a WHERE x > 10 ORDER BY y

50,000 100,000 50,000 50,000

Lowest Cost

slide-67
SLIDE 67

Locality-Aware SQL Optimization

Network latencies and throughput are important Duplicate read-mostly data in each locality Plan queries to use data from the same locality

slide-68
SLIDE 68

Locality-Aware SQL Optimization

Three copies of the postal_codes table data Use replication constraints to pin the copies to different geographic regions (US-East, US-West, EU)

CREATE TABLE postal_codes ( id INT PRIMARY KEY, code STRING, INDEX idx_eu (id) STORING (code), INDEX idx_usw (id) STORING (code) )

slide-69
SLIDE 69

Locality-Aware SQL Optimization

Optimizer includes locality in cost model Automatically selects index from same locality: primary, idx_eu, or idx_usw

CREATE TABLE postal_codes ( id INT PRIMARY KEY, code STRING, INDEX idx_eu (id) STORING (code), INDEX idx_usw (id) STORING (code) )

SELECT * FROM postal_codes

slide-70
SLIDE 70

Conclusion

  • Distributed, replicated, transactional key-value store
  • Monolithic key space
  • Raft replication of ranges (~64MB)
  • Replica placement signals: space, diversity, load, latency
  • Transactions coordinate changes across ranges
  • Mapping SQL data to KV storage
  • Distributed SQL execution
  • Distributed SQL optimization
slide-71
SLIDE 71

www.cockroachlabs.com github.com/cockroachdb/cockroach

Thank You

slide-72
SLIDE 72

Transactional Updates

DOGS

carl dagne figment jack pinetop sooshi zee peetey lula lady

Transactions used to insert records into ranges

1 2 3 carl - jack lady - peetey pinetop - zee

stella muddy INSERT[sunny]

INSERT[sunny]

Space available in range? - YES

carl dagne figment jack muddy peetey lula lady pinetop sooshi stella zee

pinetop sooshi stella zee muddy peetey lula lady

✓ ?

slide-73
SLIDE 73

Transactional Updates

DOGS

carl dagne figment jack pinetop sooshi zee peetey lula lady

1 2 3 carl - jack lady - peetey pinetop - zee

stella muddy INSERT[sunny]

carl dagne figment jack muddy peetey lula lady pinetop sooshi stella sunny zee

pinetop sooshi stella zee muddy peetey lula lady

Transactions used to insert records into ranges

INSERT[sunny]

slide-74
SLIDE 74

Range Splits

DOGS

carl dagne figment jack pinetop sooshi zee peetey lula lady

1 2 3 carl - jack lady - peetey pinetop - zee

stella muddy INSERT[rudy]

carl dagne figment jack muddy peetey lula lady pinetop sooshi stella sunny zee

pinetop sooshi stella zee muddy peetey lula lady

BUT… what happens when a range is full?

✓ ? INSERT[rudy]

Space available in range? - NO

slide-75
SLIDE 75

Range Splits

DOGS

carl dagne figment jack pinetop sooshi zee peetey lula lady stella muddy INSERT[rudy]

carl dagne figment jack muddy peetey lula lady pinetop rudy sooshi

pinetop sooshi stella zee muddy peetey lula lady

Ranges are automatically split, a new range index is created & order maintained

✓ INSERT[rudy]

split range and insert

stella sunny zee 1 2 3 carl - jack lady - peetey pinetop - sooshi 4 stella - zee

slide-76
SLIDE 76

Rebalancing Replicas

node 1 node 5 node 4 node 2 node 3 NEW

Scale: Add a node

If we add a node to the cluster, CockroachDB automatically redistributed replicas to even load across the cluster Uses the replica placement heuristics from previous slides

slide-77
SLIDE 77

Rebalancing Replicas

node 1 node 5 node 4 node 2 node 3 NEW

Scale: Add a node

If we add a node to the cluster, CockroachDB automatically redistributed replicas to even load across the cluster Uses the replica placement heuristics from previous slides Movement is decomposed into adding a replica followed by removing a replica

slide-78
SLIDE 78

Rebalancing Replicas

node 1 node 5 node 4 node 2 node 3 NEW

Scale: Add a node

If we add a node to the cluster, CockroachDB automatically redistributed replicas to even load across the cluster Uses the replica placement heuristics from previous slides Movement is decomposed into adding a replica followed by removing a replica

slide-79
SLIDE 79

Rebalancing Replicas

node 1 node 5 node 4 node 2 node 3

Loss of a node Permanent Failure

If a node goes down, the Raft group realizes a replica is missing and replaces it with a new replica

  • n an active node

Uses the replica placement heuristics from previous slides

slide-80
SLIDE 80

Rebalancing Replicas

node 1 node 5 node 4 node 2 node 3

Loss of a node Permanent Failure

If a node goes down, the Raft group realizes a replica is missing and replaces it with a new replica

  • n an active node

Uses the replica placement heuristics from previous slides

slide-81
SLIDE 81

Rebalancing Replicas

node 1 node 5 node 4 node 2

Loss of a node Temporary Failure

If a node goes down for a moment, the leaseholder can “catch up” any replica that is behind

node 3

slide-82
SLIDE 82

Transactions: Pipelining

Serial Pipelined

slide-83
SLIDE 83

Transactions: Pipelining

Serial Pipelined

s u n n y s u n n y

BEGIN WRITE[sunny]

txn:sunny (pending)

slide-84
SLIDE 84

Transactions: Pipelining

Serial Pipelined

txn:sunny (pending) s u n n y

  • z

z i e s u n n y

  • z

z i e

BEGIN WRITE[sunny] WRITE[ozzie]

slide-85
SLIDE 85

Transactions: Pipelining

Serial Pipelined

txn:sunny (pending) s u n n y

  • z

z i e t x n : s u n n y ( c

  • m

m i t )

[ k e y s : s u n n y ,

  • z

z i e ]

txn:sunny (staged)

[ k e y s : s u n n y ,

  • z

z i e ]

s u n n y

  • z

z i e

BEGIN WRITE[sunny] WRITE[ozzie] COMMIT

slide-86
SLIDE 86

Transactions: Pipelining

Serial Pipelined

txn:sunny (pending) s u n n y

  • z

z i e t x n : s u n n y ( c

  • m

m i t )

[ k e y s : s u n n y ,

  • z

z i e ]

BEGIN WRITE[sunny] WRITE[ozzie] COMMIT Committed once all

  • perations complete

We replaced the centralized commit marker with a distributed one

t

s u n n y

  • z

z i e txn:sunny (staged)

[ k e y s : s u n n y ,

  • z

z i e ]

* “Proved” with TLA+

slide-87
SLIDE 87

A Simple Transaction

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 1 node 1

carl peetey dagne lady lula muddy peetey lady

node 1

carl dagne figment jack pinetop sooshi stella zee pinetop sooshi stella zee pinetop sooshi stella zee carl peetey dagne lady lula muddy peetey lady carl dagne figment jack

INSERT INTO DOGS (sunny);

slide-88
SLIDE 88

A Simple Transaction: One Range

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 1 node 1

carl peetey dagne lady lula muddy peetey lady

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella zee pinetop sooshi stella zee pinetop sooshi stella sunny zee

BEGIN WRITE[sunny] COMMIT

GATEWAY

INSERT INTO DOGS (sunny);

NOTE: a gateway can be ANY CockroachDB instance. It can find the leaseholder for any range and execute a transaction

slide-89
SLIDE 89

A Simple Transaction: One Range

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 1 node 1

carl peetey dagne lady lula muddy peetey lady

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

BEGIN WRITE[sunny] COMMIT

GATEWAY

INSERT INTO DOGS (sunny);

ACK

slide-90
SLIDE 90

A Simple Transaction: One Range

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 1 node 1

carl peetey dagne lady lula muddy peetey lady

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

BEGIN WRITE[sunny] COMMIT

GATEWAY

INSERT INTO DOGS (sunny);

ACK

slide-91
SLIDE 91

Ranges

CockroachDB implements order-preserving data distribution

  • Automates sharding of key/value data into “ranges”
  • Supports efficient range scans
  • Requires an indexing structure

Foundational capability that enables efficient distribution

  • f data across nodes within a CockroachDB cluster

* This approach is also used by Bigtable (tablets), HBase (regions) & Spanner (ranges)

slide-92
SLIDE 92

CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) )

SQL Data Mapping: Inventory Table

ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 Key Value /inventory/name_idx/”Bat”/1 ∅ /inventory/name_idx/”Ball”/2 ∅ /inventory/name_idx/”Glove”/3 ∅

slide-93
SLIDE 93

CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) )

SQL Data Mapping: Inventory Table

ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 4 Bat 4.44 Key Value /inventory/name_idx/”Bat”/1 ∅ /inventory/name_idx/”Ball”/2 ∅ /inventory/name_idx/”Glove”/3 ∅

slide-94
SLIDE 94

CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) )

SQL Data Mapping: Inventory Table

ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 4 Bat 4.44 Key Value /inventory/name_idx/”Bat”/1 ∅ /inventory/name_idx/”Ball”/2 ∅ /inventory/name_idx/”Glove”/3 ∅ /inventory/name_idx/”Bat”/4 ∅

slide-95
SLIDE 95

SQL Execution: Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States

slide-96
SLIDE 96

SQL Execution: Hash Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States

slide-97
SLIDE 97

SQL Execution: Hash Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 1

slide-98
SLIDE 98

SQL Execution: Hash Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 1 Germany 1

slide-99
SLIDE 99

SQL Execution: Hash Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 1 Germany 1 France 1

slide-100
SLIDE 100

SQL Execution: Hash Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 1 Germany 1 France 2

slide-101
SLIDE 101

SQL Execution: Hash Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 2 Germany 1 France 2

slide-102
SLIDE 102

SQL Execution: Group By Revisited

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States

slide-103
SLIDE 103

SQL Execution: Sort on Grouping Column(s)

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Jacques France Marie France Hans Germany Bob United States Susan United States

slide-104
SLIDE 104

SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 1

slide-105
SLIDE 105

SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 2

slide-106
SLIDE 106

SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 2 Germany 1

slide-107
SLIDE 107

SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 2 Germany 1 United States 1

slide-108
SLIDE 108

SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 2 Germany 1 United States 2

slide-109
SLIDE 109

SQL Optimization: Filter Push-Down

SELECT * FROM a JOIN b WHERE x > 10 Scan

a@primary

Filter

x > 10

Results

Scan

b@primary

Join Initial plan

slide-110
SLIDE 110

SQL Optimization: Filter Push-Down

SELECT * FROM a JOIN b WHERE x > 10 Scan

a@primary

Filter

x > 10

Results

Scan

b@primary

Join Filter

x > 10

After filter push-down