CockroachDB Architecture of a Geo-Distributed SQL Database Peter - - PowerPoint PPT Presentation

cockroachdb
SMART_READER_LITE
LIVE PREVIEW

CockroachDB Architecture of a Geo-Distributed SQL Database Peter - - PowerPoint PPT Presentation

CockroachDB Architecture of a Geo-Distributed SQL Database Peter Mattis (@petermattis), Co-founder & CTO CockroachDB: Geo-distributed SQL Database Make Data Easy Distributed Horizontally scalable to grow with your application


slide-1
SLIDE 1

Architecture of a Geo-Distributed SQL Database

CockroachDB

Peter Mattis (@petermattis), Co-founder & CTO

slide-2
SLIDE 2

CockroachDB: Geo-distributed SQL Database

Make Data Easy

  • Distributed

○ Horizontally scalable to grow with your application

  • Geo-distributed

○ Handle datacenter failures ○ Place data near usage ○ Push computation near data

  • SQL

○ Lingua-franca for rich data storage ○ Schemas, indexes, and transactions make app development easier

slide-3
SLIDE 3

AGENDA

  • Introduction
  • Ranges and Replicas
  • Transactions
  • SQL Data in a KV World
  • SQL Execution
  • SQL Optimization
slide-4
SLIDE 4

Distributed, Replicated, Transactional KV*

  • Keys and values are strings

○ Lexicographically ordered by key

  • Multi-version concurrency control (MVCC)

○ Values are never updated “in place”, newer versions shadow older versions ○ Tombstones are used to delete values ○ Provides snapshot to each transaction

  • Monolithic key-space

* Not exposed for external usage

slide-5
SLIDE 5

Monolithic Key Space

DOGS

carl dagne figment jack pinetop sooshi stella zee muddy peetey lula lady

Monolithic logical key space

  • Ordered lexicographically by key
slide-6
SLIDE 6

Ranges

DOGS

carl dagne figment jack pinetop sooshi stella zee muddy peetey lula lady

carl dagne figment jack muddy peetey lula lady pinetop sooshi stella zee

Key space divided into contiguous ~64MB ranges

Ranges are small enough to be moved/split quickly Ranges are large enough to amortize indexing overhead

slide-7
SLIDE 7

Range Indexing

DOGS

carl dagne figment jack pinetop sooshi stella zee muddy peetey lula lady

Index structure used to locate ranges

(very much like a B-tree)

1 2 3 carl - jack lady - peetey pinetop - zee carl dagne figment jack muddy peetey lula lady pinetop sooshi stella zee

slide-8
SLIDE 8

Ordered Range Scans

DOGS

carl dagne figment jack pinetop sooshi zee peetey lula lady

Ordered keys enable efficient range scans

dogs >= “muddy” AND <= “stella”

1 2 3 carl - jack lady - peetey pinetop - zee carl dagne figment jack peetey lula lady pinetop sooshi zee muddy stella

stella muddy

slide-9
SLIDE 9

Transactional Updates

DOGS

carl dagne figment jack pinetop sooshi zee peetey lula lady

Transactions used to insert records into ranges

1 2 3 carl - jack lady - peetey pinetop - zee

stella muddy INSERT[sunny]

INSERT[sunny]

Space available in range? - YES

carl dagne figment jack muddy peetey lula lady pinetop sooshi stella zee

pinetop sooshi stella zee muddy peetey lula lady

✓ ?

slide-10
SLIDE 10

Transactional Updates

DOGS

carl dagne figment jack pinetop sooshi zee peetey lula lady

1 2 3 carl - jack lady - peetey pinetop - zee

stella muddy INSERT[sunny]

carl dagne figment jack muddy peetey lula lady pinetop sooshi stella sunny zee

pinetop sooshi stella zee muddy peetey lula lady

Transactions used to insert records into ranges

INSERT[sunny]

slide-11
SLIDE 11

Range Splits

DOGS

carl dagne figment jack pinetop sooshi zee peetey lula lady

1 2 3 carl - jack lady - peetey pinetop - zee

stella muddy INSERT[rudy]

carl dagne figment jack muddy peetey lula lady pinetop sooshi stella sunny zee

pinetop sooshi stella zee muddy peetey lula lady

BUT… what happens when a range is full?

✓ ? INSERT[rudy]

Space available in range? - NO

slide-12
SLIDE 12

Range Splits

DOGS

carl dagne figment jack pinetop sooshi zee peetey lula lady stella muddy INSERT[rudy]

carl dagne figment jack muddy peetey lula lady pinetop rudy sooshi

pinetop sooshi stella zee muddy peetey lula lady

Ranges are automatically split, a new range index is created & order maintained

✓ INSERT[rudy]

split range and insert

stella sunny zee 1 2 3 carl - jack lady - peetey pinetop - sooshi 4 stella - zee

slide-13
SLIDE 13

Raft and Replication

Ranges (~64MB) are the unit of replication Each range is a Raft group (Raft is a consensus replication protocol) Default to 3 replicas, though this is configurable

  • Important system ranges default to 5 replicas
  • Note: 2 replicas doesn’t make sense in consensus replication

Raft group

slide-14
SLIDE 14

Raft and Replication

Raft provides “atomic replication” of commands Commands are proposed by the leaseholder replica and distributed to the follower replicas, but only accepted when a quorum of replicas have acknowledged receipt

* Leaseholder == Raft leader

Raft group

LEASEHOLDER

slide-15
SLIDE 15

node 1 node 2 node 4 node 3

Range Leases

muddy peetey lula lady pinetop sooshi stella zee carl dagne figment jack

Reads with consensus

Reads must talk to a quorum of replicas

READ[carl]

slide-16
SLIDE 16

node 1 node 2 node 4 node 3

Range Leases

muddy peetey lula lady pinetop sooshi stella zee carl dagne figment jack

Reads without consensus

One replica is chosen as the leaseholder

READ[carl] leaseholder

slide-17
SLIDE 17

node 1 node 2 node 4 node 3

Range Leases

muddy peetey lula lady pinetop sooshi stella zee carl dagne figment jack

Reads without consensus

One replica is chosen as the leaseholder

  • Coordinates writes (proposal, key locking)
  • Performs reads

READ[carl] leaseholder

slide-18
SLIDE 18

node 1 node 2 node 4 node 3

Replica Placement

muddy peetey lula lady pinetop sooshi stella zee

  • Space
  • Diversity
  • Load
  • Latency

carl dagne figment jack

Each Range is a Raft state machine A Range has 1 or more Replicas

slide-19
SLIDE 19

node 1 node 2 node 4 node 3

Replica Placement: Diversity

muddy peetey lula lady carl dagne figment jack

Diversity

  • ptimizes placement of

replicas across “failure domains”

  • Disk
  • Single machine
  • Rack
  • Datacenter
  • Region

pinetop sooshi stella zee

slide-20
SLIDE 20

node 1 node 2 node 6 node 4 node 5

Replica Placement: Load

muddy peetey lula lady pinetop sooshi stella zee carl dagne figment jack

Load

Balances placement using heuristics that considers real-time usage metrics of the data itself

This range is high load as it is accessed more than others

While we show this for ranges within a single table, this is also applicable across all ranges across ALL tables, which is the more typical situation

node 1 node 3

slide-21
SLIDE 21

Replica Placement: Latency & Geo-partitioning

muddy peetey lula lady carl dagne figment jack pinetop sooshi stella zee USE/muddy USE/stella USE/figment USE/dagne USW/jack USW/lady USW/peetey USW/pinetop EU/carl EU/lula EU/sooshi EU/zee

We apply a constraint that indicates regional placement so we can ensure low latency access or jurisdictional control of data

slide-22
SLIDE 22

Rebalancing Replicas

node 1 node 5 node 4 node 2 node 3 NEW

Scale: Add a node

If we add a node to the cluster, CockroachDB automatically redistributed replicas to even load across the cluster Uses the replica placement heuristics from previous slides to decide which node to add to and which to remove from

slide-23
SLIDE 23

Rebalancing Replicas

node 1 node 5 node 4 node 2 node 3 NEW

Scale: Add a node

If we add a node to the cluster, CockroachDB automatically redistributed replicas to even load across the cluster Uses the replica placement heuristics from previous slides Movement is decomposed into adding a replica followed by removing a replica

slide-24
SLIDE 24

Rebalancing Replicas

node 1 node 5 node 4 node 2 node 3 NEW

Scale: Add a node

If we add a node to the cluster, CockroachDB automatically redistributed replicas to even load across the cluster Uses the replica placement heuristics from previous slides Movement is decomposed into adding a replica followed by removing a replica

slide-25
SLIDE 25

Rebalancing Replicas

node 1 node 5 node 4 node 2 node 3

Loss of a node Permanent Failure

If a node goes down, the Raft group realizes a replica is missing and replaces it with a new replica

  • n an active node

Uses the replica placement heuristics from previous slides

slide-26
SLIDE 26

Rebalancing Replicas

node 1 node 5 node 4 node 2 node 3

Loss of a node Permanent Failure

If a node goes down, the Raft group realizes a replica is missing and replaces it with a new replica

  • n an active node

Uses the replica placement heuristics from previous slides The failed replica is removed from the Raft group and a new replica created. The leaseholder sends a snapshot of the Range’s state to bring the new replica up to date.

slide-27
SLIDE 27

Rebalancing Replicas

node 1 node 5 node 4 node 2

Loss of a node Temporary Failure

If a node goes down for a moment, the leaseholder can “catch up” any replica that is behind The leaseholder can send commands to be replayed OR it can send a snapshot of the current Range data. We apply heuristics to decide which is most efficient for a given failure.

node 3

slide-28
SLIDE 28

AGENDA

  • Introduction
  • Ranges and Replicas
  • Transactions
  • SQL Data in a KV World
  • SQL Execution
  • SQL Optimization
slide-29
SLIDE 29

Transactions

Atomicity, Consistency, Isolation, Durability Serializable Isolation

  • As if the transactions are run in a serial order
  • Gold standard isolation level
  • Make Data Easy - weaker isolation levels are too great a burden

Transactions can span arbitrary ranges Conversational

  • The full set of operations is not required up front
slide-30
SLIDE 30

Transactions

Raft provides atomic writes to individual ranges Bootstrap transaction atomicity using Raft atomic writes Transaction record atomically flipped from PENDING to COMMIT

slide-31
SLIDE 31

Distributed Transactions

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 2 node 3

carl peetey dagne lady lula muddy peetey lady

node 4

carl dagne figment jack pinetop sooshi stella zee pinetop sooshi stella zee pinetop sooshi stella zee carl peetey dagne lady lula muddy peetey lady carl dagne figment jack

INSERT INTO dogs VALUES (sunny, ozzie)

slide-32
SLIDE 32

Distributed Transactions

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 2 node 3

carl peetey dagne lady lula muddy peetey lady

node 4

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella zee pinetop sooshi stella zee

BEGIN TXN1 WRITE[sunny]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING

pinetop sooshi stella zee

slide-33
SLIDE 33

Distributed Transactions

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 2 node 3

carl peetey dagne lady lula muddy peetey lady

node 4

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

BEGIN TXN1 WRITE[sunny]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING

pinetop sooshi stella sunny zee

slide-34
SLIDE 34

Distributed Transactions

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 2 node 3

carl peetey dagne lady lula muddy peetey lady

node 4

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

BEGIN TXN1 WRITE[sunny]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING

pinetop sooshi stella sunny zee

ACK

pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

slide-35
SLIDE 35

Distributed Transactions

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 2 node 3

carl peetey dagne lady lula muddy peetey lady

node 4

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

BEGIN TXN1 WRITE[sunny]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING

pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

slide-36
SLIDE 36

Distributed Transactions

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 2 node 3

carl peetey dagne lady lula muddy peetey lady

node 4

carl dagne figment jack carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

BEGIN TXN1 WRITE[sunny] WRITE[ozzie]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING

pinetop sooshi stella sunny zee carl peetey dagne lady lula muddy

  • zzie

peetey lady

ACK

pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

slide-37
SLIDE 37

Distributed Transactions

node 1

carl dagne figment jack carl peetey dagne lady lula muddy

  • zzie

peetey lady

node 2 node 3

carl peetey dagne lady lula muddy

  • zzie

peetey lady

node 4

carl dagne figment jack carl dagne figment jack

BEGIN TXN1 WRITE[sunny] WRITE[ozzie]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING

carl peetey dagne lady lula muddy

  • zzie

peetey lady pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

slide-38
SLIDE 38

Distributed Transactions

node 1

carl dagne figment jack

node 2 node 3 node 4

carl dagne figment jack carl dagne figment jack

BEGIN TXN1 WRITE[sunny] WRITE[ozzie]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING

ACK

pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee carl peetey dagne lady lula muddy

  • zzie

peetey lady carl peetey dagne lady lula muddy

  • zzie

peetey lady carl peetey dagne lady lula muddy

  • zzie

peetey lady

slide-39
SLIDE 39

Distributed Transactions

node 1

carl dagne figment jack

node 2 node 3 node 4

carl dagne figment jack carl dagne figment jack

BEGIN TXN1 WRITE[sunny] WRITE[ozzie]

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: PENDING

ACK

pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee carl peetey dagne lady lula muddy

  • zzie

peetey lady carl peetey dagne lady lula muddy

  • zzie

peetey lady carl peetey dagne lady lula muddy

  • zzie

peetey lady

slide-40
SLIDE 40

Distributed Transactions

node 1

carl dagne figment jack carl peetey dagne lady lula muddy

  • zzie

peetey lady

node 2 node 3

carl peetey dagne lady lula muddy

  • zzie

peetey lady

node 4

carl dagne figment jack carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

BEGIN TXN1 WRITE[sunny] WRITE[ozzie] COMMIT

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie) transactions TXN1: COMMIT

pinetop sooshi stella sunny zee carl peetey dagne lady lula muddy

  • zzie

peetey lady

slide-41
SLIDE 41

Distributed Transactions

node 1

carl dagne figment jack carl peetey dagne lady lula muddy

  • zzie

peetey lady

node 2 node 3

carl peetey dagne lady lula muddy

  • zzie

peetey lady

node 4

carl dagne figment jack carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

BEGIN TXN1 WRITE[sunny] WRITE[ozzie] COMMIT

GATEWAY

INSERT INTO dogs VALUES (sunny, ozzie)

pinetop sooshi stella sunny zee carl peetey dagne lady lula muddy

  • zzie

peetey lady

ACK

slide-42
SLIDE 42

Transactions: Pipelining

Serial Pipelined

slide-43
SLIDE 43

Transactions: Pipelining

Serial Pipelined

s u n n y s u n n y

BEGIN WRITE[sunny]

txn:sunny (pending)

slide-44
SLIDE 44

Transactions: Pipelining

Serial Pipelined

txn:sunny (pending) s u n n y

  • z

z i e s u n n y

  • z

z i e

BEGIN WRITE[sunny] WRITE[ozzie]

slide-45
SLIDE 45

Transactions: Pipelining

Serial Pipelined

txn:sunny (pending) s u n n y

  • z

z i e t x n : s u n n y ( c

  • m

m i t )

[ k e y s : s u n n y ,

  • z

z i e ]

txn:sunny (staged)

[ k e y s : s u n n y ,

  • z

z i e ]

s u n n y

  • z

z i e

BEGIN WRITE[sunny] WRITE[ozzie] COMMIT

slide-46
SLIDE 46

Transactions: Pipelining

Serial Pipelined

txn:sunny (pending) s u n n y

  • z

z i e t x n : s u n n y ( c

  • m

m i t )

[ k e y s : s u n n y ,

  • z

z i e ]

BEGIN WRITE[sunny] WRITE[ozzie] COMMIT Committed once all

  • perations complete

We replaced the centralized commit marker with a distributed one

t

s u n n y

  • z

z i e txn:sunny (staged)

[ k e y s : s u n n y ,

  • z

z i e ]

* “Proved” with TLA+

slide-47
SLIDE 47

AGENDA

  • Introduction
  • Ranges and Replicas
  • Transactions
  • SQL Data in a KV World
  • SQL Execution
  • SQL Optimization
slide-48
SLIDE 48

SQL

Structured Query Language Declarative, not imperative

  • These are the results I want vs perform these operations in this sequence

Relational data model

  • Typed: INT, FLOAT, STRING, ...
  • Schemas: tables, rows, columns, foreign keys
slide-49
SLIDE 49

SQL: Tabular Data in a KV World

SQL data has columns and types?!? How do we store typed and columnar data in a distributed, replicated, transactional key-value store?

  • The SQL data model needs to be mapped to KV data
  • Reminder: keys and values are lexicographically sorted
slide-50
SLIDE 50

CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT )

SQL Data Mapping: Inventory Table

ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 Key Value /1 “Bat”,1.11 /2 “Ball”,2.22 /3 “Glove”,3.33

slide-51
SLIDE 51

CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT )

SQL Data Mapping: Inventory Table

ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 Key Value /<Table>/<Index>/1 “Bat”,1.11 /<Table>/<Index>/2 “Ball”,2.22 /<Table>/<Index>/3 “Glove”,3.33

slide-52
SLIDE 52

CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT )

SQL Data Mapping: Inventory Table

ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 Key Value /inventory/primary/1 “Bat”,1.11 /inventory/primary/2 “Ball”,2.22 /inventory/primary/3 “Glove”,3.33

slide-53
SLIDE 53

CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) )

SQL Data Mapping: Inventory Table

ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 Key Value /inventory/name_idx/”Bat”/1 ∅ /inventory/name_idx/”Ball”/2 ∅ /inventory/name_idx/”Glove”/3 ∅

slide-54
SLIDE 54

CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) )

SQL Data Mapping: Inventory Table

ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 4 Bat 4.44 Key Value /inventory/name_idx/”Bat”/1 ∅ /inventory/name_idx/”Ball”/2 ∅ /inventory/name_idx/”Glove”/3 ∅

slide-55
SLIDE 55

CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) )

SQL Data Mapping: Inventory Table

ID Name Price 1 Bat 1.11 2 Ball 2.22 3 Glove 3.33 4 Bat 4.44 Key Value /inventory/name_idx/”Bat”/1 ∅ /inventory/name_idx/”Ball”/2 ∅ /inventory/name_idx/”Glove”/3 ∅ /inventory/name_idx/”Bat”/4 ∅

slide-56
SLIDE 56

AGENDA

  • Introduction
  • Ranges and Replicas
  • Transactions
  • SQL Data in a KV World
  • SQL Execution
  • SQL Optimization
slide-57
SLIDE 57

SQL Execution

Relational operators

  • Projection (SELECT <columns>)
  • Selection (WHERE <filter>)
  • Aggregation (GROUP BY <columns>)
  • Join (JOIN), union (UNION), intersect (INTERSECT)
  • Scan (FROM <table>)
  • Sort (ORDER BY)

○ Technically, not a relational operator

slide-58
SLIDE 58

SQL Execution

  • Relational expressions have input expressions and scalar expressions

○ For example, a “filter” expression has 1 input expression and a scalar expression that filters the rows from the child ○ The scan expression has zero inputs

  • Query plan is a tree of relational expressions
  • SQL execution takes a query plan and runs the operations to completion
slide-59
SLIDE 59

SQL Execution: Example

SELECT name FROM inventory WHERE name >= “b” AND name < “c”

slide-60
SLIDE 60

SQL Execution: Scan

SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan

inventory

slide-61
SLIDE 61

SQL Execution: Filter

SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan

inventory

Filter

name >= “b” AND name < “c”

slide-62
SLIDE 62

SQL Execution: Project

SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan

inventory

Filter

name >= “b” AND name < “c”

Project

name

slide-63
SLIDE 63

SQL Execution: Project

SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan

inventory

Filter

name >= “b” AND name < “c”

Project

name

Results

slide-64
SLIDE 64

SQL Execution: Index Scans

SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan

inventory@name [“b” - “c”)

The filter gets pushed into the scan

slide-65
SLIDE 65

SQL Execution: Index Scans

SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan

inventory@name [“b” - “c”)

Project

name

Results

slide-66
SLIDE 66

SQL Execution: Correctness

Correct SQL execution involves lots of bookkeeping

  • User defined tables, and indexes
  • Queries refer to table and column names
  • Execution uses table and column IDs
  • NULL handling
slide-67
SLIDE 67

SQL Execution: Performance

Performant SQL execution

  • Tight, well written code
  • Operator specialization

○ hash group by, stream group by ○ hash join, merge join, lookup join, zig-zag join

  • Distributed execution
slide-68
SLIDE 68

SQL Execution: Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States

slide-69
SLIDE 69

SQL Execution: Hash Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States

slide-70
SLIDE 70

SQL Execution: Hash Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 1

slide-71
SLIDE 71

SQL Execution: Hash Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 1 Germany 1

slide-72
SLIDE 72

SQL Execution: Hash Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 1 Germany 1 France 1

slide-73
SLIDE 73

SQL Execution: Hash Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 1 Germany 1 France 2

slide-74
SLIDE 74

SQL Execution: Hash Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States United States 2 Germany 1 France 2

slide-75
SLIDE 75

SQL Execution: Group By Revisited

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Bob United States Hans Germany Jacques France Marie France Susan United States

slide-76
SLIDE 76

SQL Execution: Sort on Grouping Column(s)

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Jacques France Marie France Hans Germany Bob United States Susan United States

slide-77
SLIDE 77

SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 1

slide-78
SLIDE 78

SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 2

slide-79
SLIDE 79

SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 2 Germany 1

slide-80
SLIDE 80

SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 2 Germany 1 United States 1

slide-81
SLIDE 81

SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Name Country Jacques France Marie France Hans Germany Bob United States Susan United States France 2 Germany 1 United States 2

slide-82
SLIDE 82

Distributed SQL Execution

Network latencies and throughput are important considerations in geo-distributed setups Push fragments of computation as close to the data as possible

slide-83
SLIDE 83

Distributed SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Scan customers Scan customers Scan customers scan scan scan

slide-84
SLIDE 84

Distributed SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Scan customers Scan customers Scan customers Group-By “country” Group-By “country” Group-By “country” group-by group-by group-by

slide-85
SLIDE 85

Distributed SQL Execution: Streaming Group By

SELECT COUNT(*), country FROM customers GROUP BY country

Scan customers Scan customers Scan customers Group-By “country” Group-By “country” Group-By “country” Group-By “country” group-by

slide-86
SLIDE 86

AGENDA

  • Introduction
  • Ranges and Replicas
  • Transactions
  • SQL Data in a KV World
  • SQL Execution
  • SQL Optimization
slide-87
SLIDE 87

SQL Optimization

An optimizer explores many plans that are logically equivalent to a given query and chooses the best one Parse Execute Search

Memo

Prep

AST Plan

Fold Constants Check Types Resolve Names Report Semantic Errors Compute properties Retrieve and attach stats Cost-independent transformations Cost-based transformations Parse SQL

slide-88
SLIDE 88

SQL Optimization: Cost-Independent Transformations

  • Some transformations always make sense

○ Constant folding ○ Filter push-down ○ Decorrelating subqueries* ○ ...

  • These transformations are cost-independent

○ If the transformation can be applied to the query, it is applied

  • Domain Specific Language for transformations

○ Compiled down to code which efficiently matches query fragments in the memo ○ ~200 transformations currently defined

* Actually cost-based, but we’re treating it as cost-independent right now

slide-89
SLIDE 89

SQL Optimization: Filter Push-Down

SELECT * FROM a JOIN b WHERE x > 10 Scan

a@primary

Filter

x > 10

Results

Scan

b@primary

Join Initial plan

slide-90
SLIDE 90

SQL Optimization: Filter Push-Down

SELECT * FROM a JOIN b WHERE x > 10 Scan

a@primary

Filter

x > 10

Results

Scan

b@primary

Join Filter

x > 10

After filter push-down

slide-91
SLIDE 91

SQL Optimization: Cost-Based Transformations

  • Some transformations are not universally good

○ Index selection ○ Join reordering ○ ...

  • These transformations are cost-based

○ When should the transformation be applied? ○ Need to try both paths and maintain both the original and transformed query ○ State explosion: thousands of possible query plans ■ Memo data structure maintains a forest of query plans ○ Estimate cost of each query, select query with lowest cost

  • Costing

○ Based on table statistics and estimating cardinality of inputs to relational expressions

slide-92
SLIDE 92

SQL Optimization: Cost-based Index Selection

The index to use for a query is affected by multiple factors

  • Filters and join conditions
  • Required ordering (ORDER BY)
  • Implicit ordering (GROUP BY)
  • Covering vs non-covering (i.e. is an index-join required)
  • Locality
slide-93
SLIDE 93

SQL Optimization: Cost-based Index Selection

SELECT * FROM a WHERE x > 10 ORDER BY y Required orderings affect index selection

Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows

slide-94
SLIDE 94

SQL Optimization: Cost-based Index Selection

Required orderings affect index selection

Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows

Scan

a@primary

Filter

x > 10

Sort

y

SELECT * FROM a WHERE x > 10 ORDER BY y

slide-95
SLIDE 95

SQL Optimization: Cost-based Index Selection

Required orderings affect index selection

Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows

Scan

a@primary

Scan

a@x [10 - )

Filter

x > 10

Sort

y

Sort

y

SELECT * FROM a WHERE x > 10 ORDER BY y

slide-96
SLIDE 96

SQL Optimization: Cost-based Index Selection

Required orderings affect index selection

Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows

Scan

a@primary

Scan

a@x [10 - )

Filter

x > 10

Scan

a@y

Sort

y

Sort

y

Filter

x > 10

SELECT * FROM a WHERE x > 10 ORDER BY y

slide-97
SLIDE 97

SQL Optimization: Cost-based Index Selection

Required orderings affect index selection

Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows

Scan

a@primary

Scan

a@x [10 - )

Filter

x > 10

Scan

a@y

Sort

y

Sort

y

Filter

x > 10

SELECT * FROM a WHERE x > 10 ORDER BY y

10 100,000 10 10

Lowest Cost

slide-98
SLIDE 98

SQL Optimization: Cost-based Index Selection

Required orderings affect index selection

Sorting is expensive if there are a lot of rows Sorting can be the better option if there are few rows

Scan

a@primary

Scan

a@x [10 - )

Filter

x > 10

Scan

a@y

Sort

y

Sort

y

Filter

x > 10

SELECT * FROM a WHERE x > 10 ORDER BY y

50,000 100,000 50,000 50,000

Lowest Cost

slide-99
SLIDE 99

Locality-Aware SQL Optimization

Network latencies and throughput are important considerations in geo-distributed setups Duplicate read-mostly data in each locality Plan queries to use data from the same locality

slide-100
SLIDE 100

Locality-Aware SQL Optimization

Three copies of the postal_codes table data Use replication constraints to pin the copies to different geographic regions (US-East, US-West, EU)

CREATE TABLE postal_codes ( id INT PRIMARY KEY, code STRING, INDEX idx_eu (id) STORING (code), INDEX idx_usw (id) STORING (code) )

slide-101
SLIDE 101

Locality-Aware SQL Optimization

Optimizer includes locality in cost model Automatically selects index from same locality: primary, idx_eu, or idx_usw

CREATE TABLE postal_codes ( id INT PRIMARY KEY, code STRING, INDEX idx_eu (id) STORING (code), INDEX idx_usw (id) STORING (code) )

SELECT * FROM postal_codes

slide-102
SLIDE 102

Conclusion

  • Distributed, replicated, transactional key-value store
  • Monolithic key space
  • Raft replication of ranges (~64MB)
  • Replica placement signals: space, diversity, load, latency
  • Pipelined transaction operations
  • Mapping SQL data to KV storage
  • Distributed SQL execution
  • Distributed SQL optimization
slide-103
SLIDE 103

www.cockroachlabs.com github.com/cockroachdb/cockroach

Thank You

slide-104
SLIDE 104

A Simple Transaction

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 1 node 1

carl peetey dagne lady lula muddy peetey lady

node 1

carl dagne figment jack pinetop sooshi stella zee pinetop sooshi stella zee pinetop sooshi stella zee carl peetey dagne lady lula muddy peetey lady carl dagne figment jack

INSERT INTO DOGS (sunny);

slide-105
SLIDE 105

A Simple Transaction: One Range

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 1 node 1

carl peetey dagne lady lula muddy peetey lady

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella zee pinetop sooshi stella zee pinetop sooshi stella sunny zee

BEGIN WRITE[sunny] COMMIT

GATEWAY

INSERT INTO DOGS (sunny);

NOTE: a gateway can be ANY CockroachDB instance. It can find the leaseholder for any range and execute a transaction

slide-106
SLIDE 106

A Simple Transaction: One Range

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 1 node 1

carl peetey dagne lady lula muddy peetey lady

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

BEGIN WRITE[sunny] COMMIT

GATEWAY

INSERT INTO DOGS (sunny);

ACK

slide-107
SLIDE 107

A Simple Transaction: One Range

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady

node 1 node 1

carl peetey dagne lady lula muddy peetey lady

node 1

carl dagne figment jack carl peetey dagne lady lula muddy peetey lady carl dagne figment jack pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee pinetop sooshi stella sunny zee

BEGIN WRITE[sunny] COMMIT

GATEWAY

INSERT INTO DOGS (sunny);

ACK

slide-108
SLIDE 108

Ranges

CockroachDB implements order-preserving data distribution

  • Automates sharding of key/value data into “ranges”
  • Supports efficient range scans
  • Requires an indexing structure

Foundational capability that enables efficient distribution

  • f data across nodes within a CockroachDB cluster

* This approach is also used by Bigtable (tablets), HBase (regions) & Spanner (ranges)