[PPT] - Spanner A distributed database system Presented by Yue Xia PowerPoint Presentation

SLIDE 1

Spanner

A distributed database system

Presented by Yue Xia

SLIDE 2

Background

Developed by Google initially as a key-value storage system
Developers want traditional database features like query language
Evolved to a full featured SQL system
Now used by teams across many parts of Google and Alphabet

SLIDE 3

Overview

Distributed computing
Architecture
Replication
Partitioning
Interleaving
Range Extraction
Distributed Union
Distributed Join
Data storage
PAX
LSM Tree
Conclusion and Discussion

SLIDE 4

Replication

Multiple datacenters in different geographic locations
Data replicated in each datacenter
Run query on nearest datacenter

SLIDE 5

Partitioning

Each datacenter has multiple servers
Data row-range sharded (partitioned)
Shards distributed across servers in each

datacenter

Id Name Department 3 Alice ‘A’ 2 Eve ‘A’ 1 Carol ‘B’ 4 Bob ‘C’ George ‘D’ 5 Fred ‘D’ Shard 1: Department = ‘A’ Shard 2: Department = ‘B’ to ‘C‘ Shard 3: Department = ‘D’

SLIDE 6

Interleaving

Parent & child tables interleaved and co-located Customer Join Order only needs

ne scan of the interleaved table

Customer Id Customer Name Order # Price customer 1 Alice

rder 1

$2

rder 2

$4 customer 2 Bob

rder 1

$6

SLIDE 7

Query Execution

1. Go to the nearest datacenter 2. Extract key range 3. Run query only on shards covering the key range

SLIDE 8

Range Extraction - Goal

Given a query, we want to know:

What shards to access
What fragments of shards to access

(seek into smaller key ranges instead of scanning the full shard)

SLIDE 9

Range Extraction - Filter Tree

Example: scan Table filter A=1 && ((B=’a’ && C=1) || (B>’a’ && C=2) Construct a tree according to the filter condition.

AND A=1 OR AND AND B=’a’ C=1 B>’a’ C=2

SLIDE 10

Range Extraction - Filter Tree

Example: scan Table filter A=1 && ((B=’a’ && C=1) || (B>’a’ && C=2) First find the range for A. Assign leaf node an initial interval

AND A=1 OR AND AND A:[1,1] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] B=’a’ C=1 B>’a’ C=2

SLIDE 11

Range Extraction - Filter Tree

Find the interval for each node from bottom to top. AND is intersection and OR is union. The range for A is [1,1]

AND A=1 OR AND AND A:[1,1] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] A:[1,1]∩[-∞,∞] = [1,1] B=’a’ C=1 B>’a’ C=2

SLIDE 12

Range Extraction - Filter Tree

Then find the range for B

AND A=1 OR AND AND A:[1,1] B:[-∞,∞] A:[-∞,∞] B:[‘a’,’a’] A:[-∞,∞] B:[-∞,∞] A:[-∞,∞] B:(‘a’,∞] A:[-∞,∞] B:[-∞,∞] A:[-∞,∞] B:[‘a’,’a’]∩[-∞,∞] = [‘a’,’a’] A:[-∞,∞] B:(’a’,∞]∩[-∞,∞] = (’a’,∞] A:[-∞,∞] B:[‘a’,’a’]U(‘a’,∞]=[‘a’,∞] A:[1,1] B:[-∞,∞]∩[‘a’,∞] = [‘a’,∞] B=’a’ C=1 B>’a’ C=2

SLIDE 13

Range Extraction - Filter Tree

Then C Note that C’s range depends on B

AND A=1 OR AND AND A:[1,1] B:[-∞,∞] C:[-∞,∞] A:[-∞,∞] B:[‘a’,’a’] C:[-∞,∞] A:[-∞,∞] B:[-∞,∞] C:[1,1] A:[-∞,∞] B:(‘a’,∞] C:[-∞,∞] A:[-∞,∞] B:[-∞,∞] C:[2,2] A:[-∞,∞] B:[‘a’,’a’] C:[1,1] A:[-∞,∞] B:(’a’,∞] C:[2,2] A:[-∞,∞] B:[‘a’,∞] C:[1,1] if B=’a’, [2,2] if B>’a’ A:[1,1] B:[‘a’,∞] C:[1,1] if B=’a’, [2,2] if B>’a’ B=’a’ C=1 B>’a’ C=2

SLIDE 14

Range Extraction - Query Rewrite

Rewrite filtered scan to self-join scan Table filter A=1 && ((B=’a’ && C=1) || (B>’a’ && C=2) becomes: Where the conditions in red are from the filter tree

JOIN JOIN Scan1(Table) Filter A in [1,1]

utput @A

Scan2(Table) Filter A=@A AND B in [‘a’,∞]

utput @A,@B

Scan3(Table) Filter A=@A And B=@B And C = 1 if B=‘a’, C=2 if B>’a’

utput columns of interest

SLIDE 15

Range Extraction - Execution

Scan1(Table) Filter A in [1,1]

utput @A

Output: 1

record id A B C 1 1 ‘9’ 2 1 ‘a’ 3 1 ‘a’ 1 4 1 ‘ab’ 5 1 ‘b’ 1 6 2 ‘c’ 2 …….

SLIDE 16

Range Extraction - Execution

Scan2(Table) Filter A=@A AND B in [‘a’,∞]

utput @A, @B

Output: 1, ‘a’ 1, ’ab’, 1, ‘b’,

record id A B C 1 1 ‘9’ 2 1 ‘a’ 3 1 ‘a’ 1 4 1 ‘ab’ 5 1 ‘b’ 1 6 2 ‘c’ 2 ……. A 1 JOIN

SLIDE 17

Range Extraction - Execution

Scan3(Table) Filter A=@A And B=@B And C = 1 if B=‘a’, C=2 if B>’a’

utput columns of interest

(output record_id if the table is sharded according to record_id) Output: 3

record id A B C 1 1 ‘9’ 2 1 ‘a’ 3 1 ‘a’ 1 4 1 ‘ab’ 5 1 ‘b’ 1 6 2 ‘c’ 2 ……. JOIN A B 1 ‘a’ 1 ‘ab’ 1 ‘b’

SLIDE 18

Range Extraction - Query Rewrite

Join 3 tables. Too slow?

JOIN JOIN Scan1(Table) Filter A in [1,1]

utput @A

Scan2(Table) Filter A=@A AND B in [‘a’,∞]

utput @A, @B

Scan3(Table) Filter A=@A And B=@B And C = 1 if B=‘a’, C=2 if B>’a’

utput columns of interest

SLIDE 19

Range Extraction - Actual Execution

Scan1(Table) Filter A in [1,1]

utput @A

Output 1 without accessing the data as A is fixed to 1.

record id A B C 1 1 ‘9’ 2 1 ‘a’ 3 1 ‘a’ 1 4 1 ‘ab’ 5 1 ‘b’ 1 6 2 ‘c’ 2 …….

SLIDE 20

Range Extraction - Execution

Scan2(Table) Filter A=@A AND B in [‘a’,∞]

utput @B

Seek the first record with A = 1 and B = ‘a’ instead of scanning the whole table. Record 2 is found. Output 1, ‘a’ Then seek the next record with A = 1 and B != ‘a’, which is record 4. Output 1, ‘ab’ (skip records with B=’a’)

record id A B C 1 1 ‘9’ 2 1 ‘a’ 3 1 ‘a’ 1 4 1 ‘ab’ 5 1 ‘b’ 1 6 2 ‘c’ 2 ……. A 1 JOIN

SLIDE 21

Range Extraction - Execution

Scan3(Table) Filter A=@A And B=@B And C = 1 if B=‘a’, C=2 if B>’a’

utput columns of interest

Similar to Scan2, seek instead of scan. Can finger

record id A B C 1 1 ‘9’ 2 1 ‘a’ 3 1 ‘a’ 1 4 1 ‘ab’ 5 1 ‘b’ 1 6 2 ‘c’ 2 ……. JOIN A B 1 ‘a’ 1 ‘ab’ 1 ‘b’

SLIDE 22

Range Extraction - Conclusion

Filter Tree: find range for each column values.
Rewrite filter to multiple self joins.
Execute with seek instead of scan.

SLIDE 23

Distributed Union

A new relational algebra operator
Send subquery to each shard and concatenate results

SLIDE 24

Distributed Union

Replace scan with distributed union of scan

Scan(Table) -> DistributedUnion[shard ⊆ T](Scan(shard))

Pull distributed union above as many operations as possible. (push computation

to each server)

SLIDE 25

Distributed Union

Some operations can be directly rewritten by: Op(DistributedUnion[shard ⊆ T](Scan(shard)))

> DistributedUnion[shard ⊆ T](Op(Scan(shard)))

Example:

Basic operations like projection, filtering…
Group by K or Ordered by K if sharded according to K
Join of interleaved table

SLIDE 26

Distributed Union

Some need extra processing: Op(DistributedUnion[shard ⊆ T](Scan(shard)))

> Op_Final(DistributedUnion[shard ⊆ T](Op_Local(Scan(shard))))

Example:

Top(5) can be done by finding the top 5 in each shard and then finding the top 5

among the results from all shards

SLIDE 27

Distributed Union - Optimization

Multiple levels of distributed union
On large shards, further parallelize between

subshards

Detect locally hosted shards and avoid

remote call

Distributed Union Distributed Union Distributed Union Shard 1 Shard 2 Shard 3 Shard 4

SLIDE 28

Distributed Union - Optimization

Range extraction:

Extract key range Map key range to shards Send to min # of servers such that they contain all the required shards Only run query on required shards

SLIDE 29

Batched Distributed Join

Join can also be distributed
Send batches of left table to each shards
Join batch with local shards
Union

SLIDE 30

Batched Distributed Join - Optimization

Select left table to fit in a batch
Range extraction for each batch
Construct the minimum batch to be sent to each shard

SLIDE 31

Data Storage - PAX

Data stored in Partition Attributes Across (PAX) layout
Records are horizontally partitioned in pages
In each page all values of each attribute are grouped together
Greatly improves cache performance

SLIDE 32

Data Storage - PAX

Id Name Age Alice 15 1 Bob 20 2 Carol 25 ... ... ... Table 0,1,2,... Alice,Bob,Carol,... 15,20,25,... PAX Page

SLIDE 33

Data Storage - PAX

SELECT age WHERE age> 20
Cache miss will cache the asked value and the values next to it

Id Name Age Alice 15 1 Bob 20 2 Carol 25 ... ... ... Table 0,1,2,... Alice,Bob,Carol,... 15,20,25,... PAX Page Cache 15,20,25 0,Alice,15|1,Bob,20| 2,Carol,25|... Cache Bob,20,2 N-ary Storage Model (NSM) Page Not Used

SLIDE 34

Data Storage - LSM Tree

Insert, update or delete would

require rewriting the whole file.

Id Age 5 1 15 2 20 Id Age 5 3 10 1 15 2 20 3 10 insert

SLIDE 35

Data Storage - LSM Tree

One B-Tree on disk (fixed page size, 100% filled),

and another in memory (smaller, no fixed block size).

Updates are stored in the tree in memory.
Merge two trees and write to disk when the tree in memory is large.

SLIDE 36

Data Storage - LSM Tree

Write to disk

Id Age 5 1 15 2 20 4 25 Id Age 5 3 10 1 15 2 20 4 25 3 10 insert Page 1 Page 2 Page 1 Page 2 Page 3

SLIDE 37

Data Storage - LSM Tree

Write to memory

Id Age 5 1 15 3 10 insert Id Age 2 20 4 25 block 1 block 2 Id Age 5 3 10 1 15 Id Age 2 20 4 25 block 1 block 2

SLIDE 38

Data Storage - LSM Tree

* 21 * 34 * * 21 * 34 * 0,5 1,15 2,20 4,25 ... 3.10

Merge

SLIDE 39

Data Storage - LSM Tree

* 21 * 34 * * 21 * 34 * 4,25 ... 0,5 3,10 1,15 2,20

Write to new space

SLIDE 40

Data Storage - LSM Tree

* 21 * 34 * * 21 * 34 * 0,5 3,10 1,15 2,20 4,25 ...

Repeat

SLIDE 41

Conclusion

Replicated in datacenters
Sharded and distributed among servers.
Interleaved tables.
Range Extraction
Distributed Union and Distributed Join
PAX and LSM Tree

SLIDE 42

Discussion - Range Extraction

Range extraction

Rewrite filter to multiple self joins
Each join outputs possible values of a column, last join outputs the range.
Each join requires reading and scanning the table
Some techniques like seeks to avoid scanning the entire table

Q: Why don’t scan the table once and filter the records?

SLIDE 43

Discussion - Range Extraction

Q: Why don’t scan the table once and filter the records? A: It hopes seeking in the table multiple times to be faster than scanning the full table

nce.

A full scan may be faster depending on the filter condition and table structure. Should choose which method to use.

SLIDE 44

Discussion - Range Extraction

Q: Is it worth spending the extra time on range extraction?

SLIDE 45

Discussion - Range Extraction

Q: Is it worth spending the extra time on range extraction? A: Saves time when table need to be scanned multiple times (like in join) and only a small portion of the table is useful. However the cost of range extraction may outweigh the benefit in some cases. Should decide whether to do range extraction. Can give an approximate wider range instead of exact range.

SLIDE 46

Discussion - Range Extraction

Filter tree doesn’t work on some conditions. How to solve this? For example: Filter A is odd, Regular Expression. Solution: Scan and filter instead. Or ignore these conditions and give a wider range.

SLIDE 47

Thank You

SLIDE 48

References

[1]https://cloud.google.com/spanner/docs/whitepapers [2]Bacon DF, Bales N, Bruno N, Cooper BF, Dickinson A, Fikes A, Fraser C, Gubarev A, Joshi M, Kogan E, Lloyd A. Spanner: Becoming a SQL

system. InProceedings of the 2017 ACM International Conference on Management of Data 2017 May 9 (pp. 331-343). ACM.

[3]Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS). 2008 Jun 1;26(2):4. [4]https://www.calebcurry.com/blogs/database-design/parent-child-tables [5]Bacon DF, Bales N, Bruno N, Cooper BF, Dickinson A, Fikes A, Fraser C, Gubarev A, Joshi M, Kogan E, Lloyd A. Spanner: Becoming a SQL

system. InProceedings of the 2017 ACM International Conference on Management of Data 2017 May 9 (pp. 331-343). ACM.

[6]https://medium.com/databasss/on-disk-io-part-3-lsm-trees-8b2da218496f [7]Ailamaki A, DeWitt DJ, Hill MD, Skounakis M. Weaving Relations for Cache Performance. InVLDB 2001 Sep 11 (Vol. 1, pp. 169-180). [8]https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Row-major_order.html [9]https://www.igi-global.com/dictionary/partition-attributes-across-pax/39627 [7]Ailamaki A, DeWitt DJ, Hill MD, Skounakis M. Weaving Relations for Cache Performance. InVLDB 2001 Sep 11 (Vol. 1, pp. 169-180). [9]http://theteacher.info/index.php/architecture-data-comms-and-applications-unit-5/4-organisation-and-structure-of-data/all-topics/3 940-multi-level-indexes [10]https://explainextended.com/2009/07/16/inner-join-vs-cross-apply/