Spanner A distributed database system Presented by Yue Xia - - PowerPoint PPT Presentation

spanner
SMART_READER_LITE
LIVE PREVIEW

Spanner A distributed database system Presented by Yue Xia - - PowerPoint PPT Presentation

Spanner A distributed database system Presented by Yue Xia Background - Developed by Google initially as a key-value storage system - Developers want traditional database features like query language - Evolved to a full featured SQL


slide-1
SLIDE 1

Spanner

A distributed database system

Presented by Yue Xia

slide-2
SLIDE 2

Background

  • Developed by Google initially as a key-value storage system
  • Developers want traditional database features like query language
  • Evolved to a full featured SQL system
  • Now used by teams across many parts of Google and Alphabet
slide-3
SLIDE 3

Overview

  • Distributed computing
  • Architecture
  • Replication
  • Partitioning
  • Interleaving
  • Range Extraction
  • Distributed Union
  • Distributed Join
  • Data storage
  • PAX
  • LSM Tree
  • Conclusion and Discussion
slide-4
SLIDE 4

Replication

  • Multiple datacenters in different geographic locations
  • Data replicated in each datacenter
  • Run query on nearest datacenter
slide-5
SLIDE 5

Partitioning

  • Each datacenter has multiple servers
  • Data row-range sharded (partitioned)
  • Shards distributed across servers in each

datacenter

Id Name Department 3 Alice ‘A’ 2 Eve ‘A’ 1 Carol ‘B’ 4 Bob ‘C’ George ‘D’ 5 Fred ‘D’ Shard 1: Department = ‘A’ Shard 2: Department = ‘B’ to ‘C‘ Shard 3: Department = ‘D’

slide-6
SLIDE 6

Interleaving

Parent & child tables interleaved and co-located Customer Join Order only needs

  • ne scan of the interleaved table

Customer Id Customer Name Order # Price customer 1 Alice

  • rder 1

$2

  • rder 2

$4 customer 2 Bob

  • rder 1

$6

slide-7
SLIDE 7

Query Execution

1. Go to the nearest datacenter 2. Extract key range 3. Run query only on shards covering the key range

slide-8
SLIDE 8

Range Extraction - Goal

Given a query, we want to know:

  • What shards to access
  • What fragments of shards to access

(seek into smaller key ranges instead of scanning the full shard)

slide-9
SLIDE 9

Range Extraction - Filter Tree

Example: scan Table filter A=1 && ((B=’a’ && C=1) || (B>’a’ && C=2) Construct a tree according to the filter condition.

AND A=1 OR AND AND B=’a’ C=1 B>’a’ C=2

slide-10
SLIDE 10

Range Extraction - Filter Tree

Example: scan Table filter A=1 && ((B=’a’ && C=1) || (B>’a’ && C=2) First find the range for A. Assign leaf node an initial interval

AND A=1 OR AND AND A:[1,1] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] B=’a’ C=1 B>’a’ C=2

slide-11
SLIDE 11

Range Extraction - Filter Tree

Find the interval for each node from bottom to top. AND is intersection and OR is union. The range for A is [1,1]

AND A=1 OR AND AND A:[1,1] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] A:[-∞,∞] A:[1,1]∩[-∞,∞] = [1,1] B=’a’ C=1 B>’a’ C=2

slide-12
SLIDE 12

Range Extraction - Filter Tree

Then find the range for B

AND A=1 OR AND AND A:[1,1] B:[-∞,∞] A:[-∞,∞] B:[‘a’,’a’] A:[-∞,∞] B:[-∞,∞] A:[-∞,∞] B:(‘a’,∞] A:[-∞,∞] B:[-∞,∞] A:[-∞,∞] B:[‘a’,’a’]∩[-∞,∞] = [‘a’,’a’] A:[-∞,∞] B:(’a’,∞]∩[-∞,∞] = (’a’,∞] A:[-∞,∞] B:[‘a’,’a’]U(‘a’,∞]=[‘a’,∞] A:[1,1] B:[-∞,∞]∩[‘a’,∞] = [‘a’,∞] B=’a’ C=1 B>’a’ C=2

slide-13
SLIDE 13

Range Extraction - Filter Tree

Then C Note that C’s range depends on B

AND A=1 OR AND AND A:[1,1] B:[-∞,∞] C:[-∞,∞] A:[-∞,∞] B:[‘a’,’a’] C:[-∞,∞] A:[-∞,∞] B:[-∞,∞] C:[1,1] A:[-∞,∞] B:(‘a’,∞] C:[-∞,∞] A:[-∞,∞] B:[-∞,∞] C:[2,2] A:[-∞,∞] B:[‘a’,’a’] C:[1,1] A:[-∞,∞] B:(’a’,∞] C:[2,2] A:[-∞,∞] B:[‘a’,∞] C:[1,1] if B=’a’, [2,2] if B>’a’ A:[1,1] B:[‘a’,∞] C:[1,1] if B=’a’, [2,2] if B>’a’ B=’a’ C=1 B>’a’ C=2

slide-14
SLIDE 14

Range Extraction - Query Rewrite

Rewrite filtered scan to self-join scan Table filter A=1 && ((B=’a’ && C=1) || (B>’a’ && C=2) becomes: Where the conditions in red are from the filter tree

JOIN JOIN Scan1(Table) Filter A in [1,1]

  • utput @A

Scan2(Table) Filter A=@A AND B in [‘a’,∞]

  • utput @A,@B

Scan3(Table) Filter A=@A And B=@B And C = 1 if B=‘a’, C=2 if B>’a’

  • utput columns of interest
slide-15
SLIDE 15

Range Extraction - Execution

Scan1(Table) Filter A in [1,1]

  • utput @A

Output: 1

record id A B C 1 1 ‘9’ 2 1 ‘a’ 3 1 ‘a’ 1 4 1 ‘ab’ 5 1 ‘b’ 1 6 2 ‘c’ 2 …….

slide-16
SLIDE 16

Range Extraction - Execution

Scan2(Table) Filter A=@A AND B in [‘a’,∞]

  • utput @A, @B

Output: 1, ‘a’ 1, ’ab’, 1, ‘b’,

record id A B C 1 1 ‘9’ 2 1 ‘a’ 3 1 ‘a’ 1 4 1 ‘ab’ 5 1 ‘b’ 1 6 2 ‘c’ 2 ……. A 1 JOIN

slide-17
SLIDE 17

Range Extraction - Execution

Scan3(Table) Filter A=@A And B=@B And C = 1 if B=‘a’, C=2 if B>’a’

  • utput columns of interest

(output record_id if the table is sharded according to record_id) Output: 3

record id A B C 1 1 ‘9’ 2 1 ‘a’ 3 1 ‘a’ 1 4 1 ‘ab’ 5 1 ‘b’ 1 6 2 ‘c’ 2 ……. JOIN A B 1 ‘a’ 1 ‘ab’ 1 ‘b’

slide-18
SLIDE 18

Range Extraction - Query Rewrite

Join 3 tables. Too slow?

JOIN JOIN Scan1(Table) Filter A in [1,1]

  • utput @A

Scan2(Table) Filter A=@A AND B in [‘a’,∞]

  • utput @A, @B

Scan3(Table) Filter A=@A And B=@B And C = 1 if B=‘a’, C=2 if B>’a’

  • utput columns of interest
slide-19
SLIDE 19

Range Extraction - Actual Execution

Scan1(Table) Filter A in [1,1]

  • utput @A

Output 1 without accessing the data as A is fixed to 1.

record id A B C 1 1 ‘9’ 2 1 ‘a’ 3 1 ‘a’ 1 4 1 ‘ab’ 5 1 ‘b’ 1 6 2 ‘c’ 2 …….

slide-20
SLIDE 20

Range Extraction - Execution

Scan2(Table) Filter A=@A AND B in [‘a’,∞]

  • utput @B

Seek the first record with A = 1 and B = ‘a’ instead of scanning the whole table. Record 2 is found. Output 1, ‘a’ Then seek the next record with A = 1 and B != ‘a’, which is record 4. Output 1, ‘ab’ (skip records with B=’a’)

record id A B C 1 1 ‘9’ 2 1 ‘a’ 3 1 ‘a’ 1 4 1 ‘ab’ 5 1 ‘b’ 1 6 2 ‘c’ 2 ……. A 1 JOIN

slide-21
SLIDE 21

Range Extraction - Execution

Scan3(Table) Filter A=@A And B=@B And C = 1 if B=‘a’, C=2 if B>’a’

  • utput columns of interest

Similar to Scan2, seek instead of scan. Can finger

record id A B C 1 1 ‘9’ 2 1 ‘a’ 3 1 ‘a’ 1 4 1 ‘ab’ 5 1 ‘b’ 1 6 2 ‘c’ 2 ……. JOIN A B 1 ‘a’ 1 ‘ab’ 1 ‘b’

slide-22
SLIDE 22

Range Extraction - Conclusion

  • Filter Tree: find range for each column values.
  • Rewrite filter to multiple self joins.
  • Execute with seek instead of scan.
slide-23
SLIDE 23

Distributed Union

  • A new relational algebra operator
  • Send subquery to each shard and concatenate results
slide-24
SLIDE 24

Distributed Union

  • Replace scan with distributed union of scan

Scan(Table) -> DistributedUnion[shard ⊆ T](Scan(shard))

  • Pull distributed union above as many operations as possible. (push computation

to each server)

slide-25
SLIDE 25

Distributed Union

Some operations can be directly rewritten by: Op(DistributedUnion[shard ⊆ T](Scan(shard)))

  • > DistributedUnion[shard ⊆ T](Op(Scan(shard)))

Example:

  • Basic operations like projection, filtering…
  • Group by K or Ordered by K if sharded according to K
  • Join of interleaved table
slide-26
SLIDE 26

Distributed Union

Some need extra processing: Op(DistributedUnion[shard ⊆ T](Scan(shard)))

  • > Op_Final(DistributedUnion[shard ⊆ T](Op_Local(Scan(shard))))

Example:

  • Top(5) can be done by finding the top 5 in each shard and then finding the top 5

among the results from all shards

slide-27
SLIDE 27

Distributed Union - Optimization

  • Multiple levels of distributed union
  • On large shards, further parallelize between

subshards

  • Detect locally hosted shards and avoid

remote call

Distributed Union Distributed Union Distributed Union Shard 1 Shard 2 Shard 3 Shard 4

slide-28
SLIDE 28

Distributed Union - Optimization

  • Range extraction:

Extract key range Map key range to shards Send to min # of servers such that they contain all the required shards Only run query on required shards

slide-29
SLIDE 29

Batched Distributed Join

  • Join can also be distributed
  • Send batches of left table to each shards
  • Join batch with local shards
  • Union
slide-30
SLIDE 30

Batched Distributed Join - Optimization

  • Select left table to fit in a batch
  • Range extraction for each batch
  • Construct the minimum batch to be sent to each shard
slide-31
SLIDE 31

Data Storage - PAX

  • Data stored in Partition Attributes Across (PAX) layout
  • Records are horizontally partitioned in pages
  • In each page all values of each attribute are grouped together
  • Greatly improves cache performance
slide-32
SLIDE 32

Data Storage - PAX

Id Name Age Alice 15 1 Bob 20 2 Carol 25 ... ... ... Table 0,1,2,... Alice,Bob,Carol,... 15,20,25,... PAX Page

slide-33
SLIDE 33

Data Storage - PAX

  • SELECT age WHERE age> 20
  • Cache miss will cache the asked value and the values next to it

Id Name Age Alice 15 1 Bob 20 2 Carol 25 ... ... ... Table 0,1,2,... Alice,Bob,Carol,... 15,20,25,... PAX Page Cache 15,20,25 0,Alice,15|1,Bob,20| 2,Carol,25|... Cache Bob,20,2 N-ary Storage Model (NSM) Page Not Used

slide-34
SLIDE 34

Data Storage - LSM Tree

  • Insert, update or delete would

require rewriting the whole file.

Id Age 5 1 15 2 20 Id Age 5 3 10 1 15 2 20 3 10 insert

slide-35
SLIDE 35

Data Storage - LSM Tree

  • One B-Tree on disk (fixed page size, 100% filled),

and another in memory (smaller, no fixed block size).

  • Updates are stored in the tree in memory.
  • Merge two trees and write to disk when the tree in memory is large.
slide-36
SLIDE 36

Data Storage - LSM Tree

  • Write to disk

Id Age 5 1 15 2 20 4 25 Id Age 5 3 10 1 15 2 20 4 25 3 10 insert Page 1 Page 2 Page 1 Page 2 Page 3

slide-37
SLIDE 37

Data Storage - LSM Tree

  • Write to memory

Id Age 5 1 15 3 10 insert Id Age 2 20 4 25 block 1 block 2 Id Age 5 3 10 1 15 Id Age 2 20 4 25 block 1 block 2

slide-38
SLIDE 38

Data Storage - LSM Tree

* 21 * 34 * * 21 * 34 * 0,5 1,15 2,20 4,25 ... 3.10

Merge

slide-39
SLIDE 39

Data Storage - LSM Tree

* 21 * 34 * * 21 * 34 * 4,25 ... 0,5 3,10 1,15 2,20

Write to new space

slide-40
SLIDE 40

Data Storage - LSM Tree

* 21 * 34 * * 21 * 34 * 0,5 3,10 1,15 2,20 4,25 ...

Repeat

slide-41
SLIDE 41

Conclusion

  • Replicated in datacenters
  • Sharded and distributed among servers.
  • Interleaved tables.
  • Range Extraction
  • Distributed Union and Distributed Join
  • PAX and LSM Tree
slide-42
SLIDE 42

Discussion - Range Extraction

Range extraction

  • Rewrite filter to multiple self joins
  • Each join outputs possible values of a column, last join outputs the range.
  • Each join requires reading and scanning the table
  • Some techniques like seeks to avoid scanning the entire table

Q: Why don’t scan the table once and filter the records?

slide-43
SLIDE 43

Discussion - Range Extraction

Q: Why don’t scan the table once and filter the records? A: It hopes seeking in the table multiple times to be faster than scanning the full table

  • nce.

A full scan may be faster depending on the filter condition and table structure. Should choose which method to use.

slide-44
SLIDE 44

Discussion - Range Extraction

Q: Is it worth spending the extra time on range extraction?

slide-45
SLIDE 45

Discussion - Range Extraction

Q: Is it worth spending the extra time on range extraction? A: Saves time when table need to be scanned multiple times (like in join) and only a small portion of the table is useful. However the cost of range extraction may outweigh the benefit in some cases. Should decide whether to do range extraction. Can give an approximate wider range instead of exact range.

slide-46
SLIDE 46

Discussion - Range Extraction

Filter tree doesn’t work on some conditions. How to solve this? For example: Filter A is odd, Regular Expression. Solution: Scan and filter instead. Or ignore these conditions and give a wider range.

slide-47
SLIDE 47

Thank You

slide-48
SLIDE 48

References

[1]https://cloud.google.com/spanner/docs/whitepapers [2]Bacon DF, Bales N, Bruno N, Cooper BF, Dickinson A, Fikes A, Fraser C, Gubarev A, Joshi M, Kogan E, Lloyd A. Spanner: Becoming a SQL

  • system. InProceedings of the 2017 ACM International Conference on Management of Data 2017 May 9 (pp. 331-343). ACM.

[3]Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS). 2008 Jun 1;26(2):4. [4]https://www.calebcurry.com/blogs/database-design/parent-child-tables [5]Bacon DF, Bales N, Bruno N, Cooper BF, Dickinson A, Fikes A, Fraser C, Gubarev A, Joshi M, Kogan E, Lloyd A. Spanner: Becoming a SQL

  • system. InProceedings of the 2017 ACM International Conference on Management of Data 2017 May 9 (pp. 331-343). ACM.

[6]https://medium.com/databasss/on-disk-io-part-3-lsm-trees-8b2da218496f [7]Ailamaki A, DeWitt DJ, Hill MD, Skounakis M. Weaving Relations for Cache Performance. InVLDB 2001 Sep 11 (Vol. 1, pp. 169-180). [8]https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Row-major_order.html [9]https://www.igi-global.com/dictionary/partition-attributes-across-pax/39627 [7]Ailamaki A, DeWitt DJ, Hill MD, Skounakis M. Weaving Relations for Cache Performance. InVLDB 2001 Sep 11 (Vol. 1, pp. 169-180). [9]http://theteacher.info/index.php/architecture-data-comms-and-applications-unit-5/4-organisation-and-structure-of-data/all-topics/3 940-multi-level-indexes [10]https://explainextended.com/2009/07/16/inner-join-vs-cross-apply/