Data Management Systems Query Processing Execution models - - PowerPoint PPT Presentation

data management systems
SMART_READER_LITE
LIVE PREVIEW

Data Management Systems Query Processing Execution models - - PowerPoint PPT Presentation

Data Management Systems Query Processing Execution models Optimization heuristics & Access to base tables rewriting Sorting and Aggregation Optimization cost models Joins Operators Gustavo Alonso Institute of


slide-1
SLIDE 1

Data Management Systems

  • Query Processing
  • Execution models
  • Optimization – heuristics &

rewriting

  • Optimization – cost models

Operators Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich

1 Operators

Access to base tables Sorting and Aggregation Joins

slide-2
SLIDE 2

Starting point

  • Data and indexes are stored in blocks as part of extents
  • Data from the base tables must be read from the buffer cache and

into the working space of the query (will be processed while reading it)

  • Basics:
  • Minimize I/O if blocks not in memory
  • Minimize accesses to tuples if data in memory
  • Prefer sequential access

Operators 2

slide-3
SLIDE 3

Access to base tables

Operators 3

slide-4
SLIDE 4

The fastest access for single tuple: row_id

  • The fastest way to access one

tuple is by using its row or tuple id:

  • Row_id = Block_id, offset
  • Essentially, a pointer to the tuple
  • Only need to access the block

where the tuple is

  • The row_id is found:
  • In the query itself
  • Through an index

Operators 4

SELECT * FROM T WHERE T.row_id = 123456 SELECT * FROM T WHERE T.key = AB34TF index Buffer cache

slide-5
SLIDE 5

When to use row_id access

Operators 5

SELECT * FROM T WHERE T.A = 42 index

  • n A
  • Row_id access can also be used

when there is a predicate over an indexed attribute:

  • use the index to find the

matching tuples and retrieve them using the row_id

  • If there are many matches, it might

induce many random accesses to different blocks

slide-6
SLIDE 6

When to use row_id access

Operators 6

SELECT * FROM T WHERE T.A = 42 AND T.B > 50 index

  • n A
  • Row_id access through an index

works even with more complex predicates

  • use the index to find the

matching tuples using the indexed attribute

  • Retrieve the tuples that match

the other predicate

T.B > 50 YES Emit tuple NO Ignore tuple

slide-7
SLIDE 7

Which index to use?

  • Assume there are two indexes, one on A and one on B

SELECT * FROM T WHERE T.A = 42 AND T.B > 50

  • We can use any of them to retrieve the data
  • Find all tuple T.A = 42 and then check T.B > 50
  • Find all the tuples T.B > 50 and then check T.A = 42
  • Find all tuples T.A = 42, find all tuples T.B > 50, match the two lists of row_ids
  • Which one to use depends on the relative selectivities of each

predicate

Operators 7

slide-8
SLIDE 8

The “slowest” access: full table scan

  • A full table scan reads all the blocks and all the tuples in each block

=> it is expensive, especially if data not in memory

  • But not is not the slowest, very stupid plans might be worse

SELECT * FROM T WHERE T.id =1 OR T.id=2 OR T.id=3 … SELECT * FROM T WHERE T.age <= 20 OR T.age >20

  • Full table scan is the upper bound in cost for retrieving data from a

base table

  • Worst case scenario for query planning, if nothing else works, use a full table

scan

  • Nevertheless, it reads the data sequentially and it can take

advantage of prefetching

Operators 8

slide-9
SLIDE 9

When to use a full table scan?

  • When there is no other option:
  • There are no indexes or indexes are not selective enough
  • Predicates involving several columns of the same table (self join)

SELECT * FROM T where T.A > T.B

  • The amount of data retrieved is large enough that sequential access is better

than many random accesses

  • Several ways to minimize the overhead of a full table scan:
  • Shared scans = use the cursor from the scan of another query
  • Sample scans = do not read everything but just a sample
  • Column store = scanning a compressed column using SIMD can be fast once

data is in memory

Operators 9

slide-10
SLIDE 10

Clustered indexes

  • A clustered index enforces that the data in the extent is organized

according to the index:

  • B+ tree = data is sorted
  • Hash index = tuples with same key are in the same bucket
  • In those cases, we might not traverse the index for each tuple:
  • Find the relevant blocks
  • Scan those blocks sequentially

Operators 10

slide-11
SLIDE 11

Clustered index example

Operators 11

SELECT * FROM T WHERE T.A = 42 AND T.B > 50 index

  • n A

T.B > 50 YES Emit tuple NO Ignore tuple SELECT * FROM T WHERE T.A = 42 AND T.B > 50 T.A = 42 YES Emit tuple NO Ignore tuple Clustered index on B

slide-12
SLIDE 12

Lowering the costs of table scans: Zone Maps

  • A zone map is a combination of coarse index and statistics
  • For every block in of a table
  • Keep the max and min values for some or all columns
  • Before reading a block, check the zone map:
  • If range does not match the predicate, do not read the block
  • It can significantly reduce I/O cost
  • In some cases it can replace an index
  • Other statistics can be kept in a zone map
  • Example of use of the Zone Maps concept is Snowflake (see chapter
  • n Storage)

Operators 12

slide-13
SLIDE 13

Other considerations

  • Other factors affecting how to access a base table:
  • A table scan using an index can be expensive but it will return sorted data:
  • Start at the beginning of the leaves of the index and retrieve the tuples one by one

(sequential access to the row_ids)

  • Expensive if index not clustered but might be cheaper than sorting the data
  • I/O is significantly more expensive that accessing data in memory
  • Random accesses are far worse for I/O than for data in memory
  • Scans in memory can be reasonably fast
  • Changes the decision points between random access and scans
  • Scans on columns not the same as scans on rows
  • Changes the decision point on when to do a scan

Operators 13

slide-14
SLIDE 14

Sorting and Aggregation

Operators 14

slide-15
SLIDE 15

Why sorting data?

  • Recall that data is not necessarily stored in a sorted manner
  • The query requires it

SELECT * FROM T ORDER_BY (T.age)

  • Some operations are easier over sorted data

SELECT DISTINCT(T.name) FROM T sort the data by T.name and return the first for each group SELECT AVG(T.age) FROM T GROUP_BY(T.level) sort the data by T.level and then find the average age for each group Joins, selections, intra-table predicates …

  • Sorting is expensive
  • Requires extra space (no sorting in place for base tables)
  • Requires CPU (comparisons)

Operators 15

slide-16
SLIDE 16

External sort (data does not fit in memory)

  • Why external sort?
  • Obvious: data does not fit in main memory (data and results!!)
  • Less obvious: many queries running at the same time sharing memory
  • Two key parameters
  • N: number of pages of input
  • M: size of in memory buffer
  • Behavior of algorithm determined by many parameters: I/O, CPU, I/O

costs, caches, data types involved, etc.

Operators 16

slide-17
SLIDE 17

Two-phase External Sort

  • N size of the input in pages, M size of the buffer in pages
  • Phase I: Create Runs
  • 1. Load allocated buffer space with tuples
  • 2. Sort tuples in buffer pool
  • 3. Write sorted tuples (run) to disk
  • 4. Goto Step 1 (create next run) until all tuples processed
  • Phase II: Merge Runs
  • Use priority heap to merge tuples from runs
  • Special cases
  • M >= N: no merge needed
  • M < sqrt(N): multiple merge phases necessary

Operators 17

slide-18
SLIDE 18

For simplicity we will hide this

  • The size of the buffer needed:
  • Minimal configuration:
  • Number of blocks -1 are used to read in data blocks
  • 1 block is used to write data out
  • Better:
  • Number of blocks -1 to read data in
  • 2 or more blocks to write data out so that we do not have to wait to write a block out

Operators 18

sort write read

slide-19
SLIDE 19

External Sort

19

97 17 3 5 27 16 2 99 13

Operators

slide-20
SLIDE 20

External Sort

20

97 17 3 5 27 16 2 99 13

97

17 3

load

Operators

slide-21
SLIDE 21

External Sort

21

97 17 3 5 27 16 2 99 13

3

17 97

sort

Operators

slide-22
SLIDE 22

External Sort

22

97 17 3 5 27 16 2 99 13

3

3 17 97

17 97

write run

Operators

slide-23
SLIDE 23

External Sort

23

97 17 3 5 27 16 2 99 13

5

3 17 97

27 16

load

Operators

slide-24
SLIDE 24

External Sort

24

97 17 3 5 27 16 2 99 13

5

3 17 97 5 16 27

16 27

sort & write

Operators

slide-25
SLIDE 25

External Sort

25

97 17 3 5 27 16 2 99 13

2

3 1 7 97 5 16 27

99 13

load

Operators

slide-26
SLIDE 26

External Sort

26

97 17 3 5 27 16 2 99 13

2

3 17 97 5 16 27 2 13 99

13 99

End of Phase 1

Operators

slide-27
SLIDE 27

External Sort

27

3

3 17 97 5 16 27 2 13 99

5 2

merge

Operators

slide-28
SLIDE 28

External Sort

28

2

3

3 17 97 5 16 27 2 13 99

5 2

merge

Operators

slide-29
SLIDE 29

External Sort

29

2 3

3

5 13

merge

3 17 97 5 16 27 2 13 99

Operators

slide-30
SLIDE 30

External Sort

30

2 3 5

17

5 13

merge

3 17 97 5 16 27 2 13 99

Operators

slide-31
SLIDE 31

External Sort

31

2 3 5

17

16 13

merge

3 17 97 5 16 27 2 13 99

Operators

slide-32
SLIDE 32

External Sort

32

2 3 5 13

17

16 13

3 17 97 5 16 27 2 13 99

Operators

slide-33
SLIDE 33

One pass vs. multi-pass sort

  • Previous algorithm is a one-pass algorithm (every data item is read
  • nce and written once)
  • However:
  • If there are many runs, I/O overhead is too high (we need to bring too many

runs to memory)

  • Merge step cannot be parallelized

Operators 33

slide-34
SLIDE 34

Multi-way Merge (N = 7; M = 2)

Operators 34

slide-35
SLIDE 35

Ways to speed up sorting

  • Prefetching and double buffering:
  • Use more than one block for writing data out
  • Prefetch blocks as needed while processing
  • Take advantage of indexes
  • Clustered: read the data in order as it is already sorted (no CPU cost,

sequential access)

  • Non-clustered: use the index to read the data in order (no CPU cost but very

expensive in terms of random access), may work for small ranges

Operators 35

slide-36
SLIDE 36

Sorting vs hashing

  • Sorting is relatively expensive
  • If not required by the query, hashing is often the better way to

answer queries of the form:

SELECT DISTINCT(T.name) FROM T SELECT AVG(T.age) FROM T GROUP_BY(T.level)

  • In both cases:
  • Build a hash table on the attribute of interest
  • Operate on the items in each bucket of the hash table:
  • Distinct: just keep one (identical ones hash to the same bucket)
  • Group_by: all identical ones are in the same bucket

Operators 36

slide-37
SLIDE 37

Distinct with hashing

Operators 37

SELECT DISTINCT(T.name) FROM T name age T Hash(name) John Mary Louis Anne Bob Tom John John Anne Tom Tom Tom DISTINCT

slide-38
SLIDE 38

Aggregation with hashing

Operators 38

SELECT AVG(T.age) FROM T GROUP_BY(T.level) name age T Hash(level)

John, Manager, 50 Louis, Admin, 27 Bob, Developer, 29

level

Anne, Manager, 48 Tom, Manager,57 Alice, Developer, 38 Jim, Developer, 42 Mary, Developer, 35 Donald, Admin, 27 Tim, Assistant, 50 Anne, CEO, 53 John, Researcher, 47

AVG(AGE)

slide-39
SLIDE 39

External hashing

  • If the hash table does not fit in memory
  • Partition the data first using a hash function
  • Then use the partitions to build the hash table
  • If the partitions do not fit in memory
  • Partition the partition again until it does
  • Build the hash table in several runs
  • For sorting, we use M blocks, M-1 to read data in, 1 to write data out
  • For hashing, we use M buffers, 1 to read data in, M-1 to write data
  • ut

Operators 39

slide-40
SLIDE 40

External hashing

Operators 40

block Hash 1 In memory Traverse extent Write to disk as buckets fill Partition phase

slide-41
SLIDE 41

External hashing

Operators 41

Blocks on disk Hash 2 ….. Block in memory

slide-42
SLIDE 42

Joins

Operators 42

slide-43
SLIDE 43

Nested loop join

  • Actually, two nested scans
  • While there are tuples in R
  • Get a tuple from R
  • Compare with all tuples in S (scan S for matches)
  • Output if match
  • Complexity is O(|R|*|S|), i.e., O(N2)
  • Sounds expensive but still used in practice (makes sense if, e.g., S is

sorted, join is on an index attribute, etc.)

Operators 43

slide-44
SLIDE 44

Nested loop join

  • Outer table is the table used in the outer loop
  • Inner table is the table used in the inner loop
  • Outer table always the smaller of the two tables (in number of blocks)
  • Maximizes sequential access in the inner loop
  • Optimization: block nested loop join
  • get a block from R
  • (hash and then) compare with all blocks of S
  • Avoids bringing the blocks of S many times for each tuple in R (now only once

per block of R)

Operators 44

slide-45
SLIDE 45

Nested loop joins, simple vs block

Operators 45

Each tuple in outer table brings all the blocks from the inner table Each block in outer table brings all the blocks from the inner table Simple nested loop join Block nested loop join

slide-46
SLIDE 46

Nested loop joins on indexed table

Operators 46

Each tuple in outer table looks up the index of the inner table Index on inner table index

slide-47
SLIDE 47

Nested loop joins with zone maps

Operators 47

Use the zone map to determine whether we need to look into a block Outer table sorted Min = 3 Max = 17 Min = 2 Max = 23 Min = 11 Max = 19 Min = 1 Max = 8 In range? 8

slide-48
SLIDE 48

Nested loop joins on sorted input

Operators 48

No need to scan outer table for every tuple in inner table as order tells us where to stop A further optimization is to use binary search Outer table sorted Scan only until matches are possible

slide-49
SLIDE 49

Sort-merge join

Operators 49

sort sort T R T sorted

  • n join

attribute R sorted

  • n join

attribute MERGE Joined table Scans the two sorted tables but it uses the sorted order to avoid having to go back on any of the tables

slide-50
SLIDE 50

✔Complexity: O(|R|+|S|), i.e., O(N) ✔Easy to parallelize Canonical Hash Join

k hash(key)

  • 1. Build phase

bucket 1 bucket n-1 bucket 0

hash table

  • 2. Probe phase

k

R S

hash(key) match

50 Operators

slide-51
SLIDE 51

(Grace) Hash Join

Operators 51

slide-52
SLIDE 52

Grace Hash Join

Operators 52

slide-53
SLIDE 53

Partitioned Hash Join (Shatdal et al. 1994)

  • No more cache misses during the join

53

k

R S

h1(key) h1(key)

. . .

1 p 1 p

"Cache conscious algorithms for relational query processing", Shatdal et al, VLDB ‘94

k

. . . . . . . . .

  • Idea: Partition input into disjoint chunks of cache size

① Partition ① Partition ② Build ③ Probe h2(k) p > #TLB-entries  TLB misses p > #Cache-entries  Cache thrashing

Problem: p can be too large!

Operators

slide-54
SLIDE 54

Multi-Pass Radix Partitioning

  • Problem: Hardware limits fan-out, i.e. T = #TLB-entries (typically 64-512)
  • Solution: Do the partitioning in multiple passes!

Operators 54

1st pass h1(key) 1 T

. . .

2nd Pass h2(key) 1 T

. . .

2nd Pass h2(key) 1 T

. . .

. . . . . .

... ith pass ... 1st log2T bits

  • f hash(key)

2nd log2T bits

  • f hash(key)

partition - 1 partition - T i

  • TLB & Cache efficiency compensates multiple read/write passes

input relation

i = logT p

«Database Architecture Optimized for the new Bottleneck: Memory Access», Manegold et al, VLDB ‘99

slide-55
SLIDE 55

Thread-1 Thread-2 Thread-3 Thread-N

Relation Local Histograms Global Histogram & Prefix Sum Partitioned

1st Scan 2nd Scan Each thread scatters out its tuples based on the prefix sum Parallel Radix Join

Parallelizing the Partitioning: Pass - 1

Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs, VLDB ‘09

55

T0 T1 TN T0 T1 TN T0 T1 TN T0 T1 TN

Operators

slide-56
SLIDE 56

Parallel Radix Join

Parallelizing the Partitioning: Pass - (2 .. i)

Thread-2 Thread-4 Thread-N

Relation

Histogram Prefix Sum

Partitioned

 

1st Scan 2nd Scan Each thread individually partitions sub-relations from pass-1

. . .

56

P1 P2 P3 P4 P11 P12 P13 P14 P21 P22 P23 P24 P31 P32 P33 P41 P42 P43

Operators

slide-57
SLIDE 57

Summary

  • Database engines implement many operators with many different

variations depending on the input data

  • Implementations are also tailored to the particular data type being

processed (integers, strings, etc.)

  • Today, emphasis is on:
  • Using the caches efficiently
  • Exploiting SIMD
  • Taking advantage of the hardware (e.g., prefetching into the caches)

Operators 57