External Sort Database Systems: The Complete Book Ch. 15.4 1 - - PowerPoint PPT Presentation

external sort
SMART_READER_LITE
LIVE PREVIEW

External Sort Database Systems: The Complete Book Ch. 15.4 1 - - PowerPoint PPT Presentation

External Sort Database Systems: The Complete Book Ch. 15.4 1 Operator Memory Needed Project O(1) Select O(1) Bag Union O(1) Join O(1) or O( |R|+|S| ) Group O( |G| ) Distinct O( |R| ) Sort O( |R| ) 2 You can get away with almost


slide-1
SLIDE 1

External Sort

Database Systems: The Complete Book

  • Ch. 15.4

1

slide-2
SLIDE 2

2

Project Bag Union Select Join Group Distinct Operator Memory Needed O(1) O(1) O(1) O(1) or O( |R|+|S| ) O( |G| ) O( |R| ) Sort O( |R| )

slide-3
SLIDE 3

3

You can get away with almost no disk-based algorithms… … as long as you have external sort.

slide-4
SLIDE 4

2-Way Sort

4

image credit: openclipart.org

Pass 1 Sort the Page Load a Page Flush the Page

slide-5
SLIDE 5

2-Way Sort

5

Pass 2 and beyond Read from 2 (sorted) buffers of size K Merge Sort into 1 buffer of size 2K Repeat (how many times?)

slide-6
SLIDE 6

2-Way Sort

6

3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 4,7 8,9 5,6 1,3 2 2,3 4,4 6,7 8,9 6 3,5 1,2 2,3 4,5 6,6 9 7,8 3,4 1,2

slide-7
SLIDE 7

Generalized External Sort

7

How can we use N buffer frames? For Pass 1? For Pass 2 onwards? How many passes do we make over the full data? Sort Bigger Initial Buffers Merge-sort Multiple Streams For data of size N, a K-way sort requires passes How many IOs do we use?

2 · #pages · #passes

slide-8
SLIDE 8

8

Pass 1 is memory-limited If we have N pages of memory, can we create more than N pages of sorted data?

slide-9
SLIDE 9

Replacement Sort

  • General idea: Create “runs” of sorted data
  • Keep a very large “working set” of data.
  • Keep appending data in ascending order to an output buffer.
  • As you flush sorted data to the output, keep loading new

tuples into the working set.

  • If you get new tuples useful for the current buffer, great!
  • Otherwise, they’ll go into the next run
  • When you run out of valid tuples to append, start a new run!

9

slide-10
SLIDE 10

Replacement Sort

10

Input Buffer Output Buffer Working Set 2 8 10 … 3 5 k=5 Step 1: Find the lowest value in the working set greater than k Step 0: k is the last value that was appended to the

  • utput buffer

Step 2: Append the value to the output buffer and update k k=8 Step 3: Insert a tuple from the input buffer and re-sort the working set 12 Repeat until k is bigger than all values in the working set

Finish the “run” and start a new one

slide-11
SLIDE 11

Replacement Sort

11

If you have N pages of memory, how many pages of sorted data will you make? E[k] = avg(k) On average, half of the tuples you read in will be useful for the current stream.

slide-12
SLIDE 12

How do we use sorted data to implement other memory-bound operators?

12

slide-13
SLIDE 13

13

Joins

slide-14
SLIDE 14

Implementing: Joins

Solution 3 (Sort-Merge Join) A B

3 1 5 2 5 4 1

Keep iterating on the set with the lowest value.

1 5 6

Done! When you hit two that match, emit, then iterate both

14

slide-15
SLIDE 15

15

Distinct

slide-16
SLIDE 16

Sort-By Distinct

16

3 1 5 2 5 4 1 6

slide-17
SLIDE 17

Sort-By Distinct

17

3 1 5 2 5 4 1 6 1 3 2 5 4 6

slide-18
SLIDE 18

18

Group-By

slide-19
SLIDE 19

Sort-By Grouping

19

3,8 1,1 5,4 2,2 5,9 4,3 1,5 6,7

slide-20
SLIDE 20

Sort-By Grouping

20

3,8 1,1 5,9 2,2 5,4 4,3 1,5 6,7

6 8 2 13 3 7