DATABASE CRACKING: Fancy Scan, not Poor Mans Sort! Hardware Folks - - PowerPoint PPT Presentation

database cracking
SMART_READER_LITE
LIVE PREVIEW

DATABASE CRACKING: Fancy Scan, not Poor Mans Sort! Hardware Folks - - PowerPoint PPT Presentation

DATABASE CRACKING: Fancy Scan, not Poor Mans Sort! Hardware Folks Cracking Folks Don Holger Pirk Eleni Petraki Strato Idreos Stefan Manegold Martin Kersten EVALUATING RANGE PREDICATES COMPLEXITY ON PAPER Scanning: O(n) Sorting:


slide-1
SLIDE 1

DATABASE CRACKING:

Fancy Scan, not Poor Man’s Sort!

Holger Pirk Eleni Petraki Strato Idreos Stefan Manegold Martin Kersten

Hardware Folks Cracking Folks Don

slide-2
SLIDE 2

EVALUATING RANGE PREDICATES

slide-3
SLIDE 3

COMPLEXITY ON PAPER

  • Scanning: O(n)
  • Sorting: O(n×log(n))
  • Cracking: O(n)
  • Essentially a single Quicksort-Step
slide-4
SLIDE 4

COSTS IN REALITY

  • Implement microbenchmarks
  • 1 Billion uniform random integer values
  • Pivot in the middle of the range
  • Workstation machine (16 GB RAM, 4 Sandy Bridge Cores)
slide-5
SLIDE 5

Parallel Scanning Cracking Parallel Sorting 0.0 2.0 4.0 6.0 8.0 10 12 Wallclock time in s 0.0 13

COSTS IN REALITY

slide-6
SLIDE 6

SO: WHAT’S GOING ON?

slide-7
SLIDE 7

L1I Misses L2 Misses L1D Misses L3 Misses Scanning Cracking Sorting 0.0 200M 400M 600M 800M 1.0B 1.2B 1.4B 0.0 1.5B

CACHE MISSES? NOPE!

slide-8
SLIDE 8

Micro-ops Issued? Allocation Stall? Micro-op Ever Retire? Frontend Bound Backend Bound Bad Speculation Retiring

No Yes No Yes No Yes

Cache Miss Stalls Other Stalls

CPU COSTS

! " " #

slide-9
SLIDE 9

Scanning Cracking Sorting 0.0 0.20 0.40 0.60 0.80 1.0 0.0 1.0 Pipeline Frontend Bad Speculation Retiring Data Stalls Pipeline Backend

CPU COSTS

slide-10
SLIDE 10

Scanning Cracking Sorting 0.0 0.20 0.40 0.60 0.80 1.0 0.0 1.0 Pipeline Frontend Bad Speculation Retiring Data Stalls Pipeline Backend

CPU COSTS

14 % !!!

slide-11
SLIDE 11

Scanning Cracking Sorting 0.0 0.20 0.40 0.60 0.80 1.0 0.0 1.0 Pipeline Frontend Bad Speculation Retiring Data Stalls Pipeline Backend

CPU COSTS

Lots of Potential

slide-12
SLIDE 12

WHAT CAN WE DO ABOUT IT?

slide-13
SLIDE 13

INCREASING CPU EFFICIENCY

slide-14
SLIDE 14

PREDICATION

for(i=0; i<size; i++)! if(input[i] < pivot) {!

  • utput[outI] = input[i];!
  • utI++!

} for(i=0; i<size; i++)! {!

  • utput[outI] = input[i];!
  • utI += (input[i] < pivot);!

}

slide-15
SLIDE 15

PREDICATION

  • Turns control dependencies into data dependencies
  • Eliminates Branch Mispredictions
  • Causes unconditional (potentially unnecessary) I/O
  • (limited to caches)
  • Works only for out-of-place algorithms
slide-16
SLIDE 16

PREDICATED CRACKING

slide-17
SLIDE 17

PREDICATED CRACKING

7 2 4 8 2 9 3 8 1 5 7 5 3

active backup

5

pivot

slide-18
SLIDE 18

PREDICATED CRACKING

3 2 4 8 2 9 3 8 1 5 7 5 7 3 7

active backup

5

pivot

slide-19
SLIDE 19

PREDICATED CRACKING

3 2 4 8 2 9 3 8 1 5 7 5 7 3 7

active backup

5

pivot

?

cmp State Before Iteration

slide-20
SLIDE 20

PREDICATED CRACKING

3 2 4 8 2 9 3 8 1 5 7 5 3 3 7

active backup

5

pivot

1

cmp

>

Evaluate Predicat & Write

slide-21
SLIDE 21

PREDICATED CRACKING

3 2 4 8 2 9 3 8 1 5 7 5 3 3 7

active backup

5

pivot

1

cmp

+= 1- −=

Advance Cursor

slide-22
SLIDE 22

PREDICATED CRACKING

3 2 4 8 2 9 3 8 1 5 7 5 3 2 7

active backup

5

pivot

1

cmp

* 1- * +

Read Next Element

slide-23
SLIDE 23

PREDICATED CRACKING

3 2 4 8 2 9 3 8 1 5 7 5 3 2 7

active backup

5

pivot

1

cmp

slide-24
SLIDE 24

PREDICATED CRACKING

  • Predication for in-place algorithms
  • No branching⇒No branch mispredictions
  • Somewhat intricate
  • Lots of copying stuff around (integer granularity ⇒ inefficient)
  • Bulk-copying would be more efficient
slide-25
SLIDE 25

VECTORIZED CRACKING

slide-26
SLIDE 26

VECTORIZED CRACKING

  • Turns in-place cracking into out-of-place cracking
  • Copies Vector-sized chunks and cracks them into the array
  • Makes vanilla-predication possible
  • Uses SIMD-copying for vector copying
  • Challenge: ensure that values aren't “accidentally" overwritten
slide-27
SLIDE 27

copy partition copy partition

VECTORIZED CRACKING

slide-28
SLIDE 28

RESULTS

slide-29
SLIDE 29

Pipeline Frontend Bad Speculation Retiring Data Stalls Pipeline Backend Vectorized Predicated Original 0.0 0.20 0.40 0.60 0.80 1.0 0.0 1.0

RESULTS

slide-30
SLIDE 30

Scan Vectorized Predicated (Register) Predicated (Cache) Original 0.0 1.0 2.0 3.0 4.0 Wallclock time in s 0.0 4.3

RESULTS: WORKSTATION

slide-31
SLIDE 31

0.0 1.0 2.0 3.0 4.0 5.0 Wallclock time in s 0.0 5.3 Scan Vectorized Predicated (Register) Predicated (Cache) Original

RESULTS: SERVER

Not there yet!

slide-32
SLIDE 32

PARALLELIZATION

slide-33
SLIDE 33

PARALLELIZATION

  • Obvious Solution: Partitioning
slide-34
SLIDE 34

CRACK & MERGE

y1 x1 y2 y3 y4 x3 x4 x2

Partition

slide-35
SLIDE 35

CRACK & MERGE

y1 x1 y2 y3 y4 x3 x4 x2

Merge

slide-36
SLIDE 36

REFINED CRACK & MERGE

y2 y1 y4 y3 x3 x1 x4 x2

Partition

slide-37
SLIDE 37

REFINED CRACK & MERGE

y2 y1 y4 y3 x3 x1 x4 x2

Smaller Merge

slide-38
SLIDE 38

RESULTS: WORKSTATION

Seconds 0,4 0,8 1,2 1,6 Scan RVPCrack RPCrack PVCrack PCrack Vectorized

slide-39
SLIDE 39

RESULTS: SERVER

Seconds 0,00 0,75 1,50 2,25 3,00 Scan RVPCrack RPCrack PVCrack PCrack Vectorized

slide-40
SLIDE 40

50 Qualifying Tuples/Pivot 100 0.20 0.40 0.60 0.80 1.0 1.2 1.4 0.0 1.5 Wallclock time in s

IMPACT OF SELECTIVITY: WORKSTATION

Vectorized Vectorized Partition & Merge Partition & Merge Refined Partition & Merge Vectorized Refined Partition & Merge Scanning

slide-41
SLIDE 41

50 Qualifying Tuples/Pivot 100 0.50 1.0 1.5 2.0 0.0 2.6 Wallclock time in s

IMPACT OF SELECTIVITY: SERVER

Vectorized Vectorized Partition & Merge Partition & Merge Refined Partition & Merge Vectorized Refined Partition & Merge Scanning

slide-42
SLIDE 42

CONCLUSIONS

slide-43
SLIDE 43