DATABASE CRACKING: Fancy Scan, not Poor Mans Sort! Hardware Folks - - PowerPoint PPT Presentation
DATABASE CRACKING: Fancy Scan, not Poor Mans Sort! Hardware Folks - - PowerPoint PPT Presentation
DATABASE CRACKING: Fancy Scan, not Poor Mans Sort! Hardware Folks Cracking Folks Don Holger Pirk Eleni Petraki Strato Idreos Stefan Manegold Martin Kersten EVALUATING RANGE PREDICATES COMPLEXITY ON PAPER Scanning: O(n) Sorting:
EVALUATING RANGE PREDICATES
COMPLEXITY ON PAPER
- Scanning: O(n)
- Sorting: O(n×log(n))
- Cracking: O(n)
- Essentially a single Quicksort-Step
COSTS IN REALITY
- Implement microbenchmarks
- 1 Billion uniform random integer values
- Pivot in the middle of the range
- Workstation machine (16 GB RAM, 4 Sandy Bridge Cores)
Parallel Scanning Cracking Parallel Sorting 0.0 2.0 4.0 6.0 8.0 10 12 Wallclock time in s 0.0 13
COSTS IN REALITY
SO: WHAT’S GOING ON?
L1I Misses L2 Misses L1D Misses L3 Misses Scanning Cracking Sorting 0.0 200M 400M 600M 800M 1.0B 1.2B 1.4B 0.0 1.5B
CACHE MISSES? NOPE!
Micro-ops Issued? Allocation Stall? Micro-op Ever Retire? Frontend Bound Backend Bound Bad Speculation Retiring
No Yes No Yes No Yes
Cache Miss Stalls Other Stalls
CPU COSTS
! " " #
Scanning Cracking Sorting 0.0 0.20 0.40 0.60 0.80 1.0 0.0 1.0 Pipeline Frontend Bad Speculation Retiring Data Stalls Pipeline Backend
CPU COSTS
Scanning Cracking Sorting 0.0 0.20 0.40 0.60 0.80 1.0 0.0 1.0 Pipeline Frontend Bad Speculation Retiring Data Stalls Pipeline Backend
CPU COSTS
14 % !!!
Scanning Cracking Sorting 0.0 0.20 0.40 0.60 0.80 1.0 0.0 1.0 Pipeline Frontend Bad Speculation Retiring Data Stalls Pipeline Backend
CPU COSTS
Lots of Potential
WHAT CAN WE DO ABOUT IT?
INCREASING CPU EFFICIENCY
PREDICATION
for(i=0; i<size; i++)! if(input[i] < pivot) {!
- utput[outI] = input[i];!
- utI++!
} for(i=0; i<size; i++)! {!
- utput[outI] = input[i];!
- utI += (input[i] < pivot);!
}
PREDICATION
- Turns control dependencies into data dependencies
- Eliminates Branch Mispredictions
- Causes unconditional (potentially unnecessary) I/O
- (limited to caches)
- Works only for out-of-place algorithms
PREDICATED CRACKING
PREDICATED CRACKING
7 2 4 8 2 9 3 8 1 5 7 5 3
active backup
5
pivot
PREDICATED CRACKING
3 2 4 8 2 9 3 8 1 5 7 5 7 3 7
active backup
5
pivot
PREDICATED CRACKING
3 2 4 8 2 9 3 8 1 5 7 5 7 3 7
active backup
5
pivot
?
cmp State Before Iteration
PREDICATED CRACKING
3 2 4 8 2 9 3 8 1 5 7 5 3 3 7
active backup
5
pivot
1
cmp
>
Evaluate Predicat & Write
PREDICATED CRACKING
3 2 4 8 2 9 3 8 1 5 7 5 3 3 7
active backup
5
pivot
1
cmp
+= 1- −=
Advance Cursor
PREDICATED CRACKING
3 2 4 8 2 9 3 8 1 5 7 5 3 2 7
active backup
5
pivot
1
cmp
* 1- * +
Read Next Element
PREDICATED CRACKING
3 2 4 8 2 9 3 8 1 5 7 5 3 2 7
active backup
5
pivot
1
cmp
PREDICATED CRACKING
- Predication for in-place algorithms
- No branching⇒No branch mispredictions
- Somewhat intricate
- Lots of copying stuff around (integer granularity ⇒ inefficient)
- Bulk-copying would be more efficient
VECTORIZED CRACKING
VECTORIZED CRACKING
- Turns in-place cracking into out-of-place cracking
- Copies Vector-sized chunks and cracks them into the array
- Makes vanilla-predication possible
- Uses SIMD-copying for vector copying
- Challenge: ensure that values aren't “accidentally" overwritten
copy partition copy partition
VECTORIZED CRACKING
RESULTS
Pipeline Frontend Bad Speculation Retiring Data Stalls Pipeline Backend Vectorized Predicated Original 0.0 0.20 0.40 0.60 0.80 1.0 0.0 1.0
RESULTS
Scan Vectorized Predicated (Register) Predicated (Cache) Original 0.0 1.0 2.0 3.0 4.0 Wallclock time in s 0.0 4.3
RESULTS: WORKSTATION
0.0 1.0 2.0 3.0 4.0 5.0 Wallclock time in s 0.0 5.3 Scan Vectorized Predicated (Register) Predicated (Cache) Original
RESULTS: SERVER
Not there yet!
PARALLELIZATION
PARALLELIZATION
- Obvious Solution: Partitioning
CRACK & MERGE
y1 x1 y2 y3 y4 x3 x4 x2
Partition
CRACK & MERGE
y1 x1 y2 y3 y4 x3 x4 x2
Merge
REFINED CRACK & MERGE
y2 y1 y4 y3 x3 x1 x4 x2
Partition
REFINED CRACK & MERGE
y2 y1 y4 y3 x3 x1 x4 x2
Smaller Merge
RESULTS: WORKSTATION
Seconds 0,4 0,8 1,2 1,6 Scan RVPCrack RPCrack PVCrack PCrack Vectorized
RESULTS: SERVER
Seconds 0,00 0,75 1,50 2,25 3,00 Scan RVPCrack RPCrack PVCrack PCrack Vectorized
50 Qualifying Tuples/Pivot 100 0.20 0.40 0.60 0.80 1.0 1.2 1.4 0.0 1.5 Wallclock time in s
IMPACT OF SELECTIVITY: WORKSTATION
Vectorized Vectorized Partition & Merge Partition & Merge Refined Partition & Merge Vectorized Refined Partition & Merge Scanning
50 Qualifying Tuples/Pivot 100 0.50 1.0 1.5 2.0 0.0 2.6 Wallclock time in s
IMPACT OF SELECTIVITY: SERVER
Vectorized Vectorized Partition & Merge Partition & Merge Refined Partition & Merge Vectorized Refined Partition & Merge Scanning