Efficient Top-K Query Processing on Massively Parallel Hardware - - PowerPoint PPT Presentation

efficient top k query processing on massively parallel
SMART_READER_LITE
LIVE PREVIEW

Efficient Top-K Query Processing on Massively Parallel Hardware - - PowerPoint PPT Presentation

Efficient Top-K Query Processing on Massively Parallel Hardware ANIL SHANBHAG, HOLGER PIRK, SAM MADDEN 1 CPU GPU 16-24 Cores ~5000 Cores 60 GB/s 250-900 GB/s Main Mem GPU Mem 128-256GB 12-32GB 16-40 GB/s 2 Top-K SELECT id FROM tweets


slide-1
SLIDE 1

Efficient Top-K Query Processing on Massively Parallel Hardware

ANIL SHANBHAG, HOLGER PIRK, SAM MADDEN

1

slide-2
SLIDE 2

CPU

16-24 Cores Main Mem

128-256GB

GPU Mem

12-32GB

GPU

~5000 Cores

250-900 GB/s 60 GB/s 16-40 GB/s

2

slide-3
SLIDE 3

Top-K

SELECT id FROM tweets WHERE tweet_time ∈ [X,Y] ORDER BY retweet_count + 0.5*likes_count DESC LIMIT K Typical K is 5-100

3

slide-4
SLIDE 4

Top-K

Classic Sequential Algorithm: Use a min-heap of size k to maintain the top-k items 7

15 20 21 50 40 32

4

slide-5
SLIDE 5

Partition and Merge

Core Core Core

On Multi-core CPU: Partition data

5

Merge Results

slide-6
SLIDE 6

On GPU

Does not work well on GPU execution model PROBLEMS !

  • Significant thread divergence

ce

……

Warp of Threads

6

  • Ma

Maintaini ning ng he heap o p of s size k k pe per t r thr hread l d limits performance ce

slide-7
SLIDE 7

7

Intuition

Sort + Top-K Heap Per-Thread Radix-Select Bucket-Select

Priority Queue ??? Heap Sort Bitonic Sort

Parallel Sequential Top-K Sort

Bitonic Top-K

slide-8
SLIDE 8

Bitonic Top-K

8

slide-9
SLIDE 9

Bitonic Sequence

Sequence S = <a0, a1, a2 … an-1> such that

  • a0 ≤ a1 ≤ .. ≤ ak
  • ak+1 ≥ ak+2 ≥ ... ≥ an-1

9

Two monotonic sequences

slide-10
SLIDE 10

10

S1 and S2 are both bitonic S1 < S2 : Every element in S1 is smaller than any element of S2

Bitonic Merge

Sor Sort Entire Sequenc equence e -> > log(n) r ) rounds.

S1 = <min(a0, an/2), min(a1, an/2+1), ... min(an/2-1, an-1)> S2 = <max(a0, an/2), max(a1,an/2+1), ... max(an/2-1, an-1)>

Apply recursively on S1 and S2 =>

From S1 From S2

< < <

slide-11
SLIDE 11

Bitonic Sort

Phase Step

1 1 2 1 2 3 4 2 1 1 2 3 4 5 6 7

Complexity: O(n(logn)2)

11

slide-12
SLIDE 12

Unsorted Sequence Finding Top-4 in 16 elements Sorted Sequences of length k Merge neighboring sorted sequences of length k To select largest k elements (bitonic sequence) Sort bitonic sequence of length k Result top-k

When list size = k

12

Phase 1 : Local Sort Phase 2: Merge Phase 3: Rebuild

Len Inc 1 1 2 1 2 2 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

v

2 1 P2: Merge P3:Rebuild

v

P1: Local Sort P3: Rebuild

v

P2:Merge 2 2

Bitonic Top-K

Complexity: O(n(logk)2)

slide-13
SLIDE 13

On the GPU

Simplest way to partition into kernels: Each column has a kernel invocation Each thread does 1 comparison n/2 comparisons needed => n/2 threads launched

Naive Sort 521ms 130ms Time to find top-32 in sequence of size 229 Final 14.5ms One Pass 10ms

13

slide-14
SLIDE 14

Optimizations

14

slide-15
SLIDE 15

15

Optimizations

Global Memory Shared Memory Registers

260 GBps Upto 3.5 TBps SM-1 Registers L1 SMEM SM-2 Registers L1 SMEM SM-N Registers L1 SMEM L2 Cache Global Memory Off chip On chip

slide-16
SLIDE 16

Optimization 1: Using Shared Memory

  • Time to find top-32 in

sequence of size 229 For thread block with T threads, load 2T elements into shared memory

16

Shared memory access Global memory access

slide-17
SLIDE 17

Instead of loading 2T, lets load 8T elements and combine the 5 phases

Optimization 2: Combining Phases

  • 17

Shared Memory Bandwidth Bound

Shared memory access Global memory access

slide-18
SLIDE 18

Optimization 3: Combining Steps

4 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

One step at a time Three steps at a time

  • 4

2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Shared memory access

18

slide-19
SLIDE 19

Optimization 4: Padding

  • 1

2 3 4 5 6 7

Memory Bank Address 1 X 2 3 X 4 5 X 6 7

1 2 3 4 5 6 7

Memory Bank Address

Thread Access Unused Cell

Before Padding After Padding

2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 3 5 7 2 4 6

19

slide-20
SLIDE 20

1 2 3 4 5 6 7 Padded Cell 1 2 3 4 5 6 7

Optimization 5: Chunk Permutation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 4

20

Step

1 2 1 4 2 1 1 2 3 4 5 6 7

Before After 17.8ms 16ms

slide-21
SLIDE 21

Evaluation

21

slide-22
SLIDE 22

Setup

Intel i7

16 Cores Main Mem 64GB GPU Mem 12 GB

Titan X

60 GB/s

16 GB/s

260 GB/s

22

slide-23
SLIDE 23

For 2^29 (1/2 billion) floats from U(0,1)

Varying K

23

slide-24
SLIDE 24

Varying Distributions

24

slide-25
SLIDE 25

SELECT id FROM tweets WHERE tweet_time < X ORDER BY retweet_count DESC LIMIT 50

Integration

Dataset: 250 million tweets May 2017

4.5 x Faster

25

slide-26
SLIDE 26

Conclusion

Data analytics on GPUs increasingly common and Top-K on GPU non-trivial Bitonic Top-k: Novel Top-K algorithm for GPU

  • Distribution Independent
  • Best performing for K <= 256

Integrated into a real database - >4x performance improvement

26