efficient top k query processing on massively parallel
play

Efficient Top-K Query Processing on Massively Parallel Hardware - PowerPoint PPT Presentation

Efficient Top-K Query Processing on Massively Parallel Hardware ANIL SHANBHAG, HOLGER PIRK, SAM MADDEN 1 CPU GPU 16-24 Cores ~5000 Cores 60 GB/s 250-900 GB/s Main Mem GPU Mem 128-256GB 12-32GB 16-40 GB/s 2 Top-K SELECT id FROM tweets


  1. Efficient Top-K Query Processing on Massively Parallel Hardware ANIL SHANBHAG, HOLGER PIRK, SAM MADDEN 1

  2. CPU GPU 16-24 Cores ~5000 Cores 60 GB/s 250-900 GB/s Main Mem GPU Mem 128-256GB 12-32GB 16-40 GB/s 2

  3. Top-K SELECT id FROM tweets WHERE tweet_time ∈ [X,Y] ORDER BY retweet_count + 0.5*likes_count DESC LIMIT K Typical K is 5-100 3

  4. Top-K 7 Classic Sequential Algorithm: Use a min-heap of size k to maintain the top-k items 15 20 21 32 50 40 4

  5. Partition and Merge On Multi-core CPU: Partition data Core Core Core Merge Results 5

  6. On GPU Does not work well on GPU execution model PROBLEMS ! Warp of Threads • Significant thread divergence ce …… • Ma Maintaini ning ng he heap o p of s size k k pe per t r thr hread l d limits performance ce 6

  7. Intuition Bitonic Top-K Priority Queue ??? Top-K Heap Sort Bitonic Sort Sort Sequential Parallel Sort + Top-K Heap Per-Thread Radix-Select Bucket-Select 7

  8. Bitonic Top-K 8

  9. Bitonic Sequence Two monotonic sequences Sequence S = <a 0 , a 1 , a 2 … a n-1 > such that • a 0 ≤ a 1 ≤ .. ≤ a k • a k+1 ≥ a k+2 ≥ ... ≥ a n-1 9

  10. Bitonic Merge S 1 = <min(a 0 , a n/2 ), min(a 1 , a n/2+1 ), ... min(a n/2-1 , a n-1 )> S 2 = <max(a 0 , a n/2 ), max(a 1 ,a n/2+1 ), ... max(a n/2-1 , a n-1 )> S 1 and S 2 are both bitonic S 1 < S 2 : Every element in S 1 is smaller than any element of S 2 < < < Apply recursively on S 1 and S 2 => From S2 From S1 Sor Sort Entire Sequenc equence e -> > log(n) r ) rounds. 10

  11. Bitonic Sort Complexity : O(n(logn) 2 ) 1 2 Phase 3 Step 1 2 1 4 2 1 0 1 2 3 4 5 6 7 11

  12. Bitonic Top-K Complexity : O(n(logk) 2 ) Unsorted Sequence Phase 1 : Local Sort Sorted Sequences of length k Finding Top-4 in 16 elements Phase 2: Merge P1: Local Sort P2: Merge P2:Merge P3:Rebuild P3: Rebuild Merge neighboring sorted sequences of length k Len 1 2 2 2 Inc To select largest k elements (bitonic sequence) 1 2 1 2 1 2 1 0 1 2 Phase 3: Rebuild 3 4 Sort bitonic sequence of length k 5 v v v 6 7 When list size = k 8 9 10 Result top-k 11 12 13 14 12 15

  13. On the GPU Simplest way to partition into kernels: Each column has a kernel invocation Each thread does 1 comparison n/2 comparisons needed => n/2 threads launched 521ms Naive Sort 130ms Time to find top-32 in Final 14.5ms sequence of size 2 29 One Pass 10ms 13

  14. Optimizations 14

  15. Optimizations SM-2 SM-1 SM-N Registers Registers Registers Registers Upto 3.5 TBps L1 SMEM L1 SMEM L1 SMEM Shared Memory 260 GBps L2 Cache On chip Off chip Global Memory Global Memory 15

  16. Optimization 1: Using Shared Memory Global memory For thread block with T threads, access load 2T elements into Shared memory shared memory access Time to find top-32 in ������ ��� �� sequence of size 2 29 ����� ��� �� �� �� ��� ��� ��� ���� �� �� 16

  17. Optimization 2: Combining Phases Global memory access Instead of loading 2T, lets load 8T elements and Shared memory access combine the 5 phases Shared Memory ������ ��� �� Bandwidth Bound ����� ���� �� �� �� ��� ��� ��� ���� �� �� 17

  18. 4 2 1 4 2 1 0 0 1 1 2 2 3 3 4 4 Optimization 3: 5 5 6 6 Combining Steps 7 7 8 8 9 9 10 10 11 11 12 12 13 13 Shared memory 14 14 access 15 15 Three steps at a time One step at a time ������ ���� �� ����� ���� �� �� �� ��� ��� ��� ���� �� �� 18

  19. Memory Bank 0 1 2 3 4 5 6 7 Optimization 4: 2 1 0 1 Padding 0 Before Padding 2 3 1 2 Address 4 5 3 4 7 6 5 6 7 Memory Bank 8 0 1 2 3 4 5 6 7 9 10 0 1 After Padding 11 12 X 2 3 13 Address X 4 5 14 15 Thread X 6 7 Access Unused Cell ������ ���� �� ����� ���� �� �� �� ��� ��� ��� ���� �� �� 19

  20. Optimization 5: Chunk Permutation Step 1 2 1 4 2 1 0 1 2 3 17.8ms Before 4 4 5 0 16ms 6 After 7 1 2 3 4 5 6 0 1 2 3 4 5 6 7 7 8 9 0 1 2 3 4 5 6 7 10 11 12 13 14 15 Padded Cell 20

  21. Evaluation 21

  22. Setup Intel i7 16 Cores Titan X 60 GB/s 260 GB/s GPU Mem Main Mem 12 GB 64GB 16 GB/s 22

  23. Varying K For 2^29 (1/2 billion) floats from U(0,1) 23

  24. Varying Distributions 24

  25. Integration Dataset: 250 million tweets May 2017 SELECT id FROM tweets WHERE tweet_time < X ORDER BY retweet_count DESC LIMIT 50 4.5 x Faster 25

  26. Conclusion Data analytics on GPUs increasingly common and Top-K on GPU non-trivial Bitonic Top-k: Novel Top-K algorithm for GPU ◦ Distribution Independent ◦ Best performing for K <= 256 Integrated into a real database - >4x performance improvement 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend