gpu multisplit
play

GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John - PowerPoint PPT Presentation

GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 1 / 16 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan


  1. GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 1 / 16

  2. Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags → Scan → Shuffle e.g., splitter: 10 input keys 25 12 4 76 7 17 6 1 compacted 4 7 6 1 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 2 / 16

  3. Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags → Scan → Shuffle e.g., splitter: 10 Other option: split input into two buckets input keys 25 12 4 76 7 17 6 1 buckets 1 1 0 1 0 1 0 0 output keys 4 7 6 1 25 12 76 17 buckets 0 0 0 0 1 1 1 1 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 2 / 16

  4. Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags → Scan → Shuffle e.g., splitter: 10 Other option: split input into two buckets Can also be solved by sorting keys Not always possible Loses “stability”, i.e., initial order within buckets not preserved input keys 25 12 4 76 7 17 6 1 buckets 1 1 0 1 0 1 0 0 output keys 4 7 6 1 25 12 76 17 buckets 0 0 0 0 1 1 1 1 sorted keys 1 4 6 7 12 17 25 76 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 2 / 16

  5. What is Multisplit? Multisplit (generalization of binary split) Let’s try multiple buckets e.g., splitters: 10 and 20 input keys 25 17 4 76 7 12 6 1 buckets 2 1 0 2 0 1 0 0 output keys 4 7 6 1 17 12 25 76 buckets 0 0 0 0 1 1 2 2 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 3 / 16

  6. What is Multisplit? Multisplit (generalization of binary split) Let’s try multiple buckets e.g., splitters: 10 and 20 Can also be solved by sorting keys input keys 25 17 4 76 7 12 6 1 buckets 2 1 0 2 0 1 0 0 output keys 4 7 6 1 17 12 25 76 buckets 0 0 0 0 1 1 2 2 sorted keys 1 4 6 7 12 17 25 76 buckets 0 0 0 0 1 1 2 2 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 3 / 16

  7. Mutlisplit primitive Input : unordered set of keys (or key-value pairs) m , number of buckets a user specified function to identify buckets for each key Output : keys (or key-value pairs) separated into m buckets B 0 B 1 B 2 B 3 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 4 / 16

  8. A Fast and Flexible Data-Organization Primitive characterizing key-value pairs into buckets General load balancing Priority queues Single Source Shortest Path (SSSP) Serial (Dijkstra): processing the vertex with the lowest weight Bellman-Ford-Moore → all vertices in parallel delta-stepping formulation of SSSP [Davidson et al., 2014] classifying vertices into buckets by their weights processing the lowest weights in parallel But no multisplit primitive → used radix-sort instead By using our own multisplit → 2.1x faster other applications colored prefix-sum reorganizing into 8 direction-based buckets in GPU based ray tracers [Yang et al., 2013] first step in building GPU hash tables [Alcantara et al., 2009] in the shallow stages of k -d tree construction [Wu et al., 2011] S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 5 / 16

  9. Common Approaches: Buckets 1 Recursive scan-based split B 0 = { i ≤ 40 } B 1 = { i > 40 } ⌈ log( m ) ⌉ rounds of binary splits Initial Keys 0 1 2 3 4 5 6 7 59 46 31 6 24 82 3 17 0 0 1 1 1 0 1 1 B 0 Exclusive scan 0 1 2 0 0 3 3 4 1 1 0 0 0 1 0 0 B 1 right to left exclusive scan 2 1 1 1 1 0 0 0 0 1 2 3 4 5 6 7 31 3 17 59 46 82 31 6 24 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 6 / 16

  10. Common Approaches: 1 Recursive scan-based split Initial Keys ⌈ log( m ) ⌉ rounds of binary splits 0 1 2 3 4 5 6 7 2 Radix sort 59 46 31 6 24 82 3 17 sorting keys binary representation overkill (sorted within buckets) 0111011 0101110 0011111 0000110 initial order is not preserved 0011000 1010010 0000011 0010001 ≤ 7 splits 0 1 2 3 4 5 6 7 31 3 31 46 59 82 6 17 24 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 6 / 16

  11. Common Approaches: 1 Recursive scan-based split Initial Keys ⌈ log( m ) ⌉ rounds of binary splits 0 1 2 3 4 5 6 7 2 Radix sort 59 46 31 6 24 82 3 17 sorting keys 1 overkill (sorted within buckets) New values 0 initial order is not preserved 59 46 31 6 24 82 3 17 3 Reduced bit sort New keys 1 0 0 0 0 1 1 0 sort (bucket ID, keys) ⌈ log m ⌉ -bit bucket IDs key-value sort 0 1 2 3 4 5 6 7 31 59 3 6 17 24 31 46 82 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 6 / 16

  12. Common Approaches: 1 Recursive scan-based split Initial Keys ⌈ log( m ) ⌉ rounds of binary splits 0 1 2 3 4 5 6 7 2 Radix sort 59 46 31 6 24 17 82 3 sorting keys overkill (sorted within buckets) buffer B 0 17 24 3 31 6 initial order is not preserved 3 Reduced bit sort buffer B 1 82 46 59 sort (bucket ID, keys) ⌈ log m ⌉ -bit bucket IDs 4 Randomized insertions compaction a PRAM algorithm large buffers for buckets 0 1 2 3 4 5 6 7 31 24 random insertions 3 31 6 82 46 59 17 initial order is not preserved S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 6 / 16

  13. Designing an Efficient Approach Stable Multisplit → unique permutation + data movement 1 Deriving all permutations → global computations histogram ( h 0 , . . . , h m − 1 ) key order per bucket j − 1 � u i ∈ B j ⇒ p ( i ) = + |{ u r : u r ∈ B j , r < i }| h k � �� � k =0 Number of keys � �� � Number of keys in before me in my own bucket previous buckets B 0 B 1 B 2 B 3 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 7 / 16

  14. Designing an Efficient Approach Stable Multisplit → unique permutation + data movement 1 Deriving all permutations → global computations histogram ( h 0 , . . . , h m − 1 ) key order per bucket 2 Final data movements → global random scatters B 0 B 1 B 2 B 3 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 7 / 16

  15. Our high level ideas 1 Global computations Localize computations several large enough local subproblems − → local histograms a single small enough global computation − → global histogram several large enough local subproblems − → permutations + scatters Avoid shared memory and synchronization: utilize intrinsics Pre scan Local Global Scan Post scan Local S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 8 / 16

  16. Our high level ideas 1 Global computations Localize computations several large enough local subproblems − → local histograms a single small enough global computation − → global histogram several large enough local subproblems − → permutations + scatters Avoid shared memory and synchronization: utilize intrinsics 2 Global random scatters Reordering keys locally in the last stage → local multisplits more computational cost but better memory access pattern (coalesced writes) S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 8 / 16

  17. Granularity Tradeoffs We experimented with a couple different subproblem granularities warp 1 warp synchronous model with minimal warp divergence fast communication via warp-wide ballot/shuffles block 2 more expensive communication via shared memory cheaper global computation (scan over m × N blocks ) more locality to extract after reordering Property Direct MS Warp-level MS Block-level MS Subproblem warp warp block reordering – warp-wide reordering block-wide reordering computational load low medium high Coalesced memory access low medium high S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 9 / 16

  18. Implementation details & Optimizations an Pre scan Warp-level MS 1 Pre-scan (Local): read keys warp histogram · · · · · · h 0 , 0 h 1 , 0 h m − 1 , 0 bit-by-bit balloting 1 · · · · · · h 0 , 1 h 1 , 1 h m − 1 , 1 ⌈ log m ⌉ rounds 2 · · · · · · store warp histogram · · · · · · h 0 ,L − 1 h 1 ,L − 1 h m − 1 ,L − 1 1: procedure warp histogram ( bucket id[0:31] ) an Input: bucket id[0:31] Output: histo[0:m-1] for each thread i = 0:31 parallel warp do · · · · · · · · · · · · 2: histo bmp[i] = 0xFFFFFFFF; 3: for (int k = 0; k < ceil(log2(m)); k++) do 4: temp buffer = ballot(bucket id[i] & 0x01); 5: an if ((i >> k) & 0x01 ) then 6: 7: histo bmp[i] &= temp buffer; else 8: histo bmp[i] &= XOR(0xFFFFFFFF, temp buffer); 9: end if 10: bucket id[i] >>= 1; 11: end for 12: 13: histo[i] = popc(histo bmp[i]); 14: end for return histo[0:m-1]; 15: 16: end procedure S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 10 / 16

  19. Implementation details & Optimizations Pre scan Warp-level MS 1 Pre-scan (Local): read keys warp histogram · · · h 0 , 0 h 1 , 0 h m − 1 , 0 bit-by-bit balloting 1 · · · h 0 , 1 h 1 , 1 h m − 1 , 1 ⌈ log m ⌉ rounds 2 · · · store warp histogram · · · h 0 ,L − 1 h 1 ,L − 1 h m − 1 ,L − 1 2 Scan (Global): exclusive scan on histograms Scan m × N warps elements · · · · · · · · · · · · h 0 , 0 h 0 , 1 h 0 ,L − 1 h 1 , 0 h 1 , 1 h 1 ,L − 1 h m − 1 , 0 h m − 1 , 1 h m − 1 ,L − 1 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 10 / 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend