GPU Multisplit
Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens
- S. Ashkiani (UC Davis)
GPU Multisplit GTC 2016 1 / 16
GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John - - PowerPoint PPT Presentation
GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 1 / 16 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan
GPU Multisplit GTC 2016 1 / 16
GPU Multisplit GTC 2016 2 / 16
GPU Multisplit GTC 2016 2 / 16
GPU Multisplit GTC 2016 2 / 16
GPU Multisplit GTC 2016 3 / 16
GPU Multisplit GTC 2016 3 / 16
B0 B1 B2 B3
GPU Multisplit GTC 2016 4 / 16
GPU Multisplit GTC 2016 5 / 16
1 Recursive scan-based split
31 24 3 17 82 59 46 6 B0 = {i ≤ 40} B1 = {i > 40} Initial Keys
31 24 3 17 82 59 46 6 31
B0 B1
Exclusive scan right to left exclusive scan
GPU Multisplit GTC 2016 6 / 16
1 Recursive scan-based split
2 Radix sort
31 24 3 17 82 59 46 6
0000011 0010001 1010010 0011000 0000110 0011111 0101110 0111011
31 24 3 17 82 59 46 6 31
Initial Keys
GPU Multisplit GTC 2016 6 / 16
1 Recursive scan-based split
2 Radix sort
3 Reduced bit sort
31 24 3 17 82 59 46 6
Initial Keys 31 24 3 17 82 59 46 6
31 24 3 17 82 59 46 6 31
GPU Multisplit GTC 2016 6 / 16
1 Recursive scan-based split
2 Radix sort
3 Reduced bit sort
4 Randomized insertions
31 24 3 17 82 59 46 6
Initial Keys 31 24 6 3 17 46 59 82
31 24 3 17 82 59 46 6 31
GPU Multisplit GTC 2016 6 / 16
1 Deriving all permutations → global computations
B0 B1 B2 B3
GPU Multisplit GTC 2016 7 / 16
1 Deriving all permutations → global computations
2 Final data movements → global random scatters B0 B1 B2 B3
GPU Multisplit GTC 2016 7 / 16
1 Global computations
Global Local Local Pre scan Scan Post scan
GPU Multisplit GTC 2016 8 / 16
1 Global computations
2 Global random scatters
GPU Multisplit GTC 2016 8 / 16
1
2
GPU Multisplit GTC 2016 9 / 16
1 Pre-scan (Local):
1
2
· · · · · · · · · · · ·
h0,0 h1,0 hm−1,0 h0,1 h1,1 hm−1,1
h0,L−1 h1,L−1 hm−1,L−1
1: procedure warp histogram(bucket id[0:31])
Input: bucket id[0:31] Output: histo[0:m-1]
2:
for each thread i = 0:31 parallel warp do
3:
histo bmp[i] = 0xFFFFFFFF;
4:
for (int k = 0; k < ceil(log2(m)); k++) do
5:
temp buffer = ballot(bucket id[i] & 0x01);
6:
if ((i >> k) & 0x01) then
7:
histo bmp[i] &= temp buffer;
8:
else
9:
histo bmp[i] &= XOR(0xFFFFFFFF, temp buffer);
10:
end if
11:
bucket id[i] >>= 1;
12:
end for
13:
histo[i] = popc(histo bmp[i]);
14:
end for
15:
return histo[0:m-1];
16: end procedure
GPU Multisplit GTC 2016 10 / 16
1 Pre-scan (Local):
1
2
2 Scan (Global):
h0,0 h1,0 hm−1,0 h0,1 h1,1 hm−1,1
h0,L−1 h1,L−1 hm−1,L−1
h0,0 h0,1 h0,L−1 hm−1,0 hm−1,1 hm−1,L−1 h1,0 h1,1 h1,L−1
· · · · · · · · · · · ·
GPU Multisplit GTC 2016 10 / 16
1 Pre-scan (Local):
1
2
2 Scan (Global):
3 Post-scan (Local):
h0,0 h1,0 hm−1,0 h0,1 h1,1 hm−1,1
h0,L−1 h1,L−1 hm−1,L−1
h0,0 h0,1 h0,L−1 hm−1,0 hm−1,1 hm−1,L−1 h1,0 h1,1 h1,L−1
· · · · · · · · · · · ·
GPU Multisplit GTC 2016 10 / 16
1
2
3
4
1
2
3
4
GPU Multisplit GTC 2016 11 / 16
4 5 6 7 8 9 10 20 30 Number of buckets (m) Average running time (msec)
Block level MS Direct MS Reduced bit sort Warp level MS
8 12 16 10 20 30 Number of buckets (m) Average running time (msec)
GPU Multisplit GTC 2016 12 / 16
2 4 6 2 4 6 8 12 16 20 26 32 Number of buckets (m) Speedup
Block level MS Direct MS Reduced bit sort Warp level MS Binary Split
2 4 6 8 2 4 6 8 12 16 20 26 32 Number of buckets (m) Speedup
Block level MS Direct MS Redcued bit sort Warp level MS Binary Split
GPU Multisplit GTC 2016 13 / 16
Radix sort (key−only) Radix sort (key−value) 5 10 15 20 32 64 256 1024 4096 16384 65536 Number of buckets (m) Average running time (msec)
Block level MS Reduced bit sort Key−only Key−value
GPU Multisplit GTC 2016 14 / 16
GPU Multisplit GTC 2016 15 / 16
GPU Multisplit GTC 2016 16 / 16
Alcantara, D. A., Sharf, A., Abbasinejad, F., Sengupta, S., Mitzenmacher, M., Owens, J. D., and Amenta, N. (2009). Real-time parallel hashing on the GPU. ACM Transactions on Graphics, 28(5):154:1–154:9. Davidson, A., Baxter, S., Garland, M., and Owens, J. D. (2014). Work-efficient parallel GPU methods for single source shortest paths. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014, pages 349–359. Wu, Z., Zhao, F., and Liu, X. (2011). SAH KD-tree construction on GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG ’11, pages 71–78. Yang, X., Xu, D., and Zhao, L. (2013). Efficient data management for incoherent ray tracing. Applied Soft Computing, 13(1):1–8.
GPU Multisplit GTC 2016 17 / 16