GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John - - PowerPoint PPT Presentation

gpu multisplit
SMART_READER_LITE
LIVE PREVIEW

GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John - - PowerPoint PPT Presentation

GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 1 / 16 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan


slide-1
SLIDE 1

GPU Multisplit

Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 1 / 16

slide-2
SLIDE 2

Motivating Simple Example: Compaction

Compaction (i.e., binary split)

Traditional approach: Flags → Scan → Shuffle

e.g., splitter: 10

input keys 25 12 4 76 7 17 6 1 compacted 4 7 6 1

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 2 / 16

slide-3
SLIDE 3

Motivating Simple Example: Compaction

Compaction (i.e., binary split)

Traditional approach: Flags → Scan → Shuffle

e.g., splitter: 10

Other option: split input into two buckets

input keys 25 12 4 76 7 17 6 1 buckets 1 1 1 1

  • utput keys

4 7 6 1 25 12 76 17 buckets 1 1 1 1

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 2 / 16

slide-4
SLIDE 4

Motivating Simple Example: Compaction

Compaction (i.e., binary split)

Traditional approach: Flags → Scan → Shuffle

e.g., splitter: 10

Other option: split input into two buckets Can also be solved by sorting keys

Not always possible Loses “stability”, i.e., initial order within buckets not preserved

input keys 25 12 4 76 7 17 6 1 buckets 1 1 1 1

  • utput keys

4 7 6 1 25 12 76 17 buckets 1 1 1 1 sorted keys 1 4 6 7 12 17 25 76

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 2 / 16

slide-5
SLIDE 5

What is Multisplit?

Multisplit (generalization of binary split)

Let’s try multiple buckets

e.g., splitters: 10 and 20

input keys 25 17 4 76 7 12 6 1 buckets 2 1 2 1

  • utput keys

4 7 6 1 17 12 25 76 buckets 1 1 2 2

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 3 / 16

slide-6
SLIDE 6

What is Multisplit?

Multisplit (generalization of binary split)

Let’s try multiple buckets

e.g., splitters: 10 and 20

Can also be solved by sorting keys

input keys 25 17 4 76 7 12 6 1 buckets 2 1 2 1

  • utput keys

4 7 6 1 17 12 25 76 buckets 1 1 2 2 sorted keys 1 4 6 7 12 17 25 76 buckets 1 1 2 2

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 3 / 16

slide-7
SLIDE 7

Mutlisplit primitive

Input:

unordered set of keys (or key-value pairs) m, number of buckets a user specified function to identify buckets for each key

Output:

keys (or key-value pairs) separated into m buckets

B0 B1 B2 B3

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 4 / 16

slide-8
SLIDE 8

A Fast and Flexible Data-Organization Primitive

characterizing key-value pairs into buckets

General load balancing Priority queues

Single Source Shortest Path (SSSP)

Serial (Dijkstra): processing the vertex with the lowest weight Bellman-Ford-Moore → all vertices in parallel delta-stepping formulation of SSSP [Davidson et al., 2014]

classifying vertices into buckets by their weights processing the lowest weights in parallel But no multisplit primitive → used radix-sort instead By using our own multisplit → 2.1x faster

  • ther applications

colored prefix-sum reorganizing into 8 direction-based buckets in GPU based ray tracers [Yang et al., 2013] first step in building GPU hash tables [Alcantara et al., 2009] in the shallow stages of k-d tree construction [Wu et al., 2011]

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 5 / 16

slide-9
SLIDE 9

Common Approaches:

1 Recursive scan-based split

⌈log(m)⌉ rounds of binary splits

31 24 3 17 82 59 46 6 B0 = {i ≤ 40} B1 = {i > 40} Initial Keys

1 2 4 6 3 7 5 1 1 1 1 1 1 2 3 3 4

31 24 3 17 82 59 46 6 31

1 2 4 6 3 7 5 1 1 1 1 1 1 1 2

B0 B1

Exclusive scan right to left exclusive scan

Buckets

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 6 / 16

slide-10
SLIDE 10

Common Approaches:

1 Recursive scan-based split

⌈log(m)⌉ rounds of binary splits

2 Radix sort

sorting keys

  • verkill (sorted within buckets)

initial order is not preserved

31 24 3 17 82 59 46 6

1 2 4 6 3 7 5

0000011 0010001 1010010 0011000 0000110 0011111 0101110 0111011

31 24 3 17 82 59 46 6 31

1 2 4 6 3 7 5

Initial Keys

binary representation ≤ 7 splits

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 6 / 16

slide-11
SLIDE 11

Common Approaches:

1 Recursive scan-based split

⌈log(m)⌉ rounds of binary splits

2 Radix sort

sorting keys

  • verkill (sorted within buckets)

initial order is not preserved

3 Reduced bit sort

sort (bucket ID, keys) ⌈log m⌉-bit bucket IDs

31 24 3 17 82 59 46 6

1 2 4 6 3 7 5

Initial Keys 31 24 3 17 82 59 46 6

1 1 1 1 New values New keys key-value sort

31 24 3 17 82 59 46 6 31

1 2 4 6 3 7 5

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 6 / 16

slide-12
SLIDE 12

Common Approaches:

1 Recursive scan-based split

⌈log(m)⌉ rounds of binary splits

2 Radix sort

sorting keys

  • verkill (sorted within buckets)

initial order is not preserved

3 Reduced bit sort

sort (bucket ID, keys) ⌈log m⌉-bit bucket IDs

4 Randomized insertions

a PRAM algorithm large buffers for buckets random insertions initial order is not preserved

31 24 3 17 82 59 46 6

1 2 4 6 3 7 5

Initial Keys 31 24 6 3 17 46 59 82

buffer B0 buffer B1 compaction

31 24 3 17 82 59 46 6 31

1 2 4 6 3 7 5

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 6 / 16

slide-13
SLIDE 13

Designing an Efficient Approach

Stable Multisplit → unique permutation + data movement

1 Deriving all permutations → global computations

histogram (h0, . . . , hm−1) key order per bucket

ui ∈ Bj ⇒ p(i) =

j−1

  • k=0

hk

Number of keys in previous buckets

+ |{ur : ur ∈ Bj, r < i}|

  • Number of keys

before me in my own bucket

B0 B1 B2 B3

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 7 / 16

slide-14
SLIDE 14

Designing an Efficient Approach

Stable Multisplit → unique permutation + data movement

1 Deriving all permutations → global computations

histogram (h0, . . . , hm−1) key order per bucket

2 Final data movements → global random scatters B0 B1 B2 B3

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 7 / 16

slide-15
SLIDE 15

Our high level ideas

1 Global computations

Localize computations

several large enough local subproblems − → local histograms a single small enough global computation − → global histogram several large enough local subproblems − → permutations + scatters Avoid shared memory and synchronization: utilize intrinsics

Global Local Local Pre scan Scan Post scan

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 8 / 16

slide-16
SLIDE 16

Our high level ideas

1 Global computations

Localize computations

several large enough local subproblems − → local histograms a single small enough global computation − → global histogram several large enough local subproblems − → permutations + scatters Avoid shared memory and synchronization: utilize intrinsics

2 Global random scatters

Reordering keys locally in the last stage → local multisplits

more computational cost but better memory access pattern (coalesced writes)

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 8 / 16

slide-17
SLIDE 17

Granularity Tradeoffs

We experimented with a couple different subproblem granularities

1

warp

warp synchronous model with minimal warp divergence fast communication via warp-wide ballot/shuffles

2

block

more expensive communication via shared memory cheaper global computation (scan over m × Nblocks) more locality to extract after reordering

Property Direct MS Warp-level MS Block-level MS Subproblem warp warp block reordering – warp-wide reordering block-wide reordering computational load low medium high Coalesced memory access low medium high

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 9 / 16

slide-18
SLIDE 18

Implementation details & Optimizations

Warp-level MS

1 Pre-scan (Local):

read keys warp histogram

1

bit-by-bit balloting

2

⌈log m⌉ rounds store warp histogram · · · · · · · · · · · ·

· · · · · · · · · · · ·

an an an · · ·

h0,0 h1,0 hm−1,0 h0,1 h1,1 hm−1,1

· · ·

h0,L−1 h1,L−1 hm−1,L−1

· · · · · · Pre scan

1: procedure warp histogram(bucket id[0:31])

Input: bucket id[0:31] Output: histo[0:m-1]

2:

for each thread i = 0:31 parallel warp do

3:

histo bmp[i] = 0xFFFFFFFF;

4:

for (int k = 0; k < ceil(log2(m)); k++) do

5:

temp buffer = ballot(bucket id[i] & 0x01);

6:

if ((i >> k) & 0x01) then

7:

histo bmp[i] &= temp buffer;

8:

else

9:

histo bmp[i] &= XOR(0xFFFFFFFF, temp buffer);

10:

end if

11:

bucket id[i] >>= 1;

12:

end for

13:

histo[i] = popc(histo bmp[i]);

14:

end for

15:

return histo[0:m-1];

16: end procedure

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 10 / 16

slide-19
SLIDE 19

Implementation details & Optimizations

Warp-level MS

1 Pre-scan (Local):

read keys warp histogram

1

bit-by-bit balloting

2

⌈log m⌉ rounds store warp histogram

2 Scan (Global):

exclusive scan on histograms m × Nwarps elements · · ·

h0,0 h1,0 hm−1,0 h0,1 h1,1 hm−1,1

· · ·

h0,L−1 h1,L−1 hm−1,L−1

· · · · · ·

h0,0 h0,1 h0,L−1 hm−1,0 hm−1,1 hm−1,L−1 h1,0 h1,1 h1,L−1

· · · · · · · · · · · ·

Pre scan Scan

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 10 / 16

slide-20
SLIDE 20

Implementation details & Optimizations

Warp-level MS

1 Pre-scan (Local):

read keys warp histogram

1

bit-by-bit balloting

2

⌈log m⌉ rounds store warp histogram

2 Scan (Global):

exclusive scan on histograms m × Nwarps elements

3 Post-scan (Local):

read keys (key-value) recompute warp histograms compute local offsets warp-level reordering compute final positions final data movement · · ·

h0,0 h1,0 hm−1,0 h0,1 h1,1 hm−1,1

· · ·

h0,L−1 h1,L−1 hm−1,L−1

· · · · · ·

h0,0 h0,1 h0,L−1 hm−1,0 hm−1,1 hm−1,L−1 h1,0 h1,1 h1,L−1

· · · · · · · · · · · ·

Pre scan Scan Post scan

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 10 / 16

slide-21
SLIDE 21

Performance Evaluation

In this presentation:

1

NVIDIA Tesla K40c GPU

2

Radix sort from CUB

including the one in the reduced-bit sort method

3

Device-wide exclusive scan from CUB

4

Uniform distribution of keys in buckets

More results in the paper:

1

Detailed timing of different stages of our algorithm

2

Other GPU achitectures: Maxwell

3

Different distributions of keys

4

Using our Multisplit algorithms in SSSP method

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 11 / 16

slide-22
SLIDE 22

Average running time vs. number of buckets

4 5 6 7 8 9 10 20 30 Number of buckets (m) Average running time (msec)

Block level MS Direct MS Reduced bit sort Warp level MS

(a) Key-only

8 12 16 10 20 30 Number of buckets (m) Average running time (msec)

(b) Key-value

Memory access quality: Block-level MS > Warp-level MS > Direct MS Computational load: Block-level MS > Warp-level MS > Direct MS

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 12 / 16

slide-23
SLIDE 23

Performance vs Radix-Sort

2 4 6 2 4 6 8 12 16 20 26 32 Number of buckets (m) Speedup

Block level MS Direct MS Reduced bit sort Warp level MS Binary Split

(c) Key-only

2 4 6 8 2 4 6 8 12 16 20 26 32 Number of buckets (m) Speedup

Block level MS Direct MS Redcued bit sort Warp level MS Binary Split

(d) Key-value

key-only: 3.0x – 6.7x key-value: 4.4x – 8.0x

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 13 / 16

slide-24
SLIDE 24

More buckets

For more buckets than the warp width (m > 32):

warp histograms → each thread in charge of multiple buckets Shared memory capacity the other bottleneck

Radix sort (key−only) Radix sort (key−value) 5 10 15 20 32 64 256 1024 4096 16384 65536 Number of buckets (m) Average running time (msec)

Block level MS Reduced bit sort Key−only Key−value

Average running time (msec) for more buckets for Block level MS and reduced-bit sort

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 14 / 16

slide-25
SLIDE 25

Conclusions

Introduce a new efficient data organization primitive High performance especially for low or modest number of buckets Full paper:

Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2016)

Code will soon be available in CUDPP:

https://github.com/cudpp/cudpp

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 15 / 16

slide-26
SLIDE 26

Thank You

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 16 / 16

slide-27
SLIDE 27

References

Alcantara, D. A., Sharf, A., Abbasinejad, F., Sengupta, S., Mitzenmacher, M., Owens, J. D., and Amenta, N. (2009). Real-time parallel hashing on the GPU. ACM Transactions on Graphics, 28(5):154:1–154:9. Davidson, A., Baxter, S., Garland, M., and Owens, J. D. (2014). Work-efficient parallel GPU methods for single source shortest paths. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014, pages 349–359. Wu, Z., Zhao, F., and Liu, X. (2011). SAH KD-tree construction on GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG ’11, pages 71–78. Yang, X., Xu, D., and Zhao, L. (2013). Efficient data management for incoherent ray tracing. Applied Soft Computing, 13(1):1–8.

  • S. Ashkiani (UC Davis)

GPU Multisplit GTC 2016 17 / 16