Not Just a Universal Crutch: Other Useful Things To Do With - - PowerPoint PPT Presentation

not just a universal crutch other useful things to do
SMART_READER_LITE
LIVE PREVIEW

Not Just a Universal Crutch: Other Useful Things To Do With - - PowerPoint PPT Presentation

Not Just a Universal Crutch: Other Useful Things To Do With atomicCAS Mitglied der Helmholtz-Gemeinschaft S6220 - Elmar Westphal - Forschungszentrum Jlich Overview atomicCAS and the Universal Crutch Parallel hashing on GPUs


slide-1
SLIDE 1 Mitglied der Helmholtz-Gemeinschaft

Not Just a Universal Crutch: Other Useful Things To Do With atomicCAS

S6220 - Elmar Westphal - Forschungszentrum Jülich

slide-2
SLIDE 2 Mitglied der Helmholtz-Gemeinschaft

Overview

  • atomicCAS and the “Universal Crutch”
  • Parallel hashing on GPUs using atomicCAS
  • Example 1: Counting unique keys in a block
  • Example 2: Group by keys within a warp
  • Conclusions
  • Addendum: Sample source codes
slide-3
SLIDE 3 Mitglied der Helmholtz-Gemeinschaft

On atomicCAS

  • From the CUDA C Programming Manual:



 “int atomicCAS(int* address, int compare, int val);
 …
 reads … old … located … in global or shared memory, computes 



 (old == compare ? val : old) 
 
 and stores the result back to memory at the same address. … 
 The function returns old (Compare And Swap).”

slide-4
SLIDE 4 Mitglied der Helmholtz-Gemeinschaft

“The Universal Crutch”

  • According to said guide, 


“any atomic operation can
 be implemented based on
 atomicCAS()”

  • Example: double precision


atomicAdd

__device__ double atomicAdd(double* address, double val) { unsigned long long int* address_as_ull = (unsigned long long int*)address; unsigned long long int old = *address_as_ull, assumed; do { assumed = old;

  • ld = atomicCAS(address_as_ull, assumed,

__double_as_longlong(val + __longlong_as_double(assumed))); } while (assumed != old); return __longlong_as_double(old); }

slide-5
SLIDE 5 Mitglied der Helmholtz-Gemeinschaft

“The Universal Crutch”

  • According to said guide, 


“any atomic operation can
 be implemented based on
 atomicCAS()”

  • Example: double precision


atomicAdd

__device__ double atomicAdd(double* address, double val) { unsigned long long int* address_as_ull = (unsigned long long int*)address; unsigned long long int old = *address_as_ull, assumed; do { assumed = old;

  • ld = atomicCAS(address_as_ull, assumed,

__double_as_longlong(val + __longlong_as_double(assumed))); } while (assumed != old); return __longlong_as_double(old); }

“any atomic operation”, like that? Great, then we’re done here. Thank you for your time!

slide-6
SLIDE 6 Mitglied der Helmholtz-Gemeinschaft

But wait! There is more!

slide-7
SLIDE 7 Mitglied der Helmholtz-Gemeinschaft

Overview

  • atomicCAS and the “Universal Crutch”
  • Parallel hashing on GPUs using atomicCAS
  • Example 1: Counting unique keys in a block
  • Example 2: Group by keys within a warp
  • Conclusions
  • Addendum: Sample source codes
slide-8
SLIDE 8 Mitglied der Helmholtz-Gemeinschaft

Origin & Motivation

  • Originally developed as part of building (partial) linked lists in

shared memory (see GTC 2012, S2036)

  • original use became obsolete with Kepler’s faster atomics
  • general idea became useful again with Maxwell’s native shared

atomics

slide-9
SLIDE 9 Mitglied der Helmholtz-Gemeinschaft

Hashing using atomicCAS

  • atomicCAS can be used to implement parallel hashing functions
  • Works very efficiently in shared memory on Maxwell
  • Building block for several useful counting and grouping operations
  • Works best at warp- or block-level with Nkeys << Nthreads
  • Hashing function chosen must fit data properties:
  • Constantly strided keys may lead to repeated collisions
slide-10
SLIDE 10 Mitglied der Helmholtz-Gemeinschaft

A New Building Block

slide-11
SLIDE 11 Mitglied der Helmholtz-Gemeinschaft

How Does It Work?

  • The loop maps the threads’ arbitrary keys to hash indices within the warp or block
  • The hash index within a scope (warp/block) is then assigned to all threads with the same

key

  • atomicCAS tries to claim the calculated hash index for its thread’s key (“my_key”)
  • There are three possible outcomes for the return value of atomicCAS:
  • 1. UNCLAIMED: this thread is the first to claim a hash index, success
  • 2. Same key as my_key: hash index claimed by same key from different thread, success
  • 3. Key different from my_key: hash index claimed by different key (hash collision), try again

with new hash index

slide-12
SLIDE 12 Mitglied der Helmholtz-Gemeinschaft

Example, 13 Unique Keys in 16 Threads

Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6

Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE

(simple to show, but ineffective for data with stride of BLOCK_SIZE!)

assigned unassigned collision

slide-13
SLIDE 13 Mitglied der Helmholtz-Gemeinschaft

Example, 13 Unique Keys in 16 Threads

Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6

Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE

(simple to show, but ineffective for data with stride of BLOCK_SIZE!)

assigned unassigned collision

slide-14
SLIDE 14 Mitglied der Helmholtz-Gemeinschaft

Example, 13 Unique Keys in 16 Threads

Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6 Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6

Key from Slot

U 19 7 2 U 7 U U U U U U U U 7 U

Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE

(simple to show, but ineffective for data with stride of BLOCK_SIZE!)

assigned unassigned collision

slide-15
SLIDE 15 Mitglied der Helmholtz-Gemeinschaft

Example, 13 Unique Keys in 16 Threads

Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6 Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6

Key from Slot

U 19 7 2 U 7 U U U U U U U U 7 U

Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE

(simple to show, but ineffective for data with stride of BLOCK_SIZE!)

assigned unassigned collision

slide-16
SLIDE 16 Mitglied der Helmholtz-Gemeinschaft

Example, 13 Unique Keys in 16 Threads

Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6

Key from Slot

U 19 7 2 U 7 U U U U U U U U 7 U Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15

Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE

assigned unassigned collision

slide-17
SLIDE 17 Mitglied der Helmholtz-Gemeinschaft

Example, 13 Unique Keys in 16 Threads

Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6

Key from Slot

U 19 7 2 U 7 U U U U U U U U 7 U Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • Iteration 2

Hash function: (key+5*hash_iteration)%BLOCK_SIZE

assigned unassigned collision

slide-18
SLIDE 18 Mitglied der Helmholtz-Gemeinschaft

Example, 13 Unique Keys in 16 Threads

Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6

Key from Slot

U 19 7 2 U 7 U U U U U U U U 7 U Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • Iteration 2

Hash function: (key+5*hash_iteration)%BLOCK_SIZE

assigned unassigned collision

slide-19
SLIDE 19 Mitglied der Helmholtz-Gemeinschaft

Example, 13 Unique Keys in 16 Threads

Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6

Key from Slot

U 19 7 2 U 7 U U U U U U U U 7 U Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • Hash

2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • 8
  • U
  • Slot

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 U 14 15

Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE

assigned unassigned collision

slide-20
SLIDE 20 Mitglied der Helmholtz-Gemeinschaft

Example, 13 Unique Keys in 16 Threads

Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6

Key from Slot

U 19 7 2 U 7 U U U U U U U U 7 U Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • Hash

2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • 8
  • U
  • Slot

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 U 14 15

Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE

assigned unassigned collision

slide-21
SLIDE 21 Mitglied der Helmholtz-Gemeinschaft

Example, 13 Unique Keys in 16 Threads

Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • 8
  • U
  • Slot

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 U 14 15

Iteration 3 Hash function: (key+5*hash_iteration)%BLOCK_SIZE

assigned unassigned collision

slide-22
SLIDE 22 Mitglied der Helmholtz-Gemeinschaft

Example, 13 Unique Keys in 16 Threads

Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • 8
  • U
  • Slot

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 U 14 15 Hash 2 13 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • Iteration 3

Hash function: (key+5*hash_iteration)%BLOCK_SIZE

assigned unassigned collision

slide-23
SLIDE 23 Mitglied der Helmholtz-Gemeinschaft

Example, 13 Unique Keys in 16 Threads

Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • 8
  • U
  • Slot

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 U 14 15 Hash 2 13 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • Iteration 3

Hash function: (key+5*hash_iteration)%BLOCK_SIZE

assigned unassigned collision

slide-24
SLIDE 24 Mitglied der Helmholtz-Gemeinschaft

Example, 13 Unique Keys in 16 Threads

Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • 8
  • U
  • Slot

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 U 14 15 Hash 2 13 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • Hash

2 13 7 2 8 7 9 7 14 11 15 5 3 4 12 6

Key from Slot

  • U
  • Slot

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 3 14 15

Iteration 3. Done. Hash function: (key+5*hash_iteration)%BLOCK_SIZE

assigned unassigned collision

slide-25
SLIDE 25 Mitglied der Helmholtz-Gemeinschaft

Overview

  • atomicCAS and the “Universal Crutch”
  • Parallel hashing on GPUs using atomicCAS
  • Example 1: Counting unique keys in a block
  • Example 2: Group by keys within a warp
  • Conclusions
  • Addendum: Sample source codes
slide-26
SLIDE 26 Mitglied der Helmholtz-Gemeinschaft

Use Case 1: Counting Unique Keys

slide-27
SLIDE 27 Mitglied der Helmholtz-Gemeinschaft

50 100 150 200 250 300 600 650 700 750 800 850 900 950

Kernel runtime for counting keys

10M keys, 1M unique keys, ordered by key with added noise from neighboring 3D cells

average number of keys per block (512 threads) kernel runtime GTX980 [µs]

slide-28
SLIDE 28 Mitglied der Helmholtz-Gemeinschaft

Variation of Use Case 1: Counting Unique Keys Using More Hash Values

slide-29
SLIDE 29 Mitglied der Helmholtz-Gemeinschaft

50 100 150 200 250 300 600 650 700 750 800 850 900 950

Kernel runtime for counting keys

10M keys, 1M unique keys, ordered by key with added noise from neighboring 3D cells

1 hash value per thread 2 hash values per thread average number of keys per block (512 threads) kernel runtime GTX980 [µs]

slide-30
SLIDE 30 Mitglied der Helmholtz-Gemeinschaft

Overview

  • atomicCAS and the “Universal Crutch”
  • Parallel hashing on GPUs using atomicCAS
  • Example 1: Counting unique keys in a block
  • Example 2: Group by keys within a warp
  • Conclusions
  • Addendum: Sample source codes
slide-31
SLIDE 31 Mitglied der Helmholtz-Gemeinschaft

Use Case 2: Finding Peers in a Warp

See also: GTC15, S5151 - Voting and Shuffling… Calculates a bit mask with bits set for all threads in this warp sharing the same key

slide-32
SLIDE 32 Mitglied der Helmholtz-Gemeinschaft

Variation of Use Case 2: UNCLAIMED is a valid key

slide-33
SLIDE 33 Mitglied der Helmholtz-Gemeinschaft

50 100 150 200 250 300 500 1000 1500 2000 2500

Kernel runtime for finding peers in a warp

10M keys, 1M unique keys, ordered by key with added noise from neighboring 3D cells hash loop warp vote loop average number of unique keys per block (512 threads) kernel runtime GTX980 [µs]

slide-34
SLIDE 34 Mitglied der Helmholtz-Gemeinschaft

50 100 150 200 250 300 500 1000 1500 2000 2500 3000 3500 4000

Kernel runtime for finding peers in a warp, different architectures

10M keys, 1M unique keys, ordered by key with added noise from neighboring 3D cells hash loop warp vote loop hash loop Kepler warp vote loop Kepler average number of unique keys per block (512 threads) kernel runtime GTX980/GTX690 [µs]

slide-35
SLIDE 35 Mitglied der Helmholtz-Gemeinschaft

Overview

  • atomicCAS and the “Universal Crutch”
  • Parallel hashing on GPUs using atomicCAS
  • Example 1: Counting unique keys in a block
  • Example 2: Group by keys within a warp
  • Conclusions
  • Addendum: Sample source codes
slide-36
SLIDE 36 Mitglied der Helmholtz-Gemeinschaft

Conclusions

  • atomicCAS based hashing loops are a versatile tool for counting

and grouping operations

  • With proper hashing function, runtime is relatively independent of

number of unique data keys

  • Benefits from Maxwell’s natively implemented shared atomics
  • Downside: significant performance penalty on older cards
slide-37
SLIDE 37 Mitglied der Helmholtz-Gemeinschaft

Thank you for your time!

Questions?

slide-38
SLIDE 38 Mitglied der Helmholtz-Gemeinschaft

Overview

  • atomicCAS and the “Universal Crutch”
  • Parallel hashing on GPUs using atomicCAS
  • Example 1: Counting unique keys in a block
  • Example 2: Group by keys within a warp
  • Conclusions
  • Addendum: Sample source codes
slide-39
SLIDE 39 Mitglied der Helmholtz-Gemeinschaft

Code example: count unique keys

template<int keys_per_thread=1,uint UNCLAIMED=0xffffffff> __global__ void count_keys_kernel2(uint *key, uint *n_keys, int n) { __shared__ uint hash[keys_per_thread*BLOCK_SIZE]; __shared__ uint found_unclaimed; #pragma unroll for(int i=0;i<keys_per_thread;++i) // initialise all hash values hash[BLOCK_SIZE*i+threadIdx.x]=UNCLAIMED; if (threadIdx.x==0) found_unclaimed=0; __syncthreads(); // all threads may access all elements, so we need to sync int i=GLOBAL_THREAD_INDEX; // use your own appropriate function if (i<n) { uint my_key=key[i]; // get actual data if (my_key==UNCLAIMED) { found_unclaimed=1; } else { uint hash_index=UNSET; uint old; do { hash_index=new_hash_index(my_key,hash_index,WITHIN_BLOCK*keys_per_thread);

  • ld=atomicCAS(hash+hash_index,UNCLAIMED,my_key);

} while (old!=UNCLAIMED && old!=my_key); } } __syncthreads(); // sync to ensure all loops from all threads are finished int total_claimed=found_unclaimed; #pragma unroll for(int i=0;i<keys_per_thread;++i) total_claimed+=__syncthreads_count(hash[BLOCK_SIZE*i+threadIdx.x]!=UNCLAIMED); // count claimed if (threadIdx.x==0) // only one thread writes the result n_keys[GLOBAL_BLOCK_INDEX]=total_claimed; }

slide-40
SLIDE 40 Mitglied der Helmholtz-Gemeinschaft

Code example: find warp peers

template<uint UNCLAIMED=0xffffffff> __global__ void get_warp_peers_shared(uint *key, uint *peers, int n) { int i=GLOBAL_THREAD_INDEX; // use your own appropriate function __shared__ int hash[BLOCK_SIZE]; hash[TX]=UNCLAIMED; // initialize hash, no sync necessary (access stays within warp) int lane=threadIdx.x%WARP_SIZE; int *hash_for_warp=&(hash[threadIdx.x-lane]); // saves lots of index calculations if (i<n) { uint my_key=key[i]; // get actual data uint my_peers; if (my_key==UNCLAIMED) // can be left out if UNCLAIMED is not a valid key my_peers=__ballot(my_key==UNCLAIMED); else { uint hash_index=UNSET; uint old; do { // look for hash unclaimed or claimed by same key hash_index=new_hash_index(my_key,hash_index,WITHIN_WARP);

  • ld=atomicCAS(hash_for_warp+hash_index,UNCLAIMED,my_key);

} while (old!=UNCLAIMED && old!=my_key); hash_for_warp[hash_index]=0; // shared memory is sparse, recycle it! atomicOr(hash_for_warp+hash_index,1<<lane); // build peers for separate keys in different places my_peers=hash_for_warp[hash_index]; } peers[i]=my_peers; } }