Not Just a Universal Crutch: Other Useful Things To Do With atomicCAS
S6220 - Elmar Westphal - Forschungszentrum Jülich
Not Just a Universal Crutch: Other Useful Things To Do With - - PowerPoint PPT Presentation
Not Just a Universal Crutch: Other Useful Things To Do With atomicCAS Mitglied der Helmholtz-Gemeinschaft S6220 - Elmar Westphal - Forschungszentrum Jlich Overview atomicCAS and the Universal Crutch Parallel hashing on GPUs
S6220 - Elmar Westphal - Forschungszentrum Jülich
“int atomicCAS(int* address, int compare, int val); … reads … old … located … in global or shared memory, computes
(old == compare ? val : old) and stores the result back to memory at the same address. … The function returns old (Compare And Swap).”
“any atomic operation can be implemented based on atomicCAS()”
atomicAdd
__device__ double atomicAdd(double* address, double val) { unsigned long long int* address_as_ull = (unsigned long long int*)address; unsigned long long int old = *address_as_ull, assumed; do { assumed = old;
__double_as_longlong(val + __longlong_as_double(assumed))); } while (assumed != old); return __longlong_as_double(old); }
“any atomic operation can be implemented based on atomicCAS()”
atomicAdd
__device__ double atomicAdd(double* address, double val) { unsigned long long int* address_as_ull = (unsigned long long int*)address; unsigned long long int old = *address_as_ull, assumed; do { assumed = old;
__double_as_longlong(val + __longlong_as_double(assumed))); } while (assumed != old); return __longlong_as_double(old); }
“any atomic operation”, like that? Great, then we’re done here. Thank you for your time!
shared memory (see GTC 2012, S2036)
atomics
key
with new hash index
Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6
Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE
(simple to show, but ineffective for data with stride of BLOCK_SIZE!)
assigned unassigned collision
Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6
Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE
(simple to show, but ineffective for data with stride of BLOCK_SIZE!)
assigned unassigned collision
Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6 Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6
Key from Slot
U 19 7 2 U 7 U U U U U U U U 7 U
Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE
(simple to show, but ineffective for data with stride of BLOCK_SIZE!)
assigned unassigned collision
Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6 Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6
Key from Slot
U 19 7 2 U 7 U U U U U U U U 7 U
Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE
(simple to show, but ineffective for data with stride of BLOCK_SIZE!)
assigned unassigned collision
Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6
Key from Slot
U 19 7 2 U 7 U U U U U U U U 7 U Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15
Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE
assigned unassigned collision
Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6
Key from Slot
U 19 7 2 U 7 U U U U U U U U 7 U Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
Hash function: (key+5*hash_iteration)%BLOCK_SIZE
assigned unassigned collision
Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6
Key from Slot
U 19 7 2 U 7 U U U U U U U U 7 U Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
Hash function: (key+5*hash_iteration)%BLOCK_SIZE
assigned unassigned collision
Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6
Key from Slot
U 19 7 2 U 7 U U U U U U U U 7 U Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 U 14 15
Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE
assigned unassigned collision
Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6
Key from Slot
U 19 7 2 U 7 U U U U U U U U 7 U Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 U U 14 15 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 U 14 15
Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE
assigned unassigned collision
Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 U 14 15
Iteration 3 Hash function: (key+5*hash_iteration)%BLOCK_SIZE
assigned unassigned collision
Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 U 14 15 Hash 2 13 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
Hash function: (key+5*hash_iteration)%BLOCK_SIZE
assigned unassigned collision
Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 U 14 15 Hash 2 13 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
Hash function: (key+5*hash_iteration)%BLOCK_SIZE
assigned unassigned collision
Slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Hash 2 8 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 U 14 15 Hash 2 13 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
2 13 7 2 8 7 9 7 14 11 15 5 3 4 12 6
Key from Slot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U 2 19 20 21 22 7 8 9 U 11 23 3 14 15
Iteration 3. Done. Hash function: (key+5*hash_iteration)%BLOCK_SIZE
assigned unassigned collision
50 100 150 200 250 300 600 650 700 750 800 850 900 950
Kernel runtime for counting keys
10M keys, 1M unique keys, ordered by key with added noise from neighboring 3D cells
average number of keys per block (512 threads) kernel runtime GTX980 [µs]
Variation of Use Case 1: Counting Unique Keys Using More Hash Values
50 100 150 200 250 300 600 650 700 750 800 850 900 950
Kernel runtime for counting keys
10M keys, 1M unique keys, ordered by key with added noise from neighboring 3D cells
1 hash value per thread 2 hash values per thread average number of keys per block (512 threads) kernel runtime GTX980 [µs]
See also: GTC15, S5151 - Voting and Shuffling… Calculates a bit mask with bits set for all threads in this warp sharing the same key
Variation of Use Case 2: UNCLAIMED is a valid key
50 100 150 200 250 300 500 1000 1500 2000 2500
Kernel runtime for finding peers in a warp
10M keys, 1M unique keys, ordered by key with added noise from neighboring 3D cells hash loop warp vote loop average number of unique keys per block (512 threads) kernel runtime GTX980 [µs]
50 100 150 200 250 300 500 1000 1500 2000 2500 3000 3500 4000
Kernel runtime for finding peers in a warp, different architectures
10M keys, 1M unique keys, ordered by key with added noise from neighboring 3D cells hash loop warp vote loop hash loop Kepler warp vote loop Kepler average number of unique keys per block (512 threads) kernel runtime GTX980/GTX690 [µs]
and grouping operations
number of unique data keys
template<int keys_per_thread=1,uint UNCLAIMED=0xffffffff> __global__ void count_keys_kernel2(uint *key, uint *n_keys, int n) { __shared__ uint hash[keys_per_thread*BLOCK_SIZE]; __shared__ uint found_unclaimed; #pragma unroll for(int i=0;i<keys_per_thread;++i) // initialise all hash values hash[BLOCK_SIZE*i+threadIdx.x]=UNCLAIMED; if (threadIdx.x==0) found_unclaimed=0; __syncthreads(); // all threads may access all elements, so we need to sync int i=GLOBAL_THREAD_INDEX; // use your own appropriate function if (i<n) { uint my_key=key[i]; // get actual data if (my_key==UNCLAIMED) { found_unclaimed=1; } else { uint hash_index=UNSET; uint old; do { hash_index=new_hash_index(my_key,hash_index,WITHIN_BLOCK*keys_per_thread);
} while (old!=UNCLAIMED && old!=my_key); } } __syncthreads(); // sync to ensure all loops from all threads are finished int total_claimed=found_unclaimed; #pragma unroll for(int i=0;i<keys_per_thread;++i) total_claimed+=__syncthreads_count(hash[BLOCK_SIZE*i+threadIdx.x]!=UNCLAIMED); // count claimed if (threadIdx.x==0) // only one thread writes the result n_keys[GLOBAL_BLOCK_INDEX]=total_claimed; }
template<uint UNCLAIMED=0xffffffff> __global__ void get_warp_peers_shared(uint *key, uint *peers, int n) { int i=GLOBAL_THREAD_INDEX; // use your own appropriate function __shared__ int hash[BLOCK_SIZE]; hash[TX]=UNCLAIMED; // initialize hash, no sync necessary (access stays within warp) int lane=threadIdx.x%WARP_SIZE; int *hash_for_warp=&(hash[threadIdx.x-lane]); // saves lots of index calculations if (i<n) { uint my_key=key[i]; // get actual data uint my_peers; if (my_key==UNCLAIMED) // can be left out if UNCLAIMED is not a valid key my_peers=__ballot(my_key==UNCLAIMED); else { uint hash_index=UNSET; uint old; do { // look for hash unclaimed or claimed by same key hash_index=new_hash_index(my_key,hash_index,WITHIN_WARP);
} while (old!=UNCLAIMED && old!=my_key); hash_for_warp[hash_index]=0; // shared memory is sparse, recycle it! atomicOr(hash_for_warp+hash_index,1<<lane); // build peers for separate keys in different places my_peers=hash_for_warp[hash_index]; } peers[i]=my_peers; } }