not just a universal crutch other useful things to do
play

Not Just a Universal Crutch: Other Useful Things To Do With - PowerPoint PPT Presentation

Not Just a Universal Crutch: Other Useful Things To Do With atomicCAS Mitglied der Helmholtz-Gemeinschaft S6220 - Elmar Westphal - Forschungszentrum Jlich Overview atomicCAS and the Universal Crutch Parallel hashing on GPUs


  1. Not Just a Universal Crutch: Other Useful Things To Do With atomicCAS Mitglied der Helmholtz-Gemeinschaft S6220 - Elmar Westphal - Forschungszentrum Jülich

  2. Overview • atomicCAS and the “Universal Crutch” • Parallel hashing on GPUs using atomicCAS • Example 1: Counting unique keys in a block • Example 2: Group by keys within a warp Mitglied der Helmholtz-Gemeinschaft • Conclusions • Addendum: Sample source codes

  3. 
 
 
 On atomicCAS • From the CUDA C Programming Manual: 
 “ int atomicCAS(int* address, int compare, int val); 
 … 
 reads … old … located … in global or shared memory, computes 
 (old == compare ? val : old) 
 Mitglied der Helmholtz-Gemeinschaft and stores the result back to memory at the same address. … 
 The function returns old (Compare And Swap).”

  4. “The Universal Crutch” __device__ double atomicAdd(double* address, double val) { • According to said guide, 
 unsigned long long int* address_as_ull = (unsigned long long int*)address; “any atomic operation can 
 unsigned long long int old = *address_as_ull, assumed; do { be implemented based on 
 assumed = old; old = atomicCAS(address_as_ull, assumed, atomicCAS()” __double_as_longlong(val + __longlong_as_double(assumed))); } while (assumed != old); return __longlong_as_double(old); • Example: double precision 
 } atomicAdd Mitglied der Helmholtz-Gemeinschaft

  5. “The Universal Crutch” __device__ double atomicAdd(double* address, double val) { • According to said guide, 
 unsigned long long int* address_as_ull = (unsigned long long int*)address; “any atomic operation can 
 unsigned long long int old = *address_as_ull, assumed; do { be implemented based on 
 assumed = old; old = atomicCAS(address_as_ull, assumed, atomicCAS()” __double_as_longlong(val + __longlong_as_double(assumed))); } while (assumed != old); return __longlong_as_double(old); • Example: double precision 
 } atomicAdd Mitglied der Helmholtz-Gemeinschaft “any atomic operation”, like that? Great, then we’re done here. Thank you for your time!

  6. Mitglied der Helmholtz-Gemeinschaft There is more! But wait!

  7. Overview • atomicCAS and the “Universal Crutch” • Parallel hashing on GPUs using atomicCAS • Example 1: Counting unique keys in a block • Example 2: Group by keys within a warp Mitglied der Helmholtz-Gemeinschaft • Conclusions • Addendum: Sample source codes

  8. Origin & Motivation • Originally developed as part of building (partial) linked lists in shared memory (see GTC 2012, S2036) • original use became obsolete with Kepler’s faster atomics • general idea became useful again with Maxwell’s native shared Mitglied der Helmholtz-Gemeinschaft atomics

  9. Hashing using atomicCAS • atomicCAS can be used to implement parallel hashing functions • Works very efficiently in shared memory on Maxwell • Building block for several useful counting and grouping operations • Works best at warp- or block-level with N keys << N threads Mitglied der Helmholtz-Gemeinschaft • Hashing function chosen must fit data properties: • Constantly strided keys may lead to repeated collisions

  10. Mitglied der Helmholtz-Gemeinschaft A New Building Block

  11. How Does It Work? • The loop maps the threads’ arbitrary keys to hash indices within the warp or block • The hash index within a scope (warp/block) is then assigned to all threads with the same key • atomicCAS tries to claim the calculated hash index for its thread’s key (“my_key”) • There are three possible outcomes for the return value of atomicCAS: 1. UNCLAIMED: this thread is the first to claim a hash index, success Mitglied der Helmholtz-Gemeinschaft 2. Same key as my_key: hash index claimed by same key from different thread, success 3. Key different from my_key: hash index claimed by different key (hash collision), try again with new hash index

  12. Example, 13 Unique Keys in 16 Threads Slot 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6 Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE (simple to show, but ineffective for data with stride of BLOCK_SIZE!) assigned unassigned collision

  13. Example, 13 Unique Keys in 16 Threads Slot 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6 Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE (simple to show, but ineffective for data with stride of BLOCK_SIZE!) assigned unassigned collision

  14. Example, 13 Unique Keys in 16 Threads Slot Slot 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 Key Key U U U U U 2 19 U 20 U 21 U 22 U U 7 U 8 U 9 U U 11 U U U U U 14 U 15 U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash Hash 2 2 3 3 7 7 2 2 8 8 7 7 9 9 7 7 14 14 11 11 15 15 5 5 3 3 4 4 7 7 6 6 Key from U 19 7 2 U 7 U U U U U U U U 7 U Slot Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE (simple to show, but ineffective for data with stride of BLOCK_SIZE!) assigned unassigned collision

  15. Example, 13 Unique Keys in 16 Threads Slot Slot 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 Key Key U U U U U 2 19 U 20 U 21 U 22 U U 7 U 8 U 9 U U 11 U U U U U 14 U 15 U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash Hash 2 2 3 3 7 7 2 2 8 8 7 7 9 9 7 7 14 14 11 11 15 15 5 5 3 3 4 4 7 7 6 6 Key from U 19 7 2 U 7 U U U U U U U U 7 U Slot Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE (simple to show, but ineffective for data with stride of BLOCK_SIZE!) assigned unassigned collision

  16. Example, 13 Unique Keys in 16 Threads Slot Slot 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 Key Key U U U U U 2 19 U 20 U 21 U 22 U U 7 U 8 U 9 U U 11 U U U U U 14 U 15 U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6 Key from U 19 7 2 U 7 U U U U U U U U 7 U Slot Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE assigned unassigned collision

  17. Example, 13 Unique Keys in 16 Threads Slot Slot 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 Key Key U U U U U 2 19 U 20 U 21 U 22 U U 7 U 8 U 9 U U 11 U U U U U 14 U 15 U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash Hash 2 2 8 3 7 7 2 2 8 8 7 7 9 9 7 7 14 14 11 11 15 15 5 5 3 3 4 4 12 7 6 6 Key from Key from U - 19 - 7 - 2 - U - 7 - U - U - U - U - U - U - U - U - 7 - U - Slot Slot Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE assigned unassigned collision

  18. Example, 13 Unique Keys in 16 Threads Slot Slot 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 Key Key U U U U U 2 19 U 20 U 21 U 22 U U 7 U 8 U 9 U U 11 U U U U U 14 U 15 U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash Hash 2 2 8 3 7 7 2 2 8 8 7 7 9 9 7 7 14 14 11 11 15 15 5 5 3 3 4 4 12 7 6 6 Key from Key from U - 19 - 7 - 2 - U - 7 - U - U - U - U - U - U - U - U - 7 - U - Slot Slot Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE assigned unassigned collision

  19. Example, 13 Unique Keys in 16 Threads Slot Slot Slot 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 Key Key Key U U U U U U U 2 2 19 19 U 20 20 U 21 21 U 22 22 U U 7 7 U 8 8 U 9 9 U U U 11 11 U 23 U U U U U 14 14 U 15 15 U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash Hash Hash 2 2 2 8 3 8 7 7 7 2 2 2 8 8 8 7 7 7 9 9 9 7 7 7 14 14 14 11 11 11 15 15 15 5 5 5 3 3 3 4 4 4 12 12 7 6 6 6 Key from Key from Key from U - - 19 8 - 7 - - 2 - - U - - 7 - - U - - U - - U - - U - - U - - U - - U - - U - - U 7 - U - - Slot Slot Slot Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE assigned unassigned collision

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend