Shuffle: Tips and Tricks Julien Demouth, NVIDIA Glossary Warp - PowerPoint PPT Presentation

Shuffle: Tips and Tricks Julien Demouth, NVIDIA

Glossary  Warp — Implicitly synchronized group of threads (32 on current HW)  Warp ID ( warpid ) — Identifier of the warp in a block: threadIdx.x / 32  Lane ID ( laneid ) — Coordinate of the thread in a warp: threadIdx.x % 32 — Special register (available from PTX): %laneid

Shuffle (SHFL)  Instruction to exchange data in a warp  Threads can “read” other threads’ registers  No shared memory is needed  It is available starting from SM 3.0

Variants  4 variants (idx, up, down, bfly): a b c d e f g h shfl.idx shfl.up shfl.down shfl.bfly h d f e a c c b g h a b c d e f c d e f g h a b c d a b g h e f Shift right to n th Shift left to n th Indexed Butterfly (XOR) any-to-any neighbour neighbour exchange

Instruction (PTX) Optional dst. predicate Lane/offset/mask shfl.mode.b32 d[|p], a, b, c; Dst. register Src. register Bound

Implement SHFL for 64b Numbers __device__ __inline__ double shfl(double x, int lane) { // Split the double number into 2 32b registers. int lo, hi; asm volatile( “mov.b32 {%0,%1}, %2;” : “=r”(lo), “=r”(hi) : “d”(x)); // Shuffle the two 32b registers. lo = __shfl(lo, lane); hi = __shfl(hi, lane); // Recreate the 64b number. asm volatile( “mov.b64 %0, {%1,%2};” : “=d(x)” : “r”(lo), “r”(hi)); return x; }  Generic SHFL: https://github.com/BryanCatanzaro/generics

Performance Experiment  One element per thread … thread: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … x:  Each thread takes its right neighbor … thread: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … x:

Performance Experiment  We run the following test on a K20 T x = input[tidx]; for(int i = 0 ; i < 4096 ; ++i) x = get_right_neighbor(x); output[tidx] = x;  We launch 26 blocks of 1024 threads — On K20, we have 13 SMs — We need 2048 threads per SM to have 100% of occupancy  We time different variants of that kernel

Performance Experiment  Shared memory (SMEM) smem[threadIdx.x] = smem[32*warpid + ((laneid+1) % 32)]; __syncthreads();  Shuffle (SHFL) x = __shfl(x, (laneid+1) % 32);  Shared memory without __syncthreads + volatile ( unsafe ) __shared__ volatile T *smem = ...; smem[threadIdx.x] = smem[32*warpid + ((laneid+1) % 32)];

Performance Experiment (fp32) SMEM per Block (KB) Execution Time (ms) 4.5 1.4 4 1.2 3.5 1 3 0.8 2.5 2 0.6 1.5 0.4 1 0.2 0.5 0 0 SMEM SMEM (unsafe) SHFL SMEM SMEM (unsafe) SHFL

Performance Experiment (fp64) SMEM per Block (KB) Execution Time (ms) 9 1.4 8 1.2 7 1 6 0.8 5 4 0.6 3 0.4 2 0.2 1 0 0 SMEM SMEM (unsafe) SHFL SMEM SMEM (unsafe) SHFL

Performance Experiment  Always faster than shared memory  Much safer than using no __syncthreads (and volatile) — And never slower  Does not require shared memory — Useful when occupancy is limited by SMEM usage

Broadcast  All threads read from a single lane x = __shfl(x, 0); // All the threads read x from laneid 0.  More complex example // All threads evaluate a predicate. int predicate = ...; // All threads vote. unsigned vote = __ballot(predicate); // All threads get x from the “last” lane which evaluated the predicate to true. if(vote) x = __shfl(x, __bfind(vote)); // __bind(unsigned i): Find the most significant bit in a 32/64 number (PTX). __bfind(&b, i) { asm volatile(“bfind.u32 %0, %1;” : “=r”(b) : “r”( i)); }

Execution Time fp32 (ms) Reduce 7 6 5 4  Code 3 2 1 // Threads want to reduce the value in x. 0 SMEM SMEM SHFL float x = …; (unsafe) #pragma unroll SMEM per Block fp32 (KB) for(int mask = WARP_SIZE / 2 ; mask > 0 ; mask >>= 1) x += __shfl_xor(x, mask); 7 6 // The x variable of laneid 0 contains the reduction. 5 4  Performance 3 2 — Launch 26 blocks of 1024 threads 1 0 — Run the reduction 4096 times SMEM SMEM SHFL (unsafe)

Execution Time fp32 (ms) Scan 7 6 5 4  Code 3 2 1 #pragma unroll for( int offset = 1 ; offset < 32 ; offset <<= 1 ) 0 SMEM SMEM SHFL { (unsafe) float y = __shfl_up(x, offset); if(laneid() >= offset) SMEM per Block fp32 (KB) x += y; } 7 6  Performance 5 4 — Launch 26 blocks of 1024 threads 3 2 — Run the reduction 4096 times 1 0 SMEM SMEM SHFL (unsafe)

Scan Execution Time fp32 (ms) 2.5  Use the predicate from SHFL 2 #pragma unroll for( int offset = 1 ; offset < 32 ; offset <<= 1 ) { 1.5 asm volatile( "{" " .reg .f32 r0;" " .reg .pred p;" " shfl.up.b32 r0|p, %0, %1, 0x0;" 1 " @p add.f32 r0, r0, %0;" " mov.f32 %0, r0;" "}“ : "+f"(x) : "r"(offset)); } 0.5  Use CUB: 0 https://nvlabs.github.com/cub Intrinsics With predicate

Bitonic Sort x: 11 3 8 5 10 15 9 7 12 4 2 0 14 13 6 1 … stride=1 3 11 8 5 10 15 9 7 4 12 2 0 13 14 6 1 … stride=2 3 5 8 11 10 15 9 7 2 0 4 12 13 14 6 1 … stride=1 3 5 8 11 15 10 9 7 0 2 4 12 14 13 6 1 …

Bitonic Sort stride=4 3 5 8 7 15 10 9 11 14 13 6 12 0 2 4 1 … stride=2 3 5 8 7 9 10 15 11 14 13 6 12 4 2 0 1 … stride=1 3 5 7 8 9 10 11 15 14 13 12 6 4 2 1 0 …

Execution Time int32 (ms) Bitonic Sort 35 30 25 int swap(int x, int mask, int dir) 20 { int y = __shfl_xor(x, mask); 15 return x < y == dir ? y : x; 10 } 5 0 x = swap(x, 0x01, bfe(laneid, 1) ^ bfe(laneid, 0)); // 2 SMEM SMEM SHFL x = swap(x, 0x02, bfe(laneid, 2) ^ bfe(laneid, 1)); // 4 (unsafe) x = swap(x, 0x01, bfe(laneid, 2) ^ bfe(laneid, 0)); x = swap(x, 0x04, bfe(laneid, 3) ^ bfe(laneid, 2)); // 8 x = swap(x, 0x02, bfe(laneid, 3) ^ bfe(laneid, 1)); SMEM per Block (KB) x = swap(x, 0x01, bfe(laneid, 3) ^ bfe(laneid, 0)); 4.5 x = swap(x, 0x08, bfe(laneid, 4) ^ bfe(laneid, 3)); // 16 4 x = swap(x, 0x04, bfe(laneid, 4) ^ bfe(laneid, 2)); 3.5 x = swap(x, 0x02, bfe(laneid, 4) ^ bfe(laneid, 1)); x = swap(x, 0x01, bfe(laneid, 4) ^ bfe(laneid, 0)); 3 x = swap(x, 0x10, bfe(laneid, 4)); // 32 2.5 x = swap(x, 0x08, bfe(laneid, 3)); 2 x = swap(x, 0x04, bfe(laneid, 2)); 1.5 x = swap(x, 0x02, bfe(laneid, 1)); 1 x = swap(x, 0x01, bfe(laneid, 0)); 0.5 // int bfe(int i, int k): Extract k-th bit from i 0 SMEM SMEM SHFL (unsafe) // PTX: bfe dst, src, start, len (see p.81, ptx_isa_3.1)

Transpose  When threads load or store arrays of structures, transposes enable fully coalesced memory operations  e.g. when loading, have the warp perform coalesced loads, then transpose to send the data to the appropriate thread Registers Memory (Load) m elements per thread (Store) n threads in warp (8 for illustration only)

Execution Time 7*int32 Transpose 8 7 6 5  You can use SMEM to implement this 4 3 transpose, or you can use SHFL 2 1 0 SMEM SMEM SHFL  Code: (unsafe) http://github.com/bryancatanzaro/trove SMEM per Block (KB) 8 7 6  Performance 5 4 3 — Launch 104 blocks of 256 threads 2 1 — Run the transpose 4096 times 0 SMEM SMEM SHFL (unsafe)

Array of Structures Access via Transpose  Transpose speeds access to arrays of structures  High-level interface: coalesced_ptr<T> — Just dereference like any pointer — Up to 6x faster than direct compiler generated access Random AoS Access Contiguous AoS Access 200 140 SHFL Gather 120 SHFL Scatter 150 100 Direct Gather SHFL Load GB/s 80 Direct Scatter 100 GB/s SHFL Store 60 Direct Load 50 40 Direct Store 20 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Size of structure in bytes Size of structure in bytes

Conclusion  SHFL is available for SM >= SM 3.0  It is always faster than “safe” shared memory  It is never slower than “unsafe” shared memory  It can be used in many different algorithms

Shuffle: Tips and Tricks Julien Demouth, NVIDIA Glossary Warp - PowerPoint PPT Presentation

Shuffle: Tips and Tricks Julien Demouth, NVIDIA Glossary Warp Implicitly synchronized group of threads (32 on current HW) Warp ID ( warpid ) Identifier of the warp in a block: threadIdx.x / 32 Lane ID ( laneid ) Coordinate

Unpacking tips and tricks Protector Techniques Conclusion Samuel Chevet w4kfu@lse.epita.fr

Productivity tips & tricks Productivity tips & tricks for a developer for a developer

Tips and Tricks For Broadcast Engineers and Managers Agenda Overview Jeff Welton, CBRE Tips

Encryption based on Card Shuffle Jooyoung Lee Faculty of Mathematics and Statistics, Sejong

Shuffle Phase Executed only in the case of one or more reducers Transfers data between the

A SHUFFLE ARGUMENT SECURE IN THE GENERIC MODEL Prastudy Fauzi, Helger Lipmaa, Michal Zajac

On the double shuffle Lie algebra structure: Ecalles approach Adriana Salerno (joint work

Optimizing Shuffle in Wide-Area Data Analytics Shuhao Liu * , Hao Wang, Baochun Li Department of

TIPS 2015-2016 What is TIPS? Talent Identification Program 3 Programs: 1) Sixth Grade Duke

Android UI Development: Tips, Tricks, and Techniques Romain Guy Chet Haase Android UI Toolkit

Grant Writing Tips & Tricks Sean Vroom NJIT TAB Felicia Fred USEPA Region 3, Brownfields

HSC FCO PRESENTS UNMJobs 2.0 Onboarding, Updates, Tips & Tricks, Q&A Session May 16,

Writing Speculative Fiction: Tips, Hints, Tricks, Pitfalls, Successes Dr. Sara L. Uckelman

Tips n Tricks with ColumnStore Jim Tommaney Alibaba Cloud 2006-2014 - InfiniDB Chief

Getting the Least Out Least Out Getting the Coding tips and usage tips for C Coding tips and

4845 US Hwy 271 North | Pittsburg, TX 75686 www.tips-usa.com 866-839-8477 tips@tips-usa.com

SPARCInstructionSet CS217 Fall2001 1 LoadInstructions

Instructions: a simple example A C statement f = (g +h) - (i + j) f, g, h, i, j are

Outline Background Our method CHRW method Bloch-Siegert shift Driven quantum

Lecture 17 Logistics HW5 due on Wednesday HW6 will be out on Wednesday due in one week

Lecture no: 5 Digital modulation Ove Edfors, Department of Electrical and Information Technology

Non-conventional receivers for coherent communication Francisco Elohim Becerra Department of

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Project 4 Inter-Process Communication and Process Management COS 318 Fall 2016 Project 4: IPC

Sambuz

Useful Links

Newsletter

Mail Us

Shuffle: Tips and Tricks Julien Demouth, NVIDIA Glossary Warp - PowerPoint PPT Presentation

Shuffle: Tips and Tricks Julien Demouth, NVIDIA Glossary Warp Implicitly synchronized group of threads (32 on current HW) Warp ID ( warpid ) Identifier of the warp in a block: threadIdx.x / 32 Lane ID ( laneid ) Coordinate

Unpacking tips and tricks Protector Techniques Conclusion Samuel Chevet w4kfu@lse.epita.fr

Productivity tips &amp; tricks Productivity tips &amp; tricks for a developer for a developer

Tips and Tricks For Broadcast Engineers and Managers Agenda Overview Jeff Welton, CBRE Tips

Encryption based on Card Shuffle Jooyoung Lee Faculty of Mathematics and Statistics, Sejong

Shuffle Phase Executed only in the case of one or more reducers Transfers data between the

A SHUFFLE ARGUMENT SECURE IN THE GENERIC MODEL Prastudy Fauzi, Helger Lipmaa, Michal Zajac

On the double shuffle Lie algebra structure: Ecalles approach Adriana Salerno (joint work

Optimizing Shuffle in Wide-Area Data Analytics Shuhao Liu * , Hao Wang, Baochun Li Department of

TIPS 2015-2016 What is TIPS? Talent Identification Program 3 Programs: 1) Sixth Grade Duke

Android UI Development: Tips, Tricks, and Techniques Romain Guy Chet Haase Android UI Toolkit

Grant Writing Tips &amp; Tricks Sean Vroom NJIT TAB Felicia Fred USEPA Region 3, Brownfields

HSC FCO PRESENTS UNMJobs 2.0 Onboarding, Updates, Tips &amp; Tricks, Q&amp;A Session May 16,

Writing Speculative Fiction: Tips, Hints, Tricks, Pitfalls, Successes Dr. Sara L. Uckelman

Tips n Tricks with ColumnStore Jim Tommaney Alibaba Cloud 2006-2014 - InfiniDB Chief

Getting the Least Out Least Out Getting the Coding tips and usage tips for C Coding tips and

4845 US Hwy 271 North | Pittsburg, TX 75686 www.tips-usa.com 866-839-8477 tips@tips-usa.com

SPARCInstructionSet CS217 Fall2001 1 LoadInstructions

Instructions: a simple example A C statement f = (g +h) - (i + j) f, g, h, i, j are

Outline Background Our method CHRW method Bloch-Siegert shift Driven quantum

Lecture 17 Logistics HW5 due on Wednesday HW6 will be out on Wednesday due in one week

Lecture no: 5 Digital modulation Ove Edfors, Department of Electrical and Information Technology

Non-conventional receivers for coherent communication Francisco Elohim Becerra Department of

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Project 4 Inter-Process Communication and Process Management COS 318 Fall 2016 Project 4: IPC

Sambuz

Useful Links

Newsletter

Mail Us

Productivity tips & tricks Productivity tips & tricks for a developer for a developer

Grant Writing Tips & Tricks Sean Vroom NJIT TAB Felicia Fred USEPA Region 3, Brownfields

HSC FCO PRESENTS UNMJobs 2.0 Onboarding, Updates, Tips & Tricks, Q&A Session May 16,