atomic operations across gpu generations
play

Atomic Operations across GPU generations Juan Gmez-Luna University - PowerPoint PPT Presentation

University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015 Atomic Operations across GPU generations Juan Gmez-Luna University of Crdoba (Spain) About me Juan Gmez-Luna Telecommunications Engineering (University of Sevilla,


  1. University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015 Atomic Operations across GPU generations Juan Gómez-Luna University of Córdoba (Spain)

  2. About me Juan Gómez-Luna • Telecommunications Engineering (University of Sevilla, 2001) • Since 2005 Lecturer at the University of Córdoba • PhD Thesis (University of Córdoba, 2012) • – Programming Issues for Video Analysis on Graphics Processing Units Research collaborations: • – Technical University Munich (Germany) – Technical University Eindhoven (The Netherlands) – University of Illinois at Urbana-Champaign (USA) – University of Málaga (Spain) – Barcelona Supercomputing Center (Spain) PI of University of Córdoba GPU Education Center (supported • by NVIDIA) Juan Gómez Luna 2 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  3. Outline • Uses of atomic operations • Atomic operations on shared memory – Evolution across GPU generations – Case studies • Stream compaction • Histogramming • Reduction • Atomic operations on global memory – Evolution across GPU generations – Case studies • Scatter vs. gather • Adjacent thread block synchronization Juan Gómez Luna 3 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  4. Uses of atomic operations • Collaboration – Atomics on an array that will be the output of the kernel – Example • Histogramming • Synchronization – Atomics on memory locations that are used for synchronization or coordination – Example • Locks, flags… Juan Gómez Luna 4 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  5. Uses of atomic operations • CUDA provides atomic functions on shared memory and global memory • Arithmetic functions – Add, sub, max, min, exch, inc, dec, CAS �� � � int atomicAdd(int*, int); � • Bitwise functions – And, or, xor • Integer, uint, ull, and float Juan Gómez Luna 5 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  6. Outline • Uses of atomic operations • Atomic operations on shared memory – Evolution across GPU generations – Case studies • Stream compaction • Histogramming • Reduction • Atomic operations on global memory – Evolution across GPU generations – Case studies • Scatter vs. gather • Adjacent thread block synchronization Juan Gómez Luna 6 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  7. Atomic operations on shared memory • Code – CUDA: int atomicAdd(int*, int); � – PTX: atom.shared.add.u32 %r25, [%rd14], 1; � – SASS: Tesla, Fermi, Kepler Maxwell /*00a0*/ � LDSLK P0, R9, [R8]; � /*01f8*/ ATOMS.ADD RZ, [R7], R11; � /*00a8*/ � @P0 IADD R10, R9, R7; � Native atomic operations for /*00b0*/ � @P0 STSCUL P1, [R8], R10; � /*00b8*/ � @!P1 BRA 0xa0; � 32-bit integer , and 32-bit and 64-bit atomicCAS – Lock/Update/Unlock vs. Native atomic operations Juan Gómez Luna 7 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  8. Atomic operations on shared memory • Atomic conflict degree – Intra-warp conflict degree from 1 to 32 t conflict t base t base No atomic conflict = Atomic conflict = concurrent votes serialized votes Shared memory Shared memory Juan Gómez Luna 8 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  9. Atomic operations on shared memory • Microbenchmarking on Tesla, Fermi and Kepler – Position conflicts (GTX 580 – Fermi) Juan Gómez Luna 9 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  10. Atomic operations on shared memory • Microbenchmarking on Tesla, Fermi and Kepler – Position conflicts (K20 – Kepler) 80.00# 70.00# 60.00# Execu2on#2me#(ms)# 50.00# 40.00# 30.00# 20.00# 10.00# 0.00# 1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31# 32# 64# 128# 256# 512# 1024# Number#of#threads#in#the#warp#vo2ng#in#the#same#loca2on#than#warp#leader#thread# Block#size# Juan Gómez Luna 10 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  11. Atomic operations on shared memory • Microbenchmarking on Maxwell – Position conflicts (GTX 980 – Maxwell) 10.00# 9.00# 8.00# Execu2on#2me#(ms)# 7.00# 6.00# 5.00# 4.00# 3.00# 2.00# 1.00# 0.00# 1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31# 32# 64# 128# 256# 512# 1024# Number#of#threads#in#the#warp#vo2ng#in#the#same#loca2on#than#warp#leader#thread# Block#size# Juan Gómez Luna 11 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  12. Case studies: Filtering • Filtering / Stream compaction Stream compaction 2 1 3 0 0 1 3 4 0 0 2 1 Input Predicate: Element > 0 Output 2 1 3 1 3 4 2 1 Juan Gómez Luna 12 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  13. Case studies: Filtering __global__ void filter_shared_k(int *dst, int *nres, const int* src, int n, int value) { � • Filtering / Stream compaction __shared__ int l_n; � int i = blockIdx.x * (NPER_THREAD * BS) + threadIdx.x; � – Global memory atomics for (int iter = 0; iter < NPER_THREAD; iter++) { � // zero the counter � if (threadIdx.x == 0) � – Shared memory atomics l_n = 0; � __syncthreads(); � // get the value, evaluate the predicate, and � // increment the counter if needed � int d, pos; � __global__ void filter_k(int *dst, int *nres, const int *src, int n, int value) { � if(i < n) { � int i = threadIdx.x + blockIdx.x * blockDim.x; � d = src[i]; � if(i < n && src[i] != value){ � if(d != value) � int index = atomicAdd(nres, 1); � pos = atomicAdd(&l_n, 1); � dst[index] = src[i]; � } � __syncthreads(); � } � } � // leader increments the global counter � if(threadIdx.x == 0) � l_n = atomicAdd(nres, l_n); � __syncthreads(); � // threads with true predicates write their elements � if(i < n && d != value) { � pos += l_n; // increment local pos by global counter � dst[pos] = d; � } � __syncthreads(); � i += BS; � } � } � Juan Gómez Luna 13 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  14. Case studies: Filtering • Filtering / Stream compaction: Shared memory atomics Juan Gómez Luna 14 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  15. Case studies: Filtering • Filtering / Stream compaction Find more: CUDA Pro Tip: Optimized Filtering with Warp- Aggregated Atomics Juan Gómez Luna 15 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  16. Case studies: Histogramming • Privatization for histogram generation Juan Gómez Luna 16 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  17. Case studies: Histogramming • Privatization – 256-bin histogram calculation for 100 real images • Shared memory implementation uses 1 sub-histogram per block • Global atomics were greatly improved in Kepler Juan Gómez Luna 17 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  18. Case studies: Histogramming • Histogram calculation For (each pixel i in image I ){ � Pixel = I [ i ] � � � // Read pixel � Pixel’ = Computation( Pixel ) � // Optional computation � Histogram [ Pixel’ ]++ � � // Vote in histogram bin � } � Juan Gómez Luna 18 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  19. Case studies: Histogramming • Histogram calculation • Natural images: spatial correlation Juan Gómez Luna 19 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  20. Case studies: Histogramming • Histogram calculation • Privatization + Replication + Padding Juan Gómez Luna 20 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  21. Case studies: Histogramming • Histogram calculation: 100 real images - Privatization + Replication + Padding 1.4# 1.4# GTX#580#(Fermi#GF110)# GTX#580#(Fermi#GF110)# 1.2# 1.2# K40c#(Kepler#GK110)# K40c#(Kepler#GK110)# 1.0# 1.0# GTX#980#(Maxwell#GM204)# Execu@on#@me#(ms)# Execu=on#=me#(ms)# 0.8# 0.8# 0.6# 0.6# 0.4# 0.4# 0.2# 0.2# 0.0# 0.0# 1# 1# 2# 2# 4# 4# 8# 8# 16# 16# 32# 32# 64# 64# 128# 128# 189# 189# 1# 1# 2# 2# 4# 4# 8# 8# 16# 16# 32# 32# 47# 47# 64# 64# 256# 256# Replica=on#factor# Replica@on#factor# Histogram#size#(bins)# Histogram#size#(bins)# Juan Gómez Luna 21 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  22. Case studies: Histogramming • Privatization – 256-bin histogram calculation for 100 real images • Shared memory implementation uses 1 sub-histogram per block • Global atomics were greatly improved in Kepler Juan Gómez Luna 22 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  23. Case studies: Reduction • Reduction – Tree-based algorithm is recommended (avoid scatter style) Juan Gómez Luna 23 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend