Atomic Operations across GPU generations
Juan Gómez-Luna
University of Córdoba (Spain)
University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015
Atomic Operations across GPU generations Juan Gmez-Luna University - - PowerPoint PPT Presentation
University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015 Atomic Operations across GPU generations Juan Gmez-Luna University of Crdoba (Spain) About me Juan Gmez-Luna Telecommunications Engineering (University of Sevilla,
University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015
2 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
3 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
4 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
5 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
6 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
– CUDA: int atomicAdd(int*, int); – PTX: atom.shared.add.u32 %r25, [%rd14], 1; – SASS: Tesla, Fermi, Kepler Maxwell
Native atomic operations for 32-bit integer , and 32-bit and 64-bit atomicCAS
– Lock/Update/Unlock vs. Native atomic operations
7
/*00a0*/ LDSLK P0, R9, [R8]; /*00a8*/ @P0 IADD R10, R9, R7; /*00b0*/ @P0 STSCUL P1, [R8], R10; /*00b8*/ @!P1 BRA 0xa0; /*01f8*/ ATOMS.ADD RZ, [R7], R11; Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
8
tbase tconflict Shared memory Shared memory tbase
No atomic conflict = concurrent votes Atomic conflict = serialized votes
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
9 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
10
0.00# 10.00# 20.00# 30.00# 40.00# 50.00# 60.00# 70.00# 80.00# 1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31# 32# 64# 128# 256# 512# 1024#
Execu2on#2me#(ms)# Number#of#threads#in#the#warp#vo2ng#in#the#same#loca2on#than#warp#leader#thread# Block#size#
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
11
0.00# 1.00# 2.00# 3.00# 4.00# 5.00# 6.00# 7.00# 8.00# 9.00# 10.00# 1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31# 32# 64# 128# 256# 512# 1024#
Execu2on#2me#(ms)# Number#of#threads#in#the#warp#vo2ng#in#the#same#loca2on#than#warp#leader#thread# Block#size#
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
12
2 1 3 1 3 4 2 1 2 1 3 1 3 4 2 1 Predicate: Element > 0 Input Output Stream compaction
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
__global__ void filter_k(int *dst, int *nres, const int *src, int n, int value) { int i = threadIdx.x + blockIdx.x * blockDim.x; if(i < n && src[i] != value){ int index = atomicAdd(nres, 1); dst[index] = src[i]; } }
– Global memory atomics – Shared memory atomics
13
__global__ void filter_shared_k(int *dst, int *nres, const int* src, int n, int value) { __shared__ int l_n; int i = blockIdx.x * (NPER_THREAD * BS) + threadIdx.x; for (int iter = 0; iter < NPER_THREAD; iter++) { // zero the counter if (threadIdx.x == 0) l_n = 0; __syncthreads(); // get the value, evaluate the predicate, and // increment the counter if needed int d, pos; if(i < n) { d = src[i]; if(d != value)
pos = atomicAdd(&l_n, 1);
} __syncthreads(); // leader increments the global counter if(threadIdx.x == 0)
l_n = atomicAdd(nres, l_n);
__syncthreads(); // threads with true predicates write their elements if(i < n && d != value) { pos += l_n; // increment local pos by global counter dst[pos] = d; } __syncthreads(); i += BS; } }
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
14
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
15
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
16 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
– 256-bin histogram calculation for 100 real images
17 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
18
For (each pixel i in image I){ Pixel = I[i]
Pixel’ = Computation(Pixel) // Optional computation Histogram[Pixel’]++
}
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
19
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
20
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
21
0.0# 0.2# 0.4# 0.6# 0.8# 1.0# 1.2# 1.4# 1# 2# 4# 8# 16# 32# 64# 128# 189# 1# 2# 4# 8# 16# 32# 47# 64# 256#
GTX#580#(Fermi#GF110)# K40c#(Kepler#GK110)# Replica=on#factor# Histogram#size#(bins)# Execu=on#=me#(ms)#
0.0# 0.2# 0.4# 0.6# 0.8# 1.0# 1.2# 1.4# 1# 2# 4# 8# 16# 32# 64# 128# 189# 1# 2# 4# 8# 16# 32# 47# 64# 256#
GTX#580#(Fermi#GF110)# K40c#(Kepler#GK110)# GTX#980#(Maxwell#GM204)# Replica@on#factor# Histogram#size#(bins)# Execu@on#@me#(ms)#
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
– 256-bin histogram calculation for 100 real images
22 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
– Tree-based algorithm is recommended (avoid scatter style)
23
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
– 7 versions in CUDA samples: Tree-based reduction in shared memory
memory
24
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
25
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
26
27 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
28
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
29 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
30 __global__ void s2g_gpu_scatter_kernel(unsigned int* in, unsigned int* out, unsigned int num_in, unsigned int num_out) { unsigned int inIdx = blockIdx.x*blockDim.x + threadIdx.x; if(inIdx < num_in) { unsigned int intermediate = outInvariant(in[inIdx]); for(unsigned int outIdx = 0; outIdx < num_out; ++outIdx) { atomicAdd(&(out[outIdx]), outDependent(intermediate, inIdx, outIdx)); } } } __global__ void s2g_gpu_gather_kernel(unsigned int* in, unsigned int* out, unsigned int num_in, unsigned int num_out) { unsigned int outIdx = blockIdx.x*blockDim.x + threadIdx.x; if(outIdx < num_out) { unsigned int out_reg = 0; for(unsigned int inIdx = 0; inIdx < num_in; ++inIdx) { unsigned int intermediate = outInvariant(in[inIdx]);
}
} }
31
Tesla: No L2 cache
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
32
Fermi: ROPs in L2 cache
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
33
Kepler: Buffer in ROPs
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
34 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Padding
35
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
36
Less parallelism More parallelism
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
37 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
38 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
39
Load Row 4 Store Row 4 Load Row 3
A S
Store Row 3 Load Row 2 Store Row 2 Load Row 1 Store Row 1 Load Row 4 Store Row 4 Load Row 3 Store Row 3 Load Row 2 Store Row 2 Load Row 1 Store Row 1 Adjacent Synchronization Kernel Termination and Re-launch Time
A S A S A S
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
40
__syncthreads(); if (tid == 0){ // Wait while(atomicOr(&flags[bid_ - 1], 0) == 0){;} // Set flag atomicOr(&flags[bid_], 1); } __syncthreads();
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
__shared__ int bid_; if (tid == 0) bid_ = atomicAdd(&S, 1); __syncthreads;
41 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
On-chip memory (registers and shared memory)
42
Global memory 4 concurrent blocks (size = 2 threads)
Flags
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
43
0.0# 20.0# 40.0# 60.0# 80.0# 100.0# 120.0# 140.0# 4950# 4955# 4960# 4965# 4970# 4975# 4980# 4985# 4990# 4995#
Throughput#(GB/s)# Number#of#Columns#(before#padding)# Number#of#Rows#=#5000#
DS#Padding# Baseline# 0.0# 20.0# 40.0# 60.0# 80.0# 100.0# 120.0# 4950# 4955# 4960# 4965# 4970# 4975# 4980# 4985# 4990# 4995#
Throughput#(GB/s)# Number#of#Columns#(before#padding)# Number#of#Rows#=#5000#
DS#Padding# Baseline# 0.0# 20.0# 40.0# 60.0# 80.0# 100.0# 120.0# 140.0# 4950# 4955# 4960# 4965# 4970# 4975# 4980# 4985# 4990# 4995#
Throughput#(GB/s)# Number#of#Columns#(aCer#unpadding)# Number#of#Rows#=#5000#
DS#Unpadding# Baseline# 0.0# 20.0# 40.0# 60.0# 80.0# 100.0# 4950# 4955# 4960# 4965# 4970# 4975# 4980# 4985# 4990# 4995#
Throughput#(GB/s)# Number#of#Columns#(aCer#unpadding)# Number#of#Rows#=#5000#
DS#Unpadding# Baseline#
Padding on NVIDIA Maxwell Padding on AMD Hawaii Unpadding on NVIDIA Maxwell Unpadding on AMD Hawaii
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
44 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
45
__syncthreads(); if (tid == 0){ // Wait while(atomicOr(&flags[bid_ - 1], 0) == 0){;} // Set flag int flag = flags[bid_ - 1]; atomicAdd(&flags[bid_], flag + count); count = flag; } __syncthreads();
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
46
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
kernel
47 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
48 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015