Atomic Operations across GPU generations Juan Gmez-Luna University - - PowerPoint PPT Presentation

atomic operations across gpu generations
SMART_READER_LITE
LIVE PREVIEW

Atomic Operations across GPU generations Juan Gmez-Luna University - - PowerPoint PPT Presentation

University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015 Atomic Operations across GPU generations Juan Gmez-Luna University of Crdoba (Spain) About me Juan Gmez-Luna Telecommunications Engineering (University of Sevilla,


slide-1
SLIDE 1

Atomic Operations across GPU generations

Juan Gómez-Luna

University of Córdoba (Spain)

University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015

slide-2
SLIDE 2

About me

  • Juan Gómez-Luna
  • Telecommunications Engineering (University of Sevilla, 2001)
  • Since 2005 Lecturer at the University of Córdoba
  • PhD Thesis (University of Córdoba, 2012)

– Programming Issues for Video Analysis on Graphics Processing Units

  • Research collaborations:

– Technical University Munich (Germany) – Technical University Eindhoven (The Netherlands) – University of Illinois at Urbana-Champaign (USA) – University of Málaga (Spain) – Barcelona Supercomputing Center (Spain)

  • PI of University of Córdoba GPU Education Center (supported

by NVIDIA)

2 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-3
SLIDE 3

Outline

  • Uses of atomic operations
  • Atomic operations on shared memory

– Evolution across GPU generations – Case studies

  • Stream compaction
  • Histogramming
  • Reduction
  • Atomic operations on global memory

– Evolution across GPU generations – Case studies

  • Scatter vs. gather
  • Adjacent thread block synchronization

3 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-4
SLIDE 4

Uses of atomic operations

  • Collaboration

– Atomics on an array that will be the output of the kernel – Example

  • Histogramming
  • Synchronization

– Atomics on memory locations that are used for synchronization or coordination – Example

  • Locks, flags…

4 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-5
SLIDE 5

Uses of atomic operations

  • CUDA provides atomic functions on shared memory

and global memory

  • Arithmetic functions

– Add, sub, max, min, exch, inc, dec, CAS

  • int atomicAdd(int*, int);
  • Bitwise functions

– And, or, xor

  • Integer, uint, ull, and float

5 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-6
SLIDE 6

Outline

  • Uses of atomic operations
  • Atomic operations on shared memory

– Evolution across GPU generations – Case studies

  • Stream compaction
  • Histogramming
  • Reduction
  • Atomic operations on global memory

– Evolution across GPU generations – Case studies

  • Scatter vs. gather
  • Adjacent thread block synchronization

6 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-7
SLIDE 7
  • Code

– CUDA: int atomicAdd(int*, int); – PTX: atom.shared.add.u32 %r25, [%rd14], 1; – SASS: Tesla, Fermi, Kepler Maxwell

Native atomic operations for 32-bit integer , and 32-bit and 64-bit atomicCAS

– Lock/Update/Unlock vs. Native atomic operations

7

Atomic operations on shared memory

/*00a0*/ LDSLK P0, R9, [R8]; /*00a8*/ @P0 IADD R10, R9, R7; /*00b0*/ @P0 STSCUL P1, [R8], R10; /*00b8*/ @!P1 BRA 0xa0; /*01f8*/ ATOMS.ADD RZ, [R7], R11; Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-8
SLIDE 8
  • Atomic conflict degree

– Intra-warp conflict degree from 1 to 32

8

Atomic operations on shared memory

tbase tconflict Shared memory Shared memory tbase

No atomic conflict = concurrent votes Atomic conflict = serialized votes

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-9
SLIDE 9

Atomic operations on shared memory

  • Microbenchmarking on Tesla, Fermi and Kepler

– Position conflicts (GTX 580 – Fermi)

9 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-10
SLIDE 10

Atomic operations on shared memory

  • Microbenchmarking on Tesla, Fermi and Kepler

– Position conflicts (K20 – Kepler)

10

0.00# 10.00# 20.00# 30.00# 40.00# 50.00# 60.00# 70.00# 80.00# 1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31# 32# 64# 128# 256# 512# 1024#

Execu2on#2me#(ms)# Number#of#threads#in#the#warp#vo2ng#in#the#same#loca2on#than#warp#leader#thread# Block#size#

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-11
SLIDE 11

Atomic operations on shared memory

  • Microbenchmarking on Maxwell

– Position conflicts (GTX 980 – Maxwell)

11

0.00# 1.00# 2.00# 3.00# 4.00# 5.00# 6.00# 7.00# 8.00# 9.00# 10.00# 1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31# 32# 64# 128# 256# 512# 1024#

Execu2on#2me#(ms)# Number#of#threads#in#the#warp#vo2ng#in#the#same#loca2on#than#warp#leader#thread# Block#size#

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-12
SLIDE 12
  • Filtering / Stream compaction

Case studies: Filtering

12

2 1 3 1 3 4 2 1 2 1 3 1 3 4 2 1 Predicate: Element > 0 Input Output Stream compaction

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-13
SLIDE 13

__global__ void filter_k(int *dst, int *nres, const int *src, int n, int value) { int i = threadIdx.x + blockIdx.x * blockDim.x; if(i < n && src[i] != value){ int index = atomicAdd(nres, 1); dst[index] = src[i]; } }

  • Filtering / Stream compaction

– Global memory atomics – Shared memory atomics

13

__global__ void filter_shared_k(int *dst, int *nres, const int* src, int n, int value) { __shared__ int l_n; int i = blockIdx.x * (NPER_THREAD * BS) + threadIdx.x; for (int iter = 0; iter < NPER_THREAD; iter++) { // zero the counter if (threadIdx.x == 0) l_n = 0; __syncthreads(); // get the value, evaluate the predicate, and // increment the counter if needed int d, pos; if(i < n) { d = src[i]; if(d != value)

pos = atomicAdd(&l_n, 1);

} __syncthreads(); // leader increments the global counter if(threadIdx.x == 0)

l_n = atomicAdd(nres, l_n);

__syncthreads(); // threads with true predicates write their elements if(i < n && d != value) { pos += l_n; // increment local pos by global counter dst[pos] = d; } __syncthreads(); i += BS; } }

Case studies: Filtering

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-14
SLIDE 14
  • Filtering / Stream compaction: Shared memory

atomics

14

Case studies: Filtering

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-15
SLIDE 15
  • Filtering / Stream compaction

Find more: CUDA Pro Tip: Optimized Filtering with Warp- Aggregated Atomics

15

Case studies: Filtering

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-16
SLIDE 16

Case studies: Histogramming

  • Privatization for histogram generation

16 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-17
SLIDE 17
  • Privatization

– 256-bin histogram calculation for 100 real images

  • Shared memory implementation uses 1 sub-histogram per block
  • Global atomics were greatly improved in Kepler

Case studies: Histogramming

17 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-18
SLIDE 18
  • Histogram calculation

18

For (each pixel i in image I){ Pixel = I[i]

  • // Read pixel

Pixel’ = Computation(Pixel) // Optional computation Histogram[Pixel’]++

  • // Vote in histogram bin

}

Case studies: Histogramming

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-19
SLIDE 19
  • Histogram calculation
  • Natural images: spatial correlation

19

Case studies: Histogramming

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-20
SLIDE 20
  • Histogram calculation
  • Privatization + Replication + Padding

20

Case studies: Histogramming

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-21
SLIDE 21
  • Histogram calculation: 100 real images
  • Privatization + Replication + Padding

21

Case studies: Histogramming

0.0# 0.2# 0.4# 0.6# 0.8# 1.0# 1.2# 1.4# 1# 2# 4# 8# 16# 32# 64# 128# 189# 1# 2# 4# 8# 16# 32# 47# 64# 256#

GTX#580#(Fermi#GF110)# K40c#(Kepler#GK110)# Replica=on#factor# Histogram#size#(bins)# Execu=on#=me#(ms)#

0.0# 0.2# 0.4# 0.6# 0.8# 1.0# 1.2# 1.4# 1# 2# 4# 8# 16# 32# 64# 128# 189# 1# 2# 4# 8# 16# 32# 47# 64# 256#

GTX#580#(Fermi#GF110)# K40c#(Kepler#GK110)# GTX#980#(Maxwell#GM204)# Replica@on#factor# Histogram#size#(bins)# Execu@on#@me#(ms)#

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-22
SLIDE 22
  • Privatization

– 256-bin histogram calculation for 100 real images

  • Shared memory implementation uses 1 sub-histogram per block
  • Global atomics were greatly improved in Kepler

Case studies: Histogramming

22 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-23
SLIDE 23
  • Reduction

– Tree-based algorithm is recommended (avoid scatter style)

23

Case studies: Reduction

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-24
SLIDE 24
  • Reduction

– 7 versions in CUDA samples: Tree-based reduction in shared memory

  • Version 0: No whole warps active
  • Version 1: Contiguous threads, but many bank conflicts
  • Version 2: No bank conflicts
  • Version 3: First level of reduction when reading from global

memory

  • Version 4: Warp shuffle or unrolling of final warp
  • Version 5: Warp shuffle or complete unrolling
  • Version 6: Multiple elements per thread sequentially

24

Case studies: Reduction

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-25
SLIDE 25

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  • Reduction

25

Case studies: Reduction

slide-26
SLIDE 26

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

  • Reduction

26

Case studies: Reduction

slide-27
SLIDE 27

Outline

  • Uses of atomic operations
  • Atomic operations on shared memory

– Evolution across GPU generations – Case studies

  • Stream compaction
  • Histogramming
  • Reduction
  • Atomic operations on global memory

– Evolution across GPU generations – Case studies

  • Scatter vs. gather
  • Adjacent thread block synchronization

27 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-28
SLIDE 28
  • Tesla:

– Executed on DRAM

  • Fermi:

– Executed on L2 – Atomic units near L2

  • Kepler and Maxwell:

– Atomic units near L2 now have kind of local cache

28

Atomic operations on global memory

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-29
SLIDE 29

Case study: Scatter vs. gather

  • Scatter vs. Gather

29 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-30
SLIDE 30

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Case study: Scatter vs. gather

  • Scatter vs. Gather

30 __global__ void s2g_gpu_scatter_kernel(unsigned int* in, unsigned int* out, unsigned int num_in, unsigned int num_out) { unsigned int inIdx = blockIdx.x*blockDim.x + threadIdx.x; if(inIdx < num_in) { unsigned int intermediate = outInvariant(in[inIdx]); for(unsigned int outIdx = 0; outIdx < num_out; ++outIdx) { atomicAdd(&(out[outIdx]), outDependent(intermediate, inIdx, outIdx)); } } } __global__ void s2g_gpu_gather_kernel(unsigned int* in, unsigned int* out, unsigned int num_in, unsigned int num_out) { unsigned int outIdx = blockIdx.x*blockDim.x + threadIdx.x; if(outIdx < num_out) { unsigned int out_reg = 0; for(unsigned int inIdx = 0; inIdx < num_in; ++inIdx) { unsigned int intermediate = outInvariant(in[inIdx]);

  • ut_reg += outDependent(intermediate, inIdx, outIdx);

}

  • ut[outIdx] += out_reg;

} }

slide-31
SLIDE 31

Case study: Scatter vs. gather

  • Scatter vs. Gather

31

Tesla: No L2 cache

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-32
SLIDE 32

Case study: Scatter vs. gather

  • Scatter vs. Gather

32

Fermi: ROPs in L2 cache

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-33
SLIDE 33

Case study: Scatter vs. gather

  • Scatter vs. Gather

33

Kepler: Buffer in ROPs

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-34
SLIDE 34
  • GPU programming with CUDA (or OpenCL) might

not completely exploit inherent parallelism in some algorithms – In-place operations

  • Possible dependence between consecutive thread

blocks

– Bulk synchronous parallel programming

  • Thread block synchronization requires kernel

termination and relaunch

Case study: Adjacent block synchronization

34 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-35
SLIDE 35

Padding

35

Case study: Adjacent block synchronization

  • In-place matrix padding

– Limited GPU memory makes it desirable

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-36
SLIDE 36

Case study: Adjacent block synchronization

  • In-place matrix padding

– Temporary storage into on-chip memory – Bulk synchronous programming

  • Global synchronization = kernel termination

36

Less parallelism More parallelism

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-37
SLIDE 37

Case study: Adjacent block synchronization

  • Motivation: In-place matrix padding

– 5000x4900 -> 5000x5000

  • Almost 100 rows moved in first iteration
  • 181 iterations with some parallelism
  • Last 99 iterations moved sequentially

– Effective throughput only less than 20% peak bw.

37 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-38
SLIDE 38

Case study: Adjacent block synchronization

  • Regular Data Sliding

– Dynamic thread block id allocation

  • Avoids deadlocks

– Loading stage

  • Coarsening factor

– Adjacent thread block synchronization

  • Avoids kernel termination and relaunch

– Storing stage

38 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-39
SLIDE 39

Case study: Adjacent block synchronization

  • Timing comparison of the two approaches

39

Load Row 4 Store Row 4 Load Row 3

A S

Store Row 3 Load Row 2 Store Row 2 Load Row 1 Store Row 1 Load Row 4 Store Row 4 Load Row 3 Store Row 3 Load Row 2 Store Row 2 Load Row 1 Store Row 1 Adjacent Synchronization Kernel Termination and Re-launch Time

A S A S A S

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-40
SLIDE 40

Case study: Adjacent block synchronization

  • Regular Data Sliding

– Adjacent block synchronization (Yan et al., 2013)

  • Leader thread waits for previous block flag set
  • Avoids kernel termination and relaunch

40

__syncthreads(); if (tid == 0){ // Wait
 while(atomicOr(&flags[bid_ - 1], 0) == 0){;} // Set flag
 atomicOr(&flags[bid_], 1); } __syncthreads();

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-41
SLIDE 41

Case study: Adjacent block synchronization

  • Regular Data Sliding

– Dynamic block id allocation

  • Avoids deadlocks

__shared__ int bid_; if (tid == 0) bid_ = atomicAdd(&S, 1); __syncthreads;

41 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-42
SLIDE 42

Case study: Adjacent block synchronization

On-chip memory (registers and shared memory)

42

Global memory 4 concurrent blocks (size = 2 threads)

Flags

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-43
SLIDE 43

Case study: Adjacent block synchronization

  • Regular Data Sliding: Padding and Unpadding

– Baseline = bulk synchronous implementation (Motivation) – Up to 9.11x (Maxwell) and up to 73.25x (Hawaii)

43

0.0# 20.0# 40.0# 60.0# 80.0# 100.0# 120.0# 140.0# 4950# 4955# 4960# 4965# 4970# 4975# 4980# 4985# 4990# 4995#

Throughput#(GB/s)# Number#of#Columns#(before#padding)# Number#of#Rows#=#5000#

DS#Padding# Baseline# 0.0# 20.0# 40.0# 60.0# 80.0# 100.0# 120.0# 4950# 4955# 4960# 4965# 4970# 4975# 4980# 4985# 4990# 4995#

Throughput#(GB/s)# Number#of#Columns#(before#padding)# Number#of#Rows#=#5000#

DS#Padding# Baseline# 0.0# 20.0# 40.0# 60.0# 80.0# 100.0# 120.0# 140.0# 4950# 4955# 4960# 4965# 4970# 4975# 4980# 4985# 4990# 4995#

Throughput#(GB/s)# Number#of#Columns#(aCer#unpadding)# Number#of#Rows#=#5000#

DS#Unpadding# Baseline# 0.0# 20.0# 40.0# 60.0# 80.0# 100.0# 4950# 4955# 4960# 4965# 4970# 4975# 4980# 4985# 4990# 4995#

Throughput#(GB/s)# Number#of#Columns#(aCer#unpadding)# Number#of#Rows#=#5000#

DS#Unpadding# Baseline#

Padding on NVIDIA Maxwell Padding on AMD Hawaii Unpadding on NVIDIA Maxwell Unpadding on AMD Hawaii

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-44
SLIDE 44

Case study: Adjacent block synchronization

  • Irregular Data Sliding

– Dynamic block id allocation – Loading stage

  • Local counter

– Reduction – Adjacent block synchronization – Storing stage

  • Binary prefix-sum within the thread block

44 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-45
SLIDE 45

Case study: Adjacent block synchronization

  • Irregular Data Sliding

– Adjacent block synchronization

  • count (shared memory variable) contains reduction result
  • flag+count is prefix sum of blocks’ reductions
  • count = flag makes it visible to all threads in block

45

__syncthreads(); if (tid == 0){ // Wait
 while(atomicOr(&flags[bid_ - 1], 0) == 0){;} // Set flag
 int flag = flags[bid_ - 1]; atomicAdd(&flags[bid_], flag + count);
 count = flag; } __syncthreads();

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-46
SLIDE 46

Case study: Adjacent block synchronization

  • Irregular Data Sliding: Select

46

Up to 3.05x Thrust

  • n Maxwell

2.80x on Kepler 1.78x on Fermi

Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-47
SLIDE 47

Case study: Adjacent block synchronization

  • Irregular Data Sliding

– Stream compaction

  • Our Ip stable implementation is 68% of the fastest Oop unstable

kernel

– Unique

  • Up to 3.24x Thrust on Maxwell
  • 2.73x on Kepler
  • 1.66x on Fermi

– Partition

  • Up to 2.84x Thrust on Maxwell
  • 2.88x on Kepler
  • 1.64x on Fermi

47 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-48
SLIDE 48
  • Significant hardware improvements for atomic
  • perations

– Shared memory: Native integer atomics – Global memory: L2 + Buffer in ROPs

  • They can free programmers from applying software
  • ptimization

– Histogramming

  • They may allow a more natural way of coding, saving

many lines of code

– Reduction

  • They may allow using new, faster algorithms

– Filtering – Adjacent synchronization

Summary

48 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

slide-49
SLIDE 49

Atomic Operations across GPU generations

Juan Gómez-Luna

University of Córdoba (Spain) el1goluj@uco.es gomezlun@illinois.edu

University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015