Atomic Operations across GPU generations Juan Gmez-Luna University - PowerPoint PPT Presentation

University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015 Atomic Operations across GPU generations Juan Gómez-Luna University of Córdoba (Spain)

About me Juan Gómez-Luna • Telecommunications Engineering (University of Sevilla, 2001) • Since 2005 Lecturer at the University of Córdoba • PhD Thesis (University of Córdoba, 2012) • – Programming Issues for Video Analysis on Graphics Processing Units Research collaborations: • – Technical University Munich (Germany) – Technical University Eindhoven (The Netherlands) – University of Illinois at Urbana-Champaign (USA) – University of Málaga (Spain) – Barcelona Supercomputing Center (Spain) PI of University of Córdoba GPU Education Center (supported • by NVIDIA) Juan Gómez Luna 2 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Outline • Uses of atomic operations • Atomic operations on shared memory – Evolution across GPU generations – Case studies • Stream compaction • Histogramming • Reduction • Atomic operations on global memory – Evolution across GPU generations – Case studies • Scatter vs. gather • Adjacent thread block synchronization Juan Gómez Luna 3 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Uses of atomic operations • Collaboration – Atomics on an array that will be the output of the kernel – Example • Histogramming • Synchronization – Atomics on memory locations that are used for synchronization or coordination – Example • Locks, flags… Juan Gómez Luna 4 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Uses of atomic operations • CUDA provides atomic functions on shared memory and global memory • Arithmetic functions – Add, sub, max, min, exch, inc, dec, CAS �� int atomicAdd(int*, int); � • Bitwise functions – And, or, xor • Integer, uint, ull, and float Juan Gómez Luna 5 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Outline • Uses of atomic operations • Atomic operations on shared memory – Evolution across GPU generations – Case studies • Stream compaction • Histogramming • Reduction • Atomic operations on global memory – Evolution across GPU generations – Case studies • Scatter vs. gather • Adjacent thread block synchronization Juan Gómez Luna 6 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Atomic operations on shared memory • Code – CUDA: int atomicAdd(int*, int); � – PTX: atom.shared.add.u32 %r25, [%rd14], 1; � – SASS: Tesla, Fermi, Kepler Maxwell /*00a0*/ � LDSLK P0, R9, [R8]; � /*01f8*/ ATOMS.ADD RZ, [R7], R11; � /*00a8*/ � @P0 IADD R10, R9, R7; � Native atomic operations for /*00b0*/ � @P0 STSCUL P1, [R8], R10; � /*00b8*/ � @!P1 BRA 0xa0; � 32-bit integer , and 32-bit and 64-bit atomicCAS – Lock/Update/Unlock vs. Native atomic operations Juan Gómez Luna 7 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Atomic operations on shared memory • Atomic conflict degree – Intra-warp conflict degree from 1 to 32 t conflict t base t base No atomic conflict = Atomic conflict = concurrent votes serialized votes Shared memory Shared memory Juan Gómez Luna 8 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Atomic operations on shared memory • Microbenchmarking on Tesla, Fermi and Kepler – Position conflicts (GTX 580 – Fermi) Juan Gómez Luna 9 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Atomic operations on shared memory • Microbenchmarking on Tesla, Fermi and Kepler – Position conflicts (K20 – Kepler) 80.00# 70.00# 60.00# Execu2on#2me#(ms)# 50.00# 40.00# 30.00# 20.00# 10.00# 0.00# 1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31#1# 4# 7#10# 13# 16#19# 22#25# 28# 31# 32# 64# 128# 256# 512# 1024# Number#of#threads#in#the#warp#vo2ng#in#the#same#loca2on#than#warp#leader#thread# Block#size# Juan Gómez Luna 10 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Atomic operations on shared memory • Microbenchmarking on Maxwell – Position conflicts (GTX 980 – Maxwell) 10.00# 9.00# 8.00# Execu2on#2me#(ms)# 7.00# 6.00# 5.00# 4.00# 3.00# 2.00# 1.00# 0.00# 1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31# 32# 64# 128# 256# 512# 1024# Number#of#threads#in#the#warp#vo2ng#in#the#same#loca2on#than#warp#leader#thread# Block#size# Juan Gómez Luna 11 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Case studies: Filtering • Filtering / Stream compaction Stream compaction 2 1 3 0 0 1 3 4 0 0 2 1 Input Predicate: Element > 0 Output 2 1 3 1 3 4 2 1 Juan Gómez Luna 12 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Case studies: Filtering __global__ void filter_shared_k(int *dst, int *nres, const int* src, int n, int value) { � • Filtering / Stream compaction __shared__ int l_n; � int i = blockIdx.x * (NPER_THREAD * BS) + threadIdx.x; � – Global memory atomics for (int iter = 0; iter < NPER_THREAD; iter++) { � // zero the counter � if (threadIdx.x == 0) � – Shared memory atomics l_n = 0; � __syncthreads(); � // get the value, evaluate the predicate, and � // increment the counter if needed � int d, pos; � __global__ void filter_k(int *dst, int *nres, const int *src, int n, int value) { � if(i < n) { � int i = threadIdx.x + blockIdx.x * blockDim.x; � d = src[i]; � if(i < n && src[i] != value){ � if(d != value) � int index = atomicAdd(nres, 1); � pos = atomicAdd(&l_n, 1); � dst[index] = src[i]; � } � __syncthreads(); � } � } � // leader increments the global counter � if(threadIdx.x == 0) � l_n = atomicAdd(nres, l_n); � __syncthreads(); � // threads with true predicates write their elements � if(i < n && d != value) { � pos += l_n; // increment local pos by global counter � dst[pos] = d; � } � __syncthreads(); � i += BS; � } � } � Juan Gómez Luna 13 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Case studies: Filtering • Filtering / Stream compaction: Shared memory atomics Juan Gómez Luna 14 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Case studies: Filtering • Filtering / Stream compaction Find more: CUDA Pro Tip: Optimized Filtering with Warp- Aggregated Atomics Juan Gómez Luna 15 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Case studies: Histogramming • Privatization for histogram generation Juan Gómez Luna 16 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Case studies: Histogramming • Privatization – 256-bin histogram calculation for 100 real images • Shared memory implementation uses 1 sub-histogram per block • Global atomics were greatly improved in Kepler Juan Gómez Luna 17 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Case studies: Histogramming • Histogram calculation For (each pixel i in image I ){ � Pixel = I [ i ] � � � // Read pixel � Pixel’ = Computation( Pixel ) � // Optional computation � Histogram [ Pixel’ ]++ � � // Vote in histogram bin � } � Juan Gómez Luna 18 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Case studies: Histogramming • Histogram calculation • Natural images: spatial correlation Juan Gómez Luna 19 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Case studies: Histogramming • Histogram calculation • Privatization + Replication + Padding Juan Gómez Luna 20 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Case studies: Histogramming • Histogram calculation: 100 real images - Privatization + Replication + Padding 1.4# 1.4# GTX#580#(Fermi#GF110)# GTX#580#(Fermi#GF110)# 1.2# 1.2# K40c#(Kepler#GK110)# K40c#(Kepler#GK110)# 1.0# 1.0# GTX#980#(Maxwell#GM204)# Execu@on#@me#(ms)# Execu=on#=me#(ms)# 0.8# 0.8# 0.6# 0.6# 0.4# 0.4# 0.2# 0.2# 0.0# 0.0# 1# 1# 2# 2# 4# 4# 8# 8# 16# 16# 32# 32# 64# 64# 128# 128# 189# 189# 1# 1# 2# 2# 4# 4# 8# 8# 16# 16# 32# 32# 47# 47# 64# 64# 256# 256# Replica=on#factor# Replica@on#factor# Histogram#size#(bins)# Histogram#size#(bins)# Juan Gómez Luna 21 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Case studies: Histogramming • Privatization – 256-bin histogram calculation for 100 real images • Shared memory implementation uses 1 sub-histogram per block • Global atomics were greatly improved in Kepler Juan Gómez Luna 22 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Case studies: Reduction • Reduction – Tree-based algorithm is recommended (avoid scatter style) Juan Gómez Luna 23 University of Illinois at Urbana-Champaign. ECE408. October 22, 2015

Atomic Operations across GPU generations Juan Gmez-Luna University - PowerPoint PPT Presentation

University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015 Atomic Operations across GPU generations Juan Gmez-Luna University of Crdoba (Spain) About me Juan Gmez-Luna Telecommunications Engineering (University of Sevilla,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

DK - Batteridrevet vakuum lfter AL-Atomic 500 D - Batteriebetrieber Vakuumheber AL-Atomic 500

GENERATIONS GENERATIONS AGENDA About me Software Development Evolved The Java Problem

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Atomic page flip and mode setting Hardware structure and abstraction Atomic page flip The

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Recent results on 2D Surface Quasi-Geostrophic Equation Alexander Kiselev UW-Madison Based on

The quasi-geostrophic equation C regularity for the critical case Luis Caffarelli and Alexis

Introduction Psyc 0032/2032: Engineering Psychology Instructor TA Paul Chang Eric Ma Who you

Designing for Success. Content-First Design. A presentation by Stelios Kourakis for DrupalCamp

Global in time, shift of stability and mixing solutions for the Muskat problem. Diego Crdoba

Road Traffic Management Public University of Navarre A. J. Bermejo, J. Villadangos, J.J. Astrain,

New developments in the G4 Low Energy Electromagnetic Classes G. O Depaola Facultad de

On the use of phone-gram units in recurrent neural networks for language identification

Atomic Operations across GPU generations Juan Gmez-Luna University - PowerPoint PPT Presentation

University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015 Atomic Operations across GPU generations Juan Gmez-Luna University of Crdoba (Spain) About me Juan Gmez-Luna Telecommunications Engineering (University of Sevilla,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

DK - Batteridrevet vakuum lfter AL-Atomic 500 D - Batteriebetrieber Vakuumheber AL-Atomic 500

GENERATIONS GENERATIONS AGENDA About me Software Development Evolved The Java Problem

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Atomic page flip and mode setting Hardware structure and abstraction Atomic page flip The

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Recent results on 2D Surface Quasi-Geostrophic Equation Alexander Kiselev UW-Madison Based on

The quasi-geostrophic equation C regularity for the critical case Luis Caffarelli and Alexis

Introduction Psyc 0032/2032: Engineering Psychology Instructor TA Paul Chang Eric Ma Who you

Designing for Success. Content-First Design. A presentation by Stelios Kourakis for DrupalCamp

Global in time, shift of stability and mixing solutions for the Muskat problem. Diego Crdoba

Road Traffic Management Public University of Navarre A. J. Bermejo, J. Villadangos, J.J. Astrain,

New developments in the G4 Low Energy Electromagnetic Classes G. O Depaola Facultad de

On the use of phone-gram units in recurrent neural networks for language identification

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,