Function Call Re-Vectorization Pupil: Rubens Emilio Alves Moreira - - PowerPoint PPT Presentation

function call
SMART_READER_LITE
LIVE PREVIEW

Function Call Re-Vectorization Pupil: Rubens Emilio Alves Moreira - - PowerPoint PPT Presentation

Function Call Re-Vectorization Pupil: Rubens Emilio Alves Moreira Advisor: Fernando Magno Quinto Pereira Function Call Re-Vectorization Programmability Efficiency Function Call Re-Vectorization CUDA: kernel <<<#warps,


slide-1
SLIDE 1

Function Call Re-Vectorization

Pupil: Rubens Emilio Alves Moreira Advisor: Fernando Magno Quintão Pereira

slide-2
SLIDE 2

Programmability Efficiency Function Call Re-Vectorization

slide-3
SLIDE 3

Programmability Efficiency

Dynamic Parallelism

Function Call Re-Vectorization

CUDA: kernel<<<#warps, #threads>>>(args...)

slide-4
SLIDE 4

Programmability Efficiency

Dynamic Parallelism Shuffle Nightmare

Function Call Re-Vectorization

CUDA: kernel<<<#warps, #threads>>>(args...)

... __shuffle(data, tid, var) __shuffle(data, tid, var) ... __synchronize() ... __shuffle(data, tid, var) ... __synchronize() __shuffle(data, tid, var) ... __shuffle(data, tid, var) ...

slide-5
SLIDE 5

Programmability Efficiency

Dynamic Parallelism Shuffle Nightmare

Function Call Re-Vectorization

Function Call Re-Vectorization

CUDA: kernel<<<#warps, #threads>>>(args...)

... __shuffle(data, tid, var) __shuffle(data, tid, var) ... __synchronize() ... __shuffle(data, tid, var) ... __synchronize() __shuffle(data, tid, var) ... __shuffle(data, tid, var) ...

slide-6
SLIDE 6

Programmability Efficiency

Dynamic Parallelism Shuffle Nightmare

Function Call Re-Vectorization

Function Call Re-Vectorization

CUDA: kernel<<<#warps, #threads>>>(args...)

... __shuffle(data, tid, var) __shuffle(data, tid, var) ... __synchronize() ... __shuffle(data, tid, var) ... __synchronize() __shuffle(data, tid, var) ... __shuffle(data, tid, var) ...

Simplicity

slide-7
SLIDE 7

Programmability Efficiency

Dynamic Parallelism Shuffle Nightmare

Function Call Re-Vectorization

Function Call Re-Vectorization

CUDA: kernel<<<#warps, #threads>>>(args...)

... __shuffle(data, tid, var) __shuffle(data, tid, var) ... __synchronize() ... __shuffle(data, tid, var) ... __synchronize() __shuffle(data, tid, var) ... __shuffle(data, tid, var) ...

High performance Simplicity

slide-8
SLIDE 8

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

Function Call Re-Vectorization

SIMD implementation of memory copy.

slide-9
SLIDE 9

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

Function Call Re-Vectorization

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { memcpy<<<1, 4>>>(dest[tid], src[tid], N[tid]); }

CUDA’s nested kernel call: Dynamic parallelism SIMD implementation of memory copy.

Too much

  • verhead
slide-10
SLIDE 10

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

Function Call Re-Vectorization

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { EVERYWHERE { for (int i=0; i < threadDim.x; ++i) { if (not (mask & (1 << i))) continue; // skip thread “i” dest_i = shuffle(dest, i); // if it is divergent src_i = shuffle(src, i); N_i = shuffle(N, i); memcpy(dest_i, src_i, N_i); } } }

Warp-synchronous wrapper for SIMD memory copy.

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { memcpy<<<1, 4>>>(dest[tid], src[tid], N[tid]); }

CUDA’s nested kernel call: Dynamic parallelism SIMD implementation of memory copy.

Too much

  • verhead

Too many lines of code

slide-11
SLIDE 11

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

Function Call Re-Vectorization

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { EVERYWHERE { for (int i=0; i < threadDim.x; ++i) { if (not (mask & (1 << i))) continue; // skip thread “i” dest_i = shuffle(dest, i); // if it is divergent src_i = shuffle(src, i); N_i = shuffle(N, i); memcpy(dest_i, src_i, N_i); } } }

Warp-synchronous wrapper for SIMD memory copy.

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { memcpy<<<1, 4>>>(dest[tid], src[tid], N[tid]); }

CUDA’s nested kernel call: Dynamic parallelism SIMD implementation of memory copy.

Too much

  • verhead

Too many lines of code

void memcpy_wrapper(int **dest, int **src, int *N, int mask) {

crev memcpy(dest[tid], src[tid], N[tid]);

}

Simplicity + Performance, a.k.a.

CREV

slide-12
SLIDE 12

Function Call Re-Vectorization Our goal is to increase the programmability of languages that target SIMD-like machines, without sacrificing efficiency.

Programmability of algorithms involving function calls:

  • Quicksort
  • Depth-First Search
  • Leader election
  • String matching
  • etc

SIMD-like hardware

  • GPUs
  • SSE vector units
  • AVX vector units
  • etc

SIMD function calls within divergent regions

slide-13
SLIDE 13

DEPARTMENT OF COMPUTER SCIENCE UNIVERSIDADE FEDERAL DE MINAS GERAIS FEDERAL UNIVERSITY OF MINAS GERAIS, BRAZIL

CONCEPTS

slide-14
SLIDE 14

Concepts: Flynn’s Taxonomy

slide-15
SLIDE 15

Concepts: Flynn’s Taxonomy

SIMD (Single Instruction Multiple Data):

  • One instruction fetcher
  • Multiple processing units
  • Global memory bench
  • Private memory bench
  • Lockstep execution
slide-16
SLIDE 16

Concepts: Flynn’s Taxonomy

SIMD (Single Instruction Multiple Data):

  • One instruction fetcher
  • Multiple processing units
  • Global memory bench
  • Private memory bench
  • Lockstep execution

All processing units execute the same instruction!

slide-17
SLIDE 17

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Kernel for parallel execution (CUDA).

Concepts: Lockstep Execution

slide-18
SLIDE 18

Concepts: Lockstep Execution

then memcpy(A, B, N); else ;

Control flow graph for kernel.

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Kernel for parallel execution (CUDA).

if (threadId.x < 3)

slide-19
SLIDE 19

Concepts: Lockstep Execution

then memcpy(A, B, N); else ;

Control flow graph for kernel.

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Kernel for parallel execution (CUDA).

if (threadId.x < 3) T0 T1 T2 T3 SIMD: LOCKSTEP EXECUTION!

slide-20
SLIDE 20

Concepts: Lockstep Execution

then memcpy(A, B, N); else ;

Control flow graph for kernel.

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Kernel for parallel execution (CUDA).

if (threadId.x < 3) SIMD: LOCKSTEP EXECUTION! T0 T1 T2 T3

slide-21
SLIDE 21

Concepts: Lockstep Execution

then memcpy(A, B, N); else ;

Control flow graph for kernel.

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Kernel for parallel execution (CUDA).

if (threadId.x < 3) SIMD: LOCKSTEP EXECUTION! T0 T1 T2 T3

slide-22
SLIDE 22

Concepts: Lockstep Execution

then memcpy(A, B, N); else ;

Control flow graph for kernel.

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Kernel for parallel execution (CUDA).

if (threadId.x < 3) SIMD: LOCKSTEP EXECUTION! T0 T1 T2 T3

slide-23
SLIDE 23

Concepts: Lockstep Execution

then memcpy(A, B, N); else ;

Control flow graph for kernel.

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Kernel for parallel execution (CUDA).

if (threadId.x < 3) SIMD: LOCKSTEP EXECUTION! T0 T1 T2 T3

slide-24
SLIDE 24

Concepts: Lockstep Execution

then memcpy(A, B, N); else ;

Control flow graph for kernel.

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Kernel for parallel execution (CUDA).

if (threadId.x < 3) SIMD: LOCKSTEP EXECUTION! T0 T1 T3 T2

slide-25
SLIDE 25

DEPARTMENT OF COMPUTER SCIENCE UNIVERSIDADE FEDERAL DE MINAS GERAIS FEDERAL UNIVERSITY OF MINAS GERAIS, BRAZIL

DIVERGENCES

slide-26
SLIDE 26

Divergences

then memcpy(A, B, N); else ;

Control flow graph for kernel.

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Kernel for parallel execution (CUDA).

if (threadId.x < 3) T0 T1 T2 T3 SIMD: LOCKSTEP EXECUTION!

slide-27
SLIDE 27

Divergences

then memcpy(A, B, N);

Control flow graph for kernel. Kernel for parallel execution (CUDA).

if (threadId.x < 3) SIMD: LOCKSTEP EXECUTION! T0 T1 T2 T3 DIVERGENCE! else ;

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

slide-28
SLIDE 28

if (threadId.x < 3) else ;

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Divergences

then memcpy(A, B, N);

Control flow graph for kernel. Kernel for parallel execution (CUDA).

SIMD: LOCKSTEP EXECUTION! DIVERGENCE! T0 T1 T2 T3

slide-29
SLIDE 29

if (threadId.x < 3) else ;

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Divergences

then memcpy(A, B, N);

Control flow graph for kernel. Kernel for parallel execution (CUDA).

SIMD: LOCKSTEP EXECUTION! DIVERGENCE! T0 T1 T2 T3 WAIT!

slide-30
SLIDE 30

if (threadId.x < 3) else ;

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Divergences

then memcpy(A, B, N);

Control flow graph for kernel. Kernel for parallel execution (CUDA).

SIMD: LOCKSTEP EXECUTION! DIVERGENCE! T0 T1 T2 T3

slide-31
SLIDE 31

if (threadId.x < 3) else ;

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Divergences

then memcpy(A, B, N);

Control flow graph for kernel. Kernel for parallel execution (CUDA).

SIMD: LOCKSTEP EXECUTION! DIVERGENCE! T0 T1 T2 T3

And waiting to process can be quite costly!

slide-32
SLIDE 32

Interlude: The Kernels of Samuel

Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf

__global__ void dec2zero(int *data, int N) { int xIndex = blockIdx.x * blockDim.x + threadIdx.x; if (xIndex < N) { while (data[xIndex] > 0) { data[xIndex]--; } } }

slide-33
SLIDE 33

Interlude: The Kernels of Samuel

Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf

__global__ void dec2zero(int *data, int N) { int xIndex = blockIdx.x * blockDim.x + threadIdx.x; if (xIndex < N) { while (data[xIndex] > 0) { data[xIndex]--; } } }

slide-34
SLIDE 34

Interlude: The Kernels of Samuel

Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf

__global__ void dec2zero(int *data, int N) { int xIndex = blockIdx.x * blockDim.x + threadIdx.x; if (xIndex < N) { while (data[xIndex] > 0) { data[xIndex]--; } } }

slide-35
SLIDE 35

Interlude: The Kernels of Samuel

Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf

__global__ void dec2zero(int *data, int N) { int xIndex = blockIdx.x * blockDim.x + threadIdx.x; if (xIndex < N) { while (data[xIndex] > 0) { data[xIndex]--; } } }

Seeking for the lowest execution time, what is the best initialization of data[]?

slide-36
SLIDE 36

Interlude: The Kernels of Samuel

void F(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size - i + 1; } } Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf int idx = threadId.x; int dimx = threadDim.x;

F assigns the result of (size - i + 1) to data[i]

slide-37
SLIDE 37

Interlude: The Kernels of Samuel

void F(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size - i + 1; } } Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf void M(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size; } } int idx = threadId.x; int dimx = threadDim.x;

M assigns the constant value size to data[i]

slide-38
SLIDE 38

Interlude: The Kernels of Samuel

void F(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size - i + 1; } } Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf void M(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size; } } void Q(int *data, int size) { for (int i = idx; i < size; i += dimx) { if (i % 2) data[i] = size; } } int idx = threadId.x; int dimx = threadDim.x;

Q does also assign size to data[i], but only for threads with odd index i

slide-39
SLIDE 39

Interlude: The Kernels of Samuel

void F(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size - i + 1; } } Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf void M(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size; } } void Q(int *data, int size) { for (int i = idx; i < size; i += dimx) { if (i % 2) data[i] = size; } } void P(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = random() % size; } } int idx = threadId.x; int dimx = threadDim.x;

P calls function random and assigns its value, modulo size, to data[i]

slide-40
SLIDE 40

Interlude: The Kernels of Samuel

void F(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size - i + 1; } } Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf void M(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size; } } void Q(int *data, int size) { for (int i = idx; i < size; i += dimx) { if (i % 2) data[i] = size; } } void P(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = random() % size; } } int idx = threadId.x; int dimx = threadDim.x;

__global__ void dec2zero(int *data, int N) { int xIndex = blockIdx.x * blockDim.x + threadIdx.x; if (xIndex < N) { while (data[xIndex] > 0) { data[xIndex]--; } } }

slide-41
SLIDE 41

Interlude: The Kernels of Samuel

void F(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size - i + 1; } } Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf void M(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size; } } void Q(int *data, int size) { for (int i = idx; i < size; i += dimx) { if (i % 2) data[i] = size; } } void P(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = random() % size; } } 16153µs: all values are equal int idx = threadId.x; int dimx = threadDim.x;

__global__ void dec2zero(int *data, int N) { int xIndex = blockIdx.x * blockDim.x + threadIdx.x; if (xIndex < N) { while (data[xIndex] > 0) { data[xIndex]--; } } }

slide-42
SLIDE 42

Interlude: The Kernels of Samuel

void F(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size - i + 1; } } Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf void M(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size; } } void Q(int *data, int size) { for (int i = idx; i < size; i += dimx) { if (i % 2) data[i] = size; } } void P(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = random() % size; } } 16250µs: values differ by constant int idx = threadId.x; int dimx = threadDim.x; 16153µs: all values are equal

__global__ void dec2zero(int *data, int N) { int xIndex = blockIdx.x * blockDim.x + threadIdx.x; if (xIndex < N) { while (data[xIndex] > 0) { data[xIndex]--; } } }

slide-43
SLIDE 43

Interlude: The Kernels of Samuel

void F(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size - i + 1; } } Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf void M(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size; } } void Q(int *data, int size) { for (int i = idx; i < size; i += dimx) { if (i % 2) data[i] = size; } } void P(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = random() % size; } } 30210µs: normal distribution

  • f values

int idx = threadId.x; int dimx = threadDim.x; 16250µs: values differ by constant 16153µs: all values are equal

__global__ void dec2zero(int *data, int N) { int xIndex = blockIdx.x * blockDim.x + threadIdx.x; if (xIndex < N) { while (data[xIndex] > 0) { data[xIndex]--; } } }

slide-44
SLIDE 44

Interlude: The Kernels of Samuel

void F(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size - i + 1; } } Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf void M(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size; } } void Q(int *data, int size) { for (int i = idx; i < size; i += dimx) { if (i % 2) data[i] = size; } } void P(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = random() % size; } } 32193µs: half the values differ! int idx = threadId.x; int dimx = threadDim.x; 30210µs: normal distribution

  • f values

16250µs: values differ by constant 16153µs: all values are equal

__global__ void dec2zero(int *data, int N) { int xIndex = blockIdx.x * blockDim.x + threadIdx.x; if (xIndex < N) { while (data[xIndex] > 0) { data[xIndex]--; } } }

slide-45
SLIDE 45

Interlude: The Kernels of Samuel

void F(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size - i + 1; } } Source: http://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf void M(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = size; } } void Q(int *data, int size) { for (int i = idx; i < size; i += dimx) { if (i % 2) data[i] = size; } } void P(int *data, int size) { for (int i = idx; i < size; i += dimx) { data[i] = random() % size; } } int idx = threadId.x; int dimx = threadDim.x;

Divergence is harmful to performance!

32193µs: half the values differ! 30210µs: normal distribution

  • f values

16250µs: values differ by constant 16153µs: all values are equal

__global__ void dec2zero(int *data, int N) { int xIndex = blockIdx.x * blockDim.x + threadIdx.x; if (xIndex < N) { while (data[xIndex] > 0) { data[xIndex]--; } } }

slide-46
SLIDE 46

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Divergences: Coda

Kernel for parallel execution (CUDA). Control flow graph for memcpy.

FUNCTION memcpy DIVERGENCE! T0 T1 T2 T3

Divergent region:

  • nly active threads

run memcpy

slide-47
SLIDE 47

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Divergences: Coda

Kernel for parallel execution (CUDA). Control flow graph for memcpy.

FUNCTION memcpy DIVERGENCE! T0 T1 T2 T3

Divergent region:

  • nly active threads

run memcpy

slide-48
SLIDE 48

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Divergences: Coda

Kernel for parallel execution (CUDA). Divergent region:

  • nly active threads

run memcpy Control flow graph for memcpy.

FUNCTION memcpy DIVERGENCE! T0 T1 T2 T3

slide-49
SLIDE 49

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } } Observed behavior: t2 c1 c0 c2 c3 c4 t3 c5 c6 a1 a0 a2 a a4 t0 b1 b0 b2 b b4 t1 b5 b6 b7

Threads

Time

3 3

Divergences: Coda

Kernel for parallel execution (CUDA). Control flow graph for memcpy.

FUNCTION memcpy DIVERGENCE! T0 T1 T2 T3

Suboptimal behavior: thread T3 is inactive. Right?

slide-50
SLIDE 50

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } } Observed behavior: t2 c1 c0 c2 c3 c4 t3 c5 c6 a1 a0 a2 a a4 t0 b1 b0 b2 b b4 t1 b5 b6 b7

Threads

Time

3 3

Divergences: Coda

Kernel for parallel execution (CUDA). Control flow graph for memcpy.

FUNCTION memcpy DIVERGENCE! T0 T1 T2 T3

Not really! We are using Dynamic Parallelism

slide-51
SLIDE 51

DEPARTMENT OF COMPUTER SCIENCE UNIVERSIDADE FEDERAL DE MINAS GERAIS FEDERAL UNIVERSITY OF MINAS GERAIS, BRAZIL

DYNAMIC PARALLELISM

slide-52
SLIDE 52

Dynamic Parallelism

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Kernel for parallel execution (CUDA). CUDA’s nested kernel call:

kernel<<<#warps, #threads>>>(args…)

slide-53
SLIDE 53

Dynamic Parallelism

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Kernel for parallel execution (CUDA). CUDA’s nested kernel call:

kernel<<<#warps, #threads>>>(args…)

Launches a new kernel, with all threads active, to process the target function

slide-54
SLIDE 54

Actual behavior with CUDA’s dynamic parallelism: t2 c1 c0 c2 c3 c4 t3 c5 c6 a1 a0 a2 a a4 t0 b1 b0 b2 b b4 t1 b5 b6 b7

Threads

Time

3 3

a2 b2 b6 a3 b3 b7 a4 a0 b0 b4 a1 b1 b5 c2 c6 c3 c0 c4 c1 c5

From T0 From T1 From T2

Time t2 t3 t0 t1

Dynamic Parallelism

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Kernel for parallel execution (CUDA).

memcpy runs once per active thread at

memcpy<<<1,4>>>

call site!

slide-55
SLIDE 55

Actual behavior with CUDA’s dynamic parallelism: t2 c1 c0 c2 c3 c4 t3 c5 c6 a1 a0 a2 a a4 t0 b1 b0 b2 b b4 t1 b5 b6 b7

Threads

Time

3 3

a2 b2 b6 a3 b3 b7 a4 a0 b0 b4 a1 b1 b5 c2 c6 c3 c0 c4 c1 c5

From T0 From T1 From T2

Time t2 t3 t0 t1 void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Dynamic Parallelism

Kernel for parallel execution (CUDA). SIMD kernels!

slide-56
SLIDE 56

Actual behavior with CUDA’s dynamic parallelism: t2 c1 c0 c2 c3 c4 t3 c5 c6 a1 a0 a2 a a4 t0 b1 b0 b2 b b4 t1 b5 b6 b7

Threads

Time

3 3

a2 b2 b6 a3 b3 b7 a4 a0 b0 b4 a1 b1 b5 c2 c6 c3 c0 c4 c1 c5

From T0 From T1 From T2

Time t2 t3 t0 t1 void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

Dynamic Parallelism

SIMD implementation of memory copy.

slide-57
SLIDE 57

Actual behavior with CUDA’s dynamic parallelism: t2 c1 c0 c2 c3 c4 t3 c5 c6 a1 a0 a2 a a4 t0 b1 b0 b2 b b4 t1 b5 b6 b7

Threads

Time

3 3

a2 b2 b6 a3 b3 b7 a4 a0 b0 b4 a1 b1 b5 c2 c6 c3 c0 c4 c1 c5

From T0 From T1 From T2

Time t2 t3 t0 t1 void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

Dynamic Parallelism

SIMD implementation of memory copy. All threads work on a single vector!

slide-58
SLIDE 58

Actual behavior with CUDA’s dynamic parallelism: t2 c1 c0 c2 c3 c4 t3 c5 c6 a1 a0 a2 a a4 t0 b1 b0 b2 b b4 t1 b5 b6 b7

Threads

Time

3 3

a2 b2 b6 a3 b3 b7 a4 a0 b0 b4 a1 b1 b5 c2 c6 c3 c0 c4 c1 c5

From T0 From T1 From T2

Time t2 t3 t0 t1 void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

Dynamic Parallelism

SIMD implementation of memory copy. All threads work on a single vector! Dynamic parallelism changes the dimension of the parallelism

slide-59
SLIDE 59

Actual behavior with CUDA’s dynamic parallelism: t2 c1 c0 c2 c3 c4 t3 c5 c6 a1 a0 a2 a a4 t0 b1 b0 b2 b b4 t1 b5 b6 b7

Threads

Time

3 3

a2 b2 b6 a3 b3 b7 a4 a0 b0 b4 a1 b1 b5 c2 c6 c3 c0 c4 c1 c5

From T0 From T1 From T2

Time t2 t3 t0 t1 void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

Dynamic Parallelism

SIMD implementation of memory copy. All threads work on a single vector! Dynamic parallelism changes the dimension of the parallelism All threads are active upon entry!

slide-60
SLIDE 60

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Dynamic Parallelism

Kernel for parallel execution (CUDA).

CUDA’s Dynamic Parallelism: Nested kernel calls

slide-61
SLIDE 61

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Dynamic Parallelism

Kernel for parallel execution (CUDA).

CUDA’s Dynamic Parallelism: Nested kernel calls Has the overhead of allocating and scheduling a new kernel

slide-62
SLIDE 62

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Dynamic Parallelism

Kernel for parallel execution (CUDA).

CUDA’s Dynamic Parallelism: Nested kernel calls Has the overhead of allocating and scheduling a new kernel kernel<<<#warps, #threads>>>(args...);

slide-63
SLIDE 63

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Dynamic Parallelism

Kernel for parallel execution (CUDA).

CUDA’s Dynamic Parallelism: Nested kernel calls Has the overhead of allocating and scheduling a new kernel kernel<<<#warps, #threads>>>(args...); Parallel Time ~ Kernel Launching Overhead + Sequential Time #warps x #threads

slide-64
SLIDE 64

void kernel(int **A, int **B, int *N) { int tid(threadId.x); if (tid < 3) { memcpy<<<1, 4>>>(A[tid], B[tid], N[tid]); } else { ; } }

Dynamic Parallelism

Kernel for parallel execution (CUDA).

CUDA’s Dynamic Parallelism: Nested kernel calls

Has the overhead of allocating and scheduling a new kernel

kernel<<<#warps, #threads>>>(args...); Parallel Time ~ Kernel Launching Overhead + Sequential Time #warps x #threads

Important benefits when new work is invoked within an executing GPU program include removing the burden on the programmer to marshal and transfer the data on which to operate. Additional parallelism can be exposed to the GPU’s hardware schedulers and load balancers dynamically, adapting in response to data-driven decisions or workloads. Algorithms and programming patterns that had previously required modifications to eliminate recursion, irregular loop structure, or other constructs that do not fit a flat, single-level of parallelism can be more transparently expressed.

Dynamic Parallelism in CUDA

Source: http://developer.download.nvidia.com/assets/cuda/files/CUDADownloads/TechBrief_Dynamic_Parallelism_in_CUDA.pdf

slide-65
SLIDE 65

DEPARTMENT OF COMPUTER SCIENCE UNIVERSIDADE FEDERAL DE MINAS GERAIS FEDERAL UNIVERSITY OF MINAS GERAIS, BRAZIL

WARP-SYNCHRONOUS PROGRAMMING

slide-66
SLIDE 66

Warp-Synchronous Programming

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy.

Warp-synchronous: All threads must be active upon entrance to the procedure!

slide-67
SLIDE 67

Warp-Synchronous Programming

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy.

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { EVERYWHERE { for (int i=0; i < threadDim.x; ++i) { if (not (mask & (1 << i))) continue; // skip thread “i” dest_i = shuffle(dest, i); // if it is divergent src_i = shuffle(src, i); N_i = shuffle(N, i); memcpy(dest_i, src_i, N_i); } } }

Warp-synchronous wrapper for SIMD memory copy.

slide-68
SLIDE 68

Warp-Synchronous Programming

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy.

Mappings: int value = [10 20 30 10] int value = [11 21 31 11] increment T0 T1 T2 T3 T0 T1 T2 T3 Warp-level parallelism!

slide-69
SLIDE 69

Warp-Synchronous Programming

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy.

Mappings: int value = [10 20 30 10] int value = [11 21 31 11] increment T0 T1 T2 T3 T0 T1 T2 T3 Reductions: int value = [10 20 30 10] int scalar = (70 70 70 70) sum T0 T1 T2 T3 T0 T1 T2 T3 Warp-level parallelism!

slide-70
SLIDE 70

Warp-Synchronous Programming: Everywhere blocks

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy.

* Everywhere blocks (early languages for SIMD machines):

  • C*
  • MPL
  • POMPC

Block wherein threads are temporarily re-enabled!

slide-71
SLIDE 71

Warp-Synchronous Programming: Everywhere blocks

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy.

* Everywhere blocks: DIVERGENCE: T0 T1 T2 T3

slide-72
SLIDE 72

Warp-Synchronous Programming: Everywhere blocks

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy.

* Everywhere blocks: EVERYWHERE { code... } DIVERGENCE: T0 T1 T2 T3 EVERYWHERE: T0 T1 T2 T3 All threads are temporarily re-enabled to process code within EVERYWHERE block!

slide-73
SLIDE 73

Warp-Synchronous Programming: Everywhere blocks

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy.

* Everywhere blocks: EVERYWHERE { code... } DIVERGENCE: T0 T1 T2 T3 EVERYWHERE: T0 T1 T2 T3 DIVERGENCE: T0 T1 T2 T3 All threads are temporarily re-enabled to process code within EVERYWHERE block! Divergences restored!

slide-74
SLIDE 74

Warp-Synchronous Programming: Everywhere blocks + Shuffle

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy.

* Everywhere blocks (early languages for SIMD machines):

  • C*
  • MPL
  • POMPC

* Shuffle (warp aware instruction): shuffle(v, i) allows thread to read the value stored in variable v, but in the register space of thread i

slide-75
SLIDE 75

Warp-Synchronous Programming: Everywhere blocks + Shuffle

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy.

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { EVERYWHERE { for (int i=0; i < threadDim.x; ++i) { if (not (mask & (1 << i))) continue; // skip thread “i” dest_i = shuffle(dest, i); // if it is divergent src_i = shuffle(src, i); N_i = shuffle(N, i); memcpy(dest_i, src_i, N_i); } } }

Warp-synchronous wrapper for SIMD memory copy.

slide-76
SLIDE 76

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { EVERYWHERE { for (int i=0; i < threadDim.x; ++i) { if (not (mask & (1 << i))) continue; // skip thread “i” dest_i = shuffle(dest, i); // if it is divergent src_i = shuffle(src, i); N_i = shuffle(N, i); memcpy(dest_i, src_i, N_i); } } }

Warp-Synchronous Programming: Everywhere blocks + Shuffle

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy. Warp-synchronous wrapper for SIMD memory copy.

  • 1. everywhere re-enables all threads!
slide-77
SLIDE 77

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { EVERYWHERE { for (int i=0; i < threadDim.x; ++i) { if (not (mask & (1 << i))) continue; // skip thread “i” dest_i = shuffle(dest, i); // if it is divergent src_i = shuffle(src, i); N_i = shuffle(N, i); memcpy(dest_i, src_i, N_i); } } }

Warp-Synchronous Programming: Everywhere blocks + Shuffle

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy. Warp-synchronous wrapper for SIMD memory copy.

  • 1. everywhere re-enables all threads!
  • 2. Skip formerly divergent threads!
slide-78
SLIDE 78

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { EVERYWHERE { for (int i=0; i < threadDim.x; ++i) { if (not (mask & (1 << i))) continue; // skip thread “i” dest_i = shuffle(dest, i); // if it is divergent src_i = shuffle(src, i); N_i = shuffle(N, i); memcpy(dest_i, src_i, N_i); } } }

Warp-Synchronous Programming: Everywhere blocks + Shuffle

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy. Warp-synchronous wrapper for SIMD memory copy.

  • 1. everywhere re-enables all threads!
  • 2. Skip formerly divergent threads!
  • 3. Extracts values for current thread “i”.
slide-79
SLIDE 79

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { EVERYWHERE { for (int i=0; i < threadDim.x; ++i) { if (not (mask & (1 << i))) continue; // skip thread “i” dest_i = shuffle(dest, i); // if it is divergent src_i = shuffle(src, i); N_i = shuffle(N, i); memcpy(dest_i, src_i, N_i); } } }

Warp-Synchronous Programming: Everywhere blocks + Shuffle

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy. Warp-synchronous wrapper for SIMD memory copy.

  • 1. everywhere re-enables all threads!
  • 2. Skip formerly divergent threads!
  • 3. Extracts values for current thread “i”.
  • 4. We then call our SIMD kernel memcpy.
slide-80
SLIDE 80

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { EVERYWHERE { for (int i=0; i < threadDim.x; ++i) { if (not (mask & (1 << i))) continue; // skip thread “i” dest_i = shuffle(dest, i); // if it is divergent src_i = shuffle(src, i); N_i = shuffle(N, i); memcpy(dest_i, src_i, N_i); } } }

Warp-Synchronous Programming: Everywhere blocks + Shuffle

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

SIMD implementation of memory copy. Warp-synchronous wrapper for SIMD memory copy.

  • 1. everywhere re-enables all threads!
  • 2. Skip formerly divergent threads!
  • 3. Extracts values for current thread “i”.
  • 4. We then call our SIMD kernel memcpy.

The target architecture must provide a directive to re-enable inactive threads.

slide-81
SLIDE 81

Warp-Synchronous Programming: Everywhere blocks + Shuffle

SPMD/SIMT

handle divergences

SIMD

all threads must be active at the call site

everywhere

temporarily re-enables all threads within the warp

shuffle

extracts private values and broadcasts them to all threads

slide-82
SLIDE 82

SPMD/SIMT

handle divergences

SIMD

all threads must be active at the call site

Warp-Synchronous Programming: Everywhere blocks + Shuffle

everywhere

temporarily re-enables all threads within the warp

shuffle

extracts private values and broadcasts them to all threads

crev

slide-83
SLIDE 83

Warp-Synchronous Programming: Everywhere blocks + Shuffle

We have defined the semantics of EVERYWHERE in the SIMD world:

Semantics of everywhere in SIMD: encode the building blocks to implement this construct

slide-84
SLIDE 84

Warp-Synchronous Programming: Everywhere blocks + Shuffle

We have defined the semantics of EVERYWHERE in the SIMD world:

Implemented an abstract SIMD machine in Prolog, with support to everywhere blocks. Extended Intel's SPMD compiler with a new idiom, function call re-vectorization, that enhances native dynamic parallelism.

slide-85
SLIDE 85

Warp-Synchronous Programming: CREV

crev memcmp(i)

slide-86
SLIDE 86

DEPARTMENT OF COMPUTER SCIENCE UNIVERSIDADE FEDERAL DE MINAS GERAIS FEDERAL UNIVERSITY OF MINAS GERAIS, BRAZIL

FUNCTION CALL RE-VECTORIZATION

slide-87
SLIDE 87

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

Function Call Re-Vectorization: Reprise

SIMD implementation of memory copy.

SIMD function

slide-88
SLIDE 88

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

Function Call Re-Vectorization: Reprise

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { memcpy<<<1, 4>>>(dest[tid], src[tid], N[tid]); }

CUDA’s nested kernel call: Dynamic parallelism SIMD implementation of memory copy.

Too much

  • verhead

SIMD function

slide-89
SLIDE 89

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

Function Call Re-Vectorization: Reprise

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { EVERYWHERE { for (int i=0; i < threadDim.x; ++i) { if (not (mask & (1 << i))) continue; // skip thread “i” dest_i = shuffle(dest, i); // if it is divergent src_i = shuffle(src, i); N_i = shuffle(N, i); memcpy(dest_i, src_i, N_i); } } }

Warp-synchronous wrapper for SIMD memory copy.

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { memcpy<<<1, 4>>>(dest[tid], src[tid], N[tid]); }

CUDA’s nested kernel call: Dynamic parallelism SIMD implementation of memory copy.

Too much

  • verhead

Too many lines of code SIMD function

slide-90
SLIDE 90

void memcpy(int *dest, int *src, int N) { for (int i=threadId.x; i < N; i+=threadDim.x) { dest[i] = src[i]; } }

Function Call Re-Vectorization: Reprise

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { EVERYWHERE { for (int i=0; i < threadDim.x; ++i) { if (not (mask & (1 << i))) continue; // skip thread “i” dest_i = shuffle(dest, i); // if it is divergent src_i = shuffle(src, i); N_i = shuffle(N, i); memcpy(dest_i, src_i, N_i); } } }

Warp-synchronous wrapper for SIMD memory copy.

void memcpy_wrapper(int **dest, int **src, int *N, int mask) { memcpy<<<1, 4>>>(dest[tid], src[tid], N[tid]); }

CUDA’s nested kernel call: Dynamic parallelism SIMD implementation of memory copy.

Too much

  • verhead

Too many lines of code

void memcpy_wrapper(int **dest, int **src, int *N, int mask) {

crev memcpy(dest[tid], src[tid], N[tid]);

}

Simplicity + Performance, a.k.a.

CREV

slide-91
SLIDE 91

Programmability Efficiency Function Call Re-Vectorization: Reprise

slide-92
SLIDE 92

Programmability Efficiency

Dynamic Parallelism

Function Call Re-Vectorization: Reprise

CUDA: kernel<<<#warps, #threads>>>(args...)

slide-93
SLIDE 93

Programmability Efficiency

Dynamic Parallelism Shuffle Nightmare

Function Call Re-Vectorization: Reprise

CUDA: kernel<<<#warps, #threads>>>(args...)

... __shuffle(data, tid, var) __shuffle(data, tid, var) ... __synchronize() ... __shuffle(data, tid, var) ... __synchronize() __shuffle(data, tid, var) ... __shuffle(data, tid, var) ...

slide-94
SLIDE 94

Programmability Efficiency

Dynamic Parallelism Shuffle Nightmare

Function Call Re-Vectorization: Reprise

Function Call Re-Vectorization

CUDA: kernel<<<#warps, #threads>>>(args...)

... __shuffle(data, tid, var) __shuffle(data, tid, var) ... __synchronize() ... __shuffle(data, tid, var) ... __synchronize() __shuffle(data, tid, var) ... __shuffle(data, tid, var) ...

slide-95
SLIDE 95

Programmability Efficiency

Dynamic Parallelism Shuffle Nightmare

Function Call Re-Vectorization: Reprise

Function Call Re-Vectorization

CUDA: kernel<<<#warps, #threads>>>(args...)

... __shuffle(data, tid, var) __shuffle(data, tid, var) ... __synchronize() ... __shuffle(data, tid, var) ... __synchronize() __shuffle(data, tid, var) ... __shuffle(data, tid, var) ...

Simplicity

slide-96
SLIDE 96

Programmability Efficiency

Dynamic Parallelism Shuffle Nightmare

Function Call Re-Vectorization: Reprise

Function Call Re-Vectorization

CUDA: kernel<<<#warps, #threads>>>(args...)

... __shuffle(data, tid, var) __shuffle(data, tid, var) ... __synchronize() ... __shuffle(data, tid, var) ... __synchronize() __shuffle(data, tid, var) ... __shuffle(data, tid, var) ...

High performance Simplicity

slide-97
SLIDE 97

Programmability Efficiency

Dynamic Parallelism Shuffle Nightmare

Function Call Re-Vectorization: Reprise

Re-enable all threads within warp, avoiding kernel allocation and scheduling Function Call Re-Vectorization

CUDA: kernel<<<#warps, #threads>>>(args...)

... __shuffle(data, tid, var) __shuffle(data, tid, var) ... __synchronize() ... __shuffle(data, tid, var) ... __synchronize() __shuffle(data, tid, var) ... __shuffle(data, tid, var) ...

High performance Simplicity

slide-98
SLIDE 98

Programmability Efficiency

Dynamic Parallelism Shuffle Nightmare

Function Call Re-Vectorization: Reprise

Re-enable all threads within warp, avoiding kernel allocation and scheduling Function Call Re-Vectorization

CUDA: kernel<<<#warps, #threads>>>(args...)

... __shuffle(data, tid, var) __shuffle(data, tid, var) ... __synchronize() ... __shuffle(data, tid, var) ... __synchronize() __shuffle(data, tid, var) ... __shuffle(data, tid, var) ...

High performance Simplicity Allowing SIMD functions to be executed, without diving into warp-synchronous coding!

slide-99
SLIDE 99

Function Call Re-Vectorization: Properties

Composability We are able to nest everywhere blocks: crev can be called recursively!

Important benefits when new work is invoked within an executing GPU program include removing the burden on the programmer to marshal and transfer the data on which to

  • perate. Additional parallelism can be exposed to the GPU’s

hardware schedulers and load balancers dynamically, adapting in response to data-driven decisions or workloads. Algorithms and programming patterns that had previously

required modifications to eliminate recursion, irregular loop structure, or other constructs that do not fit a flat, single-level of parallelism can be more

transparently expressed. Dynamic Parallelism in CUDA

Source:http://developer.download.nvidia.com/assets/cuda/files/CUD ADownloads/TechBrief_Dynamic_Parallelism_in_CUDA.pdf

// Traverses the matrix in a depth-first fashion void dfs(uniform struct Graph& graph, uniform int root, float * uniform f) { if (graph.node[root].visited) return; graph.node[root].visited = true; // Eventual computations f[root] = graph.node[root].length / (float) graph.num_nodes; // Traversal foreach (i = 0 ... graph.node[root].length) { int child = graph.node[root].edge[i].node; if (!graph.node[child].visited) {

crev dfs(graph, child, f);

} } }

slide-100
SLIDE 100

Function Call Re-Vectorization: Properties

Multiplicative composition The target crev function runs once per active thread. In a warp of W threads, the function may run up to W times. If the call is recursive, up to WN times.

// Traverses the matrix in a depth-first fashion void dfs(uniform struct Graph& graph, uniform int root, float * uniform f) { if (graph.node[root].visited) return; graph.node[root].visited = true; // Eventual computations f[root] = graph.node[root].length / (float) graph.num_nodes; // Traversal foreach (i = 0 ... graph.node[root].length) { int child = graph.node[root].edge[i].node; if (!graph.node[child].visited) {

crev dfs(graph, child, f);

} } }

T0 T1 T2 T3

dfs()

Vectorized!

n = 1

slide-101
SLIDE 101

Function Call Re-Vectorization: Properties

Multiplicative composition The target crev function runs once per active thread. In a warp of W threads, the function may run up to W times. If the call is recursive, up to WN times.

// Traverses the matrix in a depth-first fashion void dfs(uniform struct Graph& graph, uniform int root, float * uniform f) { if (graph.node[root].visited) return; graph.node[root].visited = true; // Eventual computations f[root] = graph.node[root].length / (float) graph.num_nodes; // Traversal foreach (i = 0 ... graph.node[root].length) { int child = graph.node[root].edge[i].node; if (!graph.node[child].visited) {

crev dfs(graph, child, f);

} } }

T0 T1 T2 T3

dfs()

Vectorized!

n = 1

slide-102
SLIDE 102

Function Call Re-Vectorization: Properties

Multiplicative composition The target crev function runs once per active thread. In a warp of W threads, the function may run up to W times. If the call is recursive, up to WN times.

// Traverses the matrix in a depth-first fashion void dfs(uniform struct Graph& graph, uniform int root, float * uniform f) { if (graph.node[root].visited) return; graph.node[root].visited = true; // Eventual computations f[root] = graph.node[root].length / (float) graph.num_nodes; // Traversal foreach (i = 0 ... graph.node[root].length) { int child = graph.node[root].edge[i].node; if (!graph.node[child].visited) {

crev dfs(graph, child, f);

} } }

T0 T1 T2 T3

dfs() Vectorized!

T0 T1 T2 T3

dfs() n = 1 n = 2

slide-103
SLIDE 103

Function Call Re-Vectorization: Properties

Multiplicative composition The target crev function runs once per active thread. In a warp of W threads, the function may run up to W times. If the call is recursive, up to WN times.

// Traverses the matrix in a depth-first fashion void dfs(uniform struct Graph& graph, uniform int root, float * uniform f) { if (graph.node[root].visited) return; graph.node[root].visited = true; // Eventual computations f[root] = graph.node[root].length / (float) graph.num_nodes; // Traversal foreach (i = 0 ... graph.node[root].length) { int child = graph.node[root].edge[i].node; if (!graph.node[child].visited) {

crev dfs(graph, child, f);

} } }

T0 T1 T2 T3

dfs() Vectorized!

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

n = 1 n = 2 n = 3

slide-104
SLIDE 104

Function Call Re-Vectorization: Properties

Multiplicative composition The target crev function runs once per active thread. In a warp of W threads, the function may run up to W times. If the call is recursive, up to WN times.

// Traverses the matrix in a depth-first fashion void dfs(uniform struct Graph& graph, uniform int root, float * uniform f) { if (graph.node[root].visited) return; graph.node[root].visited = true; // Eventual computations f[root] = graph.node[root].length / (float) graph.num_nodes; // Traversal foreach (i = 0 ... graph.node[root].length) { int child = graph.node[root].edge[i].node; if (!graph.node[child].visited) {

crev dfs(graph, child, f);

} } }

T0 T1 T2 T3

dfs() Vectorized!

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

n = 1 n = 2 n = 3

n

slide-105
SLIDE 105

Function Call Re-Vectorization: Properties

Commutativity There is no predefined order between execution of crev’s target function.

// Traverses the matrix in a depth-first fashion void dfs(uniform struct Graph& graph, uniform int root, float * uniform f) { if (graph.node[root].visited) return; graph.node[root].visited = true; // Eventual computations f[root] = graph.node[root].length / (float) graph.num_nodes; // Traversal foreach (i = 0 ... graph.node[root].length) { int child = graph.node[root].edge[i].node; if (!graph.node[child].visited) {

crev dfs(graph, child, f);

} } }

T0 T1 T2 T3

dfs() Vectorized!

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs() dfs()

T0 T1 T2 T3 T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

dfs()

n = 1 n = 2 n = 3

n

dfs()

T0 T1 T2 T3

dfs()

T0 T1 T2 T3

slide-106
SLIDE 106

Function Call Re-Vectorization: Properties

Synchronization parity Synchronization primitives remain correct, regardless of the crev nested level. crev uses a context stack to keep track of divergences.

... __shuffle(data, tid, var) __shuffle(data, tid, var) ... __synchronize() ... __shuffle(data, tid, var) ... __synchronize() __shuffle(data, tid, var) ... __shuffle(data, tid, var) ... ... __shuffle(data, tid, var) __shuffle(data, tid, var) ... __synchronize() ...

crev f()

... __shuffle(data, tid, var) ... __synchronize() __shuffle(data, tid, var) ... __shuffle(data, tid, var) ...

crev

slide-107
SLIDE 107

DEPARTMENT OF COMPUTER SCIENCE UNIVERSIDADE FEDERAL DE MINAS GERAIS FEDERAL UNIVERSITY OF MINAS GERAIS, BRAZIL

EXPERIMENTAL EVALUATION

slide-108
SLIDE 108

string T = text, P = pattern; void str_compare(int offset) { bool m = true; for (int i=threadId.x; i < |P|; i+=threadDim.x) if (P[i] != T[i + offset]) { m = false; break; } if (all(m == true)) Found(); } void StringMatch() { for (int i=threadId.x; i < (|T| - |P|); i+=threadDim.x) if (P[0] == T[i]) crev str_compare(i); }

Experimental Evaluation: String Matching

This is a CREV call! Example: if the warp size is 32 and 7 threads are enabled when the program flow hits this line, all 32 threads execute str_compare 7 times. In each case, the 32 threads temporarily take on the local state of the active thread that they are

  • helping. Once done, these workers all get their local state restored.

We vectorize both loops

slide-109
SLIDE 109

string T = text, P = pattern; void str_compare(int offset) { bool m = true; for (int i=threadId.x; i < |P|; i+=threadDim.x) if (P[i] != T[i + offset]) { m = false; break; } if (all(m == true)) Found(); } void StringMatch() { for (int i=threadId.x; i < (|T| - |P|); i+=threadDim.x) if (P[0] == T[i]) crev str_compare(i); }

Experimental Evaluation: String Matching

This is a CREV call! Example: if the warp size is 32 and 7 threads are enabled when the program flow hits this line, all 32 threads execute str_compare 7 times. In each case, the 32 threads temporarily take on the local state of the active thread that they are

  • helping. Once done, these workers all get their local state restored.

SIMD function: all threads must be active

slide-110
SLIDE 110

string T = text, P = pattern; void str_compare(int offset) { bool m = true; for (int i=threadId.x; i < |P|; i+=threadDim.x) if (P[i] != T[i + offset]) { m = false; break; } if (all(m == true)) Found(); } void StringMatch() { for (int i=threadId.x; i < (|T| - |P|); i+=threadDim.x) if (P[0] == T[i]) crev str_compare(i); } void NaiveStringMatch() { for (int i=threadId.x; i < (|T| - |P|); i+=threadDim.x) { int j = 0, k = i; while (j < |P| and P[j] == T[k]) { j = j + 1; k = k + 1; } if (j == |P|) Found(k); } }

Experimental Evaluation: String Matching

This is a CREV call! SIMD function: all threads must be active Naïve parallel approach

slide-111
SLIDE 111

Experimental Evaluation: Environment Setup

ISPC, the Intel SPMD Program Compiler, v 1.9.1

  • Compiler implemented on top of LLVM
  • Programming language (extension of C)

Benchmarks: seven algorithms implemented in different ways:

  • CREV: our contribution
  • PAR: implementation based on ISPC's constructs
  • SEQ: state-of-the-art sequential implementation
  • Launch: implementation using dynamic parallelism (pthreads)

Environment

  • 6-core 2.00 GHz Intel Xeon E5-2620 CPU (8-wide AVX vector units)
  • Linux Ubuntu 12.04 3.2.0
slide-112
SLIDE 112

Experimental Evaluation: String Matching

2000 4000 6000 8000 4 8 12 16 20 24 28 32

KMP PAR CREV

+42.5 +55.2 +17.0 +25.2 +10.1 +28.5 +5.2 +42.2

  • 17.1

+59.7 +0.0 +8.1

  • 1.8

+33.4

  • 10.2

+50.9

Runtime (millions of cycles) Pattern length Numbers show percentage of speedup of CREV over PAR (white) and KMP (grey) Input: 256MB in 5M lines from books from Project Gutenberg

slide-113
SLIDE 113

Function Call Re-Vectorization: CREV

Sequential Parallel Launch CREV Dataset BookFilter

  • not implemented

8530.990 7857.980 7405.175 bin-L20K-P16 String Matching 6649.279

KMP Algorithm

3576.143 393166.268 2737.939 txt-256MB-P16 Bellman-Ford 141088.730 493619.688

  • thread lim exc

529856.065 erdos-renyi Depth-First Search 3754.101 3786.263

  • thread lim exc

3790.444

  • ctree-D5

Connected- Component Leader 4054.658 3983.088 5272.919 3984.795

  • ctree-D5

Quicksort-bitonic 2.871

  • not implemented

204.278 2.878 int-16K Mergesort-bitonic 7.302

  • not implemented

104.985 4.114 int-16K

Execution times (in millions of cycles): Datasets:

  • bin-L20K-P16: 10K strings of 0s and 1s, each of length 20K, and target pattern of length 16.
  • txt-256MB-P16: 256MB in 5bi lines from books from Project Gutenberg; target pattern has length 16.
  • erdos-renyi: random Erdos-Renyi graph with 2048 nodes and 80% probability of edges.
  • ctree-D5: 8-ary complete tree of depth 5 (root + five full levels of nodes).
  • int-16K: 16K random integers in the range [0, 100000).

✪ Fastest; ✪ 1st runner up; ✪ 2nd runner up.

slide-114
SLIDE 114

Function Call Re-Vectorization: CREV

Sequential Parallel Launch CREV Dataset BookFilter

  • not implemented

8530.990 7857.980 7405.175 bin-L20K-P16 String Matching 6649.279

KMP Algorithm

3576.143 393166.268 2737.939 txt-256MB-P16 Bellman-Ford 141088.730 493619.688

  • thread lim exc

529856.065 erdos-renyi Depth-First Search 3754.101 3786.263

  • thread lim exc

3790.444

  • ctree-D5

Connected- Component Leader 4054.658 3983.088 5272.919 3984.795

  • ctree-D5

Quicksort-bitonic 2.871

  • not implemented

204.278 2.878 int-16K Mergesort-bitonic 7.302

  • not implemented

104.985 4.114 int-16K

Execution times (in millions of cycles): Datasets:

  • bin-L20K-P16: 10K strings of 0s and 1s, each of length 20K, and target pattern of length 16.
  • txt-256MB-P16: 256MB in 5bi lines from books from Project Gutenberg; target pattern has length 16.
  • erdos-renyi: random Erdos-Renyi graph with 2048 nodes and 80% probability of edges.
  • ctree-D5: 8-ary complete tree of depth 5 (root + five full levels of nodes).
  • int-16K: 16K random integers in the range [0, 100000).

✪ Fastest; ✪ 1st runner up; ✪ 2nd runner up.

slide-115
SLIDE 115

DEPARTMENT OF COMPUTER SCIENCE UNIVERSIDADE FEDERAL DE MINAS GERAIS FEDERAL UNIVERSITY OF MINAS GERAIS, BRAZIL

CONTRIBUTIONS

slide-116
SLIDE 116

Contributions: Rat

Paper published in CGO’16

Inference of Peak Density of Indirect Branches to Detect ROP Attacks

Webpage with static analysis available

http://cuda.dcc.ufmg.br/rip-rop-deducer

Paper published in SBSEG’15

Inferência Estática da Frequência Máxima de Instruções de Retorno para Detecção de Ataques ROP

ACM CGO-SRC’16 winner (1st place: golden medal) Artifact Evaluated paper benchmarks

slide-117
SLIDE 117

Contributions: Swan

Paper published in PPoPP’17

Function Call Re-Vectorization

Webpage with all source code and results available

http://cuda.dcc.ufmg.br/swan

Artifact Evaluated paper benchmarks Paper published in SBLP’16

Definição Semântica de Blocos Everywhere para Programação SIMD

Semantics of everywhere blocks in SIMD context

µSIMD Prolog abstract machine

slide-118
SLIDE 118

DEPARTMENT OF COMPUTER SCIENCE UNIVERSIDADE FEDERAL DE MINAS GERAIS FEDERAL UNIVERSITY OF MINAS GERAIS, BRAZIL

QUESTIONS?

Everything can happen. Everything is possible and probable. Time and space do not exist. On a flimsy framework of reality, the imagination spins, weaving new patterns.

Ingmar Bergman – Fanny and Alexander

Email us: Rubens Emilio Alves Moreira [rubens@dcc.ufmg.br] Fernando Magno Quintão Pereira [fernando@dcc.ufmg.br] Check our websites: http://cuda.dcc.ufmg.br/swan http://cuda.dcc.ufmg.br/rip-rop-deducer