Compilation Techniques for Automatic Extraction of Parallelism and - PowerPoint PPT Presentation

1 for (i = 0; i < n; i++) { 2 t = 0; Building the KIR (I) 3 for (j = 0; j < m; j++) { 4 t = t + A[i][j] * x[j]; 5 } 6 y[i] = t; BB0 7 } i = 0; BB1 (2) t = 0; K (2) j = 0; BB2 K K < j BB1 > t = t + A[i][j] * x[j]; (2) K < j BB2 > K < t BB1 > j++; (2) (1) (1) BB3 if (j < m) T K < t BB2 > F BB4 K < y BB4 > y[i] = t; (1) (2) (2) i++; (1) Edges (1), (2) are abstracted BB5 with diKernels T if (i < n) F Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures 16 / 118

diKernel-level Flow Dependences • Identification of the flow of information across the program • Statement-level dominance • Range of values of variable x produced and used throughout the execution of statements and DEF(x, x i ) ⊇ USE(x, y j ). Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 17

Building the KIR (II) BB0 i = 0; i=0 dominates i++ DEF(i,i=0) ⊇ USE(i,i++) BB1 (2) t = 0; K (2) j = 0; BB2 K K < j BB1 > t = t + A[i][j] * x[j]; (2) K < j BB2 > K < t BB1 > j++; (2) (1) (1) BB3 if (j < m) T K < t BB2 > F BB4 K < y BB4 > y[i] = t; (1) (2) (2) i++; (1) BB5 T if (i < n) F Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 18

Hierarchy of Execution Scopes • To expose the computational stages of the program • Based on the hierarchy of loops: one execution scope for each perfect loop nest • The root execution scope is a special node that represents the program as a whole. • diKernels belong to the innermost execution scope that contains all of their statements • diKernels that compute the loop indices belong to the ES of the corresponding loop Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 19

Building the KIR (and III) 1 for (i = 0; i < n; i++) { 2 t = 0; ROOT EXECUTION SCOPE 3 for (j = 0; j < m; j++) { 4 ES_for i (Figure 2.2, lines 1-7) t = t + A[i][j] * x[j]; 5 } 6 K < t BB1 > y[i] = t; scalar assignment 7 } K ES_for j (Figure 2.2, lines 3-5) K < t BB2 > scalar reduction K K < j BB1 > K < y BB4 > K < j BB2 > K < t BB1 > regular assignment K < t BB2 > K < y BB4 > Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 20

Outline 2. A Novel Compiler Support for Multicore Systems • KIR: A diKernel-based IR • Automatic Partitioning driven by the KIR • Automatic Parallelization of the Benchmark Suite • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 21

Spurious diKernel-level Dependences • They do not prevent the parallelization ROOT EXECUTION SCOPE ES_for i (Figure 2.2, lines 1-7) t is a privatizable scalar variable K < t BB1 > scalar assignment ES_for j (Figure 2.2, lines 3-5) 1 for (i = 0; i < n; i++) { 2 t = 0; K < t BB2 > scalar reduction 3 for (j = 0; j < m; j++) { 4 t = t + A[i][j] * x[j]; 5 } 6 y[i] = t; K < y BB4 > 7 } regular assignment Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 22

OpenMP-enabled Parallelization Strategy • Find the critical path in the KIR • diKernel-level flow dependences • Parallelizing transformations for each type of diKernel • Optimizations for the joint parallelization of loops • Minimize synchronization between diKernels • Minimize thread creation/destruction Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 23

Parallelizing Transformations FULLY PARALLEL LOOP PARTIALLY PARALLEL LOOP 1. r = 0; 1. #pragma omp parallel for 2. #pragma omp parallel for reduction (+:r) 2. for (i = 0; i < n; i++) { 3. for (i = 0; i < n; i++) { 3. A[i] = 2 4. r = r + A[i]; 4. } 5. } Array Expansion Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 24

Automatic Partitioning driven by the KIR (I) critical path ROOT EXECUTION SCOPE ES_for i (Figure 2.2, lines 1-7) 1 for (i = 0; i < n; i++) { K < t BB1 > scalar assignment 2 t = 0; 3 for (j = 0; j < m; j++) { ES_for j (Figure 2.2, lines 3-5) 4 t = t + A[i][j] * x[j]; 5 } K < t BB2 > 6 y[i] = t; scalar reduction 7 } K < y BB4 > regular assignment Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 25

Automatic Partitioning driven by the KIR (and II) critical path ROOT EXECUTION SCOPE ES_for i (Figure 2.2, lines 1-7) K < t BB1 > scalar assignment FULLY PARALLEL LOOP ES_for j (Figure 2.2, lines 3-5) 1 #pragma omp parallel shared (A,x,y) private (i,j,t) 2 { K < t BB2 > scalar reduction 3 #pragma omp for schedule ( static ) 4 for (i = 0; i < n; i = i + 1) { 5 t = 0; 6 for (j = 0; j < m; j = j + 1) { 7 K < y BB4 > t = (t) + ((A[i][j]) * (x[j])); regular assignment 8 } 9 y[i] = t; 10 } 11 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 26

Automatic Parallelization of the Benchmark Suite • Synthetic Benchmarks • Dense/Sparse Matrix-Vector Multiplication • Sobel Edge Filter • SWIM from SPEC CPU2000 • EQUAKE from SPEC CPU2000 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 28

DenseAMUX & SparseAMUX DenseAMUX 1 for (i = 0; i < n; i++) { 2 ROOT EXECUTION SCOPE t = 0; 3 for (j = 0; j < m; j++) { ES_for i (Figures 2.8a and 2.8b, lines 1-7) 4 t = t + A[i][j] * x[j]; 5 } K < t 2 > 6 y[i] = t; scalar assignment 7 } ES_for j (Figures 2.8a and 2.8b, lines 3-5) K < t 4 > SparseAMUX scalar reduction 1 for (i = 0; i < n; i++) { 2 t = 0; K < y 6 > 3 for (j = ia[i]; j < ia[i+1]-1; j++) { regular assignment 4 t = t + A[j] * x[ja[j]]; 5 } 6 y[i] = t; 7 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 29

SparseAMUX ROOT EXECUTION SCOPE FULLY PARALLEL LOOP ES_for i (Figures 2.8a and 2.8b, lines 1-7) 1 #pragma omp parallel shared (A,ia,ja,x,y) private (i,j,t) K < t 2 > 2 { scalar assignment 3 #pragma omp for schedule ( static ) 4 for (i = 0; i < n; i++) { ES_for j (Figures 2.8a and 2.8b, lines 3-5) 5 t = 0; 6 for (j = ia[i]; j < (ia[i+1] - 1); j = j + 1) { K < t 4 > 7 t = (t) + ((A[j]) * (x[ja[j]])); scalar reduction 8 } 9 y[i] = t; 10 } 11 } K < y 6 > regular assignment Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 30

AMUXMS & ATMUX AMUXMS 1 for (i = 0; i < n; i++) { 2 y[i] = A[i] * x[i]; 3 } ROOT EXECUTION SCOPE 4 for (j = 0; j < n; j++) { 5 for (l = ja[j]; l < ja[j+1]-1; l++) { ES_for i (Figures 2.9a and 2.9b, lines 1-3) 6 y[j] = y[j] + A[l] * x[ja[l]]; 7 } < y 2 > 8 } regular assignment ATMUX ES_for j,l (Figures 2.9a and 2.9b, lines 4-8) 1 for (i = 0; i < n; i++) { < y 6 > 2 irregular reduction y[i] = 0; 3 } 4 for (j = 0; j < n; j++) { 5 for (l = ia[j]; l < ia[j+1]-1; l++) { 6 y[ja[l]] = y[ja[l]] + x[j] * A[l]; 7 } 8 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 31

AMUXMS FULLY PARALLEL LOOP ROOT EXECUTION SCOPE 1 #pragma omp parallel shared (A,x,ja,y) private (i,j,l,t) ES_for i (Figures 2.9a and 2.9b, lines 1-3) 2 { 3 #pragma omp for schedule ( static ) nowait 4 for (i = 0; i < n; i = i + 1) { < y 2 > 5 y[i] = (A[i]) * (x[i]); regular assignment 6 } 7 #pragma omp for schedule ( static ) 8 for (j = 0; j < n; j = j + 1) { ES_for j,l (Figures 2.9a and 2.9b, lines 4-8) 9 for (l = ja[j]; l < (ja[j+1] - 1); l = l + 1) { 10 y[j] = (y[j]) + ((A[l]) * (x[ja[l]])); < y 6 > 11 } irregular reduction 12 } FULLY PARALLEL LOOP 13 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 32

ATMUX PARTIALLY PARALLEL LOOP 1 #pragma omp parallel shared (A,ia,ja,x,y) private (i,j,l,y___private) 2 { 3 if (omp_get_thread_num() == 0) { 4 y___private = y; 5 } else { 6 y___private = ( float *) malloc(n * sizeof ( float )); 7 } 8 ROOT EXECUTION SCOPE for (i = 0; i < n; i = i + 1) { Initialization 9 y___private[i] = 0; ES_for i (Figures 2.9a and 2.9b, lines 1-3) 10 } 11 #pragma omp for schedule ( static ) < y 2 > 12 for (j = 0; j < n; j = j + 1) { regular assignment 13 for (l = ia[j]; l < (ia[j+1] - 1); l = l + 1) { 14 y___private[ja[l]] = (y___private[ja[l]]) + ((x[j]) * (A[l])); 15 } ES_for j,l (Figures 2.9a and 2.9b, lines 4-8) 16 } Computation 17 #pragma omp critical < y 6 > 18 if (omp_get_thread_num() != 0) { irregular reduction 19 for (i = 0; i < n; i = i + 1) { 20 y[i] += y___private[i]; 21 } Reduction 22 } 23 if (omp_get_thread_num() != 0) { 24 free(y___private); 25 } 26 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 33

ROOT EXECUTION SCOPE ES_for iter (Figure 2.18, lines 1-46) EQUAKE (I) ES_for i,j (Figure 2.18, lines 2-4) < disp 4 > regular assignment 1 for (iter = 1; iter <= timesteps; iter++) { ES_for i,while (Figure 2.18, lines 5-27) 2 for (i = 0; i < ARCHnodes; i++) 3 for (j = 0; j < 3; j++) < disp 26 > 4 disp[disptplus][i][j] = 0.0; irregular reduction 5 for (i = 0; i < ARCHnodes; i++) { 6 Anext = ARCHmatrixindex[i]; Alast = ARCHmatrixindex[i+1]; 7 sum0 = K[Anext][0][0] * disp[dispt][i][0] ES_for i,j (Figure 2.18, lines 29-31) 8 + K[Anext][0][1] * disp[dispt][i][1] 9 + K[Anext][0][2] * disp[dispt][i][2]; < disp 31 > regular reduction 10 sum1 = K[Anext][1][0] * ...; sum2 = K[Anext][2][0] * ...; 11 Anext++; 12 while (Anext < Alast) { ES_for i,j (Figure 2.18, lines 32-37) 13 col = ARCHmatrixcol[Anext]; 14 sum0 += K[Anext][0][0] * disp[dispt][col][0] < disp 34 > 15 + K[Anext][0][1] * disp[dispt][col][1] regular reduction 16 + K[Anext][0][2] * disp[dispt][col][2]; 17 sum1 += K[Anext][1][0]*...; sum2 += K[Anext][2][0]*...; ES_for i,j (Figure 2.18, lines 38-40) 18 disp[disptplus][col][0] += 19 K[Anext][0][0] * disp[dispt][i][0] < disp 40 > 20 + K[Anext][1][0] * disp[dispt][i][1] regular reduction 21 + K[Anext][2][0] * disp[dispt][i][2]; 22 disp[disptplus][col][1] += K[Anext][0][1] ... ES_for i,j (Figure 2.18, lines 41-44) 23 disp[disptplus][col][2] += K[Anext][0][2] ... 24 Anext++; 25 < vel 43 > } regular assignment 26 disp[disptplus][i][0] += sum0; ... 27 } 28 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 34

ROOT EXECUTION SCOPE ES_for iter (Figure 2.18, lines 1-46) EQUAKE (II) ES_for i,j (Figure 2.18, lines 2-4) < disp 4 > regular assignment ES_for i,while (Figure 2.18, lines 5-27) 27 < disp 26 > 28 time = iter * Exc.dt; irregular reduction 29 for (i = 0; i < ARCHnodes; i++) 30 for (j = 0; j < 3; j++) ES_for i,j (Figure 2.18, lines 29-31) 31 disp[disptplus][i][j] *= - Exc.dt * Exc.dt; 32 for (i = 0; i < ARCHnodes; i++) < disp 31 > 33 for (j = 0; j < 3; j++) regular reduction 34 disp[disptplus][i][j] += 35 2.0 * M[i][j] * disp[dispt][i][j] ES_for i,j (Figure 2.18, lines 32-37) 36 - (M[i][j] - Exc.dt / 2.0 * C[i][j]) 37 * disp[disptminus][i][j] - ... < disp 34 > 38 for (i = 0; i < ARCHnodes; i++) regular reduction 39 for (j = 0; j < 3; j++) 40 disp[disptplus][i][j] /= (M[i][j] + Exc.dt / 2.0 * C[i][j]); 41 for (i = 0; i < ARCHnodes; i++) ES_for i,j (Figure 2.18, lines 38-40) 42 for (j = 0; j < 3; j++) 43 vel[i][j] = 0.5 / Exc.dt * (disp[disptplus][i][j] < disp 40 > regular reduction 44 - disp[disptminus][i][j]); 45 i = disptminus; disptminus = dispt; dispt = disptplus; disptplus = i; 46 } ES_for i,j (Figure 2.18, lines 41-44) < vel 43 > regular assignment Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 35

EQUAKE (III) Minimization of thread creation/destruction 1 #pragma omp parallel shared (disp) private (disp___disptplus___private,...) 2 { 3 if (omp_get_thread_num() == 0) { 4 disp___disptplus___private = disp[disptplus]; 5 } else { 6 disp___disptplus___private = ( double **) malloc (ARCHnodes * sizeof ( double *)); 7 for (i = 0; i < ARCHnodes; i = i + 1) 8 disp___disptplus___private[i] = ( double *) malloc(3 * sizeof ( double )); 9 } PARTIALLY PARALLEL LOOP 10 for (iter = 1; iter < (timesteps + 1); iter = iter + 1) { 11 #pragma omp barrier 12 for (i = 0; i < ARCHnodes; i = i + 1) 13 for (j = 0; j < 3; j = j + 1) Initialization 14 disp___disptplus___private[i][j] = 0.0; 15 #pragma omp for schedule ( static ) 16 for (i = 0; i < ARCHnodes; i = i + 1) { 17 Anext = ARCHmatrixindex[i]; Alast = ARCHmatrixindex[i+1]; 18 sum0 = K[Anext][0][0] * ... 19 Anext++; 20 while (Anext < Alast) { 21 col = ARCHmatrixcol[Anext]; 22 sum0 += K[Anext][0][0] * ... Computation 23 disp___disptplus___private[col][0] += K[Anext][0][0] * ... 24 Anext++; 25 } 26 disp___disptplus___private[i][0] += sum0; ... 27 } 28 #pragma omp critical 29 if (omp_get_thread_num() != 0) Reduction 30 for (i = 0; i < ARCHnodes; i = i + 1) 31 for (j = 0; j < 3; j = j + 1) 32 disp[disptplus][i][j] += disp___disptplus___private[i][j]; 33 #pragma omp barrier 34 time = iter Exc.dt; Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 36

EQUAKE (and IV) 34 time = iter * Exc.dt; 35 #pragma omp for schedule ( static ) nowait 36 FULLY PARALLEL LOOP for (i = 0; i < ARCHnodes; i = i + 1) 37 for (j = 0; j < 3; j = j + 1) 38 disp[disptplus][i][j] *= - Exc.dt * Exc.dt; 39 #pragma omp for schedule ( static ) nowait 40 for (i = 0; i < ARCHnodes; i = i + 1) FULLY PARALLEL LOOP 41 for (j = 0; j < 3; j = j + 1) 42 disp[disptplus][i][j] += ... 43 #pragma omp for schedule ( static ) nowait 44 for (i = 0; i < ARCHnodes; i = i + 1) FULLY PARALLEL LOOP 45 for (j = 0; j < 3; j = j + 1) 46 disp[disptplus][i][j] /= ... 47 #pragma omp for schedule ( static ) nowait 48 for (i = 0; i < ARCHnodes; i = i + 1) FULLY PARALLEL LOOP 49 for (j = 0; j < 3; j = j + 1) 50 vel[i][j] = ... 51 i = disptminus; disptminus = dispt; dispt = disptplus; disptplus = i; 52 } /* for iter */ 53 if (omp_get_thread_num() != 0) { 54 for (i = 0; i < ARCHnodes; i = i + 1) 55 free(disp___disptplus___private[i]); 56 free(disp___disptplus___private); 57 } 58 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 37

Program Characteristics Compilers Unknown LB Complex CF Irreg. writes Irreg. reads Temp. vars PLUTO GCC ICC KIR Benchmark diKernel √ √ √ √ reg. assig. regular assignment √ √ √ irreg. assig. irregular assignment √ √ sc. reduc. 1 scalar reduction ≈ Synthetic √ √ sc. reduc. 2 scalar reduction ≈ √ √ √ sc. reduc. 3 scalar reduction ≈ √ √ √ √ reg. reduc. regular reduction √ √ √ √ irreg. reduc. irregular reduction √ reg. recurr. regular recurrence √ √ √ DenseAMUX regular assignment ≈ Algebra √ √ √ √ AMUX regular assignment √ √ √ AMUXMS regular reduction √ √ √ √ ATMUX irregular reduction √ √ √ sobel1 regular assignment Im. √ √ √ sobel2 regular assignment √ √ √ Apps SWIM regular recurrence U √ √ √ √ EQUAKE irregular reduction ≈ Effectiveness 2 Intel Xeon E5520 quad-core Nehalem processors at 2.26 GHz with 8 MB of cache memory per processor and 8 GB of RAM Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 39

Performance: EQUAKE (Execution Time) 3.0 100 Remaining Overhead 90 Irregular 2.5 80 70 2.0 60 1.5 50 40 1.0 30 20 0.5 10 0.0 0 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC WL x 1 WL x 2 WL x 3 WL x 1 WL x 2 WL x 3 Execution Time (s) Speedup Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 40

Outline 1. Introduction 2. A Novel Compiler Support for Multicore Systems 3. Locality-Aware Automatic Parallelization for GPGPU 4. Trace-Based Affine Reconstruction of Code 5. Conclusions Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 41

Outline 3. Locality-Aware Automatic Parallelization for GPGPU • GPGPU with CUDA and OpenHMPP • Locality-Aware Generation of Efficient GPGPU Code • CONV3D & SGEMM • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 42

GPGPU with CUDA • First GPGPU programs look like graphics applications • CUDA enables the use of C CUDA kernel: specifies the operation of a single GPU thread • Main ideas: 1. Lightweight parallel threads in hierarchy: grid, block 2. Shared memory 3. Barriers Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 44

Example of CUDA-enabled GPU architecture Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 45

GPU Memories Location Access Scope registers SM read & write one GPU thread local memory DRAM read & write one GPU thread shared memory SM read & write all GPU threads in a block global memory DRAM read & write all GPU threads & CPU explicit allocations and transfers Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 46

GPU Programming Features in CUDA 1 Threadification + 2 Thread grouping: warps Impact on performance 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block - Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 47

GPGPU with OpenHMPP interaction RPC RPC RPC address spaces disjoint disjoint disjoint automatic & automatic & automatic & data transfers manual manual manual sw-managed automatic explicit handling explicit handling caches handling parallelism gangs, workers, loop iterations, loop iterations specification SIMD tasks, SIMD standard loop directives no no transformations Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 48

Outline 3. Locality-Aware Automatic Parallelization for GPGPU • GPGPU with CUDA and OpenHMPP • Locality-Aware Generation of E ffi cient GPGPU Code • CONV3D & SGEMM • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 49

GPU Programming Features addressed by our Automatic Technique 1 Threadification + 2 Thread grouping: warps Impact on performance 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block - Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 50

Chains of Recurrences (chrecs) • Algebraic formalism Definition 3.3.1. Given a constant φ ∈ Z , a function g : N 0 → Z , and the operator + , the chrec f = { φ , + , g } is defined as a function f : N 0 → Z such that: i − 1 ∑ { φ , + , g } ( i ) = φ + g ( j ) j = 0 • Useful for representing the iterations of a loop and array access patterns 2 for (i = 0; i <= N; i++) { CHRECS_xk = [{0,+,1}][{0,+,1}] 3 for (j = 0; j <= N; j++) { 4 ... x[i][j] ... 5 } 6 } • We instantiate (particularize) them for each GPU thread Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 51

Detection of Coalesced Accesses to the GPU Global Memory CHRECS_xk = [{0,+,1}][{0,+,1}] CHRECS_xk = [{0,+,1}][{0,+,1}] row column major major 1 // only for_i is threadified 1 // only for_j is threadified 2 for (i = 0; i <= N; i++) { 2 for (j = 0; j <= N; j++) { 3 3 for (j = 0; j <= N; j++) { for (i = 0; i <= N; i++) { 4 4 ... x[i][j] ... ... x[i][j] ... 5 5 } } 6 } 6 } (a) Source code S1. (c) Source code S2. T0 T1 T2 T0 T1 T2 ( i = 0 ) ( i = 1 ) ( i = 2 ) ( j = 0 ) ( j = 1 ) ( j = 2 ) j = 0 x [ 0 ][ 0 ] x [ 1 ][ 0 ] x [ 2 ][ 0 ] i = 0 x [ 0 ][ 0 ] x [ 0 ][ 1 ] x [ 0 ][ 2 ] j = 1 x [ 0 ][ 1 ] x [ 1 ][ 1 ] x [ 2 ][ 1 ] i = 1 x [ 1 ][ 0 ] x [ 1 ][ 1 ] x [ 1 ][ 2 ] the j = 2 x [ 0 ][ 2 ] x [ 1 ][ 2 ] x [ 2 ][ 2 ] i = 2 x [ 2 ][ 0 ] x [ 2 ][ 1 ] x [ 2 ][ 2 ] same ... ... ... ... ... ... ... ... 1 st dim 1 st dim chrecs chrecs { 0 } { 1 } { 2 } { 0, + , 1 } { 0, + , 1 } { 0, + , 1 } 2 nd dim 2 nd dim { 0, + , 1 } { 0, + , 1 } { 0, + , 1 } { 0 } { 1 } { 2 } (b) Non-coalesced accesses. (d) Coalesced accesses. convex set Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 52

Detection of whether an Access to the GPU Global Memory can be Coalesced 1: FUNCTION IS C OALESCED A CCESS Input: access x k [ i k ,1 ][ i k ,2 ] . . . [ i k , n ] to an n -dimensional array x stored in row-major order Input: loop nest L = L 1 , L 2 , . . . , L l where L 1 is the threadified loop Output: returns whether the given access x k can be coalesced after threadifying the loop nest L CHRECS _ x k [ { φ k ,1 , + , g k ,1 } ][ { φ k ,2 , + , g k ,2 } ] . . . [ { φ k , n , + , g k , n } ] 2: W warp of GPU threads { T0 , T1 , T2 ...} 3: for each thread Ti in W do 4: CHRECS _ x Ti k [ { φ Ti k ,1 , + , g Ti k ,1 } ][ { φ Ti k ,2 , + , g Ti k ,2 } ] . . . [ { φ Ti k , n , + , g Ti k , n } ] 5: end for 6: if ( 9 d 2 { 1 . . . n � 1 } , Tj 2 W � { T0 } : { φ Tj k , d , + , g Tj k , d } 6 = { φ T0 k , d , + , g T0 k , d } ) then 7: . first n � 1 chrecs differ return false 8: end if 9: CHRECS _ RANGE _ x k , n S Ti { φ Ti k , n , + , g Ti k , n } 10: if CHRECS _ RANGE _ x k , n defines a convex set then 11: return true . threads of the warp access consecutive locations 12: else 13: return ( 8 Tj 2 W � { T0 } : { φ Tj k , n , + , g Tj k , n } = { φ T0 k , n , + , g T0 k , n } ) 14: . threads of the warp access the same location end if 15: 16: end FUNCTION Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 53

Usage of Registers to Store Reused Data within a GPU Thread 1: PROCEDURE S TORE R EUSED D ATA I N R EGISTERS Input: n -dimensional array x [ s 1 ][ s 2 ] . . . [ s n ] Input: loop nest L = L 1 , L 2 , . . . , L l where L 1 is the threadified loop Output: a modified program that exploits reused data to maximize the usage of the GPU registers collect accesses x k [ i k ,1 ][ i k ,2 ] . . . [ i k , n ] with k 2 { 1, . . . , m } 2: CHRECS _ x k [ { φ k ,1 , + , g k ,1 } ][ { φ k ,2 , + , g k ,2 } ] . . . [ { φ k , n , + , g k , n } ] 3: for each thread Ti do 4: CHRECS _ x Ti k [ { φ Ti k ,1 , + , g Ti k ,1 } ][ { φ Ti k ,2 , + , g Ti k ,2 } ] . . . [ { φ Ti k , n , + , g Ti k , n } ] 5: REUSED _ DATA _ x Ti T m k = 1 CHRECS _ x Ti 6: k if ( REUSED _ DATA _ x Ti 6 = ∅ ) then 7: store reused data between the accesses made by Ti in its set of 8: registers if data are private end if 9: end for 10: 11: end PROCEDURE Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 54

Usage of the GPU Shared Memory for Data Shared between the Threads of a Block 1: PROCEDURE S TORE S HARED D ATA I N S HARED M EMORY Input: n -dimensional array x [ s 1 ][ s 2 ] . . . [ s n ] Input: loop nest L = L 1 , L 2 , . . . , L l where L 1 is the threadified loop Output: a modified program using the GPU shared memory to share data between the threads of a block collect accesses x k [ i k ,1 ][ i k ,2 ] . . . [ i k , n ] with k 2 { 1, . . . , m } 2: CHRECS _ x k [ { φ k ,1 , + , g k ,1 } ][ { φ k ,2 , + , g k ,2 } ] . . . [ { φ k , n , + , g k , n } ] 3: for each block B do 4: for each thread Ti in B do 5: CHRECS _ x Ti k [ { φ Ti k ,1 , + , g Ti k ,1 } ][ { φ Ti k ,2 , + , g Ti k ,2 } ] . . . [ { φ Ti k , n , + , g Ti k , n } ] 6: end for 7: SHDATA _ x T Ti CHRECS _ x Ti k with k 2 { 1, . . . , m } 8: if ( SHDATA _ x 6 = ∅ ) then 9: store data shared between the threads of block B 10: in the shared memory end if 11: end for 12: 13: end PROCEDURE Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 55

Increase the Computational Load of a GPU Thread 1: PROCEDURE I NCREASE L OAD Input: access x k [ i k ,1 ][ i k ,2 ] . . . [ i k , n ] to an n -dimensional array x stored in row-major order Input: loop nest L = L 1 , L 2 , . . . , L l where both L 1 , L 2 are threadified Input: amount of data ∆ to be processed by a GPU thread Output: a modified program after applying loop tiling under the OpenHMPP programming model increment the step of the outer loop L 1 to ∆ 2: for each scalar variable s in L do 3: promote s to an array s [ ∆ ] 4: transform reads and writes to s into loops of ∆ iterations 5: end for 6: 7: end PROCEDURE Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 56

Use Scalar Variables to Enable GPU Compiler Optimizations 1: PROCEDURE I NCREASE L OAD Input: loop nest L = L 1 , L 2 , L 3 . . . , L l that results of Algorithm 3.4 where both L 1 , L 2 are threadified, the step of L 1 is ∆ , and L 3 is the created loop with ∆ iterations Output: a modified program that uses more scalar variables to enable GPU compiler optimizations apply loop fission to L 3 , the loop created in line 5 of Algorithm 3.4 2: for each loop L 0 3 resulting from the fission of L 3 do 3: interchange loops until L 0 3 is the innermost one 4: insert a fullunroll directive before L 0 5: 3 end for 6: 7: end PROCEDURE Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 57

CONV3D & SGEMM conv3d-hmpp1 conv3d-hmpp2 conv3d-hmpp3 sgemm-hmpp1 sgemm-hmpp2 sgemm-hmpp3 sgemm-hmpp4 sgemm-cublas conv3d-cpu sgemm-mkl sgemm-cpu GPU Features √ √ √ √ √ √ √ Coalescing - - - - √ √ √ √ Registers - - - - √ √ Shared Memory - - - - Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 59

shaded to be omitted in the discovering of CONV3D (I) parallelism 1 int sizex, sizey, sizez, bound = 4; 2 3 void conv3d( float output[sizex][sizey][sizez], 4 float input[bound+sizex+bound][4+sizey+4][4+sizez+4], ROOT EXECUTION SCOPE 5 float coefx, float coefy, float coefz) { 6 ES_for i,j,k (Figure 3.4, lines 7-35) 7 for ( int i = 0; i < sizex; i++) { 8 for ( int j = 0; j < sizey; j++) { 9 for ( int k = 0; k < sizez; k++) { K < tempx 10 > 10 float tempx = input[i][j][k] + coefx * scalar assignment 11 ( 12 input[i-1][j][k] + input[i+1][j][k] + 13 input[i-2][j][k] + input[i+2][j][k] + 14 input[i-3][j][k] + input[i+3][j][k] + 15 input[i-4][j][k] + input[i+4][j][k] 16 ); K < tempy 17 > 17 float tempy = input[i][j][k] + coefy * scalar assignment 18 ( 19 input[i][j-1][k] + input[i][j+1][k] + 20 input[i][j-2][k] + input[i][j+2][k] + 21 input[i][j-3][k] + input[i][j+3][k] + 22 input[i][j-4][k] + input[i][j+4][k] K < tempz 24 > 23 ); scalar assignment 24 float tempz = input[i][j][k] + coefz * 25 ( 26 input[i][j][k-1] + input[i][j][k+1] + 27 input[i][j][k-2] + input[i][j][k+2] + 28 input[i][j][k-3] + input[i][j][k+3] + K < output 31 > 29 input[i][j][k-4] + input[i][j][k+4] regular reduction 30 ); 31 output[i][j][k] = 32 output[i][j][k] + tempx + tempy + tempz; 33 } 34 } FULLY PARALLEL LOOP 35 } 36 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 60

CONV3D (II) • conv3d-cpu: Sequential code • conv3d-hmpp1: Coalescing CHRECS_input1 = 1. int i, j, k, size_x, size_y, size_z; 2. float coefx,coefy,coefz,*input,*output; [{0,+,1}][{0,+,1}][{0,+,1}] 3. 4. for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* 8. … Default OpenHMPP policy • CHRECS_input1T1 = CHRECS_input1T0 = [{0}][{1}][{0,+,1}] [{0}][{0}][{0,+,1}] Loop nest is permuted to forj, fork, fori • CHRECS_input1T0 = CHRECS_input1T1 = [{0,+,1}][{0}][{0}] [{0,+,1}][{0}][{1}] Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 61

CONV3D (III) • conv3d-hmpp2: Registers 4. for (j = 0; j < size_y; j++) { 5. for (k = 0; k < size_z; k++) { 6. for (i = 0; i < size_x; i++) { 7. float tempx = input[i][j][k]+coefx* 8. ( 9. input[i-1][j][k]+input[i+1][j][k]+ 10. … CHRECS_input1 = CHRECS_input3 = CHRECS_input2 = [{0,+,1}][{0,+,1}][{0,+,1}] [{1,+,1}][{0,+,1}][{0,+,1}] [{-1,+,1}][{0,+,1}][{0,+,1}] CHRECS_input1T0 = CHRECS_input2T0 = CHRECS_input3T0 = [{0,+,1}][{0}][{0}] [{-1,+,1}][{0}][{0}] [{1,+,1}][{0}][{0}] ∩≠∅ Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 62

CONV3D (IV) 1 #pragma hmpp conv3d___hmpp2 codelet 2 void conv3d___hmpp2( float output[sizex][sizey][sizez], 3 float input[bound+sizex+bound][4+sizey+4][4+sizez+4], 4 float coefx, float coefy, float coefz) { 5 6 #pragma hmppcg gridify (j, k) 18 for ( int i = 0; i < sizex; i++) { 7 for ( int j = 0; j < sizey; j++) { 19 i___minus4 = i___minus3; 8 for ( int k = 0; k < sizez; k++) { 20 i___minus3 = i___minus2; 9 float i___minus4 = 0; 21 i___minus2 = i___minus1; 10 float i___minus3 = input[-4][j][k]; 22 i___minus1 = i___plus0; 11 float i___minus2 = input[-3][j][k]; 23 i___plus0 = i___plus1; 12 float i___minus1 = input[-2][j][k]; 24 i___plus1 = i___plus2; 13 float i___plus0 = input[-1][j][k]; 25 i___plus2 = i___plus3; 14 float i___plus1 = input[0][j][k]; 26 i___plus3 = i___plus4; 15 float i___plus2 = input[1][j][k]; 27 i___plus4 = input[i+4][j][k]; 16 float i___plus3 = input[2][j][k]; 28 float tempx = i___plus0 + coefx * 17 float i___plus4 = input[3][j][k]; 29 ( 18 for int 30 i___minus1 + i___plus1 + 31 i___minus2 + i___plus2 + 32 i___minus3 + i___plus3 + 33 i___minus4 + i___plus4 34 ); 35 float tempy = ... 36 float tempz = ... 37 output[i][j][k] = 38 output[i][j][k] + tempx + tempy + tempz; 39 } 40 } 41 } 42 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 63

CONV3D (V) • conv3d-hmpp3: Shared memory 4. for (j = 0; j < size_y; j++) { T0 T1 5. for (k = 0; k < size_z; k++) { CHRECS 1 st dim 2 nd dim 3 rd dim 1 st dim 2 nd dim 3 rd dim 6. for (i = 0; i < size_x; i++) { { 0, + , 1 } { 0 } { 0 } { 0, + , 1 } { 0 } { 1 } CHRECS _ input 19 … 21. float tempz = input[i][j][k]+coefz* { 0, + , 1 } { 0 } { − 1 } { 0, + , 1 } { 0 } { 0 } CHRECS _ input 20 22. ( { 0, + , 1 } { 0 } { 1 } { 0, + , 1 } { 0 } { 2 } CHRECS _ input 21 23. input[i][j][k-1]+input[i][j][k+1]+ { 0, + , 1 } { 0 } { − 2 } { 0, + , 1 } { 0 } { − 1 } CHRECS _ input 22 24. input[i][j][k-2]+input[i][j][k+2]+ CHRECS _ input 23 { 0, + , 1 } { 0 } { 2 } { 0, + , 1 } { 0 } { 3 } 25. input[i][j][k-3]+input[i][j][k+3]+ { 0, + , 1 } { 0 } { − 3 } { 0, + , 1 } { 0 } { − 2 } CHRECS _ input 24 26. input[i][j][k-4]+input[i][j][k+4] { 0, + , 1 } { 0 } { 3 } { 0, + , 1 } { 0 } { 4 } CHRECS _ input 25 27. ); CHRECS _ input 26 { 0, + , 1 } { 0 } { − 4 } { 0, + , 1 } { 0 } { − 3 } … { 0, + , 1 } { 0 } { 4 } { 0, + , 1 } { 0 } { 5 } CHRECS _ input 27 shared clause of the gridify directive Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 64

CONV3D (and VI) 1 #pragma hmpp conv3d___hmpp3 codelet 2 void conv3d___hmpp3( float output[sizex][sizey][sizez], 3 float input[bound+sizex+bound][4+sizey+4][4+sizez+4], 4 float coefx, float coefy, float coefz) { 5 float input___shared[bound+8+bound][bound+32+bound]; 6 #pragma hmppcg gridify (j,k), blocksize (32x8), shared (input___shared),unguarded 7 for ( int j = 0; j < sizey; j++) { 8 for ( int k = 0; k < sizez; k++) { 9 int tx = 0; 10 int ty = 0; 11 #pragma hmppcg set tx = RankInBlockX() 12 #pragma hmppcg set ty = RankInBlockY() 13 int rk = tx + bound; 14 int rj = ty + bound; 15 float i___minus4 = ... 16 for ( int i = 0; i < sizex; i++) { 17 i___minus4 = ... 18 #pragma hmppcg grid barrier 24 float tempx = ... 19 input___shared[rj-bound][rk-bound] = input[i][j-bound][k-bound]; 25 float tempy = i___plus0 + coefy * 20 input___shared[rj+bound][rk-bound] = input[i][j+bound][k-bound]; 26 ( 21 input___shared[rj-bound][rk+bound] = input[i][j-bound][k+bound]; 27 input___shared[rj-1][rk] + input___shared[rj+1][rk] + 22 input___shared[rj+bound][rk+bound] = input[i][j+bound][k+bound]; 28 input___shared[rj-2][rk] + input___shared[rj+2][rk] + 23 #pragma hmppcg grid barrier 29 input___shared[rj-3][rk] + input___shared[rj+3][rk] + 24 30 input___shared[rj-4][rk] + input___shared[rj+4][rk] 31 ); 32 float tempz = i___plus0 + coefz * 33 ( 34 input___shared[rj][rk-1] + input___shared[rj][rk+1] + 35 input___shared[rj][rk-2] + input___shared[rj][rk+2] + 36 input___shared[rj][rk-3] + input___shared[rj][rk+3] + 37 input___shared[rj][rk-4] + input___shared[rj][rk+4] 38 ); 39 output[i][j][k] = 40 output[i][j][k] + tempx + tempy + tempz; 41 } 42 } 43 } 44 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 65

SGEMM (I) ROOT EXECUTION SCOPE 1 int m, n, k; ES_for i,j (Figure 3. 8 , lines 5-13) 2 void sgemm( float C[m][n], float alpha, float A[m][k], 3 float B[k][n], float beta) { K < prod 7 > 4 scalar assignment 5 for ( int i = 0; i < m; i++) { 6 for ( int j = 0; j < n; j++) { ES_for l (Figure 3. 8 , lines 8-10) 7 float prod = 0; 8 for ( int l = 0; l < k; l++) { K < prod 9 > 9 prod += A[i][l] * B[l][j]; scalar reduction 10 } 11 C[i][j] = alpha * prod + beta * C[i][j]; 12 } 13 } K < C 11 > 14 } regular reduction FULLY PARALLEL LOOP Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 66

SGEMM (II) • sgemm-cpu: Sequential code • sgemm-mkl: Intel MKL • sgemm-hmpp1: Offloading (and check coalescing) 1 int m, n, k; 2 void sgemm( float C[m][n], float alpha, float A[m][k], 3 float B[k][n], float beta) { 4 5 for ( int i = 0; i < m; i++) { not instantiated T0 T1 6 for ( int j = 0; j < n; j++) { CHRECS 1 st dim 2 nd dim 1 st dim 2 nd dim 1 st dim 2 nd dim 7 float prod = 0; { 0, + , 1 } { 0, + , 1 } { 0 } { 0, + , 1 } { 0 } { 0, + , 1 } CHRECS _ A 8 for ( int l = 0; l < k; l++) { { 0, + , 1 } { 0, + , 1 } { 0, + , 1 } { 0 } { 0, + , 1 } { 1 } CHRECS _ B 9 prod += A[i][l] * B[l][j]; CHRECS _ C { 0, + , 1 } { 0, + , 1 } { 0 } { 0 } { 0 } { 1 } 10 } 11 C[i][j] = alpha * prod + beta * C[i][j]; 12 } 13 } 14 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 67

SGEMM (III) • sgemm-hmpp2: Tiling preserving coalescing 1 int m, n, k; 2 #define DELTA 16 3 4 #pragma hmpp sgemm___hmpp2 codelet 5 void sgemm___hmpp2( float C[m][n], float alpha, float A[m][k], 6 float B[k][n], float beta) { 7 8 #pragma hmppcg gridify (i,j), blocksize (128x1) 9 for ( int i = 0; i < m; i = i + DELTA) { 10 for ( int j = 0; j < n; j++) { 11 float prod[DELTA]; 12 for ( int t = 0; t < DELTA; t++) { 13 prod[t] = 0; 14 for ( int l = 0; l < k; l++) { 15 prod[t] += A[i+t][l] * B[l][j]; 16 } 17 C[i+t][j] = alpha * prod[t] + beta * C[i+t][j]; 18 } 19 } 20 } 21 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 68

SGEMM (and IV) 1 int m, n, k; • sgemm-hmpp3: 2 #define DELTA 16 3 Let the compiler 4 #pragma hmpp sgemm___hmpp3 codelet 5 void sgemm___hmpp3( float C[m][n], float alpha, float A[m][k], 6 float B[k][n], float beta) { use the registers 7 8 #pragma hmppcg gridify (i,j), blocksize (128x1) ( fullunroll ) 9 for ( int i = 0; i < m; i = i + DELTA) { 10 for ( int j = 0; j < n; j++) { 11 float prod[DELTA]; • sgemm-hmpp4: 12 #pragma hmppcg fullunroll 13 for ( int t = 0; t < DELTA; t++) { 14 Use the shared prod[t] = 0; 15 } 16 for ( int l = 0; l < k; l++) { memory for B 17 #pragma hmppcg fullunroll 18 for ( int t = 0; t < DELTA; t++) { 19 prod[t] += A[i+t][l] * B[l][j]; 20 } • sgemm-cublas: 21 } 22 #pragma hmppcg fullunroll NVIDIA CUBLAS 23 for ( int t = 0; t < DELTA; t++) { 24 C[i+t][j] = alpha * prod[t] + beta * C[i+t][j]; library 25 } 26 } 27 } 28 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 69

Performance Evaluation: CONV3D sizex, sizey and sizez in 128, 256, 384, 512, 640 and 768 120 conv3d-cpu conv3d-hmpp1 100 conv3d-hmpp2 conv3d-hmpp3 80 GFLOPS 60 40 20 0 CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton) Fermi cards introduced memory caches Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 71

Performance Evaluation: SGEMM (I) m, n and k in 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 4096, 6144 and 8192 500 sgemm-cpu sgemm-mkl 400 sgemm-hmpp1 sgemm-hmpp2 GFLOPS 300 sgemm-hmpp3 sgemm-hmpp4 sgemm-cublas 200 100 0 CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton) the biggest improvement factor is the usage of the GPU shared memory Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 72

Performance Evaluation: SGEMM (and II) blue: sgemm-cublas red: sgemm-hmpp4 8192 black: sgemm-mkl 6144 4096 k 2048 1024 128 8192 6144 8192 6144 4096 4096 2048 128 10242048 1024 128 n m GPU Tesla S2050 (pluton) Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 73

Outline 1. Introduction 2. A Novel Compiler Support for Multicore Systems 3. Locality-Aware Automatic Parallelization for GPGPU 4. Trace-Based A ffi ne Reconstruction of Code 5. Main Contributions and Future Research Lines Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 74

Outline 4. Trace-Based Affine Reconstruction of Code • Problem Formulation • Problem Resolution with CHOLESKY • Extensions for Supporting Nonlinear Traces • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 75

Problem Statement 1 0x1e2d140 2 0x1e2d140 88 0x1e2d248 1. for (i = 0; i <= 29; i++) { . . 89 0x1e2d340 . 2. for (j = 0; j <= 29-i; j++) { 90 0x1e2d348 3. for (k = 0; k < i; k++) { 91 0x1e2d350 30 0x1e2d140 4. … A[i][k] … 92 0x1e2d340 31 0x1e2d240 5. } 93 0x1e2d348 32 0x1e2d248 6. } 94 0x1e2d350 33 0x1e2d240 7. } . 34 0x1e2d248 . . . . . • We assume that: • Addresses are generated by a single instruction • Instruction is enclosed in an affine loop nest • Existing memory optimization techniques based on the polyhedral model, and any other static or dynamic optimization technique in the absence of source and/or binary code, can be applied. Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 77

Problem Formulation (I) DO i 1 = 0 , u 1 ( − → ı ) DO i 2 = 0 , u 2 ( − → ı ) . . . DO i n = 0 , u n ( − → ı ) V [ f 1 ( − → ı )] . . . [ f m ( − → ı )] V [ f 1 ( − → ı )] . . . [ f m ( − → ı )] = V [ c 0 + i 1 c 1 + . . . + i n c n ] Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 78

Problem Formulation (II) • In our model, only three possible variations of a loop index between two consecutive iterations are allowed 1. i j does not change ⇒ δ k j = 0 2. i j is increased by one ⇒ δ k j = 1 3. i j is reset to 0 ⇒ δ k j = − i k j 2 i k + 1 3 2 3 − i k δ k 1 1 1 i k + 1 − i k δ k 6 7 6 7 = − → ( − → ı k + 1 − − → 2 2 2 ı k ) = δ k 6 7 6 7 = . . 6 7 6 7 . . . . 6 7 6 7 4 5 4 5 i k + 1 − i k δ k n n n Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 79

Problem Formulation (and III) • The stride between two consecutive accesses is a linear combination of the coefficients of the loop indices accesses σ k = V ( − → ı k + 1 ) − V ( − → ı k ) indices. σ k = c 1 i k + 1 c n i k + 1 V + ( c 0 + + . . . + ) − n 1 c 1 i k c n i k V + ( c 0 + + . . . + n ) = 1 c 1 δ k c n δ k = + . . . + = n 1 c − → − → δ k = Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 80

Solution Space 2n + 1 candidates Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 82

Problem Resolution (I) { − → c , I k , U , − → solution S k subtrace { a 1 , . . . , a k } w } , n = • Coefficients of the Loop Indices ector − → c ∈ Z n of u j ( − → ı ) = w j + u j ,1 i 1 + . . . + u j , ( j − 1 ) i ( j − 1 ) • Iteration Indices   − 1 0 0 . . . 0   w 1 Matrix I k = [ − → ı 1 | . . . | − → ı k ] ∈ Z n × k of u 2,1 − 1 0 . . . 0   w 2     and − →     u 3,1 u 3,2 − 1 . . . 0 w = U = .     . . . . . .   • Bounds  ...  . . . . . . . .       w n matrix U ∈ Z n × n u n ,1 u n ,2 u n ,3 . . . − 1 − → w ∈ Z n such w ≥ − → U − → ı + − → 0 T Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 83

Problem Resolution (II) • To be a valid solution: • Each consecutive pair of indices must be sequential UI k + − → w 1 1 × k ≥ 0 n × k • The observed strides are coherent with the reconstructed ones c − → δ k = σ k − → c ( − → ı k + 1 − − → ı k ) = − → Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 84

Problem Resolution: CHOLESKY (I) 1 #define N 32; 2 double p[N], A[N][N], x; 1 0x1e2d140 3 int i, j, k; 2 0x1e2d140 88 0x1e2d248 4 . . 89 0x1e2d340 5 #pragma scop . 90 0x1e2d348 6 for (i = 0; i < N; ++i) { 91 0x1e2d350 7 30 0x1e2d140 x = A[i][i]; 92 0x1e2d340 8 31 0x1e2d240 for (j = 0; j <= i - 1; ++j) 93 0x1e2d348 32 0x1e2d248 9 x = x - A[i][j] * A[i][j]; 94 0x1e2d350 33 0x1e2d240 10 p[i] = 1.0 / sqrt(x); . 34 0x1e2d248 . 11 for (j = i + 1; j < N; ++j) { . . 12 x = A[i][j]; . . 13 for (k = 0; k <= i - 1; ++k) 14 x = x - A[j][k] * A[i][k] ; 15 A[j][i] = x * p[i]; 16 } 17 } σ 1 ⇤ = [ a 2 − a 1 ] = [ 0 ] − → 18 #pragma endscop 8 ⇥ c = ı 2 ⇤ = [ 0, 1 ] > I 2 = > ⇥ − → ı 1 | − → > > > − → w = [ 1 ] T : Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 85

Problem Resolution: Building the System • Calculate the observed stride σ k = a k + 1 − a k • Build a diophantine linear equation system c ) − → − → c ( − → ı k + 1 − − → ı k ) = σ k ⇒ ( − → c T − → δ k = − → c T σ k • One or more solutions: Explore them independently • No solution under current boundaries • Increase dimensionality adding a new loop • Modify boundaries • Discard this branch Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 86

Problem Resolution: Solving the System c ) − → − → c ( − → ı k + 1 − − → ı k ) = σ k ⇒ ( − → c T − → δ k = − → c T σ k • As indices must be sequential, there are at most n solutions { − → = +( l , − → ı k + 1 ı k ) , 0 < l ≤ n } l • We only need to calculate the predicted stride for each valid index and compare with the observed stride the n valid indices c − → l = − → σ k δ k as ˆ l , k Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 87

Problem Resolution: CHOLESKY (II) • Solution for the first two accesses: σ 1 ⇤ = [ a 2 − a 1 ] = [ 0 ] − → 8 ⇥ c = ı 2 ⇤ = [ 0, 1 ] > > I 2 = ⇥ − → ı 1 | − → > > > − → w = [ 1 ] T : • Processing the third access: a 3 = 0x1e2d140 σ 2 = a 3 − a 2 = 0x1e2d140 − 0x1e2d140 = 0 c � ! 1 = [ 0 ] [ 1 ] T = 0 1 = � ! σ 2 δ 2 ˆ h i I = [ I | + ( 1, � ! ı 2 ) ] = 0 1 2 • The reconstruction continues until σ 30 = a 31 � a 30 = 0x1e2d240 � 0x1e2d140 ) σ 30 = 0x100 = 256 3 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 88

Problem Resolution: Increasing the Solution Dimensionality (I) • Add a new loop ı k + 1 = f ( p , � � ! ! ı k )   I k ( 1: p ,: ) � ! I k + 1 = ı k + 1   0 1 ⇥ k     I k ( p + 1: n ,: ) • In CHOLESKY " # " # 0 . . . 0 1 0 . . . 0 1 I = = 0 . . . 29 I ( 1:30 ) 0 0 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 89

Problem Resolution: Increasing the Solution Dimensionality (and II) • The coefficient for the new loop can be derived from the observed stride 4 2 3 0 . . 6 7 . 6 7 6 7 n 0 6 7 p = σ k + 6 7 i k h i ∑ c 0 = σ k ) � ! c ( � ! ı k + 1 � � ! r c r ı k ) = σ k ) 6 7 c 1 , . . . , c p , c 0 p , c p + 1 , . . . , c n 1 6 7 6 7 � i k r = p + 1 6 7 p 6 7 . 6 7 . . 6 7 4 5 � i k n • In CHOLESKY ⇥ 256 0 = σ 30 + i 30 0 ⇤ 1 c 1 = 256 + 0 · 29 ⇒ − → c 0 c =  � Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 90

Problem Resolution: Updating the Loop Bounds • Loop indices must be sequential and stay into loop bounds U 0 I k + 1 + � ! w 0 1 1 ⇥ ( k + 1 ) � 0 n ⇥ ( k + 1 ) • Inconsistent system The branch is discarded • System with solutions Overdetermined of � ! of O ( 1 ) . w 0 becomes O ( n 2 ) . of U 0 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 91

Problem Resolution: Accelerating the Traversal • In the general case, the complexity of exploring the solution space for a trace with A addresses generated by n loops is O(n A ) γ k = U − ı k + − − → → → w • Each element indicates how many more iterations of each index are left before it resets under the bounds � ! is � ! = +( l , � ! • The most plausible value for the next index is ı k + 1 ı k ) , l � ! k where l is the position of the innermost positive element • Several accesses are recognized in block Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 92

Problem Resolution: CHOLESKY (III) " # " # 0 . . . 0 1 0 . . . 0 1 I = = 0 . . . 29 I ( 1:30 ) 0 0 " # " # − 1 0 − 1 0 i T − → h 1 | − → = [ 1 | 29 ] T U = = w = w ( 1:1 ) − 1 0 U ( 1:1,1:1 ) 0 UI + − → w 1 1 × ( 31 ) ≥ 0 2 × ( 31 ) " # " # " # h − 1 0 0 . . . 0 1 1 i + = 1 . . . 1 0 − 1 0 . . . 29 0 29 " # " # 0 0 0 . . . 0 − 1 1 . . . 1 + = 0 − 1 − 2 . . . − 29 0 29 . . . 29 " # 1 1 1 . . . 1 0 ≥ 0 2 × ( 31 ) 29 28 27 . . . 0 29 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 93

Problem Resolution: CHOLESKY (and IV) 1 #define N 32; 2 double p[N], A[N][N], x; 3 int i, j, k; 4 5 #pragma scop 6 for (i = 0; i < N; ++i) { 7 x = A[i][i]; access A [ i ][ k ] as A [ 256 ⇤ i + 8 ⇤ k ] . 8 for (j = 0; j <= i - 1; ++j) 9 x = x - A[i][j] * A[i][j]; 10 p[i] = 1.0 / sqrt(x); 11 for (j = i + 1; j < N; ++j) { 12 h i x = A[i][j]; � ! c = 256 0 8 13 for (k = 0; k <= i - 1; ++k) 14 x = x - A[j][k] * A[i][k] ; 15 A[j][i] = x * p[i]; 2 3 � 1 0 0 16 } 17 } U = � 1 � 1 0 6 7 18 #pragma endscop 4 5 1 0 � 1 h i � ! w = 29 29 0 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 94

Supporting Nonlinearity: Input Noise • Some trace files mainly contain references issued by a single access, but mixed with unrelated ones (e.g., nearly affine or unlabeled traces) • The exploration of the solution space can be modified to discard until max observed accesses whether ( ) e σ k = σ k + r , 0 < e ≤ max ∑ ˆ r = 0 • Tolerance parameter for discarding a branch Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 96

Compilation Techniques for Automatic Extraction of Parallelism and - PowerPoint PPT Presentation

Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures Jos M. Andin P H D A DVISORS : Gabriel Rodrguez and Manuel Arenaz Outline 1. Introduction 2. A Novel Compiler Support for

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Automatic Extraction From Automatic Extraction From and Reasoning About and Reasoning About

Automatic text classification and extraction of Automatic text classification and extraction of

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern

ocamlbuild , a tool for automatic compilation of OCaml projects Berke Durak Nicolas Pouillard

Automatic Library Compilation and Proof Tree Visualisation for Coq Proof General Hendrik Tews

The Compilation Process Preprocessing: o processes include-files, conditional compilation and

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Automatic Collocation Extraction from Text Corpora Pavel Pecina Ustav form aln a

Automatic Defect Detection Andrzej Wasylkowski Overview Automatic Defect Detection

Automatic Enrollment and Automatic IRAs David C. John The Heritage Foundation The Retirement

Automatic Registration and Calibration Automatic Registration and Calibration Automatic

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Robust and structural ergodicity of stochastic reaction networks Corentin Briat and Mustafa

Anti-patterns What is an Anti-Pattern? A pattern is a named, proven approach to solving a

Holographic Q-Lattices and Metal-Insulator Transitions Jerome Gauntlett Aristomenis Donos

Lecture Outline Regeltechniek Previous lecture: Stability and transient response. Lecture 4

Crystal Structure Prediction by Vertex Removal in Euclidean Space Duncan Adamson University of

Charged Particle Reactions and Production of Medic al/Industrial Radioisotopes Muhammad Shahid

Survey and Implementation of Smart Space Scenarios Philipp H artinger Advisors: Marc-Oliver

Containers and Namespaces in the Linux Kernel Kir Kolyshkin <kir@openvz.org> Agenda

Compilation Techniques for Automatic Extraction of Parallelism and - PowerPoint PPT Presentation

Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures Jos M. Andin P H D A DVISORS : Gabriel Rodrguez and Manuel Arenaz Outline 1. Introduction 2. A Novel Compiler Support for

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Automatic Extraction From Automatic Extraction From and Reasoning About and Reasoning About

Automatic text classification and extraction of Automatic text classification and extraction of

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern

ocamlbuild , a tool for automatic compilation of OCaml projects Berke Durak Nicolas Pouillard

Automatic Library Compilation and Proof Tree Visualisation for Coq Proof General Hendrik Tews

The Compilation Process Preprocessing: o processes include-files, conditional compilation and

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Automatic Collocation Extraction from Text Corpora Pavel Pecina Ustav form aln a

Automatic Defect Detection Andrzej Wasylkowski Overview Automatic Defect Detection

Automatic Enrollment and Automatic IRAs David C. John The Heritage Foundation The Retirement

Automatic Registration and Calibration Automatic Registration and Calibration Automatic

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Robust and structural ergodicity of stochastic reaction networks Corentin Briat and Mustafa

Anti-patterns What is an Anti-Pattern? A pattern is a named, proven approach to solving a

Holographic Q-Lattices and Metal-Insulator Transitions Jerome Gauntlett Aristomenis Donos

Lecture Outline Regeltechniek Previous lecture: Stability and transient response. Lecture 4

Crystal Structure Prediction by Vertex Removal in Euclidean Space Duncan Adamson University of

Charged Particle Reactions and Production of Medic al/Industrial Radioisotopes Muhammad Shahid

Survey and Implementation of Smart Space Scenarios Philipp H artinger Advisors: Marc-Oliver

Containers and Namespaces in the Linux Kernel Kir Kolyshkin &lt;kir@openvz.org&gt; Agenda

Containers and Namespaces in the Linux Kernel Kir Kolyshkin <kir@openvz.org> Agenda