Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP - - PowerPoint PPT Presentation

locality aware automatic parallelization for gpgpu with
SMART_READER_LITE
LIVE PREVIEW

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP - - PowerPoint PPT Presentation

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives Jos M. Andin, Manuel Arenaz, Franois Bodin, Gabriel Rodrguez and Juan Tourio 7th International Symposium on High-Level Parallel Programming and Applications


slide-1
SLIDE 1

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño

7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 — Amsterdam, Netherlands

slide-2
SLIDE 2

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Outline

  • Motivation: General Purpose Computation with GPUs
  • GPGPU with CUDA & OpenHMPP
  • The KIR: an IR for the Detection of Parallelism
  • Locality-Aware Generation of Efficient GPGPU Code
  • Case Studies: CONV3D & SGEMM
  • Performance Evaluation
  • Conclusions & Future Work
slide-3
SLIDE 3

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Outline

  • Motivation: General Purpose Computation with GPUs
  • GPGPU with CUDA & OpenHMPP
  • The KIR: an IR for the Detection of Parallelism
  • Locality-Aware Generation of Efficient GPGPU Code
  • Case Studies: CONV3D & SGEMM
  • Performance Evaluation
  • Conclusions & Future Work
slide-4
SLIDE 4

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

1 5 9 13 18 24 51 80 117 183 280 481 649 993 1,267 1,779 3,016 4,195 6,043 6,681 7,108 11,865 14,387 19,484 21,871 24,129

1 10 100 1000 10,000 100,000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 Performance (vs. VAX-11/780)

25%/year 52%/year 22%/year

IBM POWERstation 100, 150 MHz Digital Alphastation 4/266, 266 MHz Digital Alphastation 5/300, 300 MHz Digital Alphastation 5/500, 500 MHz AlphaServer 4000 5/600, 600 MHz 21164 Digital AlphaServer 8400 6/575, 575 MHz 21264 Professional Workstation XP1000, 667 MHz 21264A Intel VC820 motherboard, 1.0 GHz Pentium III processor IBM Power4, 1.3 GHz Intel Xeon EE 3.2 GHz AMD Athlon, 2.6 GHz Intel Core 2 Extreme 2 cores, 2.9 GHz Intel Core Duo Extreme 2 cores, 3.0 GHz Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 1.5, VAX-11/785 AMD Athlon 64, 2.8 GHz Digital 3000 AXP/500, 150 MHz HP 9000/750, 66 MHz IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz MIPS M/120, 16.7 MHz Sun-4/260, 16.7 MHz VAX 8700, 22 MHz AX-11/780, 5 MHz

The Parallel Challenge

David A. Patterson and John L. Hennessy. ! Computer Organization and Design: ! The Hardware/Software Interface. ! Elsevier, 2014.

slide-5
SLIDE 5

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

10 20 30 40 50 60 70 80

ACCELERATORS/CO-PROCESSORS

2006 2007 2008 2009 2010 2011 2012 2013 2014 Clearspeed CSX600

AMD

Intel

ATI

Cell NVIDIA

SYSTEMS

General Purpose Computing with GPUs

The TOP500 List. ! June 2014.

slide-6
SLIDE 6

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Outline

  • Motivation: General Purpose Computation with GPUs
  • GPGPU with CUDA & OpenHMPP
  • The KIR: an IR for the Detection of Parallelism
  • Locality-Aware Generation of Efficient GPGPU Code
  • Case Studies: CONV3D & SGEMM
  • Performance Evaluation
  • Conclusions & Future Work
slide-7
SLIDE 7

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

GPGPU with CUDA

  • First GPGPU programs look like graphics applications

!

  • CUDA enables the use of C

CUDA kernel: specifies the operation of a single GPU thread

  • Main ideas:

1.Lightweight parallel threads in hierarchy: grid, block 2.Shared memory 3.Barriers

slide-8
SLIDE 8

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

GPU Programming Features in CUDA 1 Threadification 2 Thread grouping: warps 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block

slide-9
SLIDE 9

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

GPGPU with OpenHMPP

  • Directive-based approaches provide several advantages:
  • More readable codes
  • Only one file
  • Independent from the hardware platform
  • Reasonable performance
slide-10
SLIDE 10

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

GPGPU with OpenHMPP

  • Directive-based approaches provide several advantages:
  • More readable codes
  • Only one file
  • Independent from the hardware platform
  • Reasonable performance

explicit control of software-managed caches explicit control of software-managed caches

slide-11
SLIDE 11

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

GPGPU with OpenHMPP

  • Directive-based approaches provide several advantages:
  • More readable codes
  • Only one file
  • Independent from the hardware platform
  • Reasonable performance

explicit control of software-managed caches explicit control of software-managed caches standard loop transformations

slide-12
SLIDE 12

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

GPU Programming Features with OpenHMPP 1 Threadification 2 Thread grouping 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block

gridify advancedLoad delegatedStore permute unroll fuse tile shared

slide-13
SLIDE 13

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Outline

  • Motivation: General Purpose Computation with GPUs
  • GPGPU with CUDA & OpenHMPP
  • The KIR: an IR for the Detection of Parallelism
  • Locality-Aware Generation of Efficient GPGPU Code
  • Case Studies: CONV3D & SGEMM
  • Performance Evaluation
  • Conclusions & Future Work
slide-14
SLIDE 14

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

diKernel: Domain- Independent Computational Kernel

  • Characterizes the computations

carried out in a program without being affected by how they are coded

  • Exposes multiple levels of

parallelism

TEXT LEVEL (ASCII code) SYNTACTIC LEVEL (abstract syntax tree) SEMANTIC LEVEL (control flow and data dependence graphs) DOMAIN-INDEPENDENT CONCEPT LEVEL (programming practice) DOMAIN-SPECIFIC CONCEPT LEVEL (problem solving methods and application domain)

  • M. Arenaz et al. XARK: An Extensible Framework for

Automatic Recognition of Computational Kernels. ACM Transactions on Programming Languages and Systems, 30(6), 2008.

slide-15
SLIDE 15

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Building the KIR

  • Non-statement based, high-level, hierarchical IR

1.diKernel recognition on the DDG 2.Identification of flow dependences 3.Hierarchy of execution scopes reflecting the computational stages & diKernel classification

J.M. Andión et al. A Novel Compiler Support for Automatic Parallelization on Multicore Systems. Parallel Computing, 39(9), 2013.

slide-16
SLIDE 16

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Example of KIR: CONV3D

1.int i, j, k, size_x, size_y, size_z; 2.float coefx,coefy,coefz,*input,*output; 3. 4.for (i = 0; i < size_x; i++) {

  • 5. for (j = 0; j < size_y; j++) {
  • 6. for (k = 0; k < size_z; k++) {
  • 7. float tempx = input[i][j][k]+coefx*
  • 8. (
  • 9. input[i-1][j][k]+input[i+1][j][k]+
  • 10. input[i-2][j][k]+input[i+2][j][k]+
  • 11. input[i-3][j][k]+input[i+3][j][k]+
  • 12. input[i-4][j][k]+input[i+4][j][k]
  • 13. );
  • 14. float tempy = input[i][j][k]+coefy*
  • 15. (
  • 16. input[i][j-1][k]+input[i][j+1][k]+
  • 17. input[i][j-2][k]+input[i][j+2][k]+
  • 18. input[i][j-3][k]+input[i][j+3][k]+
  • 19. input[i][j-4][k]+input[i][j+4][k]
  • 20. );
  • 21. float tempz = input[i][j][k]+coefz*
  • 22. (
  • 23. input[i][j][k-1]+input[i][j][k+1]+
  • 24. input[i][j][k-2]+input[i][j][k+2]+
  • 25. input[i][j][k-3]+input[i][j][k+3]+
  • 26. input[i][j][k-4]+input[i][j][k+4]
  • 27. );
  • 28. output[i][j][k] =
  • 29. output[i][j][k]+tempx+tempy+tempz;
  • 30. }
  • 31. }

32.}

ROOT EXECUTION SCOPE ES_fori,j,k (Fig. 1a, lines 4-32) K < tempz21 > scalar assignment K < output28 > regular reduction K < tempy14 > scalar assignment K < tempx7 > scalar assignment

shaded to be omitted in the discovering of parallelism

slide-17
SLIDE 17

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Outline

  • Motivation: General Purpose Computation with GPUs
  • GPGPU with CUDA & OpenHMPP
  • The KIR: an IR for the Detection of Parallelism
  • Locality-Aware Generation of Efficient GPGPU Code
  • Case Studies: CONV3D & SGEMM
  • Performance Evaluation
  • Conclusions & Future Work
slide-18
SLIDE 18

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

GPU Programming Features addressed by our Automatic Technique 1 Threadification 2 Thread grouping 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block

slide-19
SLIDE 19

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Detection of coalesced accesses to the GPU global memory

1 //

  • nly

for_i is threadified 2 for (i = 0; i <= N; i++) { 3 for (j = 0; j <= N; j++) { 4 ... x[i][j] ... 5 } 6 }

(a) Source code S1.

T0 T1 T2 (i=0) (i=1) (i=2) j=0 x[0][0] x[1][0] x[2][0] j=1 x[0][1] x[1][1] x[2][1] j=2 x[0][2] x[1][2] x[2][2] . . . . . . . . . . . . chrecs 1stdim {0} {1} {2} 2nddim {0,+,1} {0,+,1} {0,+,1}

(b) Non-coalesced accesses.

1 //

  • nly

for_j is threadified 2 for (j = 0; j <= N; j++) { 3 for (i = 0; i <= N; i++) { 4 ... x[i][j] ... 5 } 6 }

(c) Source code S2.

T0 T1 T2 (j=0) (j=1) (j=2) i=0 x[0][0] x[0][1] x[0][2] i=1 x[1][0] x[1][1] x[1][2] i=2 x[2][0] x[2][1] x[2][2] . . . . . . . . . . . . chrecs 1stdim {0,+,1} {0,+,1} {0,+,1} 2nddim {0} {1} {2}

(d) Coalesced accesses.

CHRECS_xk = [{0,+,1}][{0,+,1}] CHRECS_xk = [{0,+,1}][{0,+,1}]

slide-20
SLIDE 20

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Detection of coalesced accesses to the GPU global memory

1 //

  • nly

for_i is threadified 2 for (i = 0; i <= N; i++) { 3 for (j = 0; j <= N; j++) { 4 ... x[i][j] ... 5 } 6 }

(a) Source code S1.

T0 T1 T2 (i=0) (i=1) (i=2) j=0 x[0][0] x[1][0] x[2][0] j=1 x[0][1] x[1][1] x[2][1] j=2 x[0][2] x[1][2] x[2][2] . . . . . . . . . . . . chrecs 1stdim {0} {1} {2} 2nddim {0,+,1} {0,+,1} {0,+,1}

(b) Non-coalesced accesses.

1 //

  • nly

for_j is threadified 2 for (j = 0; j <= N; j++) { 3 for (i = 0; i <= N; i++) { 4 ... x[i][j] ... 5 } 6 }

(c) Source code S2.

T0 T1 T2 (j=0) (j=1) (j=2) i=0 x[0][0] x[0][1] x[0][2] i=1 x[1][0] x[1][1] x[1][2] i=2 x[2][0] x[2][1] x[2][2] . . . . . . . . . . . . chrecs 1stdim {0,+,1} {0,+,1} {0,+,1} 2nddim {0} {1} {2}

(d) Coalesced accesses.

CHRECS_xk = [{0,+,1}][{0,+,1}] the same contiguous range CHRECS_xk = [{0,+,1}][{0,+,1}]

slide-21
SLIDE 21

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Locality-Aware Generation of Efficient GPGPU Code (and III)

2.Usage of registers to store reused data within a GPU thread 3.Usage of the GPU shared memory for data shared between the threads of a warp 4.Increase the computational load of a GPU thread (loop tiling preserving coalescing)

slide-22
SLIDE 22

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Outline

  • Motivation: General Purpose Computation with GPUs
  • GPGPU with CUDA & OpenHMPP
  • The KIR: an IR for the Detection of Parallelism
  • Locality-Aware Generation of Efficient GPGPU Code
  • Case Studies: CONV3D & SGEMM
  • Performance Evaluation
  • Conclusions & Future Work
slide-23
SLIDE 23

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Case Study: CONV3D (I)

1.int i, j, k, size_x, size_y, size_z; 2.float coefx,coefy,coefz,*input,*output; 3. 4.for (i = 0; i < size_x; i++) {

  • 5. for (j = 0; j < size_y; j++) {
  • 6. for (k = 0; k < size_z; k++) {
  • 7. float tempx = input[i][j][k]+coefx*
  • 8. (
  • 9. input[i-1][j][k]+input[i+1][j][k]+
  • 10. input[i-2][j][k]+input[i+2][j][k]+
  • 11. input[i-3][j][k]+input[i+3][j][k]+
  • 12. input[i-4][j][k]+input[i+4][j][k]
  • 13. );
  • 14. float tempy = input[i][j][k]+coefy*
  • 15. (
  • 16. input[i][j-1][k]+input[i][j+1][k]+
  • 17. input[i][j-2][k]+input[i][j+2][k]+
  • 18. input[i][j-3][k]+input[i][j+3][k]+
  • 19. input[i][j-4][k]+input[i][j+4][k]
  • 20. );
  • 21. float tempz = input[i][j][k]+coefz*
  • 22. (
  • 23. input[i][j][k-1]+input[i][j][k+1]+
  • 24. input[i][j][k-2]+input[i][j][k+2]+
  • 25. input[i][j][k-3]+input[i][j][k+3]+
  • 26. input[i][j][k-4]+input[i][j][k+4]
  • 27. );
  • 28. output[i][j][k] =
  • 29. output[i][j][k]+tempx+tempy+tempz;
  • 30. }
  • 31. }

32.}

ROOT EXECUTION SCOPE ES_fori,j,k (Fig. 1a, lines 4-32) K < tempz21 > scalar assignment K < output28 > regular reduction K < tempy14 > scalar assignment K < tempx7 > scalar assignment

shaded to be omitted in the discovering of parallelism

slide-24
SLIDE 24

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Case Study: CONV3D (II)

  • conv3d-cpu
  • conv3d-hmpp1: Coalescing

! !

  • Default OpenHMPP policy

!

  • Loop nest is permuted to forj, fork, fori (permute directive)

1.int i, j, k, size_x, size_y, size_z; 2.float coefx,coefy,coefz,*input,*output; 3. 4.for (i = 0; i < size_x; i++) {

  • 5. for (j = 0; j < size_y; j++) {
  • 6. for (k = 0; k < size_z; k++) {
  • 7. float tempx = input[i][j][k]+coefx*

CHRECS_input1 = [{0,+,1}][{0,+,1}] [{0,+,1}] CHRECS_input1T0 = [{0}][{0}][{0,+,1}] CHRECS_input1T1 = [{0}][{1}][{0,+,1}] CHRECS_input1T0 = [{0,+,1}][{0}][{0}] CHRECS_input1T1 = [{0,+,1}][{0}][{1}]

slide-25
SLIDE 25

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Case Study: CONV3D (III)

  • conv3d-hmpp2: Registers

4.for (i = 0; i < size_x; i++) {

  • 5. for (j = 0; j < size_y; j++) {
  • 6. for (k = 0; k < size_z; k++) {
  • 7. float tempx = input[i][j][k]+coefx*
  • 8. (
  • 9. input[i-1][j][k]+input[i+1][j][k]+

CHRECS_input1 = [{0,+,1}][{0,+,1}][{0,+,1}] CHRECS_input1T0 = [{0,+,1}][{0}][{0}] CHRECS_input2 = [{-1,+,1}] [{0,+,1}][{0,+,1}] CHRECS_input1 = [{1,+,1}][{0,+,1}][{0,+,1}] CHRECS_input2T0 = [{-1,+,1}][{0}][{0}] CHRECS_input3T0 = [{1,+,1}][{0}][{0}]

∩≠∅

slide-26
SLIDE 26

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Case Study: CONV3D (and IV)

  • conv3d-hmpp3: Shared memory

4.for (i = 0; i < size_x; i++) {

  • 5. for (j = 0; j < size_y; j++) {
  • 6. for (k = 0; k < size_z; k++) {

  • 21. float tempz = input[i][j][k]+coefz*
  • 22. (
  • 23. input[i][j][k-1]+input[i][j][k+1]+
  • 24. input[i][j][k-2]+input[i][j][k+2]+
  • 25. input[i][j][k-3]+input[i][j][k+3]+
  • 26. input[i][j][k-4]+input[i][j][k+4]
  • 27. );

… T0 T1 1stdim 2nddim 3rddim 1stdim 2nddim 3rddim CHRECS input19 {0,+,1} {0} {0} {0,+,1} {0} {1} CHRECS input20 {0,+,1} {0} {1} {0,+,1} {0} {0} CHRECS input21 {0,+,1} {0} {1} {0,+,1} {0} {2} CHRECS input22 {0,+,1} {0} {2} {0,+,1} {0} {1} CHRECS input23 {0,+,1} {0} {2} {0,+,1} {0} {3} CHRECS input24 {0,+,1} {0} {3} {0,+,1} {0} {2} CHRECS input25 {0,+,1} {0} {3} {0,+,1} {0} {4} CHRECS input26 {0,+,1} {0} {4} {0,+,1} {0} {3} CHRECS input27 {0,+,1} {0} {4} {0,+,1} {0} {5}

shared clause of the gridify directive

slide-27
SLIDE 27

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Case Study: SGEMM (I)

1.int i, j, l, m, n, k; 2.float A[m][k], B[k][n], C[m][n]; 3.float alpha, beta, prod; 4. 5.for (i = 0; i < m; i++) {

  • 6. for (j = 0; j < n; j++) {
  • 7. prod = 0;
  • 8. for (l = 0; l < k; l++) {
  • 9. prod += A[i][l] * B[l][j];
  • 10. }
  • 11. C[i][j] = alpha * prod + beta * C[i][j];
  • 12. }

13.}

ROOT EXECUTION SCOPE ES_fori,j (Fig. 3a, lines 5-13) ES_forl (Fig. 3a, lines 8-10) K < prod7 > scalar assignment K < prod9 > scalar reduction K < C11 > regular reduction

slide-28
SLIDE 28

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Case Study: SGEMM (II)

  • sgemm-cpu
  • sgemm-mkl: Intel MKL
  • sgemm-hmpp1: Offloading (and check coalescing)

1.int i, j, l, m, n, k; 2.float A[m][k], B[k][n], C[m][n]; 3.float alpha, beta, prod; 4. 5.for (i = 0; i < m; i++) {

  • 6. for (j = 0; j < n; j++) {
  • 7. prod = 0;
  • 8. for (l = 0; l < k; l++) {
  • 9. prod += A[i][l] * B[l][j];
  • 10. }
  • 11. C[i][j] = alpha * prod + beta * C[i][j];
  • 12. }

13.}

not instantiated T0 T1 1stdim 2nddim 1stdim 2nddim 1stdim 2nddim CHRECS A {0,+,1} {0,+,1} {0} {0,+,1} {0} {0,+,1} CHRECS B {0,+,1} {0,+,1} {0,+,1} {0} {0,+,1} {1} CHRECS C {0,+,1} {0,+,1} {0} {0} {0} {1}

slide-29
SLIDE 29

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Case Study: SGEMM (and III)

  • sgemm-hmpp2: Tiling preserving coalescing

1.int i, j, l, m, n, k; 2.float A[m][k], B[k][n], C[m][n]; 3.float alpha, beta, prod; 4. 5.for (i = 0; i < m; i++) {

  • 6. for (j = 0; j < n; j++) {
  • 7. prod = 0;
  • 8. for (l = 0; l < k; l++) {
  • 9. prod += A[i][l] * B[l][j];
  • 10. }
  • 11. C[i][j] = alpha * prod + beta * C[i][j];
  • 12. }

13.} Algorithm 4 Increase the computational load of a GPU thread

1: procedure INCREASELOAD Input: access xk[ik,1][ik,2]...[ik,n] to an n-dimensional array x stored in row-major order Input: loop nest L = L1,L2,...,Ll where both L1,L2 are threadified Input: amount of data ∆ to be processed by a GPU thread 2: increment the step of the outer loop L1 to ∆ 3: for each scalar variable s in L do 4: promote s to an array s[∆] 5: transform reads and writes to s into loops of ∆ iterations 6: end for 7: end procedure

  • sgemm-hmpp3: Let the compiler use the registers (unroll)
  • sgemm-hmpp4: Use the shared memory for B
  • sgemm-cublas: NVIDIA CUBLAS library
slide-30
SLIDE 30

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Outline

  • Motivation: General Purpose Computation with GPUs
  • GPGPU with CUDA & OpenHMPP
  • The KIR: an IR for the Detection of Parallelism
  • Locality-Aware Generation of Efficient GPGPU Code
  • Case Studies: CONV3D & SGEMM
  • Performance Evaluation
  • Conclusions & Future Work
slide-31
SLIDE 31

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Performance Evaluation: CONV3D

CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton) 20 40 60 80 100 120 GFLOPS conv3d-hmpp1 conv3d-hmpp2 conv3d-hmpp3 conv3d-cpu

Fermi cards introduced memory caches

sizex, sizey and sizez in 128, 256, 384, 512, 640 and 768

slide-32
SLIDE 32

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Performance Evaluation: SGEMM

CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton) 100 200 300 400 500 GFLOPS sgemm-cpu sgemm-mkl sgemm-hmpp1 sgemm-hmpp2 sgemm-hmpp3 sgemm-hmpp4 sgemm-cublas the biggest improvement factor is the usage of the GPU shared memory

m, n and k in 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 4096, 6144 and 8192

slide-33
SLIDE 33

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Outline

  • Motivation: General Purpose Computation with GPUs
  • GPGPU with CUDA & OpenHMPP
  • The KIR: an IR for the Detection of Parallelism
  • Locality-Aware Generation of Efficient GPGPU Code
  • Case Studies: CONV3D & SGEMM
  • Performance Evaluation
  • Conclusions & Future Work
slide-34
SLIDE 34

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Conclusions

KIR-based locality-aware automatic parallelization technique that targets GPU-based heterogeneous systems

  • exploit data locality in the complex GPU memory hierarchy
  • two representative case studies: CONV3D & SGEMM
  • chains of recurrences model accesses to n-dimensional

arrays

  • OpenHMPP directives enabled a great understandability

and portability of the generated GPU code

slide-35
SLIDE 35

J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Future Work

  • New automatic partitioning algorithm of the KIR to handle

the interactions between computations in full-scale applications

  • Auto-tuning approaches to select the best performance
  • n a given hardware architecture
  • Test with larger benchmark suite and on other manycore

accelerators

slide-36
SLIDE 36

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño

7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 — Amsterdam, Netherlands