locality aware automatic parallelization for gpgpu with
play

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP - PowerPoint PPT Presentation

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives Jos M. Andin, Manuel Arenaz, Franois Bodin, Gabriel Rodrguez and Juan Tourio 7th International Symposium on High-Level Parallel Programming and Applications


  1. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 — Amsterdam, Netherlands

  2. Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  3. Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  4. 100,000 Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) 24,129 Intel Core Duo Extreme 2 cores, 3.0 GHz 21,871 19,484 Intel Core 2 Extreme 2 cores, 2.9 GHz 14,387 10,000 AMD Athlon 64, 2.8 GHz 11,865 AMD Athlon, 2.6 GHz Intel Xeon EE 3.2 GHz 6,043 6,681 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) Performance (vs. VAX-11/780) 4,195 IBM Power4, 1.3 GHz 3,016 Intel VC820 motherboard, 1.0 GHz Pentium III processor 1,779 Professional Workstation XP1000, 667 MHz 21264A 1,267 Digital AlphaServer 8400 6/575, 575 MHz 21264 1000 993 AlphaServer 4000 5/600, 600 MHz 21164 649 Digital Alphastation 5/500, 500 MHz 481 Digital Alphastation 5/300, 300 MHz 22%/year 280 Digital Alphastation 4/266, 266 MHz 183 IBM POWERstation 100, 150 MHz 117 100 Digital 3000 AXP/500, 150 MHz 80 HP 9000/750, 66 MHz 51 52%/year IBM RS6000/540, 30 MHz 24 MIPS M2000, 25 MHz 18 MIPS M/120, 16.7 MHz 13 10 Sun-4/260, 16.7 MHz 9 VAX 8700, 22 MHz 5 AX-11/780, 5 MHz 25%/year 1.5, VAX-11/785 1 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 David A. Patterson and John L. Hennessy. ! The Parallel Challenge Computer Organization and Design: ! The Hardware/Software Interface. ! Elsevier, 2014. J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  5. ACCELERATORS/CO-PROCESSORS 80 70 60 ATI AMD 50 Intel SYSTEMS 40 30 NVIDIA 20 10 Cell Clearspeed CSX600 2006 2007 2008 2009 2010 2011 2012 2013 2014 General Purpose The TOP500 List. ! Computing with GPUs June 2014. J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  6. Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  7. GPGPU with CUDA • First GPGPU programs look like graphics applications ! • CUDA enables the use of C CUDA kernel: specifies the operation of a single GPU thread • Main ideas: 1. Lightweight parallel threads in hierarchy: grid, block 2. Shared memory 3. Barriers J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  8. GPU Programming Features in CUDA 1 Threadification 2 Thread grouping: warps 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  9. GPGPU with OpenHMPP • Directive-based approaches provide several advantages: • More readable codes • Only one file • Independent from the hardware platform • Reasonable performance J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  10. GPGPU with OpenHMPP • Directive-based approaches provide several advantages: • More readable codes • Only one file • Independent from the hardware platform • Reasonable performance explicit control of explicit control of software-managed caches software-managed caches J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  11. GPGPU with OpenHMPP • Directive-based approaches provide several advantages: • More readable codes • Only one file • Independent from the hardware platform • Reasonable performance explicit control of explicit control of software-managed caches software-managed caches standard loop transformations J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  12. GPU Programming Features with OpenHMPP 1 Threadification 2 Thread grouping 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency gridify advancedLoad delegatedStore permute 7 Occupancy unroll fuse tile shared 8 Threads per block J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  13. Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  14. DOMAIN-SPECIFIC CONCEPT LEVEL diKernel: (problem solving methods and application domain) Domain- DOMAIN-INDEPENDENT CONCEPT LEVEL (programming practice) Independent SEMANTIC LEVEL Computational (control flow and data dependence graphs) SYNTACTIC LEVEL Kernel (abstract syntax tree) TEXT LEVEL (ASCII code) •Characterizes the computations carried out in a program without being affected by how they are coded M. Arenaz et al. XARK: An Extensible Framework for Automatic Recognition of Computational Kernels. ACM Transactions on Programming Languages and •Exposes multiple levels of Systems, 30(6), 2008. parallelism J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  15. Building the KIR • Non-statement based, high-level, hierarchical IR 1. diKernel recognition on the DDG 2. Identification of flow dependences 3. Hierarchy of execution scopes reflecting the computational stages & diKernel classification J.M. Andión et al. A Novel Compiler Support for Automatic Parallelization on Multicore Systems. Parallel Computing, 39(9), 2013. J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  16. shaded to be omitted in the discovering of Example of KIR: CONV3D parallelism 1. int i, j, k, size_x, size_y, size_z; 2. float coefx,coefy,coefz,*input,*output; 3. ROOT EXECUTION SCOPE 4. for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { ES_for i,j,k (Fig. 1a, lines 4-32) 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* 8. ( K < tempx 7 > 9. input[i-1][j][k]+input[i+1][j][k]+ scalar assignment 10. input[i-2][j][k]+input[i+2][j][k]+ 11. input[i-3][j][k]+input[i+3][j][k]+ 12. input[i-4][j][k]+input[i+4][j][k] 13. ); 14. float tempy = input[i][j][k]+coefy* K < tempy 14 > 15. ( scalar assignment 16. input[i][j-1][k]+input[i][j+1][k]+ 17. input[i][j-2][k]+input[i][j+2][k]+ 18. input[i][j-3][k]+input[i][j+3][k]+ 19. input[i][j-4][k]+input[i][j+4][k] 20. ); K < tempz 21 > 21. float tempz = input[i][j][k]+coefz* scalar assignment 22. ( 23. input[i][j][k-1]+input[i][j][k+1]+ 24. input[i][j][k-2]+input[i][j][k+2]+ 25. input[i][j][k-3]+input[i][j][k+3]+ 26. input[i][j][k-4]+input[i][j][k+4] K < output 28 > 27. ); regular reduction 28. output[i][j][k] = 29. output[i][j][k]+tempx+tempy+tempz; 30. } 31. } 32. } J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  17. Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

  18. GPU Programming Features addressed by our Automatic Technique 1 Threadification 2 Thread grouping 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend