comp 633 parallel computing
play

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 - PowerPoint PPT Presentation

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia Oct 2015 COMP 633 - Prins Heterogeneous Programming 1 Heterogeneous


  1. COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit – Jeff Larkin, Nvidia – Oct 2015 COMP 633 - Prins Heterogeneous Programming 1

  2. Heterogeneous Parallel Computers • Composed of – CPU(s) • Low-latency processor optimized for sequential execution • large memory size and deep memory hierarchy – 1-8 Accelerator(s) • high throughput SIMD or MIMD processors optimized for data-parallel execution • high-performance local memory with limited size (16-24 GB) and small depth memory hierarchy • Example – Multisocket compute server • Host: two-socket 20 – 40 Intel Xeon cores with 128 – 512 GB CC-NUMA shared memory • Accelerators: 1-8 accelerators (e.g. Nvidia Cuda cards connected via PCIe x16 interfaces (16GB/s) – host controls data to/from accelerator memory COMP 633 - J. F. Prins Heterogeneous Programming 2

  3. Scaling accelerators and interconnect • DGX-2 (2018) 16 GPUs and 300GB/s full bisection-width interconnect COMP 633 - J. F. Prins Heterogeneous Programming 3

  4. Basic Programming Models • Offload model – idea: offload computational kernels GPU Xeon Phi • send data • call kernel(s) • retrieve data – accelerator-specific compiler support CPU • Cuda compiler ( nvcc ) for Nvidia GPUs • Intel vectorizing compiler ( icc –mmic ) for Intel Xeon Phi KNL – #pragma offload target(mic: n ) in(…) out(…) inout(…) – accelerator-neutral OpenCL • Cuda-like notation • OpenCL compiler can target Nvidia or Intel Xeon Phi COMP 633 - J. F. Prins Heterogeneous Programming 4

  5. Emerging Programming Models • directive model – idea: identify sections of code to be compiled for accelerator(s) • data transfer and kernel invocation generated by compiler – accelerator-neutral efforts • OpenACC – #pragma acc parallel loop for (…) { … } – gang, worker, vector (threadblock, warp, warp in SIMT lockstep) – gcc 5, PGI, Cray, CAPS, Nvidia compilers • OpenMP 4.0 – similar directives to (but more general than) OpenACC – implemented by gcc 4.9 and icc compiler • accelerator-specific compiler support – Intel Cilk Plus and C++ compilers for Intel Xeon Phi COMP 633 - J. F. Prins Heterogeneous Programming 5

  6. Introduction to OpenACC Jeff Larkin, NVIDIA Developer Technologies

  7. Why OpenACC? 6

  8. University of Illinois PowerGrid- MRI Reconstruction main() { <serial code> #pragma acc kernels OpenACC //automatically runs on GPU { <parallel code> } Simple | Powerful | Portable } 70x Speed-Up 2 Days of Effort Fueling the Next Wave of RIKEN Japan NICAM- Climate Modeling 8000+ Scientific Discoveries in HPC Developers using OpenACC 7-8x Speed-Up 5% of Code Modified http://www.cr ay.com/sites/default/files/r esources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf http://www.hpcwire.com/off-the-wir e/first-round-of-2015-hackathons-gets-under way 7 http://on-demand.gputechconf.com/gtc/2015/pr esentation/S5297-Hisashi-Yashir o.pdf http://www.openacc.or g/content/ex periences-por ting-molecular -dynamics-code-gpus-cr ay-x k7

  9. OpenACC Directives Manage Incremental #pragma acc data copyin(a,b) copyout(c) Data { Movement Single source ... #pragma acc parallel Interoperable { Initiate #pragma acc loop gang vector Parallel for (i = 0; i < n; ++i) { Performance portable Execution z[i] = x[i] + y[i]; ... CPU, GPU, MIC } Optimize } Loop ... Mappings } 9

  10. Accelerated Computing Fundamentals 10

  11. Accelerated Computing 10x Performance & 5x Energy Efficiency for HPC GPU Accelerator GPU Accelera era ator Optimized Optimized p d for d or fo CPU CPU Parallel Tasks Optimized for Optimized for ptimized f Serial Tasks 11

  12. What is Heterogeneous Programming? Application Code Compute-Intensive Functions Rest of Sequential A few % of Code CPU Code GPU CPU A large % of Time + 12

  13. Portability & Performance Portability Accelerated Libraries High performance with little or no code change Limited by what libraries are available Compiler Directives High Level: Based on existing languages; simple, familiar, portable High Level: Performance may not be optimal Parallel Language Extensions Greater flexibility and control for maximum performance Performance Often less portable and more time consuming to implement 13

  14. Code for Portability & Performance • Implement as much as possible using Libraries portable libraries • Use directives for rapid and Directives portable development • Use lower level languages Languages for important kernels 14

  15. OpenACC Programming Cycle 15

  16. Identify Available Parallelism Optimize Express Loop Parallelism Performance Express Data Movement 16

  17. Example: Jacobi Iteration Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. Common, useful algorithm Example: Solve Laplace equation in 2D: ! "(#, $) = % A(i,j+1) A(i-1,j) A(i+1,j) A(i,j) !"# $, % = ! ($ & 1, %) + ! $ + 1,% + ! $, % & 1 + ! $, % + 1 A(i,j-1) 4 17

  18. Jacobi Iteration: C Code while ( err > tol && iter < iter_max ) { Iterate until converged err=0.0; Iterate across matrix for( int j = 1; j < n-1; j++) { elements for(int i = 1; i < m-1; i++) { Calculate new value from Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + neighbors A[j-1][i] + A[j+1][i]); err = max(err, abs(Anew[j][i] - A[j][i])); Compute max error for } convergence } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Swap input/output arrays A[j][i] = Anew[j][i]; } } iter++; } 18 18

  19. Identify Available Parallelism Optimize Express Loop Parallelism Performance Express Data Movement 19

  20. Identify Parallelism Data dependency while ( err > tol && iter < iter_max ) { between iterations. err=0.0; for( int j = 1; j < n-1; j++) { Independent loop for(int i = 1; i < m-1; i++) { iterations Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(Anew[j][i] - A[j][i])); } } Independent loop for( int j = 1; j < n-1; j++) { iterations for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; } 20 20

  21. Identify Available Parallelism Optimize Express Loop Parallelism Performance Express Data Movement 21

  22. OpenACC kernels Directive The kernels directive identifies a region that may contain loops that the compiler can turn into parallel kernels . #pragma acc kernels { for(int i=0; i<N; i++) The compiler identifies { kernel 1 x[i] = 1.0; 2 parallel loops and y[i] = 2.0; generates 2 kernels. } for(int i=0; i<N; i++) { kernel 2 y[i] = a*x[i] + y[i]; } } 22 22

  23. Parallelize with OpenACC kernels while ( err > tol && iter < iter_max ) { err=0.0; Look for parallelism #pragma acc kernels { within this region. for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(Anew[j][i] - A[j][i])); } } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } iter++; } 23 23

  24. Building the code $ pgcc -fast -ta=tesla -Minfo=all laplace2d.c main: 40, Loop not fused: function call before adjacent loop Generated vector sse code for the loop 51, Loop not vectorized/parallelized: potential early exits 55, Generating copyout(Anew[1:4094][1:4094]) Generating copyin(A[:][:]) Generating copyout(A[1:4094][1:4094]) Generating Tesla code 57, Loop is parallelizable 59, Loop is parallelizable Accelerator kernel generated 57, #pragma acc loop gang /* blockIdx.y */ 59, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 63, Max reduction generated for error 67, Loop is parallelizable 69, Loop is parallelizable Accelerator kernel generated 67, #pragma acc loop gang /* blockIdx.y */ 69, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 24 24

  25. Intel Xeon E5- 2698 v3 @ Speed-up (Higher is Better) 2.30GHz 3.50X (Haswell) 3.29X Why did OpenACC vs. 3.00X slow down here? NVIDIA T esla 2.91X K40 2.77X 2.50X 2.00X 1.66X 1.50X 1.00X 1.00X 0.90X 0.50X 0.00X Single Thread 2 Threads 4 Threads 6 Threads 8 Threads OpenACC 25

  26. Very low Compute/Memcpy ratio Compute 5 seconds Memory Copy 62 seconds 26

  27. 104ms/iteration PCIe Copies 27

  28. Excessive Data Transfers while ( err > tol && iter < iter_max ) { err=0.0; #pragma acc kernels A, Anew resident Anew reside A, A, Anew resident on Anew resident on host accelerator C for( int j = 1; j < n-1; j++) { for for( int j = 1; j < n r( int j = 1; r( int j = 1; j < n o for(int i = 1; i < m-1; i++) { p These copies Anew[j][i] = 0.25 * (A[j][i+1] + happen every y A[j][i-1] + A[j-1][i] + C A[j+1][i]); iteration of the o err = max(err, abs(Anew[j][i] – outer while A[j][i]); p loop! } y } ... A, Anew resident Anew reside t A, Anew resident on Anew resident ... on host accelerator } 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend