programming heterogeneous systems
play

Programming Heterogeneous Systems F. Bodin June 2013 Uppsala - PowerPoint PPT Presentation

Programming Heterogeneous Systems F. Bodin June 2013 Uppsala Introduction HPC and embedded software going for dramatic changes to adapt to massive parallelism o Huge market issue o Many codes and users not ready directives based


  1. Other Source-to-Source Technologies  PoCC (Polyhedral Compiler Collection) o http://www.cse.ohio-state.edu/~pouchet/software/pocc/  Rose source-to-source translators o http://rosecompiler.org/  Cetus, source-to-Source compiler infrastructure o http://cetus.ecn.purdue.edu  EDG, Edison Design Group technology o http://www.edg.com  Clang o http://clang.llvm.org  Insieme Compiler o http://www.insieme-compiler.org/  … 5/06/13 Uppsala 22

  2. CAPS Auto-tuning Approach

  3. An Embedded DSL Oriented Approach to Auto-Tuning  Source-to-Source approach o Exploit native compilers  Code partitioning to help offline approach o CodeletFinder  Scripting to implement DSL approaches o Generate domain / code specific search / optimization space o Static selections of variants  Runtime code parameters instantiation o Variant parameters fixed at runtime  Low level API for 'auto-tuning drivers" o Separate objective functions from optimization space generation  An engineering issue o How to embed code/domain specific strategies so it is ready to use to programmers o Still dealing with legacy code o Integration in the compiling process is a key feature  Focus on node level issues 5/06/13 Uppsala 24

  4. Auto-tuning Flow Overview CT 0 Performance Source code CodeletFinder CT 0 Tools CT 2 Optimizing Optimizing HMPP Compiler Scripts Strategy CAPS Autotunable profiling, collect profiling data auto-tuning executable explore the variants tuning driver space code interface 5/06/13 Uppsala 25

  5. Code Partitioning for Auto-Tuning  Tuning and analyzing performance of large/complex applications is usually a challenge o Execution time with real data sets is usually too long to be compatible with the trial/experiment cycle o Many performance or tuning tools cannot be used at large scale  Compute intensive parts of the code usually represent a small portion of the total application o Extracts these to focus on them o Allows to use many analysis and optimizing tools o Faster experiments cycle  Similar works: Code Isolator (Y.-J. Lee and M. W. Hall) and Rose Outliner (C. Liao, D. J. Quinlan, R. Vuduc, and T. Panas) 5/06/13 Uppsala 26

  6. CodeletFinder  Decomposing applications in hotspots  Each hotspot can be efficiently analyzed separately o Outlined hotspots  Mix of static and dynamic analysis o Code and data 5/06/13 Uppsala 27

  7. CodeletFinder Process Overview Micro Project Hotspot Codelet Bencher Capture Finder Builder - Finds hotspots in the - Builds the codelets - Captures data for - Captures build application using based on the micro-benches process execution profiles identified hotspots - Runs the micro- - Capture execution - Statically extracts (code outliner) benches parameters potential hotspots - Creates standalone - Replays the build micro-benchs on demand - Patterns are given to build the codelets 5/06/13 Uppsala 28

  8. Scripting to Generate the Search Space  Most tuning strategies are code/domain specifics o In regards to the code structure and runtime properties o Many codes live long and allow to amortize code specific approaches  Many different high-level approaches can be embedded o Stencil code generator (e.g. Patus) o Polyhedral model based approach  PoCC o Libraries o Data structures transformations o … 5/06/13 Uppsala 29

  9. New CAPS Compiler Features  Code C++ C Fortran Frontend Frontend Frontend transformation s scripts (in Application/Domain Scripting Engine specific scripts Lua) can be added as a Extraction module pre- kernels compilation Host Fun #2 phase code Fun#1 Fun #3  Scripts can read and Instrumentation OpenCL/Cuda module modify the Generation source code AST CPU compiler Native (gcc, ifort, …) compilers 5/06/13 Uppsala 30

  10. Tuning Script Implementation  Directives convey programmer knowledge  The code provides low level information o e.g. loop index, variables names, …  Scripts hide low level code transformation details  Many loop transformations can be implemented using hmppcg directives Expressions providing high level information to the scripts script to be activated !$capstune scriptName scriptInput � code region � !$capstune end scriptName � … � 5/06/13 Uppsala 31

  11. Simple Example-1 Specify the script to generate an optimized stencil code using … � various method !$capstune stencil … � - multiple variants !$acc kernel � - external tools - using a library !$acc loop independent � do i=1,10 � !$acc loop independent � do j=1,10 � a(i,j) = … b(i,j) … � end do � end do � !$acc end kernel � !$capstune end stencil � … � 5/06/13 Uppsala 32

  12. Simple Example-2 Transform a data structure for an accelerator: - Take slides of a derived type TYPE foo � REAL :: w(10,10) � REAL :: x(10,10) � - Decision cannot be usually REAL :: y(10,10) � made on local code analysis REAL :: z(10,10) � END type foo � … � !$capstune scalarize state_x => state%x , state_z => state%z � !$acc parallel num_gangs(10) num_workers(10) copyout(state_x) copyin(state_z) � !$acc loop gang � do i=1,10 � !$acc loop worker � do j=1,10 � state%x(i,j) = state%z(i,j) + i+j/1000.0 � end do � end do � !$acc end parallel � !$capstune end scalarize � 5/06/13 Uppsala 33

  13. Making OpenMP Codes Heterogeneous

  14. Code Generation Process Overview  Converts OpenMP to the use of GPU automatically  Currently focusing on AMD APUs  Incremental process to make the OpenMP code GPU friendly 5/06/13 Uppsala 35

  15. Data Uses Analysis  Necessary to allocate data on the accelerator and compute basic data transfers overheads  Keep analysis overhead low  Analysis based on an abstract execution of the OpenMP loop nest sequence 5/06/13 Uppsala 36

  16. Preliminary Example of Experiments Display 5/06/13 Uppsala 37

  17. PART I Conclusion  OpenMP, a good start to migrate codes o Data use analysis is a key feature  Source-to-source technology well adapted to heterogeneity o Avoid "one compiler fits all" approach  Auto-tuning techniques helps to simplify code tuning and deployment  The DSL approach helps to guide the auto-tuning process 5/06/13 Uppsala 38

  18. PART II OpenACC Directives for Accelerators

  19. Credits  http://www.openacc.org/ o V1.0: November 2011 Specification  OpenACC, Directives for Accelerators, Nvidia Slideware  CAPS Compilers-3.x OpenACC Reference Manual , CAPS entreprise 5/06/13 Uppsala 40

  20. Agenda  OpenACC Overview and Compilers o Lab Session 1: Using CAPS Compilers  Programming Model o Lab Session 2: Offloading Computations  Managing Data o Lab Session 3: Optimizing Data Transfers  Specifying Parallelization o Lab Session 4: Optimizing Compute Kernels  Asynchronism o Lab Session 5: Performing Asynchronous Computations  Runtime API  OpenACC 2.0 Draft Specification 5/06/13 Uppsala 41

  21. OpenACC Overview and Compilers

  22. Directive-based Programming (1)  Three ways of programming GPGPU applications: Programming Libraries Directives Languages Ready-to-use Quickly Accelerate Maximum Performance Acceleration Existing Applications 5/06/13 Uppsala 43

  23. Directive-based Programming (2) 5/06/13 Uppsala 44

  24. Advantages of Directive-based Programming  Simple and fast development of accelerated applications  Non-intrusive  Helps to keep a unique version of code o To preserve code assets o To reduce maintenance cost o To be portable on several accelerators  Incremental approach  Enables "portable" performance 5/06/13 Uppsala 45

  25. OpenACC Initiative  A CAPS, CRAY, Nvidia and PGI initiative  Open Standard  A directive-based approach for programming heterogeneous many-core hardware for C and FORTRAN applications  http://www.openacc-standard.com 5/06/13 Uppsala 46

  26. OpenACC Compilers (1) CAPS Compilers: PGI Accelerator  Source-to-source  Extension of x86 PGI compilers compiler  Support Intel Xeon Phi,  Support Intel Xeon Phi, NVIDIA GPUs, AMD NVIDIA GPUs, AMD GPUs and APUs GPUs and APUs Cray Compiler:  Provided with Cray systems only 5/06/13 Uppsala 47

  27. CAPS Compilers (2) Are source-to-source compilers, composed of 3 parts:  The directives (OpenACC or OpenHMPP) o Define parts of code to be accelerated o Indicate resource allocation and communication o Ensure portability  The toolchain o Helps building manycore applications o Includes compilers and target code generators o Insulates hardware specific computations o Uses hardware vendor SDK  The runtime o Helps to adapt to platform configuration o Manages hardware resource availability 5/06/13 Uppsala 48

  28. CAPS Compilers (3)  Take the original application as input and generate another application source code as output o Automatically turn the OpenACC source code into a accelerator- specific source code (CUDA, OpenCL)  Compile the entire hybrid application  Just prefix the original compilation line with capsmc to produce a hybrid application $ capsmc gcc myprogram.c $ capsmc gfortran myprogram.f90 5/06/13 Uppsala 49

  29. CAPS Compilers (4) C++ C Fortran  CAPS Compilers drives Frontend Frontend Frontend all compilation passes  Host application Extraction module compilation codelets o Calls traditional CPU Host Fun Fun code compilers #1 Fun #2 #3 o CAPS Runtime is linked to the host part of the Instrumen- OpenCL CUDA application tation Generatio Code module Generation n  Device code production OpenCL CPU compiler CUDA o According to the (gcc, ifort, …) compilers compilers specified target o A dynamic library is built HWA Code Executable CAPS (Dynamic Runtime (mybin.exe) library) 5/06/13 Uppsala 50

  30. CAPS Compilers Options  Usage: $ capsmc [CAPSMC_FLAGS] <host_compiler> [HOST_COMPILER_FLAGS] <source_files>  To display the compilation process $ capsmc –d -c gcc myprogram.c  To specify accelerator-specific code $ capsmc –-openacc-target CUDA gcc myprogram.c #(default) $ capsmc –-openacc-target OPENCL gcc myprogram.c #(AMD and Phi) 5/06/13 Uppsala 51

  31. Lab Session 1: Using CAPS Compilers Uppsala 52

  32. Lab 1: Using CAPS Compilers  Compile and execute a simple “Hello world!” application  Use the –d and –c flags to display the compilation process  Use ldd on the output executable to print library dependencies 5/06/13 Uppsala 53

  33. Programming Model 54

  34. Programming Model  Express data and computations to be executed on an accelerator o Using marked code regions Data/stream/vector parallelism to be  Main OpenACC constructs exploited by HWA o Parallel and kernel regions e.g. CUDA / OpenCL o Parallel loops o Data regions o Runtime API CPU and HWA linked with a PCIx bus 5/06/13 Uppsala 55

  35. Execution Model  Among a bulk of computations executed by the CPU, some regions can be offloaded to hardware accelerators o Parallel regions o Kernels regions  Host is responsible for: o Allocating memory space on accelerator o Initiating data transfers o Launching computations o Waiting for completion o Deallocating memory space  Accelerators execute parallel regions: o Use work-sharing directives o Specify level of parallelization 5/06/13 Uppsala 56

  36. OpenACC Execution Model  Host-controlled execution  Based on three parallelism levels o Gangs – coarse grain o Workers – fine grain o Vectors – finest grain Device Gang Gang … Worker Worker Vector Vector s s 5/06/13 Uppsala 57

  37. Gangs, Workers, Vectors  In CAPS Compilers, gangs, workers and vectors correspond to the following in a CUDA grid gridDim.x = number of gangs blockDim.y = gridDim.y = 1 number of workers blockDim.x = number of vectors  Beware: this implementation is compiler-dependent 5/06/13 Uppsala 58

  38. Directive Syntax  C #pragma acc directive-name [clause [, clause] …] { code to offload }  Fortran !$acc directive-name [clause [, clause] …] code to offload !$acc end directive-name 5/06/13 Uppsala 59

  39. Parallel Construct  Starts parallel execution on the accelerator  Creates gangs and workers  The number of gangs and workers remains constant for the parallel region  One worker in each gang begins executing the code in the region #pragma acc parallel […] { … for(i=0; i < n; i++) { for(j=0; j < n; j++) { Code executed on the hardware … accelerator } } … } 5/06/13 Uppsala 60

  40. Kernels Construct  Defines a region of code to be compiled into a sequence of accelerator kernels o Typically, each loop nest will be a distinct kernel  The number of gangs and workers can be different for each kernel #pragma acc kernels […] $!acc kernels […] { for(i=0; i < n; i++) { DO i=1,n 1st Kernel … … } END DO … … for(j=0; j < n; j++) { DO j=1,n 2nd Kernel … … } END DO } $!acc end kernels 5/06/13 Uppsala 61

  41. Lab Session 2: Offloading Computations

  42. Lab 2: Offloading Computations  Offload two SAXPY operations on the accelerator device: Y = Alpha . X + Y – X, Y are vectors – Alpha is a scalar  Use parallel and kernels construct  Pay attention to the compilers notifications  Use the logger to understand the behavior of the accelerator $ export HMPPRT_LOG_LEVEL=info  Use CUDA profiling to display CUDA grid properties 5/06/13 Uppsala 63

  43. Managing Data 64

  44. What is the problem using discrete accelerators?  PCIe transfers have huge latencies  In kernels and parallel regions, data are implicitly managed o Data are automatically transferred to and from the device o Implies possible useless communications  Avoiding transfers leads to a better performance  OpenACC offers a solution to control transfers 5/06/13 Uppsala 65

  45. Device Memory Reuse float A[n];  In this example: o A and B are allocated #pragma acc kernels and transferred for the { first kernels region for(i=0; i < n; i++) { o A and C are allocated A[i] = B[n – i]; and transferred for the } second kernels region } … init(C)  How to reuse A … between the two #pragma acc kernels kernels regions? { o And save transfer and for(i=0; i < n; i++) { allocation time C[i] += A[i] * alpha; } } 5/06/13 Uppsala 66

  46. Memory Allocations  Avoid data reallocation using the create clause o It declares variables, arrays or subarrays to be allocated in the device memory o No data specified in this clause will be copied between host and device  The scope of such a clause corresponds to a data region o Data regions are used to define such scopes (as is, they have no effect) o They define scalars, arrays and subarrays to be allocated on the device memory for the duration of the region  Kernels and Parallel regions implicitly define data regions 5/06/13 Uppsala 67

  47. Data Presence  How to tell the compiler that data has already been allocated?  The present clause declares data that are already present on the device o Thanks to data region that contains this region of code  CAPS Runtime will find and use the data on device 5/06/13 Uppsala 68

  48. Data Construct: Create and Present Clause float A[n]; Allocation of A of size n on the #pragma acc data create(A) device { Reuse of A already allocated on #pragma acc kernels present(A) the device { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … init(C) … Reuse of A already allocated on #pragma acc kernels present(A) the device { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } } Deallocation of A on the device } 5/06/13 Uppsala 69

  49. Data Storage: Mirroring How is the data stored in a data region?  A data construct defines a section of code where data are mirrored between host and  device Mirroring duplicates a CPU memory block into the HWA memory  The mirror identifier is a CPU memory block address o Only one mirror per CPU block o Users ensure consistency of copies via directives o CAPS RT Descriptor ……… ……… Master copy Mirror copy ……… ……… ……… ……… ……… ……… ……… ……… ……… ……… ……. ……. Host Memory HWA Memory 5/06/13 Uppsala 70

  50. Arrays and Subarrays (1)  In C and C++, specified with start and length #pragma acc data create a[0:n] OR #pragma acc data create a[:n] o Allocation of an array a of size n #pragma acc data create a[2:n/2] o Allocation of an subarray of a of size n/2 • ie: elements a[2], a[3], …, a[n/2-1 + 2] o Static arrays can be allocated automatically o Length of dynamically allocated arrays must be explicitly specified 5/06/13 Uppsala 71

  51. Arrays and Subarrays (2)  In Fortran, specified with a list of range specifications !$acc data create a(0:n,0:m) o Allocation of an array a of size n*m !$acc data create a(1:3,5:5) o Allocation of a subarray of a of size 3*1 • ie: elements a(1,5), a(2,5), a(3,5)  In any language, any array or subarray must be a contiguous block of memory 5/06/13 Uppsala 72

  52. Arrays and Subarrays Example #pragma acc data create(A[:n]) !$acc data create(A(1:n)) { #pragma acc kernels present(A[:n]) !$acc kernels present(A(1:n)) { do i=1,n for(i=0; i < n; i++) { A(i) = B(n – i) A[i] = B[n – i]; end do } !$acc end kernels } … … init(C) init(C) … … #pragma acc kernels present(A[:n]) !$acc kernels present(A(1:n)) { do i=1,n for(i=0; i < n; i++) { C(i) = A(i) * alpha + C(i) C[i] += A[i] * alpha; end do } !$acc end kernels } } !$acc end data 5/06/13 Uppsala 73

  53. Redundant Transfers  In this example: #pragma acc data create(A[:n]) { o A is allocated for the data #pragma acc kernels present(A[:n]) section { • No data transfer of A between for(i=0; i < n; i++) { host and device A[i] = B[n – i]; } o B is allocated and transferred } for the first kernels region … #pragma acc kernels present(A[:n]) • Input transfer { • Output transfer for(i=0; i < n; i++) { o C is allocated and transferred C[i] = A[i] * alpha; for the second kernels region } } • Input transfer } • Output transfer  How to avoid useless data transfers for B and C? 5/06/13 Uppsala 74

  54. Input Transfers: Copyin Clause  Declares data that need only #pragma acc data create(A[:n]) to be copied from the host { #pragma acc kernels present(A[:n]) \ to the device when entering copyin(B[:n]) the data section { for(i=0; i < n; i++) { o Performs input transfers only A[i] = B[n – i]; } }  It defines scalars, arrays and … #pragma acc kernels present(A[:n]) subarrays to be allocated on { the device memory for the for(i=0; i < n; i++) { duration of the data region C[i] = A[i] * alpha; } } } 5/06/13 Uppsala 75

  55. Output Transfers: Copyout Clause  Declares data that need only #pragma acc data create(A[:n]) to be copied from the device { #pragma acc kernels present(A[:n]) \ to the host when exiting data copyin(B[:n]) section { for(i=0; i < n; i++) { o Performs output transfers only A[i] = B[n – i]; } }  It defines scalars, arrays and … #pragma acc kernels present(A[:n]) \ subarrays to be allocated on copyout(C[:n]) the device memory for the { duration of the data region for(i=0; i < n; i++) { C[i] = A[i] * alpha; } } } 5/06/13 Uppsala 76

  56. Input/Output Transfers: Copy Clause  If we change the example, how to #pragma acc data create(A[:n]) express that input and output { transfers of C are required? #pragma acc kernels present(A[:n]) \ copyin(B[:n])  Use copy clause to: { for(i=0; i < n; i++) { o Declare data that need to be copied A[i] = B[n – i]; from the host to the device when entering the data section } o Assign values on the device that } need to be copied back to the host … when exiting the data section init(C) o Allocate scalars, arrays and … subarrays on the device memory for #pragma acc kernels present(A[:n]) \ the duration of the data region copy(C[:n]) { It corresponds to the default  for(i=0; i < n; i++) { behavior in our example C[i] += A[i] * alpha; } } } 5/06/13 Uppsala 77

  57. Transfer Example: Summary #pragma acc data create (A[:n]) Allocation of A of size n on the device { Reuse of A already allocated on the device #pragma acc kernels present(A[:n]) \ Allocation of B of size n on the device and copyin (B[:n]) transfer of data of B from host to device { for(i=0; i < n; i++) { A[i] = B[n – i]; } Deallocation of B on the device } … init(C) … Reuse of A already allocated on the device #pragma acc kernels present(A[:n]) \ Allocation of C of size n on the device and copy (C[:n]) transfer of data of C from host to device { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } Transfer of C from device to host and } deallocation of C on the device Deallocation of A on the device } 5/06/13 Uppsala 78

  58. Alternative Behaviors program main  In this example: … !$acc data create(X(1:n)) o A is allocated for the data call f1( n, X, Y ) … region !$acc end data o The first call to subroutine … f1 reuses the data of A call f1( n, X, Z ) … already allocated contains  What happens for the subroutine f1 ( n, A, B ) second call to f1? … !$acc kernels present(A(1:n)) \ o A is specified as present copyin(B(1:n)) but it has been released at do i=1,n the end of the data section A(i) = B(n – i) end do o It leads to an error when !$acc end kernels executed end subroutine f1 … end program main 5/06/13 Uppsala 79

  59. Present_or_create Clause  Combines two behaviors  Declares data that may be present o If data is already present, use value in the device memory o If not, allocate data on device when entering region and deallocate when exiting  May be shortened to pcreate 5/06/13 Uppsala 80

  60. Present_or_copyin/copyout Clauses  If data is already present, use value in the device memory  If not: o Both present_or_copyin / present_or_copyout allocate memory on device at region entry o present_or_copyin copies the value from the host at region entry o present_or_copyout copies the value from the device to the host at region exit o Both present_or_copyin / present_or_copyout deallocate memory at region exit  May be shortened to pcopyin and pcopyout 5/06/13 Uppsala 81

  61. Present_or_copy Clause  If data is already present, use value in the device memory  If not: o Allocates data on device and copies the value from the host at region entry o Copies the value from the device to the host and deallocate memory at region exit  May be shortened to pcopy 5/06/13 Uppsala 82

  62. Present_or_* Clauses Example program main Allocation of A of size n on the device … !$acc data create(A(1:n)) Reuse of A already allocated on the device call f1( n, A, B ) Allocation of B of size n on the device for the … duration of the subroutine and input transfer of B !$acc end data … Deallocation of A on the device call f1( n, A, C ) Allocation of A and B of size n on the device … for the duration of the subroutine contains Input transfer of B and output transfer of A subroutine f1 ( n, A, B ) … !$acc kernels pcopyout (A(1:n)) \ copyin(B(1:n)) do i=1,n A(i) = B(n – i) Present_or_* clauses are end do !$acc end kernels generally safer end subroutine f1 … end program main 5/06/13 Uppsala 83

  63. Default Behavior  CAPS Compilers is able to detect the variables required on the device for the kernels and parallel constructs.  According to the specification, depending on the type of the variables, they follow the following policies o Tables: present_or_copy behavior o Scalar • if not live in or live out variable: private behavior • copy behavior otherwise 5/06/13 Uppsala 84

  64. Constructs and Directives  OpenACC defines two ways of managing accelerator allocations and transfers o With data constructs followed by allocation or transfer clauses o Or standalone directives for allocations or transfers  Data constructs are declarative o They define properties for a code regions and variables  Imperative directives are standalone statements 5/06/13 Uppsala 85

  65. Declare Directive  In Fortran: used in the declaration section of a subroutine  In C/C++: follow a variable declaration  Specifies variables or arrays to be allocated on the device memory for the duration of the function, subroutine or program  Specifies the kind of transfer to realize (create, copy, copyin, etc) float A[n]; float A[n]; #pragma acc data create(A) #pragma acc declare create(A) { #pragma acc kernels present(A) #pragma acc kernels present(A) { { for(i=0; i < n; i++) { for(i=0; i < n; i++) { A[i] = B[n – i]; A[i] = B[n – i]; } } } } … } 5/06/13 Uppsala 86

  66. Update Directive  Used within explicit or implicit data region  Updates all or part of host memory arrays with values from the device when used with host clause  Updates all or part of device memory arrays with values from the host when used with device clause !$acc data create( A(1:n), \ B(1:n) ) !$acc kernels copyout(A(1:n)) \ copyin (B(1:n)) !$acc update device (B(1:n)) do i=1,n !$acc kernels do i=1,n A(i) = B(n – i) A(i) = B(n – i) end do end do !$acc end kernels !$acc end kernels !$acc update host (A(1:n)) !$acc end kernels 5/06/13 Uppsala 87

  67. Lab session 3: Data Management

  68. Lab 3: Data Management  Offload two SAXPY operations (cf. Lab 2) o Where arrays are allocated dynamically  Specify data size on kernels and parallel regions and appropriate transfers  Avoid deallocating and reallocating the data on the accelerator by defining a data section  Ensure the data displayed between the two compute regions are correct by updating the host mirror  Notice the performance evolution and understand why thanks to the logger 5/06/13 Uppsala 89

  69. Specifying Parallelization 90

  70. Parallel and Kernels Constructs Default Behavior  By default, CAPS Compilers will create 192 gangs and 256 workers containing 1 vector each for parallel and kernels regions o The resulting CUDA grid size will be 192 thread blocks o Each thread block containing 256*1 CUDA threads  CAPS Compilers will detect data-independent loops and will distribute iterations among gangs and workers  Loop ‘i’ was shared among gangs(192) and workers(256)  How to modify the number of gangs, workers or vectors? 5/06/13 Uppsala 91

  71. Gangs, Workers, Vectors in Parallel Constructs #pragma acc parallel, num_gangs(128) \  In parallel constructs, the num_workers(256) number of gangs, workers { and vectors is the same for … the entire section for(i=0; i < n; i++) { for(j=0; j < m; j++) { …  The clauses: } } o num_gangs … o num_workers } o vector_length  Enable to specify the … number of gangs, workers 256 and vectors in the … … … corresponding parallel section 128 5/06/13 Uppsala 92

  72. Loop Constructs  A Loop directive applies to a loop that immediately follow the directive  The parallelism to use is described by one of the following clause: o Gang for coarse-grain parallelism o Worker for middle-grain parallelism o Vector for fine-grain parallelism 5/06/13 Uppsala 93

  73. Gangs (1) #pragma acc parallel, num_gangs(128) \  Gang clause: num_workers(192) { o The iterations of the … following loop are executed #pragma acc loop gang for(i=0; i < n; i++) { in parallel for(j=0; j < m; j++) { … } o Iterations are distributed } … among the gangs available } o In a parallel construct, no argument is allowed … 192 … … … i= i= i= i= 0 1 2 0 128 5/06/13 Uppsala 94

  74. Gangs (2) if(i = 0 ; i < n/2 ; i ++) { A[i] = B[i] * B[i] * 3.14; } #pragma parallel num_gang(2) { #pragma acc loop gang for(i = 0; i < n; i ++) { A[i] = B[i] * B[i] * 3.14; } } if(i = n/2 ; i < n ; i ++) { A[i] = B[i] * B[i] * 3.14; } 5/06/13 Uppsala 95

  75. Workers #pragma acc parallel, num_gangs(128) \  Worker clause: num_workers(192) { o The iterations of the … following loop are executed #pragma acc loop gang in parallel for(i=0; i < n; i++) { #pragma acc loop worker for(j=0; j < n; j++) { o Iterations are distributed … among the multiple workers } withing a single gang } … } o Loop iterations must be data independent, unless it performs a reduction j=0 j=1 operation … j=2 192 … … … o In a parallel construct, no argument is allowed i= i= i= i= 0 1 2 0 128 5/06/13 Uppsala 96

  76. Vector #pragma acc parallel, num_gangs(128) \ num_workers(192)  Vector clause { … o The iterations of the #pragma acc loop gang following loop are for(i=0; i < n; i++) { #pragma acc loop worker executed in SIMD mode for(j=0; j < m; j++) { #pragma acc loop vector for(k=0; k < l; k++) { … o Iterations are distributed } among the multiple } } workers withing a single … gang } … … j=0 … j=1 j=2 … 192 k=0 k=1 k=2 o In a parallel construct, … … i= i= i= no argument is allowed 0 0 0 128 5/06/13 Uppsala 97

  77. Gang, Worker, Vector in Kernels Constructs #pragma acc kernels  The parallelism { description is the same … as in parallel sections #pragma acc loop gang(128) for(i=0; i < n; i++) { … }  However, these clauses accept an argument to … … specify the number of … … … gangs, workers or vectors to use i= i= i= #pragma acc loop gang(64) for(j=0; j < m; j++) { 0 0 2 128  Every loop can have a … } different number of } gangs, workers or … vectors in the same kernels region … … … i= i= i= 0 0 2 64 5/06/13 Uppsala 98

  78. Data Independency In kernels sections, the clause independent specifies that iterations of the  loop are data-independent The user does not have to think about gangs, workers or vector parameters  Allows the compiler to generate code to execute the iterations in parallel  with no synchronization A[0] = 0; A(1) = 0 #pragma acc loop independent $!acc loop independent for(i=1; i<n; i++) DO i=2,n { A[i] = A[i]-1; A(i) = A(i-1) } END DO Programming error 5/06/13 Uppsala 99

  79. Sequential Execution  It is possible to specify sequential loops using the seq !$acc loop independent clause DO i=0,n !$acc loop seq DO j=1,4 A(j)…  Useful to increase the ENDDO ENDDO work per thread for example 5/06/13 Uppsala 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend