Programming Heterogeneous Systems
- F. Bodin
June 2013 Uppsala
Programming Heterogeneous Systems F. Bodin June 2013 Uppsala - - PowerPoint PPT Presentation
Programming Heterogeneous Systems F. Bodin June 2013 Uppsala Introduction HPC and embedded software going for dramatic changes to adapt to massive parallelism o Huge market issue o Many codes and users not ready directives based
June 2013 Uppsala
*see ETP4HPC Strategic Research Agenda
2 Uppsala 5/06/13
Uppsala 3 5/06/13
Uppsala 5 5/06/13
Uppsala 6 5/06/13
Uppsala 7
5/06/13
Uppsala 8 5/06/13
codes need to move in this space and new HWs to come
5/06/13 Uppsala 9
X86 / ARM multi-cores Intel MIC/KALRAY MPPA NVIDA/AMD/ARM GPUs Fat cores - OO Light cores SIMT cores
Uppsala 10
CPU compilers
Accelerator compilers
x86 ARM MIPS PowerPC … x86 PTX HSA Kalray MPPA Isa …
5/06/13
hardware
from Paulius Micikevicius, NVIDIA
Uppsala 11 5/06/13
Uppsala 12
friendly
OpenACC
as efficient as native one
Uppsala 13
Efficiency Loss of the code variants. Lower the better. Value 0,00% indicates that the variant reaches the best performance.
*http://www.caps-entreprise.com/wp-content/uploads/2012/08/One-OpenCL-to-rule-them-all.pdf
5/06/13
Uppsala 14
infrastructure is important
Uppsala 15 5/06/13
abstract
configurations
5/06/13 Uppsala 16
Parallel HW independent code e.g. C, Fortran Parallel dep. code e.g. CUDA, OpenCL code generation to get closer to HW code high level information cannot be reconstructed
dynamically via dynamic parameters such as OpenACC #gang, #worker, #vector
program transformations as the one provided in OpenHMPP
Uppsala 17
select variant codelet variant 1 Execution feedback codelet variant 2 codelet variant 3 codelet variant … HMPP compiler dynamic
5/06/13
Uppsala 18
size_t gangs[] = { 8, 16, 32, 64, 128, 128, 8, 16, 32, 64, 128, 256 }; size_t workers[] = { 16, 16, 16, 16, 16, 16, 24, 24, 24, 24, 24, 24 }; … while (nber_of_iterations < max_iterations) { … variant = variantSelectorState("kernel.c:21", (sizeof(gangs)/sizeof(size_t))-1); blur(images[(currentImage + 1) % 2], image_caps, width, height, blockSize, gangs[variant], workers[variant]); … } #pragma acc parallel, copyin(dst_caps[0:height*width]), copyout(src_caps[0:height*width]), num_gangs(gangs), num_workers(workers), vector_length(32) { #pragma acc loop, gang for (tileY = 0; tileY < tileCountY; tileY++) { for (tileX = 0; tileX < tileCountX; tileX++) { …
Parameterized parallel regions parameter space to explore set auto-tuning driver on
5/06/13
Uppsala 19
#call kernel time
exploration phase steady state
Data can be collected over multiple executions
2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Kernel Computation Time (in sec). Lower is better
0,2 0,4 0,6 0,8 1 1,2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Kernel Computation Time (in sec). Lower is better
Kepler config 10 = 256 G x 128 W CARMA config. 8 = 14 G x 16 W 5/06/13
Uppsala 20
AMD Trinity APU AMD 7970 GPU Intel Xeon Phi Nvidia Fermi
best config. 8 = 14 G x 8 W best config. 16 = 64 G x 8 W best config. 16 = 64 G x 8 W best config 10 = 256 G x 128 W
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
5/06/13
targets
dependencies
(e.g. Fortran)
compiler
Uppsala 21
C++ Frontend C Frontend Fortran Frontend Executable
(mybin.exe) Instrumentation module CPU compiler (gcc, ifort, …)
HWA Code
(Dyn. library) OpenCL/Cuda Generation Native compilers
Extraction module
Fun #2 Fun #3 Fun#1
Host code kernels
CAPS Runtime
5/06/13
Uppsala 22
5/06/13
programmers
Uppsala 24 5/06/13
Uppsala 25
HMPP Compiler Autotunable executable code CAPS profiling, tuning interface auto-tuning driver
collect profiling data explore the variants space
Source code CodeletFinder CT0 CT0 CT2 Performance Tools Optimizing Strategy Optimizing Scripts
5/06/13
with the trial/experiment cycle
Uppsala 26 5/06/13
27 Uppsala 5/06/13
Project Capture Hotspot Finder Codelet Builder Micro Bencher
process
parameters
based on identified hotspots (code outliner)
micro-benchs
to build the codelets
application using execution profiles
potential hotspots
the micro-benches
benches
28 Uppsala 5/06/13
Uppsala 29 5/06/13
5/06/13 Uppsala 30
C++ Frontend C Frontend Fortran Frontend
Instrumentation module CPU compiler (gcc, ifort, …) OpenCL/Cuda Generation Native compilers
Extraction module
Fun #2 Fun #3 Fun#1
Host code kernels
Scripting Engine
Application/Domain specific scripts
Uppsala 31
!$capstune scriptName scriptInput code region !$capstune end scriptName … script to be activated Expressions providing high level information to the scripts
5/06/13
Uppsala 32
… !$capstune stencil … !$acc kernel !$acc loop independent do i=1,10 !$acc loop independent do j=1,10 a(i,j) = … b(i,j) … end do end do !$acc end kernel !$capstune end stencil … Specify the script to generate an optimized stencil code using various method
5/06/13
Uppsala 33
TYPE foo REAL :: w(10,10) REAL :: x(10,10) REAL :: y(10,10) REAL :: z(10,10) END type foo … !$capstune scalarize state_x => state%x , state_z => state%z !$acc parallel num_gangs(10) num_workers(10) copyout(state_x) copyin(state_z) !$acc loop gang do i=1,10 !$acc loop worker do j=1,10 state%x(i,j) = state%z(i,j) + i+j/1000.0 end do end do !$acc end parallel !$capstune end scalarize
Transform a data structure for an accelerator:
made on local code analysis
5/06/13
5/06/13 Uppsala 35
5/06/13 Uppsala 36
5/06/13 Uppsala 37
Uppsala 38 5/06/13
Uppsala 40 5/06/13
Uppsala 41 5/06/13
Uppsala 43
Ready-to-use Acceleration
Quickly Accelerate Existing Applications
Maximum Performance
5/06/13
Uppsala 44 5/06/13
Uppsala 45 5/06/13
Uppsala 46 5/06/13
Uppsala 47
5/06/13
Uppsala 48 5/06/13
specific source code (CUDA, OpenCL)
Uppsala 49
$ capsmc gcc myprogram.c $ capsmc gfortran myprogram.f90
5/06/13
compilers
to the host part of the application
specified target
built
Uppsala 50
Fun #3
C++ Frontend C Frontend Fortran Frontend
CUDA Code Generation
Executable
(mybin.exe) Instrumen- tation module
CPU compiler (gcc, ifort, …) CUDA compilers HWA Code
(Dynamic library)
OpenCL Generatio n OpenCL compilers
Extraction module
Fun #2
Host code codelets
CAPS Runtime Fun #1
5/06/13
Uppsala 51
$ capsmc –d -c gcc myprogram.c $ capsmc –-openacc-target CUDA gcc myprogram.c #(default) $ capsmc –-openacc-target OPENCL gcc myprogram.c #(AMD and Phi)
$ capsmc [CAPSMC_FLAGS] <host_compiler> [HOST_COMPILER_FLAGS] <source_files>
5/06/13
Uppsala 52
5/06/13 53 Uppsala
54
Uppsala 55
Data/stream/vector parallelism to be exploited by HWA
e.g. CUDA / OpenCL CPU and HWA linked with a PCIx bus
5/06/13
Uppsala 56 5/06/13
Uppsala 57
Device Gang Worker Vector s Gang Worker Vector s
5/06/13
Uppsala 58
gridDim.y = 1 gridDim.x = number of gangs blockDim.y = number of workers blockDim.x = number of vectors
5/06/13
Uppsala 59
!$acc directive-name [clause [, clause] …] code to offload !$acc end directive-name #pragma acc directive-name [clause [, clause] …] { code to offload }
5/06/13
Uppsala 60
#pragma acc parallel […] { … for(i=0; i < n; i++) { for(j=0; j < n; j++) { … } } … } Code executed on the hardware accelerator
5/06/13
Uppsala 61
#pragma acc kernels […] { for(i=0; i < n; i++) { … } … for(j=0; j < n; j++) { … } } $!acc kernels […] DO i=1,n … END DO … DO j=1,n … END DO $!acc end kernels 1st Kernel 2nd Kernel
5/06/13
– X, Y are vectors – Alpha is a scalar
Uppsala 63
$ export HMPPRT_LOG_LEVEL=info
5/06/13
64
5/06/13 65 Uppsala
Uppsala 66
float A[n]; #pragma acc kernels { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … init(C) … #pragma acc kernels { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } }
5/06/13
memory
device
effect)
device memory for the duration of the region
Uppsala 67 5/06/13
Uppsala 68 5/06/13
Uppsala 69
float A[n]; #pragma acc data create(A) { #pragma acc kernels present(A) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … init(C) … #pragma acc kernels present(A) { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } } }
Allocation of A of size n on the device Deallocation of A on the device Reuse of A already allocated on the device Reuse of A already allocated on the device
5/06/13
device
Uppsala 70
Host Memory Master copy ……… ……… ……… ……… ……… ……… ……. HWA Memory CAPS RT Descriptor ……… ……… ……… ……… ……… ……… ……. Mirror copy
5/06/13
Uppsala 71
#pragma acc data create a[0:n] OR #pragma acc data create a[:n] #pragma acc data create a[2:n/2]
5/06/13
Uppsala 72
!$acc data create a(0:n,0:m) !$acc data create a(1:3,5:5)
5/06/13
Uppsala 73
#pragma acc data create(A[:n]) { #pragma acc kernels present(A[:n]) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … init(C) … #pragma acc kernels present(A[:n]) { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } } } !$acc data create(A(1:n)) !$acc kernels present(A(1:n)) do i=1,n A(i) = B(n – i) end do !$acc end kernels … init(C) … !$acc kernels present(A(1:n)) do i=1,n C(i) = A(i) * alpha + C(i) end do !$acc end kernels !$acc end data
5/06/13
section
host and device
for the first kernels region
for the second kernels region
Uppsala 74
#pragma acc data create(A[:n]) { #pragma acc kernels present(A[:n]) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … #pragma acc kernels present(A[:n]) { for(i=0; i < n; i++) { C[i] = A[i] * alpha; } } }
5/06/13
to be copied from the host to the device when entering the data section
subarrays to be allocated on the device memory for the duration of the data region
Uppsala 75
#pragma acc data create(A[:n]) { #pragma acc kernels present(A[:n]) \ copyin(B[:n]) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … #pragma acc kernels present(A[:n]) { for(i=0; i < n; i++) { C[i] = A[i] * alpha; } } }
5/06/13
to be copied from the device to the host when exiting data section
subarrays to be allocated on the device memory for the duration of the data region
Uppsala 76
#pragma acc data create(A[:n]) { #pragma acc kernels present(A[:n]) \ copyin(B[:n]) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … #pragma acc kernels present(A[:n]) \ copyout(C[:n]) { for(i=0; i < n; i++) { C[i] = A[i] * alpha; } } }
5/06/13
express that input and output transfers of C are required?
from the host to the device when entering the data section
need to be copied back to the host when exiting the data section
subarrays on the device memory for the duration of the data region
behavior in our example
Uppsala 77
#pragma acc data create(A[:n]) { #pragma acc kernels present(A[:n]) \ copyin(B[:n]) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … init(C) … #pragma acc kernels present(A[:n]) \ copy(C[:n]) { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } } }
5/06/13
Uppsala 78
#pragma acc data create(A[:n]) { #pragma acc kernels present(A[:n]) \ copyin(B[:n]) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … init(C) … #pragma acc kernels present(A[:n]) \ copy(C[:n]) { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } } } Allocation of A of size n on the device Deallocation of A on the device Transfer of C from device to host and deallocation of C on the device Reuse of A already allocated on the device Allocation of B of size n on the device and transfer of data of B from host to device Deallocation of B on the device Reuse of A already allocated on the device Allocation of C of size n on the device and transfer of data of C from host to device
5/06/13
region
f1 reuses the data of A already allocated
but it has been released at the end of the data section
executed
Uppsala 79
program main … !$acc data create(X(1:n)) call f1( n, X, Y ) … !$acc end data … call f1( n, X, Z ) … contains subroutine f1( n, A, B ) … !$acc kernels present(A(1:n)) \ copyin(B(1:n)) do i=1,n A(i) = B(n – i) end do !$acc end kernels end subroutine f1 … end program main
5/06/13
when exiting
Uppsala 80 5/06/13
device at region entry
region exit
region exit
Uppsala 81 5/06/13
entry
at region exit
Uppsala 82 5/06/13
Uppsala 83 program main … !$acc data create(A(1:n)) call f1( n, A, B ) … !$acc end data … call f1( n, A, C ) … contains subroutine f1( n, A, B ) … !$acc kernels pcopyout(A(1:n)) \ copyin(B(1:n)) do i=1,n A(i) = B(n – i) end do !$acc end kernels end subroutine f1 … end program main
Allocation of A of size n on the device Reuse of A already allocated on the device Allocation of B of size n on the device for the duration of the subroutine and input transfer
Deallocation of A on the device Allocation of A and B of size n on the device for the duration of the subroutine Input transfer of B and output transfer of A
Present_or_* clauses are generally safer
5/06/13
Uppsala 84 5/06/13
Uppsala 85 5/06/13
Uppsala 86
float A[n]; #pragma acc data create(A) { #pragma acc kernels present(A) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … } float A[n]; #pragma acc declare create(A) #pragma acc kernels present(A) { for(i=0; i < n; i++) { A[i] = B[n – i]; } }
5/06/13
Uppsala 87
!$acc kernels copyout(A(1:n)) \ copyin (B(1:n)) do i=1,n A(i) = B(n – i) end do !$acc end kernels !$acc data create( A(1:n), \ B(1:n) ) !$acc update device (B(1:n)) !$acc kernels do i=1,n A(i) = B(n – i) end do !$acc end kernels !$acc update host (A(1:n)) !$acc end kernels
5/06/13
5/06/13 89 Uppsala
90
Uppsala 91
Loop ‘i’ was shared among gangs(192) and workers(256)
5/06/13
Uppsala 92
#pragma acc parallel, num_gangs(128) \ num_workers(256) { … for(i=0; i < n; i++) { for(j=0; j < m; j++) { … } } … }
… … … … 256 128
5/06/13
Uppsala 93 5/06/13
following loop are executed in parallel
among the gangs available
argument is allowed
Uppsala 94
#pragma acc parallel, num_gangs(128) \ num_workers(192) { … #pragma acc loop gang for(i=0; i < n; i++) { for(j=0; j < m; j++) { … } } … }
… … … 192 128 i= … i= i= 1 i= 2
5/06/13
Uppsala 95
#pragma parallel num_gang(2) { #pragma acc loop gang for(i = 0; i < n; i ++) { A[i] = B[i] * B[i] * 3.14; } } if(i = 0; i < n/2; i ++) { A[i] = B[i] * B[i] * 3.14; } if(i = n/2; i < n; i ++) { A[i] = B[i] * B[i] * 3.14; }
5/06/13
following loop are executed in parallel
among the multiple workers withing a single gang
independent, unless it performs a reduction
argument is allowed
Uppsala 96
#pragma acc parallel, num_gangs(128) \ num_workers(192) { … #pragma acc loop gang for(i=0; i < n; i++) { #pragma acc loop worker for(j=0; j < n; j++) { … } } … }
… … … 192 128 i= … i= i= 1 i= 2
j=0 j=1 j=2
5/06/13
Uppsala 97
#pragma acc parallel, num_gangs(128) \ num_workers(192) { … #pragma acc loop gang for(i=0; i < n; i++) { #pragma acc loop worker for(j=0; j < m; j++) { #pragma acc loop vector for(k=0; k < l; k++) { … } } } … }
… 192 128 i=
j=0 j=1 j=2
… i= … … i= … …
k=0 k=1 k=2
5/06/13
description is the same as in parallel sections
accept an argument to specify the number of gangs, workers or vectors to use
different number of gangs, workers or vectors in the same kernels region
Uppsala 98 #pragma acc kernels { … #pragma acc loop gang(128) for(i=0; i < n; i++) { … } … #pragma acc loop gang(64) for(j=0; j < m; j++) { … } }
… 64 … i= … i= … i= 2 … … i= … i= … i= 2 128
5/06/13
loop are data-independent
with no synchronization
Uppsala 99
Programming error
A[0] = 0; #pragma acc loop independent for(i=1; i<n; i++) { A[i] = A[i]-1; } A(1) = 0 $!acc loop independent DO i=2,n A(i) = A(i-1) END DO
5/06/13
Uppsala 100
!$acc loop independent DO i=0,n !$acc loop seq DO j=1,4 A(j)… ENDDO ENDDO
5/06/13
Uppsala 101
#pragma acc loop collapse (2) for(i=0; i<n; i++) { for(j=0; j<m; j++) { A[i][j]= … } } #pragma acc loop for(k=0; k<n*m; k++) { int i = k%m; int j = k/n; A[i][j]= … }
5/06/13
Uppsala 102
int w; #pragma acc loop independent, private (w) for(i = 0; i < n; i++) { w = i*i; b[i] = b[i] + w*a[i]; } …
5/06/13
associated loop
stored in the original variable
Uppsala 103
#pragma acc loop worker, reduction(+: sum) for(i=0; i < n; i++) { sum += foo(i, tab[i]); }
foo(n-2, tab[n-2] ) foo(n-1, tab[n-1] ) foo(0, tab[0] ) foo(1, tab[1] ) … + + ... ... sum + + +
Worker #0 Worker #1 Worker #n-2 Worker #n-1 Worker #0 Worker #n/2-1
...
Worker #0 5/06/13
Uppsala 104
C and C++ Fortran Operator Initialization Value Operator Initialization Value + + * 1 * 1 max least max least min largest min largest & ~0 iand all bits on | ior && 1 ieor || .and. .true. .or. .false. .eqv. .true. .neqv. .false.
5/06/13
105
C = alpha A . B + beta C – A, B an C are matrices – Alpha, beta are scalars
5/06/13 106 Uppsala
107
completion of the parallel or kernels region
Uppsala 108
CPU HWA 1 2 3 4 5 CPU HWA 1 2 3 4 5
5/06/13
Uppsala 109
#pragma acc kernels, async { … } #pragma acc kernels, async { … } #pragma acc wait $!acc kernels, async 1 … $!acc end kernels … $!acc kernels, async 2 … $!acc end kernels … $!acc wait 1
5/06/13
Uppsala 110 5/06/13
executed
Uppsala 111
#pragma acc kernels if(cond) { for(i=0; i < n; i++) { … } … } $!acc kernels if(cond) DO i=1,n … END DO … $!acc end kernels
5/06/13
112
C = alpha A . B + beta C – A, B an C are matrices – Alpha, beta are scalars
5/06/13 113 Uppsala
114
available
5/06/13 115 Uppsala
Uppsala 116 5/06/13
void acc_init ( acc_device_t ) (C) Subroutine acc_init ( devicetype ) (Fortran)
Void acc_shutdown ( acc_device_t ) (C) Subroutine acc_shutdown ( devicetype ) (Fortran)
Uppsala 117 5/06/13
int acc_get_num_device (acc_device_t) (C) integer function acc_get_num_device (devicetype) (Fortran)
int acc_set_device_type (acc_device_t) (C) subroutine acc_set_device_type (devicetype) (Fortran)
acc_device_type acc_get_device_type (void) (C) function acc_get_device_type () (Fortran)
Uppsala 118 5/06/13
Uppsala 119
int dev; Dev = acc_get_num_device(acc_device_cuda); #pragma acc data copy(A[0:N]) if (dev) { #pragma acc kernels if (dev) ... #pragma acc kernels if (dev) for (int i = 0+t*N/2; i < (1+t)*N/2; ++i) { A[i] = A[i] ...; } ... }
Check number of CUDA devices available on the system
If no device is available, the host code is executed
5/06/13
void acc_set_device_num (int, acc_device_t) (C) subroutine acc_set_device_num ( devicenum, devicetype) (Fortran)
int acc_get_device_num (acc_device_t) (C) Integer function acc_get_device_num (devicetype) (Fortran)
Uppsala 120 5/06/13
Uppsala 121
#pragma omp parallel for for (int t = ; t < 2; ++t) { acc_set_device_num(t, acc_device_default); #pragma acc kernels copy(A[0+t*N/2:(1+t)*N/2]) { #pragma acc loop independent for (int i = 0+t*N/2; i < (1+t)*N/2; ++i) { A[i] = A[i] ...; } ... } acc_shutdown(acc_device_default) }
Two CPU threads are created with OpenMP:
Data set is split in two: each set will be processed by one device
5/06/13
Uppsala 122 5/06/13
Uppsala 123
float *a = (float *)acc_malloc(sizeof(float)*size); float *b = (float *)acc_malloc(sizeof(float)*size); float *c = (float *)malloc(sizeof(float)*size); #pragma acc kernels deviceptr(a, b) copyout(c[0:size]) { // a and b initialisation ... #pragma acc loop independent for (i = 0; i < size; ++i) { c[i] += a[i] * b[i]; } } acc_free(a); acc_free(b); free(c);
Arrays a and b are
not mirrored on the host.
5/06/13
Uppsala 124
parallel or kernels constructs
thanks to routine directive
5/06/13 125 Uppsala
Uppsala 126 5/06/13
Uppsala 127
#pragma acc data copy(A) { … } #pragma acc enter data copy(A) … #pragma acc exit data
5/06/13
construct
Uppsala 128 5/06/13
Uppsala 129
#pragma acc loop tile(32,5) for(i=0; i < n; i++) { for(j=0; j < n; j++) { … } }
#pragma acc loop for(i_1=0; i_1 < n; i_1=i_1+5) { for(i_2=0; i_2 < 5; i_2++) { for(j_1=0; j_1 < n; j_1=j_1+32) { for(j_2=0; j_2 < 32; j_2++) { … } } } }
5/06/13
130
Constructs Directives Clauses Parallel Kernels Data Loop Declare Update If x x x x Async x x x Private x x Firstprivate x Reduction x x Create/Present Copy/Pcopy Copyin/Pcopyin Copyout/Pcopyout Deviceptr x x x x Collapse x Gang/Worker/Vector x Num_gangs / Num_workers / Vector_length x Seq x Independent x Host/Device x Uppsala 131 5/06/13
Uppsala 132 5/06/13
Accelerator Programming model
Directive-based programming
Parallel Computing
GPGPU
Many-Core programming Parallelization
HPC
OpenCL
Code speedup
NVIDIA CUDA High Performance Computing
Performance
Visit CAPS Website: www.caps-entreprise.com
5/06/13 133 Uppsala