Optimizing Large Reductions in BerkeleyGW on GPUs
Using OpenMP and OpenACC
Rahulkumar Gayatri, Charlene Yang
National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory March 8, 2019
rgayatri@lbl.gov, cjyang@lbl.gov
Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and - - PowerPoint PPT Presentation
Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and OpenACC Rahulkumar Gayatri, Charlene Yang National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory March 8, 2019 rgayatri@lbl.gov,
Using OpenMP and OpenACC
National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory March 8, 2019
rgayatri@lbl.gov, cjyang@lbl.gov
Motivation
directive based programming models
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 2 / 41
Overview
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 3 / 41
BerkeleyGW
properties of materials
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 4 / 41
Test Case Kernel
approximation
add instructions (partial FMA)
between 1-10, i.e., the kernel has to be optimized for memory locality and vectorization/SIMT efficiency
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 5 / 41
Complex Number in C/C++
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 6 / 41
GPP
Code
for(X){ // X = 512 for(N){ // N = 1638 for(M){ // M = 32768 for(int iw = 0; iw < 3; ++iw){ //Some computation
} } } }
O(2GBs)
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 7 / 41
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 8 / 41
GPP CPU Parallelization
#pragma omp parallel for reduction(output re[0-2], output im[0-2] for(X){ for(N){ for(M){ //Vectorize for(int iw = 0; iw < 3; ++iw){ //Unroll //Store local } } for(int iw = 0; iw < 3; ++iw){
} } }
runtime by 10%
(intel/2018) to guarantee vectorization and unrolling
with compilers that do not support array reduction
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 9 / 41
GPP Performance on CPU
1 2 3 4 5 6 7 8 CPU-architecture
Lower is Better
T[secs]
Performance of GPP on Cori
Haswell Xeon Phi
at NERSC,LBL
KNL execution to be 4× faster than Haswell
is approximately 3.5× faster than Haswell
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 10 / 41
GPP Performance on CPU
Xeon Phi - 2.2 seconds
1 2 3 4 5 6 7 8 CPU-architecture
Lower is Better
T[secs]
Performance of GPP on Cori
Haswell Xeon Phi
at LBNL
KNL execution to be 4× faster than Haswell
is 3× faster than Haswell
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 11 / 41
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 12 / 41
Parallelism on GPU KNL to Volta
KNL GPU
164K threads
a SM
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 13 / 41
GPU Programming Models
Volta GPU available on Cori and Summit
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 14 / 41
GPU Programming Models OpenMP 4.5
Volta GPU available on Cori and Summit
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 15 / 41
OpenMP 4.5 Offload Directives
Directives to distribute work across GPU threads
target − offload the code−block on to the device teams − spawn one or more thread team distribute − distribute iterations of the loops onto master threads of the team parallel for − distribute loop iterations among threads in a threadblock simd − implementation dependent on compilers #pragma omp target teams distribute for() //Distribute the loop across threadblocks #pragma omp parallel for for() //Distribute the loop across threads within a threadblock
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 16 / 41
OpenMP 4.5 Data Movement
Allocate and delete data on the device
#pragma omp target enter data map(alloc: list−of−data−structures[:]) #pragma omp target exit data map(delete: list−of−data−structures[:])
Update data on device and host
#pragma omp target update to/from (list−of−data−structures[:]) to − HostToDevice from − DeviceToHost
Clauses to use with target directives
map(to:...) map(from:...) map(tofrom:...)
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 17 / 41
OpenMP 4.5 Routines on Device
Routines
#pragma omp declare target void foo(); #pragma omp end declare target
Not necessary if routines are inlined
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 18 / 41
OpenMP Offload of GPP
#pragma omp target teams distribute map(to:...) map(tofrom:output re[0-2], output im[0-2]) for(X){ #pragma omp parallel for for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } } for(int iw = 0; iw < 3; ++iw){ #pragma omp atomic
#pragma omp atomic
} }
threadblocks
threads in a threadblocks
OpenMP 4.5 directives. Hence use atomic to maintain correctness
increases overhead of synchronization
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 19 / 41
Optimized Implementation
#pragma omp target enter data map(alloc:input[0:X]) #pragma omp target update input[0:X]) #pragma omp target teams distribute \ parallel for collapse(2) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }
} } #pragma omp target exit data map(delete:input)
gave the best performance with the same parallelization technique
and distribute them across threadblocks and threads within a block
the performance of the kernel by 10%
enter/exit data
in the performance
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 20 / 41
GPP on GPU
1 2 3 4 5 GCC Cray Clang xl
Xeon Phi Lower is Better
T[secs]
Performance of GPP on V100 with OpenMP 4.5
XL
than XL
takes 26 seconds
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 21 / 41
OpenMP 4.5 directives Compiler Interpretations
Grid Thread GCC teams distribute parallel for XL teams distribute parallel for Clang teams distribute parallel for Cray teams distribute simd
Table 1: OpenMP 4.5 mapping onto GPU hardware
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 22 / 41
XL Implementation
#pragma omp target enter data map(alloc:input[0:X]) #pragma omp target teams distribute \ parallel for collapse(2) \ map(to:input[0:X]) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }
} } #pragma omp target exit data map(delete:input)
reduction clause should not be passed to any other clause in the same directive
target region has to be passed via a map clause
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 23 / 41
Clang Implementation
#pragma omp target enter data map(alloc:input[0:X]) #pragma omp target update input[0:X]) #pragma omp target teams distribute \ parallel for collapse(2) \ map(tofrom:output re(0,1,2), output im(0,1,2)) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }
} } #pragma omp target exit data map(delete:input)
device using OpenMP 4.5 directives need not be passed via map clauses
reduction clause have to also be passed to map clauses
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 24 / 41
Cray Implementation
#pragma omp target enter data map(alloc:input[0:X]) #pragma omp target update input[0:X]) #pragma omp target teams distribute \ simd collapse(2) \ map(tofrom:output re(0,1,2), output im(0,1,2)) reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }
} } #pragma omp target exit data map(delete:input)
sequentially inside the target region
threads of a threadblock
be passed to the map clauses
allocated need not be passed via the map clauses
routines annotated with declare target
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 25 / 41
GCC Implementation
#pragma omp target enter data map(alloc:input[0:X]) #pragma omp target teams distribute \ parallel for collapse(2) \ map(tofrom:output re(0,1,2), output im(0,1,2)) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }
} } #pragma omp target exit data map(delete:input)
beforehand using data map (alloc:...) clauses, they need not be passed to map clauses again
reduction clause have to also be passed to map clauses
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 26 / 41
OpenMP 4.5 Summary
map clauses
⊲ Even if they are allocated on the device beforehand
⊲ Even if one of them is a reduction clause
clauses
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 27 / 41
OpenACC
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 28 / 41
OpenACC Directies OpenMP-OpenACC 1-1 Directive Map
OpenACC
gang − threadblock vector − Threads in a threadblock worker − y dimension inside a threadblock (PGI compiler)
OpenMP
teams distribute parallel for simd #pragma acc parallel loop gang #pragma acc loop vector #pragma acc loop worker
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 29 / 41
OpenACC Directives OpenACC Directionality Clauses
#pragma acc enter data copyin #pragma acc enter data copyout #pragma acc enter data copy #pragma acc enter data create(...) #pragma acc exit data delete(...)
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 30 / 41
OpenACC Implementation of GPP PGI vs Cray
#pragma acc enter data create copyin(input[0:X]) #pragma acc enter data update device(input[0:X] #pragma acc parallel loop gang collapse(2) present(input) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ #pragma acc loop vector\ reduction(+:output re(0,1,2), output im(0,1,2)) for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }
} }
distribute across threadblocks
threads of a threadblock
gang and vector level since the output variables are updated by every thread.
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 31 / 41
OpenACC Implementation of GPP PGI vs Cray
#pragma acc enter data create copyin(input[0:X]) #pragma acc enter data update device(input[0:X] #pragma acc parallel loop gang vector collapse(2) present(input[0:X]) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }
} }
N loops to distribute across threadblocks and threads within a block
structures have to be passed to the present clause
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 32 / 41
GPP on GPU OpenACC Performance on Volta
0.5 1 1.5 2 2.5 3 3.5 4 Cray PGI
Xeon Phi Lower is Better
T[secs]
Performance of GPP on V100 with OpenACC
PGI
runtime
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 33 / 41
Comparison of Performance on GPU Performance on Volta
0.5 1 1.5 2 2.5 3 3.5 4 4.5 Volta(V100)
Xeon Phi
Time [secs]
Performance Comparison of GPP on V100
OpenACC(Cray) OpenMP(Cray) OpenACC(PGI) OpenMP(Clang) OpenMP(XL) CUDA
reference time
OpenACC give similar performance and is slower than Xeon Phi
the 2nd best implementation
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 34 / 41
CUDA Implementation of GPP cuda/10.0
CUDA
for(X){ // blockIdx.x for(N){ // blockIdx.y for(M){ // threadIdx.x for(int iw = 0; iw < 3; ++iw){ //Store local } }
Add
Add } }
loops
a threadblock
correctness
dim3 numBlocks(X,N,1); dim3 numThreads(64,1,1); gpp_kernel<<<numBlocks, nunThreads>>>;
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 35 / 41
OpenMP Implementation to match CUDA Loop reordering
CUDA
for(X){ // blockIdx.x for(N){ // blockIdx.y for(M){ // threadIdx.x for(int iw = 0; iw < 3; ++iw){ //Store local } }
} }
OpenMP
#pragma omp target teams distribute \ parallel for collapse(2) \ map(to:...) \ reduction(+:output re0,1,2, output im0,1,2) for(N){ for(X){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }
} }
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 36 / 41
Comparison of Performance on GPU Performance on Volta
0.5 1 1.5 2 2.5 3 3.5 4 Volta(V100)
Xeon Phi
Time [secs] OpenACC(Cray) OpenMP(Cray) OpenACC(PGI) OpenMP(Clang) OpenMP(XL) CUDA
are 2× faster after loop re-ordering
faster
faster
similar to optimized CUDA
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 37 / 41
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 38 / 41
Performance Portability
#pragma omp target enter data map(alloc:input[0:X]) #pragma omp target update input[0:X]) #pragma omp target teams distribute \ parallel for collapse(2) \ map(tofrom:output re(0,1,2), output im(0,1,2)) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(N){ for(X){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }
} } #pragma omp target exit data map(delete:input)
team and associates all threads to that team
and N loops and distribute them across threads
related directives, for example device memory allocation directives
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 39 / 41
Performance Portability
5 10 15 20 Volta Xeon-Phi T[secs] CPU Optimized OpenMP 4.5 GPU Optimized OpenMP 4.5
GPU - clang compiler CPU - intel/2018 compilers
10% slower than optimized Xeon Phi
30× slower on Volta
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 40 / 41
Conclusions
CUDA performance
GTC 2019 Rahul (NERSC-LBL) March 8, 2019 41 / 41