Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and - - PowerPoint PPT Presentation

optimizing large reductions in berkeleygw on gpus
SMART_READER_LITE
LIVE PREVIEW

Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and - - PowerPoint PPT Presentation

Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and OpenACC Rahulkumar Gayatri, Charlene Yang National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory March 8, 2019 rgayatri@lbl.gov,


slide-1
SLIDE 1

Optimizing Large Reductions in BerkeleyGW on GPUs

Using OpenMP and OpenACC

Rahulkumar Gayatri, Charlene Yang

National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory March 8, 2019

rgayatri@lbl.gov, cjyang@lbl.gov

slide-2
SLIDE 2

Motivation

Why Attend this Talk

  • 5 of the top 10 supercomputers are using NVIDIA GPUs
  • Most of the codes optimized for CPUs have to now be rewritten for GPUs
  • Compiler directive based approaches are attractive due to their ease of use
  • Port incrementally for big codes
  • This talk would provide a detailed analysis of the current state of the

directive based programming models

  • Their performance compared to optimized CUDA code
  • Supported compilers
  • Differences in compiler implementations

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 2 / 41

slide-3
SLIDE 3

Overview

Outline of the Presentation

  • BerkeleyGW, a material science code
  • General Plasmon Pole (GPP), a mini-app
  • Baseline CPU implementation
  • GPU programming models (OpenMP, OpenACC, CUDA)
  • GPP on GPU
  • Naive implementation
  • Optimized implementation
  • Compare approaches and performance of each implementation
  • Backport GPU implementation on CPU for performance portability

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 3 / 41

slide-4
SLIDE 4

BerkeleyGW

BerkeleyGW

  • The GW method is an accurate approach to simulate the excited state

properties of materials

  • What happens when you add or remove an electron from a system
  • How do electrons behave when you apply a voltage
  • How does the system respond to light or x-rays
  • Extract stand alone kernels that could be run as mini-apps

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 4 / 41

slide-5
SLIDE 5

Test Case Kernel

General Plasmon Pole (GPP)

  • Mini-app from BerkeleyGW
  • Computes the electron self-energy using the General Plasmon Pole

approximation

  • Characteristics of GPP
  • Reduction over a series of double complex arrays involving multiply, divide and

add instructions (partial FMA)

  • For typical calculations, it evaluates to an arithmetic intensity (Flops/Byte)

between 1-10, i.e., the kernel has to be optimized for memory locality and vectorization/SIMT efficiency

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 5 / 41

slide-6
SLIDE 6

Complex Number in C/C++

Complex Number Class

  • BerkeleyGW consist of double-complex number calculation
  • std::complex difficulties
  • Performance issues
  • Difficult to vectorize
  • Cannot offload operations onto the device using OpenMP 4.5
  • Thrust::complex
  • Challenges in offloading complex operator routines on device
  • Built an in-house complex class
  • 2-doubles on CPU
  • double2 vector type on GPU

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 6 / 41

slide-7
SLIDE 7

GPP

GPP pseudo code - reduction in the innermost loop

Code

for(X){ // X = 512 for(N){ // N = 1638 for(M){ // M = 32768 for(int iw = 0; iw < 3; ++iw){ //Some computation

  • utput[iw] += ...

} } } }

  • Memory

O(2GBs)

  • Typical single node problem size
  • output - double complex

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 7 / 41

slide-8
SLIDE 8

GPP On CPU

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 8 / 41

slide-9
SLIDE 9

GPP CPU Parallelization

OpenMP 3.0 parallelization of GPP

#pragma omp parallel for reduction(output re[0-2], output im[0-2] for(X){ for(N){ for(M){ //Vectorize for(int iw = 0; iw < 3; ++iw){ //Unroll //Store local } } for(int iw = 0; iw < 3; ++iw){

  • utput_re[iw] += ...
  • utput_im[iw] += ...

} } }

  • Unroll innermost iw-loop
  • Vectorize M-loop
  • Collapse increased the

runtime by 10%

  • Check compiler reports

(intel/2018) to guarantee vectorization and unrolling

  • Flatten arrays into scalars

with compilers that do not support array reduction

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 9 / 41

slide-10
SLIDE 10

GPP Performance on CPU

Runtime of GPP on Cori

1 2 3 4 5 6 7 8 CPU-architecture

Lower is Better

T[secs]

Performance of GPP on Cori

Haswell Xeon Phi

  • Performance numbers from Cori

at NERSC,LBL

  • Haswell
  • Xeon Phi
  • intel/2018 compilers
  • A perfect scaling would allow a

KNL execution to be 4× faster than Haswell

  • KNL implementation of GPP

is approximately 3.5× faster than Haswell

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 10 / 41

slide-11
SLIDE 11

GPP Performance on CPU

Runtime of GPP on Cori

Xeon Phi - 2.2 seconds

1 2 3 4 5 6 7 8 CPU-architecture

Lower is Better

T[secs]

Performance of GPP on Cori

Haswell Xeon Phi

  • Performance numbers from Cori

at LBNL

  • Haswell
  • Xeon Phi
  • intel/2018 compilers
  • A perfect scaling would allow a

KNL execution to be 4× faster than Haswell

  • KNL implementation of GPP

is 3× faster than Haswell

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 11 / 41

slide-12
SLIDE 12

GPP On GPU

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 12 / 41

slide-13
SLIDE 13

Parallelism on GPU KNL to Volta

GPU Hardware

KNL GPU

  • Going from 272 to

164K threads

  • 164k threads
  • 80 SMs
  • 2048 threads within

a SM

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 13 / 41

slide-14
SLIDE 14

GPU Programming Models

Programming Models used to port GPP on GPU

  • OpenMP 4.5
  • Cray
  • XL(IBM)
  • Clang
  • GCC
  • OpenACC
  • PGI
  • Cray
  • CUDA

Volta GPU available on Cori and Summit

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 14 / 41

slide-15
SLIDE 15

GPU Programming Models OpenMP 4.5

OpenMP offloading to GPU

  • OpenMP 4.5
  • Cray
  • XL(IBM)
  • Clang
  • GCC
  • OpenACC
  • PGI
  • Cray
  • CUDA

Volta GPU available on Cori and Summit

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 15 / 41

slide-16
SLIDE 16

OpenMP 4.5 Offload Directives

OpenMP directives to offload code-blocks onto GPUs

Directives to distribute work across GPU threads

target − offload the code−block on to the device teams − spawn one or more thread team distribute − distribute iterations of the loops onto master threads of the team parallel for − distribute loop iterations among threads in a threadblock simd − implementation dependent on compilers #pragma omp target teams distribute for() //Distribute the loop across threadblocks #pragma omp parallel for for() //Distribute the loop across threads within a threadblock

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 16 / 41

slide-17
SLIDE 17

OpenMP 4.5 Data Movement

OpenMP 4.5 directives to move data from device to host

Allocate and delete data on the device

#pragma omp target enter data map(alloc: list−of−data−structures[:]) #pragma omp target exit data map(delete: list−of−data−structures[:])

Update data on device and host

#pragma omp target update to/from (list−of−data−structures[:]) to − HostToDevice from − DeviceToHost

Clauses to use with target directives

map(to:...) map(from:...) map(tofrom:...)

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 17 / 41

slide-18
SLIDE 18

OpenMP 4.5 Routines on Device

OpenMP 4.5 directives to offload routines on the device

Routines

#pragma omp declare target void foo(); #pragma omp end declare target

Not necessary if routines are inlined

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 18 / 41

slide-19
SLIDE 19

OpenMP Offload of GPP

Naive OpenMP 4.5 implementation of GPP

#pragma omp target teams distribute map(to:...) map(tofrom:output re[0-2], output im[0-2]) for(X){ #pragma omp parallel for for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } } for(int iw = 0; iw < 3; ++iw){ #pragma omp atomic

  • utput_re[iw] += ...

#pragma omp atomic

  • utput_im[iw] += ...

} }

  • Distribute M-loop across

threadblocks

  • Distribute N-loop among

threads in a threadblocks

  • No array reduction with

OpenMP 4.5 directives. Hence use atomic to maintain correctness

  • Parallelizing M-loop

increases overhead of synchronization

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 19 / 41

slide-20
SLIDE 20

Optimized Implementation

Optimized implementation with OpenMP 4.5

#pragma omp target enter data map(alloc:input[0:X]) #pragma omp target update input[0:X]) #pragma omp target teams distribute \ parallel for collapse(2) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }

  • utput_re(0,1,2) += ...
  • utput_im(0,1,2) += ...

} } #pragma omp target exit data map(delete:input)

  • XL, Clang, Cray and GCC

gave the best performance with the same parallelization technique

  • Collapse N and M loops

and distribute them across threadblocks and threads within a block

  • Memory allocation improved

the performance of the kernel by 10%

  • #pragma omp target

enter/exit data

  • Reduction gave a 3× boost

in the performance

  • Flatten arrays to scalars

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 20 / 41

slide-21
SLIDE 21

GPP on GPU

Performance of GPP on V100 with OpenMP 4.5

1 2 3 4 5 GCC Cray Clang xl

Xeon Phi Lower is Better

T[secs]

Performance of GPP on V100 with OpenMP 4.5

  • Cray is 3× slower than

XL

  • Clang is 30% slower

than XL

  • GCC implementation

takes 26 seconds

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 21 / 41

slide-22
SLIDE 22

OpenMP 4.5 directives Compiler Interpretations

OpenMP 4.5 directives map onto hardware

Grid Thread GCC teams distribute parallel for XL teams distribute parallel for Clang teams distribute parallel for Cray teams distribute simd

Table 1: OpenMP 4.5 mapping onto GPU hardware

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 22 / 41

slide-23
SLIDE 23

XL Implementation

Optimized implementation with XL

#pragma omp target enter data map(alloc:input[0:X]) #pragma omp target teams distribute \ parallel for collapse(2) \ map(to:input[0:X]) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }

  • utput_re(0,1,2) += ...
  • utput_im(0,1,2) += ...

} } #pragma omp target exit data map(delete:input)

  • Did not support class
  • perators in older versions.
  • Variables passed to the

reduction clause should not be passed to any other clause in the same directive

  • All data accessed inside the

target region has to be passed via a map clause

  • simd has no effect

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 23 / 41

slide-24
SLIDE 24

Clang Implementation

Optimized implementation with Clang

#pragma omp target enter data map(alloc:input[0:X]) #pragma omp target update input[0:X]) #pragma omp target teams distribute \ parallel for collapse(2) \ map(tofrom:output re(0,1,2), output im(0,1,2)) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }

  • utput_re(0,1,2) += ...
  • utput_im(0,1,2) += ...

} } #pragma omp target exit data map(delete:input)

  • Data allocated on the

device using OpenMP 4.5 directives need not be passed via map clauses

  • Variables passed to the

reduction clause have to also be passed to map clauses

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 24 / 41

slide-25
SLIDE 25

Cray Implementation

Optimized Cray implementation

#pragma omp target enter data map(alloc:input[0:X]) #pragma omp target update input[0:X]) #pragma omp target teams distribute \ simd collapse(2) \ map(tofrom:output re(0,1,2), output im(0,1,2)) reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }

  • utput_re(0,1,2) += ...
  • utput_im(0,1,2) += ...

} } #pragma omp target exit data map(delete:input)

  • parallel for is executed

sequentially inside the target region

  • simd distributes loop across

threads of a threadblock

  • reduction variables have to

be passed to the map clauses

  • Previously allocated data

allocated need not be passed via the map clauses

  • printf is not supported inside

routines annotated with declare target

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 25 / 41

slide-26
SLIDE 26

GCC Implementation

Optimized GCC implementation

#pragma omp target enter data map(alloc:input[0:X]) #pragma omp target teams distribute \ parallel for collapse(2) \ map(tofrom:output re(0,1,2), output im(0,1,2)) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }

  • utput_re(0,1,2) += ...
  • utput_im(0,1,2) += ...

} } #pragma omp target exit data map(delete:input)

  • simd gives compiler error
  • If data is allocated

beforehand using data map (alloc:...) clauses, they need not be passed to map clauses again

  • Variables passed to the

reduction clause have to also be passed to map clauses

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 26 / 41

slide-27
SLIDE 27

OpenMP 4.5 Summary

Cheat Sheet of Do’s and Dont’s

  • XL
  • Everything accessed inside the target region has to be mapped explicitly via

map clauses

⊲ Even if they are allocated on the device beforehand

  • Do not pass the same data to two different clauses in the same directive

⊲ Even if one of them is a reduction clause

  • Clang, GCC, Cray
  • Always pass the directionality information to the reduction variables via map

clauses

  • GCC - Do not use simd

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 27 / 41

slide-28
SLIDE 28

OpenACC

OpenACC offloading to GPU

  • OpenMP
  • Cray
  • XL(IBM)
  • Clang
  • GCC
  • OpenACC
  • PGI
  • Cray
  • CUDA

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 28 / 41

slide-29
SLIDE 29

OpenACC Directies OpenMP-OpenACC 1-1 Directive Map

OpenACC directive map on GPU

OpenACC

gang − threadblock vector − Threads in a threadblock worker − y dimension inside a threadblock (PGI compiler)

OpenMP

teams distribute parallel for simd #pragma acc parallel loop gang #pragma acc loop vector #pragma acc loop worker

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 29 / 41

slide-30
SLIDE 30

OpenACC Directives OpenACC Directionality Clauses

OpenACC directives for memory movement

#pragma acc enter data copyin #pragma acc enter data copyout #pragma acc enter data copy #pragma acc enter data create(...) #pragma acc exit data delete(...)

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 30 / 41

slide-31
SLIDE 31

OpenACC Implementation of GPP PGI vs Cray

Optimized GPP implementation with PGI OpenACC

#pragma acc enter data create copyin(input[0:X]) #pragma acc enter data update device(input[0:X] #pragma acc parallel loop gang collapse(2) present(input) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ #pragma acc loop vector\ reduction(+:output re(0,1,2), output im(0,1,2)) for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }

  • utput_re{0,1,2} += ...
  • utput_im{0,1,2} += ...

} }

  • Collapse X and N loops to

distribute across threadblocks

  • Distribute M loops across

threads of a threadblock

  • reduction required at

gang and vector level since the output variables are updated by every thread.

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 31 / 41

slide-32
SLIDE 32

OpenACC Implementation of GPP PGI vs Cray

Optimized GPP implementation with Cray OpenACC

#pragma acc enter data create copyin(input[0:X]) #pragma acc enter data update device(input[0:X] #pragma acc parallel loop gang vector collapse(2) present(input[0:X]) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(X){ for(N){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }

  • utput_re{0,1,2} += ...
  • utput_im{0,1,2} += ...

} }

  • Collapse Distribute X and

N loops to distribute across threadblocks and threads within a block

  • Dimensions of the data

structures have to be passed to the present clause

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 32 / 41

slide-33
SLIDE 33

GPP on GPU OpenACC Performance on Volta

Cray and PGI implementations of GPP using OpenACC

0.5 1 1.5 2 2.5 3 3.5 4 Cray PGI

Xeon Phi Lower is Better

T[secs]

Performance of GPP on V100 with OpenACC

  • Cray is 3× slower than

PGI

  • Cray is 50% slower than
  • ptimized Xeon Phi

runtime

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 33 / 41

slide-34
SLIDE 34

Comparison of Performance on GPU Performance on Volta

Performance comparison of all GPU implementations

0.5 1 1.5 2 2.5 3 3.5 4 4.5 Volta(V100)

Xeon Phi

Time [secs]

Performance Comparison of GPP on V100

OpenACC(Cray) OpenMP(Cray) OpenACC(PGI) OpenMP(Clang) OpenMP(XL) CUDA

  • Dashed line is Xeon Phi

reference time

  • Cray OpenMP and

OpenACC give similar performance and is slower than Xeon Phi

  • CUDA is 2× faster than

the 2nd best implementation

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 34 / 41

slide-35
SLIDE 35

CUDA Implementation of GPP cuda/10.0

CUDA Implementation of GPP

CUDA

for(X){ // blockIdx.x for(N){ // blockIdx.y for(M){ // threadIdx.x for(int iw = 0; iw < 3; ++iw){ //Store local } }

  • utput_re{0,1,2} += ... //Atomic

Add

  • utput_im{0,1,2} += ... //Atomic

Add } }

  • 2-dimensional grid for X and N

loops

  • Distribute M-loop across threads in

a threadblock

  • CUDA atomics to maintain

correctness

dim3 numBlocks(X,N,1); dim3 numThreads(64,1,1); gpp_kernel<<<numBlocks, nunThreads>>>;

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 35 / 41

slide-36
SLIDE 36

OpenMP Implementation to match CUDA Loop reordering

OpenMP loop re-reordering to match CUDA implementation

CUDA

for(X){ // blockIdx.x for(N){ // blockIdx.y for(M){ // threadIdx.x for(int iw = 0; iw < 3; ++iw){ //Store local } }

  • utput_re{0,1,2} += ... //Atomic
  • utput_im{0,1,2} += ... //Atomic

} }

OpenMP

#pragma omp target teams distribute \ parallel for collapse(2) \ map(to:...) \ reduction(+:output re0,1,2, output im0,1,2) for(N){ for(X){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }

  • utput_re{0,1,2} += ...
  • utput_im{0,1,2} += ...

} }

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 36 / 41

slide-37
SLIDE 37

Comparison of Performance on GPU Performance on Volta

Performance of GPP implementations after loop reordering

0.5 1 1.5 2 2.5 3 3.5 4 Volta(V100)

Xeon Phi

Time [secs] OpenACC(Cray) OpenMP(Cray) OpenACC(PGI) OpenMP(Clang) OpenMP(XL) CUDA

  • OpenMP(XL and Clang)

are 2× faster after loop re-ordering

  • OpenACC(PGI) is 30%

faster

  • OpenACC(Cray) is 3×

faster

  • XL and Clang OpenMP

similar to optimized CUDA

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 37 / 41

slide-38
SLIDE 38

Performance Portability

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 38 / 41

slide-39
SLIDE 39

Performance Portability

Interpretation of OpenMP 4.5 dierctives on CPU

#pragma omp target enter data map(alloc:input[0:X]) #pragma omp target update input[0:X]) #pragma omp target teams distribute \ parallel for collapse(2) \ map(tofrom:output re(0,1,2), output im(0,1,2)) \ reduction(+:output re(0,1,2), output im(0,1,2)) for(N){ for(X){ for(M){ for(int iw = 0; iw < 3; ++iw){ //Store local } }

  • utput_re(0,1,2) += ...
  • utput_im(0,1,2) += ...

} } #pragma omp target exit data map(delete:input)

  • intel/2018 compilers
  • teams - creates a single

team and associates all threads to that team

  • Reverse the order of X

and N loops and distribute them across threads

  • Ignores other OpenMP 4.5

related directives, for example device memory allocation directives

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 39 / 41

slide-40
SLIDE 40

Performance Portability

Performance of GPU implementations on CPU

5 10 15 20 Volta Xeon-Phi T[secs] CPU Optimized OpenMP 4.5 GPU Optimized OpenMP 4.5

GPU - clang compiler CPU - intel/2018 compilers

  • GPU optimized OpenMP is

10% slower than optimized Xeon Phi

  • CPU optimized OpenMP is

30× slower on Volta

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 40 / 41

slide-41
SLIDE 41

Conclusions

Summary of the Presentation

  • Multiple implementations of OpenMP offloading gave us close to optimized

CUDA performance

  • Differences in Compiler interpretations of OpenMP 4.5 offload directives
  • Loop reordering might provide benefits due to change in data access patterns
  • OpenACC had difficulty in CPU-vectorization
  • Portable code but not performance portable

GTC 2019 Rahul (NERSC-LBL) March 8, 2019 41 / 41