[PPT] - S6240 - High-Level GPU Programming Using OpenMP 4.5 and PowerPoint Presentation

SLIDE 1

S6240 ¡-‑ ¡High-‑Level ¡GPU ¡ Programming ¡Using ¡OpenMP ¡4.5 ¡ and ¡Clang/LLVM

Arpith ¡Jacob, ¡Alexandre ¡Eichenberger, ¡Samuel ¡Antao, ¡Carlo ¡Bertolli, ¡Tong ¡ Chen, ¡Zehra ¡Sura, ¡Hyojin ¡Sung, ¡Georgios ¡Rokos, ¡Kevin ¡O’Brien ¡ IBM ¡T. ¡J. ¡Watson ¡Research ¡Center ¡

SLIDE 2

IBM ¡is ¡building ¡heterogeneous ¡systems ¡with ¡Power ¡+ ¡GPU ¡
AdvocaKng ¡the ¡use ¡of ¡the ¡OpenMP ¡programming ¡model ¡
IBM ¡Research ¡is ¡contribuKng ¡OpenMP ¡support ¡for ¡NVIDIA ¡GPUs ¡in ¡

Clang/LLVM ¡

Upstreaming ¡in ¡progress. ¡download ¡at: ¡ibm.biz/ykt-omp

2 ¡

CPU ¡

GPU ¡

Overview

SLIDE 3

T ¡ T ¡ T ¡ T ¡ T ¡ T ¡ T ¡ T ¡

C ¡ C ¡ C ¡ C ¡ C ¡ C ¡ C ¡ C ¡ C ¡ C ¡ C ¡ C ¡

SMTs ¡ 12 ¡CORES ¡ PER ¡SOCKET ¡ SMX ¡ Streaming ¡ Mul8processors ¡ SP ¡CUDA ¡cores, ¡ DP ¡units, ¡etc.. ¡ SMX ¡ SMX ¡ SMX ¡ SMX ¡ SMX ¡ SMX ¡ SMX ¡ SMX ¡ SMX ¡ SMX ¡ SMX ¡ SMX ¡

ff ¡chip ¡

L1 ¡ L2 ¡ L3 ¡ L4 ¡ L3 ¡ DRAM ¡

ff ¡chip ¡

Constant ¡ Read-‑Only ¡ L1+SMEM ¡ L2 ¡ DRAM ¡

2 ¡TB+ ¡ 12 ¡GB ¡

Processing ¡ Storage ¡ Kepler/Maxwell ¡ Kepler/Maxwell ¡ Kepler/Maxwell ¡

P8 ¡ P8 ¡

NVLINK/PCI-‑E ¡

3 ¡

Kepler/Maxwell ¡ Latency ¡SensiRve ¡ High ¡single ¡thread ¡performance ¡ Hide ¡latency ¡via ¡memory ¡prefetch ¡or, ¡ Cache ¡hierarchy ¡for ¡spaKal ¡and ¡temporal ¡locality ¡ Throughput ¡OpRmized ¡ OpKmized ¡for ¡mulK-‑threaded ¡code ¡ Low ¡overhead ¡context ¡switch ¡ Hide ¡memory ¡latency ¡with ¡threads ¡

ExploiFng ¡Heterogeneous ¡Node ¡Resources

Kepler ¡

SLIDE 4

ApplicaKons ¡must ¡exploit ¡heterogeneous ¡resources ¡in ¡a ¡

performance ¡portable ¡manner ¡ ¡ ¡

Use ¡vendor ¡specific ¡languages ¡and ¡direcKves? ¡
Compiler ¡specific ¡pragmas? ¡
Mix ¡of ¡programming ¡models? ¡OpenMP, ¡OpenACC, ¡CUDA ¡

4 ¡

CPU ¡

GPU ¡

Programmability ¡Challenge

SLIDE 5

Fortran ¡

RAJA ¡ KOKKOS ¡ OP2 ¡OPS ¡ … ¡

Kepler/Maxwell ¡ Kepler/Maxwell ¡

P8 ¡ P8 ¡

NVLINK/PCI-‑E ¡

Kepler/Maxwell ¡ Kepler/Maxwell ¡

OpenMP is widely used to program CPUs; latest specs support accelerators
Write performance portable code using flexible parallelism models
Industry-wide acceptance: IBM, Intel, PathScale, Cray, PGI, Oracle, MS

5 ¡

Programming ¡Overview

SLIDE 6

6 ¡

node ¡memory

¡

¡ ¡ ¡ ¡ ¡ ¡

SMX ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¡ ¡ ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

cache ¡

SMX ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

cache ¡

¡ ¡ ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡ ¡ ¡ ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

socket ¡ socket ¡

CPU

¡

GPU

¡

OpenMP ¡Memory ¡Model

SLIDE 7

7 ¡

node ¡memory

¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L3 ¡

¡ ¡ ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡ ¡ ¡ ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

socket ¡ socket ¡

CPU

¡

CPU ¡threads ¡may ¡have ¡their ¡own ¡view ¡of ¡

shared ¡variables ¡

Relaxed ¡consistency ¡
Explicit ¡flush ¡operaKons ¡on ¡host ¡

required ¡

Unsynchronized ¡accesses ¡may ¡lead ¡to ¡data ¡

races ¡

OpenMP ¡Memory ¡Model

SLIDE 8

8 ¡

node ¡memory

¡

device ¡memory ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¡ ¡ ¡ ¡ ¡ ¡

SMX ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¡ ¡ ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

cache ¡

SMX ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

cache ¡

¡ ¡ ¡

GPU

¡

OMP4 ¡extends ¡views ¡to ¡target ¡

devices ¡

Map: ¡control ¡data ¡views ¡
Target ¡data ¡enter/exit ¡
Target ¡update ¡
Unsynchronized ¡accesses ¡may ¡lead ¡

to ¡data ¡races ¡

OpenMP ¡Memory ¡Model

SLIDE 9

9 ¡

node ¡memory

¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L3 ¡

device ¡memory ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¡ ¡ ¡ ¡ ¡ ¡

SMX ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¡ ¡ ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

cache ¡

SMX ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

cache ¡

¡ ¡ ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡ ¡ ¡ ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

core ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡

L2 ¡

socket ¡ socket ¡

CPU

¡

GPU

¡

OpenMP ¡Memory ¡Model

SLIDE 10

How ¡do ¡we ¡use ¡OpenMP ¡offload? ¡ ¡ ¡

10 ¡

#pragma omp target map(to: cls, len, compression[0:len]) \ map(from: bvc[0:len]) for (int i=0; i<len; i++) { bvc[i] = cls * (compression[i] + 1.0); }

OpenMP ¡Offload ¡Model

SLIDE 11

11 ¡

#pragma omp target map(to: cls, len, compression[0:len]) \ map(from: bvc[0:len]) for (int i=0; i<len; i++) { bvc[i] = cls * (compression[i] + 1.0); }

GPU GPU ¡ CPU CPU ¡ map() ¡ target ¡ funcRon ¡shipping ¡ compression[], ¡cls, ¡len ¡ ¡ b v c [ ] ¡ device ¡execuRon ¡

inacKve ¡host ¡thread ¡

OpenMP ¡Offload ¡Model

SLIDE 12

12 ¡

Loop work-sharing

#pragma omp parallel for for (i = 0; i < M; i++) for (j = 0; j < N; j++) A[i][j] += u1[i] * v1[j] + u2[i] * v2[j];

¡ ¡ ¡ ¡ ¡ ¡

core ¡0 ¡ ¡ ¡ core ¡1 ¡ ¡ ¡ core ¡2 ¡ ¡ ¡ core ¡3 ¡ ¡ ¡ core ¡4 ¡ ¡ ¡ core ¡5 ¡ ¡ ¡ core ¡6 ¡ ¡ ¡ core ¡7 ¡ ¡ ¡

socket ¡0 ¡ socket ¡1 ¡

Affinity: spread threads to maximize bandwidth ¡

¡ ¡ ¡ ¡ ¡ ¡

core ¡0 ¡ ¡ ¡ core ¡1 ¡ ¡ ¡ core ¡2 ¡ ¡ ¡ core ¡3 ¡ ¡ ¡ core ¡4 ¡ ¡ ¡ core ¡5 ¡ ¡ ¡ core ¡6 ¡ ¡ ¡ core ¡7 ¡ ¡ ¡

socket ¡0 ¡ socket ¡1 ¡

Affinity: pack threads to reuse cache locality ¡

Exploit ¡Hardware ¡Threads ¡on ¡POWER ¡CPUs

SLIDE 13

13 ¡

Loop work-sharing on GPUs with a target task

CUDA Grid CUDA Block

GPU

SMs, warps, lanes

CUDA

grid, blocks, threads

OpenMP

teams, threads, simd lanes

Block(0,1) Block(1,1) Block(2,1) Block(0,0) Block(1,0) Block(2,0)

Team0

parallel team master

Th0 Th1 Th31

. . .

Th32 Th33 Th63

. . .

T0 T1 T31

. . .

T32 T33 T63 Warp0

L0 L1 L31

. . .

Warp1

L0 L1 L31

#pragma omp target teams distribute #pragma omp parallel for for (i = 0; i < M; i++) for (j = 0; j < N; j++) A[i][j] += u1[i] * v1[j] + u2[i] * v2[j];

Exploit ¡Streaming ¡MulFprocessors ¡on ¡GPUs

SLIDE 14

14 ¡

SIMD and other OpenMP forms supported on the GPU

Team0

parallel team master

Th0 Th1

simd simd

T0 T1 T31

. . .

T32 T33 T63 Warp0

L0 L1 L31

. . .

Warp1

L0 L1 L31

CUDA Block

#pragma omp target teams distribute #pragma omp parallel for for (i = 0; i < M; i++) #pragma omp simd for (j = 0; j < N; j++) A[i][j] += u1[i] * v1[j] + u2[i] * v2[j];

14 ¡

Exploit ¡Streaming ¡MulFprocessors ¡on ¡GPUs

SLIDE 15

15 ¡

Task Parallelism

#pragma omp parallel #pragma omp single { #pragma omp task depend(out: a) TraverseForward(A); #pragma omp task depend(in: a) TraverseReverse(B); … }

Tasks are well suited for parallelism that is dynamically uncovered: e.g. searches, graph processing
Tasks are load balanced between threads in the parallel region
A task is fired once all its dependent tasks have completed

Exploit ¡Hardware ¡Threads ¡on ¡POWER ¡CPUs

SLIDE 16

16 ¡

Target constructs are implicit tasks
A host thread may initiate several target tasks asynchronously
Target tasks may have dependencies

Dependencies between target tasks are resolved completely on the GPU without host intervention

Host task ¡ Target task ¡

CPU ¡& ¡GPU ¡Parallelism ¡using ¡Tasks

SLIDE 17

17 ¡

Concurrency in a node

Host threads and device threads
Multiple GPUs in a node
Overlap device computation

and communication

Concurrent target tasks on

a GPU with task dependencies

CPU ¡& ¡GPU ¡Parallelism ¡using ¡Tasks

SLIDE 18

▪ CLANG

Front-end to parse source code

and generate LLVM IR code

Modified to generate code for

OpenMP device constructs

Produces two copies of code for

target regions

Inserts calls to standardized

OMP runtime interface functions

Compiler driver modified to

process code copies through different backends

▪ NVPTX backend

Produces ptx code which is then

processed through ptxas to generate CUDA binary

Collaborating with wider community and industry partners (LLVM open-source, OMP standards) C/C++ with OpenMP 4.5 CLANG

Outlining/Duplication

LLVM Power backend LLVM NVPTX backend GPU OpenMP Library Host OpenMP Library

Device Linker Host Linker

Executable

18 ¡

LLVM ¡Compiler ¡SchemaFc

SLIDE 19

19 ¡

Compiler responsible for

thread-activation and thread-coordination

OpenMP ¡Codegen ¡Internals

SLIDE 20

LULESH: ¡proxy ¡for ¡hydrodynamics ¡code ¡

¡ ¡

S6513 ¡-‑ ¡GPU ¡OpRmizaRon ¡of ¡the ¡Kripke ¡Neutral-‑ParRcle ¡Transport ¡Mini-‑App, ¡Thursday, ¡

15:30 ¡at ¡Marrioh ¡Salon ¡3 ¡

20 ¡

heps://codesign.llnl.gov/lulesh.php ¡

Performance ¡Preview

Kernel* CUDA (us) OpenMP 4.0 (us) Acceleration Calculation 3.2 ¡ 4.3 ¡ Apply Boundary Acceleration 5.1 ¡ 4.8 ¡ Position and Velocity Calculation 3.2 ¡ 4.8 ¡ 4.1 ¡ Kinematics and Monotonic Gradient Calculation 17 ¡ 6.5 ¡ 58 ¡ 40 ¡ Monotonic Region Calculation 11 ¡ 15 ¡ Apply Material Properties to Regions 92 ¡ 102.8 ¡

Performance ¡Analysis ¡of ¡OpenMP ¡on ¡a ¡GPU ¡Using ¡a ¡CORAL ¡Proxy ¡ApplicaRon, ¡Bercea ¡et ¡al. ¡PMBS ¡'15. ¡

SLIDE 21

21 ¡

Opensource: download and installation instructions at:

ibm.biz/ykt-omp

Currently supports OpenMP 4.0, with offload to GPU
Open source host runtime based on Intel contributed KMPC lib
Open source GPU runtime developed and contributed by IBM Research
Working on upstreaming 4.5 implementation to Clang/LLVM
Contact: acjacob@us.ibm.com

Compiler ¡Availability ¡and ¡Roadmap

This ¡work ¡is ¡parKally ¡supported ¡by ¡the ¡CORAL ¡project ¡LLNS ¡Subcontract ¡No. ¡B604142. ¡

S6240 ¡-­‑ ¡High-­‑Level ¡GPU ¡ Programming ¡Using ¡OpenMP ¡4.5 ¡ and ¡Clang/LLVM

Clang/LLVM ¡

Overview

P8 ¡ P8 ¡

ExploiFng ¡Heterogeneous ¡Node ¡Resources

performance ¡portable ¡manner ¡ ¡ ¡

Programmability ¡Challenge

P8 ¡ P8 ¡

Programming ¡Overview

OpenMP ¡Memory ¡Model

OpenMP ¡Memory ¡Model

OpenMP ¡Memory ¡Model

OpenMP ¡Memory ¡Model

How ¡do ¡we ¡use ¡OpenMP ¡offload? ¡ ¡ ¡

#pragma omp target map(to: cls, len, compression[0:len]) \ map(from: bvc[0:len]) for (int i=0; i<len; i++) { bvc[i] = cls * (compression[i] + 1.0); }

OpenMP ¡Offload ¡Model

OpenMP ¡Offload ¡Model

Loop work-sharing

Exploit ¡Hardware ¡Threads ¡on ¡POWER ¡CPUs

Loop work-sharing on GPUs with a target task

Exploit ¡Streaming ¡MulFprocessors ¡on ¡GPUs

SIMD and other OpenMP forms supported on the GPU

Exploit ¡Streaming ¡MulFprocessors ¡on ¡GPUs

Task Parallelism

Exploit ¡Hardware ¡Threads ¡on ¡POWER ¡CPUs

CPU ¡& ¡GPU ¡Parallelism ¡using ¡Tasks

CPU ¡& ¡GPU ¡Parallelism ¡using ¡Tasks

LLVM ¡Compiler ¡SchemaFc

OpenMP ¡Codegen ¡Internals

Performance ¡Preview

ibm.biz/ykt-omp

Compiler ¡Availability ¡and ¡Roadmap

S6240 ¡-‑ ¡High-‑Level ¡GPU ¡ Programming ¡Using ¡OpenMP ¡4.5 ¡ and ¡Clang/LLVM