Compiler Optimization For (OpenMP) Accelerator Offloading Johannes - - PowerPoint PPT Presentation

compiler optimization for openmp accelerator offloading
SMART_READER_LITE
LIVE PREVIEW

Compiler Optimization For (OpenMP) Accelerator Offloading Johannes - - PowerPoint PPT Presentation

EuroLLVM April 8, 2019 Brussels, Belgium Leadership Computing Facility Argonne National Laboratory Compiler Optimization For (OpenMP) Accelerator Offloading Johannes Doerfert and Hal Finkel https://www.alcf.anl.gov/ This research was


slide-1
SLIDE 1

Compiler Optimization For (OpenMP) Accelerator Offloading

EuroLLVM — April 8, 2019 — Brussels, Belgium

Johannes Doerfert and Hal Finkel

Leadership Computing Facility Argonne National Laboratory https://www.alcf.anl.gov/

slide-2
SLIDE 2

Acknowledgment

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative efgort of two U.S. Department of Energy organizations (Offjce of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including soħtware, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative.

1/14

slide-3
SLIDE 3

Compiler Optimization Original Program Aħter Optimizations

int y = 7; for (i = 0; i < N; i++) { f(y, i); } g(y); for (i = 0; i < N; i++) { f(7, i); } g(7);

2/14

slide-4
SLIDE 4

Compiler Optimization Original Program Aħter Optimizations

int y = 7; #pragma omp parallel for for (i = 0; i < N; i++) { f(y, i); } g(y);

2/14

slide-5
SLIDE 5

Compiler Optimization For Parallelism Original Program Aħter Optimizations

int y = 7; #pragma omp parallel for for (i = 0; i < N; i++) { f(y, i); } g(y); int y = 7; #pragma omp parallel for for (i = 0; i < N; i++) { f(y, i); } g(y);

2/14

slide-6
SLIDE 6

Current Compiler Optimization For Parallelism

3/14

slide-7
SLIDE 7

Current Compiler Optimization For Parallelism

None

⋆†

⋆At least for LLVM/Clang up to 8.0 †And not considering smart runtime libraries!

3/14

slide-8
SLIDE 8

Performance Implications

Why is this important?

4/14

slide-9
SLIDE 9

Performance Implications

4/14

slide-10
SLIDE 10

Performance Implications

4/14

slide-11
SLIDE 11

Performance Implications

4/14

slide-12
SLIDE 12

Performance Implications

4/14

slide-13
SLIDE 13

Optimization Categories

Optimizations for sequential aspects

  • May reuse existing transformations

(patches up for review!) ⇒ Introduce suitable abstractions to bridge the indirection (DONE!) Optimizations for parallel aspects

  • New explicit parallelism-aware transformations

(see IWOMP’18a) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18b)

a

Compiler Optimizations For OpenMP, J. Doerfert, H. Finkel, IWOMP 2018

b

A Parallel IR in Real Life: Optimizing OpenMP, H. Finkel, J. Doerfert, X. Tian,

  • G. Stelle, Euro

LLVM Meeting 2018

5/14

slide-14
SLIDE 14

Optimization Categories

Optimizations for sequential aspects

  • May reuse existing transformations

(patches up for review!) ⇒ Introduce suitable abstractions to bridge the indirection (DONE!) Optimizations for parallel aspects

  • New explicit parallelism-aware transformations

(see IWOMP’18a) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18b)

a

Compiler Optimizations For OpenMP, J. Doerfert, H. Finkel, IWOMP 2018

b

A Parallel IR in Real Life: Optimizing OpenMP, H. Finkel, J. Doerfert, X. Tian,

  • G. Stelle, Euro

LLVM Meeting 2018

5/14

slide-15
SLIDE 15

Optimization Categories

Optimizations for sequential aspects

  • May reuse existing transformations

(patches up for review!) ⇒ Introduce suitable abstractions to bridge the indirection (DONE!) Optimizations for parallel aspects

  • New explicit parallelism-aware transformations

(see IWOMP’18a) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18b)

a

Compiler Optimizations For OpenMP, J. Doerfert, H. Finkel, IWOMP 2018

b

A Parallel IR in Real Life: Optimizing OpenMP, H. Finkel, J. Doerfert, X. Tian,

  • G. Stelle, Euro

LLVM Meeting 2018

5/14

slide-16
SLIDE 16

Optimization Categories

Optimizations for sequential aspects

  • May reuse existing transformations

(patches up for review!) ⇒ Introduce suitable abstractions to bridge the indirection (DONE!) Optimizations for parallel aspects

  • New explicit parallelism-aware transformations

(see IWOMP’18a) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18b)

a

Compiler Optimizations For OpenMP, J. Doerfert, H. Finkel, IWOMP 2018

b

A Parallel IR in Real Life: Optimizing OpenMP, H. Finkel, J. Doerfert, X. Tian,

  • G. Stelle, Euro

LLVM Meeting 2018

5/14

slide-17
SLIDE 17

Optimization Categories

Optimizations for sequential aspects

  • May reuse existing transformations

(patches up for review!) ⇒ Introduce suitable abstractions to bridge the indirection (DONE!) Optimizations for parallel aspects

  • New explicit parallelism-aware transformations

(see IWOMP’18a) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18b)

aCompiler Optimizations For OpenMP, J. Doerfert, H. Finkel, IWOMP 2018 bA Parallel IR in Real Life: Optimizing OpenMP, H. Finkel, J. Doerfert, X. Tian,

  • G. Stelle, Euro

LLVM Meeting 2018

5/14

slide-18
SLIDE 18

Optimization Categories

Optimizations for sequential aspects

  • May reuse existing transformations

(patches up for review!) ⇒ Introduce suitable abstractions to bridge the indirection (DONE!) Optimizations for parallel aspects

  • New explicit parallelism-aware transformations

(see IWOMP’18a) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18b)

aCompiler Optimizations For OpenMP, J. Doerfert, H. Finkel, IWOMP 2018 bA Parallel IR in Real Life: Optimizing OpenMP, H. Finkel, J. Doerfert, X. Tian,

  • G. Stelle, Euro

LLVM Meeting 2018

5/14

slide-19
SLIDE 19

The Compiler Black Box

>> clang -O3 -fopenmp-targets=...

{ #pragma omp target teams parallel work1(); }

6/14

slide-20
SLIDE 20

The Compiler Black Box

>> clang -O3 -fopenmp-targets=...

{ #pragma omp target teams parallel work1(); }

6/14

slide-21
SLIDE 21

The Compiler Black Box

>> clang -O3 -fopenmp-targets=...

{ #pragma omp target teams parallel work1(); }

6/14

“relatively” good performance :)

slide-22
SLIDE 22

The Compiler Black Box

>> clang -O3 -fopenmp-targets=...

{ #pragma omp target teams parallel work1(); #pragma omp target teams #pragma omp parallel work2(); }

6/14

slide-23
SLIDE 23

The Compiler Black Box

>> clang -O3 -fopenmp-targets=...

{ #pragma omp target teams parallel work1(); #pragma omp target teams #pragma omp parallel work2(); }

6/14

“relatively” good performance :)

slide-24
SLIDE 24

The Compiler Black Box

>> clang -O3 -fopenmp-targets=...

#pragma omp target teams { #pragma omp parallel work1(); #pragma omp parallel work2(); }

6/14

slide-25
SLIDE 25

The Compiler Black Box

>> clang -O3 -fopenmp-targets=...

#pragma omp target teams { #pragma omp parallel work1(); #pragma omp parallel work2(); }

6/14

probably poor performance :(

slide-26
SLIDE 26

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma { #pragma omp target teams foo(); #pragma omp target teams parallel work(); // <- Hotspot #pragma omp target teams bar(); }

N teams, with M threads each, all executing work concurrently. all execut- ing work concurrently. all executing work concurrently. all executing work

  • concurrently. all executing work concurrently. all executing work concur-
  • rently. all executing work concurrently. all executing work concurrently.

7/14

slide-27
SLIDE 27

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma { #pragma omp target teams foo(); #pragma omp target teams parallel work(); // <- Hotspot #pragma omp target teams bar(); }

N teams, with M threads each, all executing work concurrently. all execut- ing work concurrently. all executing work concurrently. all executing work

  • concurrently. all executing work concurrently. all executing work concur-
  • rently. all executing work concurrently. all executing work concurrently.

7/14

slide-28
SLIDE 28

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma { #pragma omp target teams foo(); #pragma omp target teams parallel work(); // <- Hotspot #pragma omp target teams bar(); }

N teams, with M threads each, all executing work concurrently. all execut- ing work concurrently. all executing work concurrently. all executing work

  • concurrently. all executing work concurrently. all executing work concur-
  • rently. all executing work concurrently. all executing work concurrently.

7/14

slide-29
SLIDE 29

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma { #pragma omp target teams foo(); #pragma omp target teams parallel work(); // <- Hotspot #pragma omp target teams bar(); }

N teams, with M threads each, all executing work concurrently. all execut- ing work concurrently. all executing work concurrently. all executing work

  • concurrently. all executing work concurrently. all executing work concur-
  • rently. all executing work concurrently. all executing work concurrently.

7/14

slide-30
SLIDE 30

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); }

1 master and N-1 worker teams, worker teams M threads: Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

7/14

slide-31
SLIDE 31

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); }

1 master and N-1 worker teams, worker teams M threads: Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

7/14

slide-32
SLIDE 32

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); }

1 master and N-1 worker teams, worker teams M threads: Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

7/14

slide-33
SLIDE 33

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); }

1 master and N-1 worker teams, worker teams M threads: Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

7/14

slide-34
SLIDE 34

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); }

1 master and N-1 worker teams, worker teams M threads: Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

7/14

slide-35
SLIDE 35

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); }

1 master and N-1 worker teams, worker teams M threads: Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

7/14

slide-36
SLIDE 36

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); }

1 master and N-1 worker teams, worker teams M threads: Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

7/14

Problems:

  • a separate master team costs resources
  • synchronization has overhead
  • currently impossible to optimization
slide-37
SLIDE 37

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); }

1 master and N-1 worker teams, worker teams M threads: Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

7/14

Problems:

  • a separate master team costs resources
  • synchronization has overhead
  • currently impossible to optimization
slide-38
SLIDE 38

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); }

1 master and N-1 worker teams, worker teams M threads: Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

7/14

Problems:

  • a separate master team costs resources
  • synchronization has overhead
  • currently impossible to optimization
slide-39
SLIDE 39

The Compiler Black Box — Behind the Curtain (of Clang)

#pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); }

1 master and N-1 worker teams, worker teams M threads: Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

7/14

Problems:

  • a separate master team costs resources
  • synchronization has overhead
  • currently impossible to optimization
slide-40
SLIDE 40

OpenMP Offload — Overview

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code + few RT Calls + Logic Device Opt. Device RT +Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-41
SLIDE 41

OpenMP Offload — Overview

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code + few RT Calls + Logic Device Opt. Device RT +Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-42
SLIDE 42

OpenMP Offload — Overview

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code + few RT Calls + Logic Device Opt. Device RT +Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-43
SLIDE 43

OpenMP Offload — Overview

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code + few RT Calls + Logic Device Opt. Device RT +Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-44
SLIDE 44

OpenMP Offload — Overview

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code+ few RT Calls + Logic Device Opt. Device RT +Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-45
SLIDE 45

OpenMP Offload — Overview

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code+ few RT Calls + Logic Device Opt. Device RT +Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-46
SLIDE 46

OpenMP Offload — Overview

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code+ few RT Calls + Logic Device Opt. Device RT +Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-47
SLIDE 47

OpenMP Offload — Overview

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code+ few RT Calls + Logic Device Opt. Device RT +Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-48
SLIDE 48

OpenMP Offload — Overview

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code+ few RT Calls + Logic Device Opt. Device RT +Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-49
SLIDE 49

OpenMP Offload — Overview

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code+ few RT Calls + Logic Device Opt. Device RT +Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-50
SLIDE 50

OpenMP Offload — Overview & Directions

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code+ few RT Calls + Logic Device Opt. Device RT +Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-51
SLIDE 51

OpenMP Offload — Overview & Directions

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code+ few RT Calls + Logic Device Opt. Device RT+Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-52
SLIDE 52

OpenMP Offload — Overview & Directions

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code+few RT Calls + Logic Device Opt. Device RT+Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-53
SLIDE 53

OpenMP Offload — Overview & Directions

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code+few RT Calls + Logic Device Opt. Device RT+Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

Pending patches “fix” the motivating problem and allow for more to come! Reviewers are needed! Interested? Take a look and contact me :)

slide-54
SLIDE 54

OpenMP Offload — Overview & Directions

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code+few RT Calls + Logic Device Opt. Device RT+Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

Pending patches “fix” the motivating problem and allow for more to come! Reviewers are needed! Interested? Take a look and contact me :)

slide-55
SLIDE 55

OpenMP Offload — Overview & Directions

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code+few RT Calls + Logic Device Opt. Device RT+Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

Pending patches “fix” the motivating problem and allow for more to come! Reviewers are needed! Interested? Take a look and contact me :)

slide-56
SLIDE 56

OpenMP Offload — Overview & Directions

  • 1. Offmoad-Specific Optimizations on Device Code

Code Code Gen. Device Code+few RT Calls + Logic Device Opt. Device RT+Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-57
SLIDE 57

OpenMP Offload — Overview & Directions

  • 2. Optimize Device and Host Code Together

Code Code Gen. Device Code+few RT Calls + Logic Device Opt. Device RT+Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-58
SLIDE 58

OpenMP Offload — Overview & Directions

  • 2. Optimize Device and Host Code Together

Code Code Gen. Device Code+few RT Calls + Logic Device Opt. Device RT+Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-59
SLIDE 59

OpenMP Offload — Overview & Directions

  • 2. Optimize Device and Host Code Together

Code Code Gen. Device Code+few RT Calls + Logic Device Opt. Host AND Device Opti- mization Device RT+Logic Host Code Host Opt. Fat Binary OpenMP Clang LLVM-IR LLVM Assembly Cross Module IPO

8/14

slide-60
SLIDE 60

Pending Patches — Target Region Interface

  • A straight-forward #pragma omp target front-end:

⋄ simplified implementation

CGOpenMPRuntimeNVPTX.cpp ~5.0k loc

⋄ improved reusability (F18, ...)

CGOpenMPRuntimeTRegion.cpp ~0.5k loc

  • Interface exposes information and implementation choices:

⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM

  • Device RT interface & implementation are separated:

⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic

9/14

slide-61
SLIDE 61

Pending Patches — Target Region Interface

  • A straight-forward #pragma omp target front-end:

⋄ simplified implementation

CGOpenMPRuntimeNVPTX.cpp ~5.0k loc

⋄ improved reusability (F18, ...)

CGOpenMPRuntimeTRegion.cpp ~0.5k loc

  • Interface exposes information and implementation choices:

⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM

  • Device RT interface & implementation are separated:

⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic

9/14

slide-62
SLIDE 62

Pending Patches — Target Region Interface

  • A straight-forward #pragma omp target front-end:

⋄ simplified implementation

CGOpenMPRuntimeNVPTX.cpp ~5.0k loc

⋄ improved reusability (F18, ...)

CGOpenMPRuntimeTRegion.cpp ~0.5k loc

  • Interface exposes information and implementation choices:

⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM

  • Device RT interface & implementation are separated:

⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic

9/14

slide-63
SLIDE 63

Pending Patches — Target Region Interface

  • A straight-forward #pragma omp target front-end:

⋄ simplified implementation

CGOpenMPRuntimeNVPTX.cpp ~5.0k loc

⋄ improved reusability (F18, ...)

CGOpenMPRuntimeTRegion.cpp ~0.5k loc

  • Interface exposes information and implementation choices:

⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM

  • Device RT interface & implementation are separated:

⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic

9/14

slide-64
SLIDE 64

Pending Patches — Target Region Interface

  • A straight-forward #pragma omp target front-end:

⋄ simplified implementation

CGOpenMPRuntimeNVPTX.cpp ~5.0k loc

⋄ improved reusability (F18, ...)

CGOpenMPRuntimeTRegion.cpp ~0.5k loc

  • Interface exposes information and implementation choices:

⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM

  • Device RT interface & implementation are separated:

⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic

9/14

slide-65
SLIDE 65
  • 1. Offload-Specific Optimizations — “SPMD-zation”
  • use inter-procedural reasoning to place minimal guards/synchronization
  • if legal, switch all boolean UseSPMDMode flags to true
  • currently, no (unknown) global side-efgects allowed outside parallel regions.

10/14

slide-66
SLIDE 66
  • 1. Offload-Specific Optimizations — “SPMD-zation”
  • use inter-procedural reasoning to place minimal guards/synchronization
  • if legal, switch all boolean UseSPMDMode flags to true
  • currently, no (unknown) global side-efgects allowed outside parallel regions.

10/14

slide-67
SLIDE 67
  • 1. Offload-Specific Optimizations — “SPMD-zation”
  • use inter-procedural reasoning to place minimal guards/synchronization
  • if legal, switch all boolean UseSPMDMode flags to true
  • currently, no (unknown) global side-efgects allowed outside parallel regions.

10/14

slide-68
SLIDE 68
  • 1. Offload-Specific Optimizations — Custom State Machines
  • use optimized state-machines when unavoidable
  • reachability & post-dominance restrict the set of potential next parallel regions

to work on

  • reuse already communicated/shared values if possible
  • currently, a simple state machine is generated with explicit conditionals for all

known parallel regions in the module

11/14

slide-69
SLIDE 69
  • 1. Offload-Specific Optimizations — Custom State Machines
  • use optimized state-machines when unavoidable
  • reachability & post-dominance restrict the set of potential next parallel regions

to work on

  • reuse already communicated/shared values if possible
  • currently, a simple state machine is generated with explicit conditionals for all

known parallel regions in the module

11/14

slide-70
SLIDE 70
  • 1. Offload-Specific Optimizations — Custom State Machines
  • use optimized state-machines when unavoidable
  • reachability & post-dominance restrict the set of potential next parallel regions

to work on

  • reuse already communicated/shared values if possible
  • currently, a simple state machine is generated with explicit conditionals for all

known parallel regions in the module

11/14

slide-71
SLIDE 71
  • 1. Offload-Specific Optimizations — Custom State Machines
  • use optimized state-machines when unavoidable
  • reachability & post-dominance restrict the set of potential next parallel regions

to work on

  • reuse already communicated/shared values if possible
  • currently, a simple state machine is generated with explicit conditionals for all

known parallel regions in the module

11/14

slide-72
SLIDE 72
  • 2. Optimize Device and Host Together — Abstract Call Sites

CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs)

12/14

slide-73
SLIDE 73
  • 2. Optimize Device and Host Together — Abstract Call Sites

CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs)

12/14

slide-74
SLIDE 74
  • 2. Optimize Device and Host Together — Abstract Call Sites

CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs)

12/14

Functional changes required for Inter-procedural Constant Propagation:

slide-75
SLIDE 75

Abstract Call Sites — Performance Results

13/14

slide-76
SLIDE 76

Abstract Call Sites — Performance Results

13/14

slide-77
SLIDE 77

Abstract Call Sites — Performance Results

13/14

slide-78
SLIDE 78

Abstract Call Sites — Performance Results

13/14

slide-79
SLIDE 79

Conclusion

14/14

slide-80
SLIDE 80

Conclusion

14/14

slide-81
SLIDE 81

Conclusion

14/14

slide-82
SLIDE 82

Conclusion

14/14

slide-83
SLIDE 83

Conclusion

14/14

slide-84
SLIDE 84
slide-85
SLIDE 85

OpenMP-Aware Optimizations

(see IWOMP’18)

I: Attribute Propagation — Bidirectional Information Transfer: read/write-only, restrict/noalias, … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: ⟹ reduce start/stop overheads and expose barriers IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result

  • comm. vs. operand comm. &par. compute
slide-86
SLIDE 86

OpenMP-Aware Optimizations

(see IWOMP’18)

I: Attribute Propagation — Bidirectional Information Transfer: read/write-only, restrict/noalias, … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: ⟹ reduce start/stop overheads and expose barriers IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result

  • comm. vs. operand comm. &par. compute
slide-87
SLIDE 87

OpenMP-Aware Optimizations

(see IWOMP’18)

I: Attribute Propagation — Bidirectional Information Transfer: read/write-only, restrict/noalias, … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: ⟹ reduce start/stop overheads and expose barriers IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result

  • comm. vs. operand comm. &par. compute
slide-88
SLIDE 88

OpenMP-Aware Optimizations

(see IWOMP’18)

I: Attribute Propagation — Bidirectional Information Transfer: read/write-only, restrict/noalias, … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: ⟹ reduce start/stop overheads and expose barriers IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result

  • comm. vs. operand comm. &par. compute
slide-89
SLIDE 89

OpenMP-Aware Optimizations

(see IWOMP’18)

I: Attribute Propagation — Bidirectional Information Transfer: read/write-only, restrict/noalias, … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: ⟹ reduce start/stop overheads and expose barriers IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result

  • comm. vs. operand comm. &par. compute
slide-90
SLIDE 90

OpenMP-Aware Optimizations

(see IWOMP’18)

I: Attribute Propagation — Bidirectional Information Transfer: read/write-only, restrict/noalias, … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: ⟹ reduce start/stop overheads and expose barriers IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result

  • comm. vs. operand comm. &par. compute
slide-91
SLIDE 91

OpenMP-Aware Optimizations

(see IWOMP’18)

I: Attribute Propagation — Bidirectional Information Transfer: read/write-only, restrict/noalias, … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: ⟹ reduce start/stop overheads and expose barriers IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result

  • comm. vs. operand comm. &par. compute
slide-92
SLIDE 92

OpenMP-Aware Optimizations

(see IWOMP’18)

I: Attribute Propagation — In LLVM: Attribute Deduction (IPO!) read/write-only, restrict/noalias, … II: Variable Privatization — In LLVM: Argument Promotion (IPO!) shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: ⟹ reduce start/stop overheads and expose barriers IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result

  • comm. vs. operand comm. &par. compute
slide-93
SLIDE 93

Early Outlining OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

slide-94
SLIDE 94

Early Outlining OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn,

&N, &In, &Out);

slide-95
SLIDE 95

Early Outlining OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn,

&N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn(int tid, int *N, float** In, float** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for (int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

slide-96
SLIDE 96

Early Outlining OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn,

&N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn(int tid, int* N, float** In, float** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for (int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

slide-97
SLIDE 97

An Abstract Parallel IR OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by an annotated loop for /* parallel */ (int i = 0; i < N; i++) body_fn(i, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn(int i , int* N, float** In, float** Out) { (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

slide-98
SLIDE 98

Early Outlined + Transitive Calls OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn,

&N, &In, &Out); // Model transitive call: body_fn(?, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn(int tid, int* N, float** In, float** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for (int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

slide-99
SLIDE 99

Early Outlined + Transitive Calls OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn,

&N, &In, &Out); // Model transitive call: body_fn(?, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn(int tid, int* N, float** In, float** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for (int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

+ valid and executable IR − integration cost per IPO + no unintended interactions

slide-100
SLIDE 100

IPO in LLVM

CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs)

slide-101
SLIDE 101

Transitive Call Sites in LLVM

CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs)

slide-102
SLIDE 102

Transitive Call Sites in LLVM

CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs) Functional changes required for Inter-procedural Constant Propagation:

slide-103
SLIDE 103

Evaluated Version

Version Description Opt. base plain “-O3”, thus no parallel optimizations attr attribute propagation through attr. deduction (IPO) I argp variable privatization through arg. promotion (IPO) II n/a constant propagation (IPO)

slide-104
SLIDE 104
slide-105
SLIDE 105

Some Context

Examples Examples are given in a C-like language with OpenMP annotations. Transformations Our transformations work on the LLVM intermediate representation (LLVM-IR), thus take and produce LLVM-IR. OpenMP Runtime Library We experience OpenMP annotations as OpenMP runtime library calls and the situation is most oħten more complicated than presented here.

slide-106
SLIDE 106

Some Context

Examples Examples are given in a C-like language with OpenMP annotations. Transformations Our transformations work on the LLVM intermediate representation (LLVM-IR), thus take and produce LLVM-IR. OpenMP Runtime Library We experience OpenMP annotations as OpenMP runtime library calls and the situation is most oħten more complicated than presented here.

slide-107
SLIDE 107

Evaluation Environment

  • Run with 1 Thread2
  • Median and variance of 51 runs is shown
  • Rodiana 3.1 benchmarks and LULESH v1.0 (OpenMP)
  • Only time in parallel constructs was measured

2Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz

slide-108
SLIDE 108

Performance Results

slide-109
SLIDE 109

Performance Results

slide-110
SLIDE 110

Performance Results

slide-111
SLIDE 111

Performance Results

slide-112
SLIDE 112

Performance Results

slide-113
SLIDE 113

Action Item I

1) Run your OpenMP code sequentially

⋆,

with and without OpenMP. 2) Email me

† the results!

export OMP_NUM_THREADS=1

jdoerfert@anl.gov

slide-114
SLIDE 114

Action Item I

1) Run your OpenMP code sequentially

⋆,

with and without OpenMP. 2) Email me

† the results!

export OMP_NUM_THREADS=1

jdoerfert@anl.gov

slide-115
SLIDE 115

Action Item I

1) Run your OpenMP code sequentially

⋆,

with and without OpenMP. 2) Email me

† the results!

export OMP_NUM_THREADS=1

jdoerfert@anl.gov

slide-116
SLIDE 116

Action Item II

1) Always

⋆ use default(none) and

firstprivate(...) 2) Revisit Action Item I

For scalars/pointers if you do not have explicit synchronization.

slide-117
SLIDE 117

Action Item II

1) Always

⋆ use default(none) and

firstprivate(...) 2) Revisit Action Item I

For scalars/pointers if you do not have explicit synchronization.

slide-118
SLIDE 118

Action Item II

1) Always

⋆ use default(none) and

firstprivate(...) 2) Revisit Action Item I

For scalars/pointers if you do not have explicit synchronization.

slide-119
SLIDE 119

Action Item II

1) Always

⋆ use default(none) and

firstprivate(...) 2) Revisit Action Item I

For scalars/pointers if you do not have explicit synchronization.

NO need to “share” the variable A!

slide-120
SLIDE 120

Constant Propagation Example

double gamma[4][8]; gamma[0][0] = 1; // ... and so on till ... gamma[3][7] = -1; Kokkos::parallel_for( "CalcFBHourglassForceForElems A", numElem, KOKKOS_LAMBDA(const int &i2) { // Use gamma[0][0] ... gamme[3][7] }

slide-121
SLIDE 121

Constant Propagation Performance

slide-122
SLIDE 122

Optimization I: Attribute Propagation OpenMP Input:

#pragma omp parallel for firstprivate(...) for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

slide-123
SLIDE 123

Optimization I: Attribute Propagation OpenMP Input:

#pragma omp parallel for firstprivate(...) for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn, N, In, Out);

// Parallel region outlined in the front-end (clang)! void body_fn(int i, int N, float* In, float* Out) { Out[i] = In[i] + In[i + N]; }

slide-124
SLIDE 124

Optimization I: Attribute Propagation OpenMP Input:

#pragma omp parallel for firstprivate(...) for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn, N, In, Out);

// Parallel region outlined in the front-end (clang)! void body_fn(int i, int N, float* /* read-only & no-escape */ In, float* /* write-only & no-escape */ Out) { Out[i] = In[i] + In[i + N]; }

slide-125
SLIDE 125

Optimization I: Attribute Propagation OpenMP Input:

#pragma omp parallel for firstprivate(...) for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn, N,

/* ro & no-esc */ In, /* wo & no-esc */ Out); // Parallel region outlined in the front-end (clang)! void body_fn(int i, int N, float* /* read-only & no-escape */ In, float* /* write-only & no-escape */ Out) { Out[i] = In[i] + In[i + N]; }

slide-126
SLIDE 126

Optimization I: Attribute Propagation (cont)

int foo() { int a = 0; #pragma omp parallel { #pragma omp critical { a += 1; } bar(); #pragma omp critical { a *= 2; } } return a; }

slide-127
SLIDE 127

Optimization I: Attribute Propagation (cont)

int foo() { int a = 0; #pragma omp parallel { #pragma omp critical { a += 1; } bar(); #pragma omp critical { a *= 2; } } return a; } int foo() { int a = 0; int *restrict p = &a;

  • mp_rt_parallel_for(pwork, p);

return a; } void pwork(int tid, int *p) { if (omp_critical(tid)) { *p = *p + 1;

  • mp_critical_end(tid);

} bar(); if (omp_critical(tid)) { *p = *p * 2;

  • mp_critical_end(tid);

} }

slide-128
SLIDE 128

Optimization I: Attribute Propagation (cont)

void pwork(int tid, int *restrict p) { if (omp_critical(tid)) {

  • mp_critical_end(tid);

} bar(); if (omp_critical(tid)) { *p = 2 * (*p + 1);

  • mp_critical_end(tid);

} } int foo() { int a = 0; int *restrict p = &a;

  • mp_rt_parallel_for(pwork, p);

return a; } void pwork(int tid, int *p) { if (omp_critical(tid)) { *p = *p + 1;

  • mp_critical_end(tid);

} bar(); if (omp_critical(tid)) { *p = *p * 2;

  • mp_critical_end(tid);

} }

slide-129
SLIDE 129

Optimization I: Attribute Propagation (cont)

void pwork(int tid, int *restrict p) { if (omp_critical(tid)) { *p = *p + 1;

  • mp_critical_end(tid);

} bar()[p]; // May "use" p. if (omp_critical(tid)) { *p = *p * 2;

  • mp_critical_end(tid);

} } int foo() { int a = 0; int *restrict p = &a;

  • mp_rt_parallel_for(pwork, p);

return a; } void pwork(int tid, int *p) { if (omp_critical(tid)) { *p = *p + 1;

  • mp_critical_end(tid);

} bar(); if (omp_critical(tid)) { *p = *p * 2;

  • mp_critical_end(tid);

} }

slide-130
SLIDE 130

Optimization II: Variable Privatization OpenMP Input:

#pragma omp parallel for shared(...) for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

slide-131
SLIDE 131

Optimization II: Variable Privatization OpenMP Input:

#pragma omp parallel for shared(...) for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out);

// Parallel region outlined in the front-end (clang)! void body_fn(int i, int* N, float** In, float** Out) { (*Out)[i] = (*In)[i] + (*In)[i + (*N)]; }

slide-132
SLIDE 132

Optimization II: Variable Privatization OpenMP Input:

#pragma omp parallel for shared(...) for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out);

// Parallel region outlined in the front-end (clang)! void body_fn(int i, int* /* ro & ne */ N, float** /* ro & ne */ In, float** /* ro & ne */ Out) { (*Out)[i] = (*In)[i] + (*In)[i + (*N)]; }

slide-133
SLIDE 133

Optimization II: Variable Privatization OpenMP Input:

#pragma omp parallel for firstprivate(...) for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn, N, In, Out);

// Parallel region outlined in the front-end (clang)! void body_fn(int i, int N, float* In, float* Out) { Out[i] = In[i] + In[i + N]; }

slide-134
SLIDE 134

Optimization III: Parallel Region Expansion

slide-135
SLIDE 135

Optimization III: Parallel Region Expansion

void copy(float* dst, float* src, int N) { #pragma omp parallel for for(int i = 0; i < N; i++) { dst[i] = src[i]; } // implicit barrier! } void compute_step_factor(int nelr, float* vars, float* areas, float* sf) { #pragma omp parallel for for (int blk = 0; blk < nelr / block_length; ++blk) { ... } // implicit barrier! }

slide-136
SLIDE 136

Optimization III: Parallel Region Expansion

pragma omp parallel for (int i = 0; i < iterations; i++) { copy(old_vars, vars, nelr * NVAR); compute_step_factor(nelr, vars, areas, sf); for (int j = 0; j < RK; j++) { compute_flux(nelr, ese, normals, vars, fluxes, ff_vars, ff_m_x, ff_m_y, ff_m_z, ff_dnergy); time_step(j, nelr, old_vars, vars, sf, fluxes);

slide-137
SLIDE 137

Optimization III: Parallel Region Expansion

pragma omp parallel for (int i = 0; i < iterations; i++) { #pragma omp parallel for // copy for (...) { /* write old_vars, read vars */ } // implicit barrier! compute_step_factor(nelr, vars, areas, sf); for (int j = 0; j < RK; j++) { compute_flux(nelr, ese, normals, vars, fluxes, ff_vars, ff_m_x, ff_m_y, ff_m_z, ff_dnergy); time_step(j, nelr, old_vars, vars, sf, fluxes);

slide-138
SLIDE 138

Optimization III: Parallel Region Expansion

pragma omp parallel for (int i = 0; i < iterations; i++) { #pragma omp parallel for // copy for (...) { /* write old_vars, read vars */ } // implicit barrier! #pragma omp parallel for // compute_step_factor for (...) { /* write sf, read vars & area */ } // implicit barrier! for (int j = 0; j < RK; j++) { #pragma omp parallel for // compute_flux for (...) { /* write fluxes, read vars & ... */ } // implicit barrier! ...

slide-139
SLIDE 139

Optimization III: Parallel Region Expansion

#pragma omp parallel for (int i = 0; i < iterations; i++) { #pragma omp for // copy for (...) { /* write old_vars, read vars */ } // explicit barrier in LLVM-IR! #pragma omp for // compute_step_factor for (...) { /* write sf, read vars & area */ } // explicit barrier in LLVM-IR! for (int j = 0; j < RK; j++) { #pragma omp for // compute_flux for (...) { /* write fluxes, read vars & ... */ } // explicit barrier in LLVM-IR! ...

slide-140
SLIDE 140

Optimization IV: Barrier Elimination

#pragma omp parallel for (int i = 0; i < iterations; i++) { #pragma omp for // copy for (...) { /* write old_vars, read vars */ } // explicit barrier in LLVM-IR! #pragma omp for // compute_step_factor for (...) { /* write sf, read vars & area */ } // explicit barrier in LLVM-IR! for (int j = 0; j < RK; j++) { #pragma omp for // compute_flux for (...) { /* write fluxes, read vars & ... */ } // explicit barrier in LLVM-IR! ...

slide-141
SLIDE 141

Optimization IV: Barrier Elimination

#pragma omp parallel for (int i = 0; i < iterations; i++) { #pragma omp for // copy for (...) { /* write old_vars, read vars */ } // explicit barrier in LLVM-IR! #pragma omp for // compute_step_factor for (...) { /* write sf, read vars & area */ } // explicit barrier in LLVM-IR! for (int j = 0; j < RK; j++) { #pragma omp for // compute_flux for (...) { /* write fluxes, read vars & ... */ } // explicit barrier in LLVM-IR! ...

slide-142
SLIDE 142

Optimization IV: Barrier Elimination

#pragma omp parallel for (int i = 0; i < iterations; i++) { #pragma omp for nowait // copy for (...) { /* write old_vars, read vars */ } #pragma omp for nowait // compute_step_factor for (...) { /* write sf, read vars & area */ } for (int j = 0; j < RK; j++) { #pragma omp for // compute_flux for (...) { /* write fluxes, read vars & ... */ } // explicit barrier in LLVM-IR! ...

slide-143
SLIDE 143

Optimization V: Communication Optimization

slide-144
SLIDE 144

Optimization V: Communication Optimization

void f(int *X, int *restrict Y) { int L = *X; // immovable int N = 512; // movable int A = N + L; // movable #pragma omp parallel for \ firstprivate(X, Y, N, L, A) for (int i = 0; i < N; i++) { int K = *Y; // movable int M = N * K; // movable X[i] = M+A*L*i; // immovable } }

slide-145
SLIDE 145

Optimization V: Communication Optimization

void f(int *X, int *restrict Y) { int L = *X; // immovable int N = 512; // movable int A = N + L; // movable #pragma omp parallel for \ firstprivate(X, Y, N, L, A) for (int i = 0; i < N; i++) { int K = *Y; // movable int M = N * K; // movable X[i] = M+A*L*i; // immovable } }

slide-146
SLIDE 146

Optimization V: Communication Optimization

void f(int *X, int *restrict Y) { int L = *X; // immovable int N = 512; // movable int A = N + L; // movable #pragma omp parallel for \ firstprivate(X, Y, N, L, A) for (int i = 0; i < N; i++) { int K = *Y; // movable int M = N * K; // movable X[i] = M+A*L*i; // immovable } }

slide-147
SLIDE 147

Optimization V: Communication Optimization

void f(int *X, int *restrict Y) { int L = *X; // immovable int N = 512; // movable int A = N + L; // movable #pragma omp parallel for \ firstprivate(X, Y, N, L, A) for (int i = 0; i < N; i++) { int K = *Y; // movable int M = N * K; // movable X[i] = M+A*L*i; // immovable } } void g(int *X, int *restrict Y) { int L = *X; // immovable int K = *Y; // 𝑑 ld > 𝑑𝜕 int M = 512 * K; // 𝑑mul + 𝑑 𝐿 > 𝑑𝜕 #pragma omp parallel \ firstprivate(X, M, L) { int A = 512 + L; // 𝑑add < 𝑑𝜕 #pragma omp for \ firstprivate(X, M, A, L) for (int i = 0; i < 512; i++) { X[i] = M+A*L*i; // immovable } } }

slide-148
SLIDE 148

Early Outlining: Sequential Optimization Problems NO Information Transfer :

  • utlined function ⟺ runtime library call site

Value Transfer Declaration OpenMP Clause Communication Type T var; default = shared &var of type T* T var; shared(var) &var of type T* T var; lastprivate(var) &var of type T* T var; firstprivate(var) var of type T T var; private(var) none

slide-149
SLIDE 149

Early Outlining: Sequential Optimization Problems NO Information Transfer:

  • utlined function ⟺ runtime library call site

Value Transfer Declaration OpenMP Clause Communication Type T var; default = shared &var of type T* T var; shared(var) &var of type T* T var; lastprivate(var) &var of type T* T var; firstprivate(var) var of type T T var; private(var) none

slide-150
SLIDE 150

Early Outlining: Sequential Optimization Problems NO Information Transfer:

  • utlined function ⟺ runtime library call site

Value Transfer Declaration OpenMP Clause Communication Type T var; default = shared &var of type T* T var; shared(var) &var of type T* T var; lastprivate(var) &var of type T* T var; firstprivate(var) var of type T T var; private(var) none

slide-151
SLIDE 151

Early Outlining: Sequential Optimization Problems NO Information Transfer:

  • utlined function ⟺ runtime library call site

Value Transfer Declaration OpenMP Clause Communication Type T var; default = shared &var of type T* T var; shared(var) &var of type T* T var; lastprivate(var) &var of type T* T var; firstprivate(var) var of type T T var; private(var) none

slide-152
SLIDE 152

Early Outlining: Sequential Optimization Problems NO Information Transfer:

  • utlined function ⟺ runtime library call site

Value Transfer Declaration OpenMP Clause Communication Type T var; default = shared &var of type T* T var; shared(var) &var of type T* T var; lastprivate(var) &var of type T* T var; firstprivate(var) var of type T T var; private(var) none

slide-153
SLIDE 153

Early Outlining: Sequential Optimization Problems NO Information Transfer:

  • utlined function ⟺ runtime library call site

Value Transfer Declaration OpenMP Clause Communication Type T var; default = shared &var of type T* T var; shared(var) &var of type T* T var; lastprivate(var) &var of type T* T var; firstprivate(var) var of type T T var; private(var) none

slide-154
SLIDE 154
slide-155
SLIDE 155

Target Region — The Interface

void kernel(...) { init: char ThreadKind = __kmpc_target_region_kernel_init(...); if (ThreadKind == -1) { // actual worker thread if (!UsedLibraryStateMachine) user_code_state_machine(); goto exit; } else if (ThreadKind == 0) { // surplus worker thread goto exit; } else { // team master thread goto user_code; } user_code: // User defined kernel code, parallel regions are replaced by // by __kmpc_target_region_kernel_parallel(...) calls. // Fallthrough to de-initialization deinit: __kmpc_target_region_kernel_deinit(...); exit: /* exit the kernel */ }

slide-156
SLIDE 156

Target Region — The Interface

// Initialization int8_t __kmpc_target_region_kernel_init(ident_t *Ident, bool UseSPMDMode, bool RequiresOMPRuntime, bool UseStateMachine, bool RequiresDataSharing); // De-Initialization void __kmpc_target_region_kernel_deinit(ident_t *Ident, bool UseSPMDMode, bool RequiredOMPRuntime); // Parallel execution typedef void (*ParallelWorkFnTy)(void * /* SharedValues */, void * /* PrivateValues */) CALLBACK(ParallelWorkFnTy, SharedValues, PrivateValues) void __kmpc_target_region_kernel_parallel(ident_t *Ident, bool UseSPMDMode, bool RequiredOMPRuntime, ParallelWorkFnTy ParallelWorkFn, void *SharedValues, uint16_t SharedValuesBytes, void *PrivateValues, uint16_t PrivateValuesBytes, bool SharedMemPointers);

slide-157
SLIDE 157

Target Region — The Implementation

  • (almost) the same as with the current NVPTX backend, except for

shared/firstprivate variables

  • implemented in Cuda as part of the library, not generated into the user

code/module/TU by Clang

  • the boolean flags are commonly constant, aħter inlining all target region

abstractions is gone

slide-158
SLIDE 158

Action Item III

1) Review your OpenMP target code. 2) Email me

† if you use the “bad” pattern!

jdoerfert@anl.gov

slide-159
SLIDE 159

Action Item III

1) Review your OpenMP target code. 2) Email me

† if you use the “bad” pattern!

jdoerfert@anl.gov

slide-160
SLIDE 160

Action Item III

1) Review your OpenMP target code. 2) Email me

† if you use the “bad” pattern!

jdoerfert@anl.gov

slide-161
SLIDE 161

Current Work — Reviews, Evaluation, Features, Hardening

  • started the review process
  • more test-cases needed to determine benefit
  • more developers needed to add missing features
  • more users/developers needed to improve test coverage
slide-162
SLIDE 162

Current Work — Reviews, Evaluation, Features, Hardening

  • started the review process
  • more test-cases needed to determine benefit
  • more developers needed to add missing features
  • more users/developers needed to improve test coverage
slide-163
SLIDE 163

Current Work — Reviews, Evaluation, Features, Hardening

  • started the review process
  • more test-cases needed to determine benefit
  • more developers needed to add missing features
  • more users/developers needed to improve test coverage
slide-164
SLIDE 164

Current Work — Reviews, Evaluation, Features, Hardening

  • started the review process
  • more test-cases needed to determine benefit
  • more developers needed to add missing features
  • more users/developers needed to improve test coverage
slide-165
SLIDE 165

Current Work — Reviews, Evaluation, Features, Hardening

  • started the review process
  • more test-cases needed to determine benefit
  • more developers needed to add missing features
  • more users/developers needed to improve test coverage

Interested? Please let me know!

slide-166
SLIDE 166

Future Work — Optimizations, Front-ends, Targets

  • improve and extend the LLVM’s OpenMP optimizations: connection to

abstract callsites, memory placement, …

  • use target regions in other “front-ends”: F18, Polly, Rust?, …
  • implement the interface for other targets: GPUs, FPGAs?, …
slide-167
SLIDE 167

Future Work — Optimizations, Front-ends, Targets

  • improve and extend the LLVM’s OpenMP optimizations: connection to

abstract callsites, memory placement, …

  • use target regions in other “front-ends”: F18, Polly, Rust?, …
  • implement the interface for other targets: GPUs, FPGAs?, …
slide-168
SLIDE 168

Future Work — Optimizations, Front-ends, Targets

  • improve and extend the LLVM’s OpenMP optimizations: connection to

abstract callsites, memory placement, …

  • use target regions in other “front-ends”: F18, Polly, Rust?, …
  • implement the interface for other targets: GPUs, FPGAs?, …
slide-169
SLIDE 169

Future Work — Optimizations, Front-ends, Targets

  • improve and extend the LLVM’s OpenMP optimizations: connection to

abstract callsites, memory placement, …

  • use target regions in other “front-ends”: F18, Polly, Rust?, …
  • implement the interface for other targets: GPUs, FPGAs?, …

Interested? Please let me know!