Optimizing Indirections, or using abstractions without remorse - - PowerPoint PPT Presentation

optimizing indirections or using abstractions without
SMART_READER_LITE
LIVE PREVIEW

Optimizing Indirections, or using abstractions without remorse - - PowerPoint PPT Presentation

Optimizing Indirections, or using abstractions without remorse LLVMDev18 October 18, 2018 San Jose, California, USA Johannes Doerfert, Hal Finkel Leadership Computing Facility Argonne National Laboratory https://www.alcf.anl.gov/


slide-1
SLIDE 1

Optimizing Indirections, or using abstractions without remorse

LLVMDev’18 — October 18, 2018 — San Jose, California, USA

Johannes Doerfert, Hal Finkel

Leadership Computing Facility Argonne National Laboratory https://www.alcf.anl.gov/

slide-2
SLIDE 2

Acknowledgment

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative.

1/15

slide-3
SLIDE 3

Context & Motivation

slide-4
SLIDE 4

Context — Optimizations For Parallel Programs

Optimizations for sequential aspects

  • Can reuse (improved) existing transformations

⇒ Introduce suitable abstractions and transformations to bridge the indirection Optimizations for parallel aspects

  • New explicit parallelism-aware transformations (see IWOMP’18a)

⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talkb)

a

Compiler Optimizations For OpenMP, J. Doerfert, H. Finkel, IWOMP 2018

b

A Parallel IR in Real Life: Optimizing OpenMP, H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM Meeting 2018

2/15

slide-5
SLIDE 5

Context — Optimizations For Parallel Programs

Optimizations for sequential aspects

  • Can reuse (improved) existing transformations

⇒ Introduce suitable abstractions and transformations to bridge the indirection Optimizations for parallel aspects

  • New explicit parallelism-aware transformations (see IWOMP’18a)

⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talkb)

a

Compiler Optimizations For OpenMP, J. Doerfert, H. Finkel, IWOMP 2018

b

A Parallel IR in Real Life: Optimizing OpenMP, H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM Meeting 2018

2/15

slide-6
SLIDE 6

Context — Optimizations For Parallel Programs

Optimizations for sequential aspects

  • Can reuse (improved) existing transformations

⇒ Introduce suitable abstractions and transformations to bridge the indirection Optimizations for parallel aspects

  • New explicit parallelism-aware transformations (see IWOMP’18a)

⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talkb)

aCompiler Optimizations For OpenMP, J. Doerfert, H. Finkel, IWOMP 2018 b

A Parallel IR in Real Life: Optimizing OpenMP, H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM Meeting 2018

2/15

slide-7
SLIDE 7

Context — Optimizations For Parallel Programs

Optimizations for sequential aspects

  • Can reuse (improved) existing transformations

⇒ Introduce suitable abstractions and transformations to bridge the indirection Optimizations for parallel aspects

  • New explicit parallelism-aware transformations (see IWOMP’18a)

⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talkb)

aCompiler Optimizations For OpenMP, J. Doerfert, H. Finkel, IWOMP 2018 b

A Parallel IR in Real Life: Optimizing OpenMP, H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM Meeting 2018

2/15

slide-8
SLIDE 8

Context — Optimizations For Parallel Programs

Optimizations for sequential aspects

  • Can reuse (improved) existing transformations

⇒ Introduce suitable abstractions and transformations to bridge the indirection Optimizations for parallel aspects

  • New explicit parallelism-aware transformations (see IWOMP’18a)

⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talkb)

aCompiler Optimizations For OpenMP, J. Doerfert, H. Finkel, IWOMP 2018 bA Parallel IR in Real Life: Optimizing OpenMP, H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM

Meeting 2018

2/15

slide-9
SLIDE 9

Context — Optimizations For Parallel Programs

Optimizations for sequential aspects

  • Can reuse (improved) existing transformations

⇒ Introduce suitable abstractions and transformations to bridge the indirection Optimizations for parallel aspects

  • New explicit parallelism-aware transformations (see IWOMP’18a)

⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talkb)

aCompiler Optimizations For OpenMP, J. Doerfert, H. Finkel, IWOMP 2018 bA Parallel IR in Real Life: Optimizing OpenMP, H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM

Meeting 2018

2/15

Interested? Contact me and come to our BoF!

slide-10
SLIDE 10

Context — Compiler Optimization Original Program After Optimizations

int y = 7; for (i = 0; i < N; i++) { f(y, i); } g(y);

3/15

slide-11
SLIDE 11

Context — Compiler Optimization Original Program After Optimizations

int y = 7; for (i = 0; i < N; i++) { f(y, i); } g(y); for (i = 0; i < N; i++) { f(7, i); } g(7);

3/15

slide-12
SLIDE 12

Motivation — Compiler Optimization For Parallelism Original Program After Optimizations

int y = 7; #pragma omp parallel for for (i = 0; i < N; i++) { f(y, i); } g(y);

3/15

slide-13
SLIDE 13

Motivation — Compiler Optimization For Parallelism Original Program After Optimizations

int y = 7; #pragma omp parallel for for (i = 0; i < N; i++) { f(y, i); } g(y); int y = 7; #pragma omp parallel for for (i = 0; i < N; i++) { f(y, i); } g(y);

3/15

slide-14
SLIDE 14

Sequential Performance of Parallel Programs

Why is this important?

4/15

slide-15
SLIDE 15

Sequential Performance of Parallel Programs

4/15

slide-16
SLIDE 16

Sequential Performance of Parallel Programs

4/15

slide-17
SLIDE 17

Sequential Performance of Parallel Programs

4/15

slide-18
SLIDE 18

Sequential Performance of Parallel Programs

4/15

slide-19
SLIDE 19

Early Outlining OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

5/15

slide-20
SLIDE 20

Early Outlining OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn,

&N, &In, &Out);

5/15

slide-21
SLIDE 21

Early Outlining OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn,

&N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn(int tid, int *N, float** In, float** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for (int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

5/15

slide-22
SLIDE 22

Early Outlining OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn,

&N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn(int tid, int* N, float** In, float** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for (int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

5/15

slide-23
SLIDE 23

An Abstract Parallel IR OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by an annotated loop

parfor (int i = 0; i < N; i++)

body_fn(i, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn(int i , int* N, float** In, float** Out) { (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

5/15

slide-24
SLIDE 24

Early Outlining OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn,

&N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn(int tid, int* N, float** In, float** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for (int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

5/15

slide-25
SLIDE 25

Early Outlining + Transitive Calls OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn,

&N, &In, &Out); // Model transitive call: body_fn(?, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn(int tid, int* N, float** In, float** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for (int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

5/15

slide-26
SLIDE 26

Early Outlining + Transitive Calls OpenMP Input:

#pragma omp parallel for for (int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

// Parallel region replaced by a runtime call.

  • mp_rt_parallel_for(0, N, &body_fn,

&N, &In, &Out); // Model transitive call: body_fn(?, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn(int tid, int* N, float** In, float** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for (int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

5/15

+ valid and executable IR + no unintended interactions + >1k function pointers arguments in LLVM-TS + SPEC − integration cost per IPO

slide-27
SLIDE 27

Call Abstraction in LLVM

CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs)

6/15

slide-28
SLIDE 28

Call Abstraction in LLVM + Transitive Call Sites

CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs)

6/15

slide-29
SLIDE 29

Call Abstraction in LLVM + Transitive Call Sites

CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs)

6/15

slide-30
SLIDE 30

Call Abstraction in LLVM + Transitive Call Sites

CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs)

6/15

Functional Changes for Inter-Procedural Constant Propagation:

slide-31
SLIDE 31

Inter-Procedural Optimization (IPO) in LLVM

slide-32
SLIDE 32

IPO — Attribute Inference

  • O3 -disable-inlining

static int* internal_ret1_rrw(int *r0, int *r1, int *w0); static int* internal_ret0_nw(int *n0, int *w0); static int* internal_ret1_rw(int *r0, int *w0); int* external_source_ret2_nrw(int *n0, int *r0, int *w0); int* external_sink_ret2_nrw(int *n0, int *r0, int *w0); int* external_ret2_nrw(int *n0, int *r0, int *w0);

7/15

slide-33
SLIDE 33

IPO — Attribute Inference

  • O3 -disable-inlining

static int* internal_ret1_rrw(int *r0, int *r1, int *w0) { if (!*r0) return r1; internal_ret1_rw(r0, w0); *w0 = *r0 + *r1; internal_ret1_rw(r1, w0); internal_ret0_nw(r0, w0); internal_ret0_nw(w0, w0); external_ret2_nrw(r0, r1, w0); external_ret2_nrw(r1, r0, w0); external_sink_ret2_nrw(r0, r1, w0); external_sink_ret2_nrw(r1, r0, w0); return internal_ret0_nw(r1, w0); }

7/15

slide-34
SLIDE 34

IPO — Attribute Inference

  • O3 -disable-inlining

7/15

slide-35
SLIDE 35

IPO — Attribute Inference

  • O3 -disable-inlining

7/15

slide-36
SLIDE 36

IPO — Attribute Inference

  • O3 -disable-inlining

7/15

Interested? See our RFC: ”Properly” Derive Function/Argument/Parameter Attributes

slide-37
SLIDE 37

IPO — Constant Propagation

  • O3 -disable-inlining

struct Pair { static int foo(int a, int b) { return a + b; // 5? } int bar() { return foo(2, 3); } . .

8/15

slide-38
SLIDE 38

IPO — Constant Propagation

  • O3 -disable-inlining

static int foo(int a, int b) { return a + b; // 5? } int bar() { return foo(2, 3); } struct Pair { int a, b; }; static int foo(struct Pair p) { return p.a + p.b; // 5? } int bar() { struct Pair p = {2, 3}; return foo(p); }

8/15

slide-39
SLIDE 39

IPO — Constant Propagation

  • O3 -disable-inlining

static int foo(int a, int b) { return 5; } int bar() { return foo(2, 3); } struct Pair { int a, b; }; static int foo(struct Pair p) { return 5; } int bar() { struct Pair p = {2, 3}; return foo(p); }

8/15

slide-40
SLIDE 40

IPO — Constant Propagation

  • O3 -disable-inlining

struct Pair { int a, b; }; static int foo(struct Pair *p) { return p->a + p->b; // 5? } int bar() { struct Pair p = {2, 3}; return foo(&p); } struct Tuple { int a, b, c, d; }; static int foo(struct Tuple t) { return t.a + t.b + t.c + t.d; // 5? } int bar() { struct Tuple t = {2, 3, 0, 0}; return foo(t); }

8/15

slide-41
SLIDE 41

IPO — Constant Propagation

  • O3 -disable-inlining

struct Pair { int a, b; }; static int foo(struct Pair *p) { return p->a + p->b; } int bar() { struct Pair p = {2, 3}; return foo(&p); } struct Tuple { int a, b, c, d; }; static int foo(struct Tuple t) { return t.a + t.b + t.c + t.d; } int bar() { struct Tuple t = {2, 3, 0, 0}; return foo(t); }

8/15

slide-42
SLIDE 42

IPO — Constant Propagation

  • O3 -disable-inlining

struct Pair { int a, b; }; static int foo(struct Pair *p) { return p->a + p->b; } int bar() { struct Pair p = {2, 3}; return foo(&p); } struct Tuple { int a, b, c, d; }; static int foo(struct Tuple t) { return t.a + t.b + t.c + t.d; } int bar() { struct Tuple t = {2, 3, 0, 0}; return foo(t); }

8/15

Why? Pipeline is less tuned and passes are conservative for IPO.

slide-43
SLIDE 43

IPO — Object Arguments

  • O3 -disable-inlining

struct Tuple { int a, b, c, d, e, f, g, *h; }; static int f(struct Tuple *t) { return t->a+t->c+t->e+t->g; } int bar(struct Tuple *t) { t->a = 3; t->e = 7; /* ... */ f(t); // ... t->h does escape in f! }

9/15

slide-44
SLIDE 44

IPO — Object Arguments — 1. Fan Out Early

  • O3 -disable-inlining

struct Tuple { int a, b, c, d, e, f, g, *h; }; static int f(struct Tuple *t) { return t->a+t->c+t->e+t->g; } int bar(struct Tuple *t) { t->a = 3; t->e = 7; /* ... */ f(t); // ... t->h does escape in f! } struct Tuple { int a, b, c, d, e, f, g, *h; }; static int f(int a, int c, int e, int g) { return a + c + e + g; } int bar(struct Tuple *t) { t->a = 3; t->e = 7; /* ... */ f(3, t->c, 7, t->g); // ... t->h does not escape in f! }

9/15

slide-45
SLIDE 45

IPO — Object Arguments — 2. Optimize

  • O3 -disable-inlining

struct Tuple { int a, b, c, d, e, f, g, *h; }; static int f(struct Tuple *t) { return t->a+t->c+t->e+t->g; } int bar(struct Tuple *t) { t->a = 3; t->e = 7; /* ... */ f(t); // ... t->h does escape in f! } struct Tuple { int a, b, c, d, e, f, g, *h; }; static int f(int c, int g) { return 3 + c + 3 + g; } int bar(struct Tuple *t) { t->a = 3; t->e = 7; /* ... */ f(t->c, t->g); // ... t->h does escape in f! }

9/15

slide-46
SLIDE 46

IPO — Object Arguments — 3. Condense Late

  • O3 -disable-inlining

struct Tuple { int a, b, c, d, e, f, g, *h; }; static int f(struct Tuple *t) { return t->a+t->c+t->e+t->g; } int bar(struct Tuple *t) { t->a = 3; t->e = 7; /* ... */ f(t); // ... t->h does escape in f! } struct Tuple { int a, b, c, d, e, f, g, *h; }; static int f(struct Tuple *t) { return 3 + t->c + 7 + t->g; } int bar(struct Tuple *t) { t->a = 3; t->e = 7; /* ... */ f(t); // ... t->h does escape in f! }

9/15

slide-47
SLIDE 47

IPO — Object Arguments — 3. Condense Late

  • O3 -disable-inlining

struct Tuple { int a, b, c, d, e, f, g, *h; }; static int f(struct Tuple *t) { return t->a+t->c+t->e+t->g; } int bar(struct Tuple *t) { t->a = 3; t->e = 7; /* ... */ f(t); // ... t->h does escape in f! } struct Tuple { int a, b, c, d, e, f, g, *h; }; static int f(struct Tuple *t) { return 3 + t->c + 7 + t->g; } int bar(struct Tuple *t) { t->a = 3; t->e = 7; /* ... */ f(t); // ... t->h does escape in f! }

9/15

Aggressively unpack object arguments early and condense arguments late as an alternative/substitution for inlining.

slide-48
SLIDE 48

IPO — Additional Proposals/Prototypes

  • track values of fields across function calls, e.g., closure initialization

[prototype]

  • determine performance impact of missing static information

[ongoing]

  • export attributes for libraries, e.g., add __attribute__((const))

[planned]

10/15

slide-49
SLIDE 49

IPO — Additional Proposals/Prototypes

  • track values of fields across function calls, e.g., closure initialization

[prototype]

  • determine performance impact of missing static information

[ongoing]

  • export attributes for libraries, e.g., add __attribute__((const))

[planned]

10/15

Interested? Contact me!

slide-50
SLIDE 50

Evaluation

slide-51
SLIDE 51

OpenMP Optimizations

Version Description Opt. base plain “-O3”, thus no parallel optimizations attr attribute propagation through attr. deduction (IPO) I argp variable privatization through arg. promotion (IPO) II n/a constant propagation (IPO)

11/15

slide-52
SLIDE 52

OpenMP Optimizations — Performance Results

12/15

slide-53
SLIDE 53

OpenMP Optimizations — Performance Results

12/15

slide-54
SLIDE 54

OpenMP Optimizations — Performance Results

12/15

slide-55
SLIDE 55

OpenMP Optimizations — Performance Results

12/15

slide-56
SLIDE 56

OpenMP Optimizations — Performance Results

12/15

slide-57
SLIDE 57

OpenMP Optimizations — Performance Results

12/15

slide-58
SLIDE 58

Array Constant Propagation Example

double gamma[4][8]; gamma[0][0] = 1; // ... and so on till ... gamma[3][7] = -1; Kokkos::parallel_for( "CalcFBHourglassForceForElems A", numElem, KOKKOS_LAMBDA(const int &i2) { // Use gamma[0][0] ... gamme[3][7] }

13/15

slide-59
SLIDE 59

Array Constant Propagation Performance

14/15

slide-60
SLIDE 60

Conclusion

slide-61
SLIDE 61

Conclusion

15/15

slide-62
SLIDE 62

Conclusion

15/15

slide-63
SLIDE 63

Conclusion

15/15

slide-64
SLIDE 64

Conclusion

15/15

slide-65
SLIDE 65

Conclusion

15/15