An alternative OpenMP Backend for Polly Michael Halkenhuser 2019 - - PowerPoint PPT Presentation

▶

Jan 19, 2023 348 likes •753 views

An alternative OpenMP Backend for Polly Michael Halkenhuser 2019 European LLVM Developers Meeting 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhuser | 1 / 19 Polly Polyhedral framework on LLVM-IR 2019-04-08 |

SLIDE 1

An alternative OpenMP Backend for Polly

Michael Halkenhäuser 2019 European LLVM Developers’ Meeting

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 1 / 19

SLIDE 2

Polly

◮ Polyhedral framework on LLVM-IR

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 2 / 19

SLIDE 3

Polly

◮ Polyhedral framework on LLVM-IR

◮ Efficient analyses and transformations ◮ Code generation

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 2 / 19

SLIDE 4

Polly

◮ Polyhedral framework on LLVM-IR

◮ Efficient analyses and transformations ◮ Code generation

◮ Example transformations

◮ Loop interchange / fission / fusion ◮ Strip mining (Vectorization)

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 2 / 19

SLIDE 5

Polly

◮ Polyhedral framework on LLVM-IR

◮ Efficient analyses and transformations ◮ Code generation

◮ Example transformations

◮ Loop interchange / fission / fusion ◮ Strip mining (Vectorization) ◮ Automatic parallelization

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 2 / 19

SLIDE 6

Polly – Sample Parallelization

◮ Automatic parallelization

◮ No need for manual annotation

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 3 / 19

SLIDE 7

Polly – Sample Parallelization

◮ Automatic parallelization

◮ No need for manual annotation

// "matvect" -- Sequential // (Simplified dependencies) for (i = 0; i <= n; i++) { for (j = 0; j <= n; j++) s[i] = s[i] + a[i][j] * x[j]; }

Input

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 3 / 19

SLIDE 8

Polly – Sample Parallelization

◮ Automatic parallelization

◮ No need for manual annotation

// "matvect" -- Sequential // (Simplified dependencies) for (i = 0; i <= n; i++) { for (j = 0; j <= n; j++) s[i] = s[i] + a[i][j] * x[j]; }

Input

// "matvect" -- OpenMP parallelized // Equivalent to the LLVM-IR output #pragma omp parallel for [...] \ schedule (dynamic, 1) num_threads(N) for (i = 0; i <= n; i++) { for (j = 0; j <= n; j++) s[i] = s[i] + a[i][j] * x[j]; }

Output

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 3 / 19

SLIDE 9

Polly – Parallelization Scheme

◮ Polly detects parallelizable code regions

◮ Moved into an outlined function ◮ Executed using OpenMP API

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 4 / 19

SLIDE 10

Motivation for an alternative OpenMP Backend

◮ Limited influence on OpenMP execution

◮ Increase number of user options ◮ Improve fine-tuning possibilities

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 5 / 19

SLIDE 11

Motivation for an alternative OpenMP Backend

◮ Limited influence on OpenMP execution

◮ Increase number of user options ◮ Improve fine-tuning possibilities

◮ Dependent on GNU OpenMP API

◮ Expand the scope of application

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 5 / 19

SLIDE 12

Motivation for an alternative OpenMP Backend

◮ Limited influence on OpenMP execution

◮ Increase number of user options ◮ Improve fine-tuning possibilities

◮ Dependent on GNU OpenMP API

◮ Expand the scope of application

◮ LLVM OpenMP implementation available

◮ Enable direct use of LLVM’s OpenMP runtime ◮ Support automated testing

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 5 / 19

SLIDE 13

LLVM OpenMP Backend

◮ Extension of the preexisting backend

◮ Reused common functionalities

◮ Moved into abstract base class

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 6 / 19

SLIDE 14

LLVM OpenMP Backend

◮ Extension of the preexisting backend

◮ Reused common functionalities

◮ Moved into abstract base class

◮ API-specific call creation and placement

◮ Implemented in derived class per backend

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 6 / 19

SLIDE 15

LLVM OpenMP Backend

◮ Extension of the preexisting backend

◮ Reused common functionalities

◮ Moved into abstract base class

◮ API-specific call creation and placement

◮ Implemented in derived class per backend

◮ User may choose backend

◮ Via CL switch, similar to

◮ Number of threads

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 6 / 19

SLIDE 16

LLVM OpenMP Backend

◮ Extension of the preexisting backend

◮ Reused common functionalities

◮ Moved into abstract base class

◮ API-specific call creation and placement

◮ Implemented in derived class per backend

◮ User may choose backend

◮ Via CL switch, similar to

◮ Number of threads

◮ Additional options

◮ Scheduling type ◮ Chunk size

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 6 / 19

SLIDE 17

LLVM OpenMP Backend – Options

◮ Scheduling type determines work distribution

static dynamic guided Predetermined, Threads request Hybrid scheduling of uniform distribution work shares of static and dynamic, using

f iterations

chunk size a minimum chunk size

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 7 / 19

SLIDE 18

LLVM OpenMP Backend – Options

◮ Scheduling type determines work distribution

static dynamic guided Load Balancing – +

Organization Overhead

+ –

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 8 / 19

SLIDE 19

LLVM OpenMP Backend – Options

◮ Scheduling type determines work distribution

static dynamic guided Load Balancing – +

Organization Overhead

+ –

◮ static

suited for constant computational demands

◮ dynamic

suited for shifting computational demands

◮ guided

suited for "both"

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 8 / 19

SLIDE 20

Experimental Methodology

◮ PolyBench1

◮ Provides multiple datasets ◮ Triggers auto-parallelization in 18 benchmarks

1https://sourceforge.net/projects/polybench/

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 9 / 19

SLIDE 21

Experimental Methodology

◮ PolyBench1

◮ Provides multiple datasets ◮ Triggers auto-parallelization in 18 benchmarks

◮ Runtime results

◮ Average from 50 out of 60 runs (10% trimmed-mean) ◮ Utilized CPU: AMD R5 1600X

1https://sourceforge.net/projects/polybench/

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 9 / 19

SLIDE 22

Experimental Methodology

◮ PolyBench1

◮ Provides multiple datasets ◮ Triggers auto-parallelization in 18 benchmarks

◮ Runtime results

◮ Average from 50 out of 60 runs (10% trimmed-mean) ◮ Utilized CPU: AMD R5 1600X

◮ Plots show relative speedup

◮ speedup =

runtime of baseline runtime of competitor 1https://sourceforge.net/projects/polybench/

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 9 / 19

SLIDE 23

Performance Impact of chunk size

adi atax bicg cholesky correlation covariance deriche doitgen gemver gesummv g-schmidt lu ludcmp mvt symm syr2k syrk trmm

PolyBench-Benchmark 0.0 2.0 4.0 Achieved Speedup

LLVM OpenMP Chunk Size Comparison Large Dataset · No Vectorization · Dynamic Scheduling · 12 Threads · Baseline: Chunk Size 1

Chunk Size 2 Chunk Size 3 Chunk Size 4 Chunk Size 6

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 10 / 19

SLIDE 24

Performance Impact of scheduling type

adi atax bicg cholesky correlation covariance deriche doitgen gemver gesummv g-schmidt lu ludcmp mvt symm syr2k syrk trmm

PolyBench-Benchmark 0.0 2.0 4.0 6.0 8.0 10.0 Achieved Speedup

LLVM OpenMP Scheduling Comparison No Vectorization · 12 Threads · Baseline: Dynamic Scheduling

Guided Scheduling · Large Dataset Static Scheduling · Large Dataset

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 11 / 19

SLIDE 25

Intermezzo – Customization Options

◮ Chunk size

◮ 1 is usually a reasonable choice ◮ Very beneficial in particular cases

◮ More than 3× speedup possible

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 12 / 19

SLIDE 26

Intermezzo – Customization Options

◮ Chunk size

◮ 1 is usually a reasonable choice ◮ Very beneficial in particular cases

◮ More than 3× speedup possible

◮ Scheduling type

◮ Dynamic: Good overall performance ◮ Guided: Performs at least as good as dynamic ◮ Static: Problem-dependent

◮ May achieve 8× speedup compared to dynamic

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 12 / 19

SLIDE 27

Backend Comparison LLVM versus GNU OpenMP Backend

adi atax bicg cholesky correlation covariance deriche doitgen gemver gesummv g-schmidt lu ludcmp mvt symm syr2k syrk trmm

PolyBench-Benchmark 0.0 0.5 1.0 1.5 2.0 Achieved Speedup

GNU & LLVM Backend Comparison Large Dataset · No Vectorization · 4 Threads · Baseline: GNU Backend

LLVM OpenMP · Best Result

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 13 / 19

SLIDE 28

Backend Comparison LLVM versus GNU OpenMP Backend

adi atax bicg cholesky correlation covariance deriche doitgen gemver gesummv g-schmidt lu ludcmp mvt symm syr2k syrk trmm

PolyBench-Benchmark 0.0 0.5 1.0 1.5 2.0 Achieved Speedup

GNU & LLVM Backend Comparison Large Dataset · No Vectorization · 12 Threads · Baseline: GNU Backend

LLVM OpenMP · Best Result

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 14 / 19

SLIDE 29

Intermezzo – Backend Comparison

◮ Using the maximum number of available threads ◮ Our "LLVM" backend

◮ Achieves comparable performance ◮ Performs significantly faster than "GNU" in seven cases ◮ Reaches up to 1.6× speedup

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 15 / 19

SLIDE 30

Intermezzo – Backend Comparison

◮ Using the maximum number of available threads ◮ Our "LLVM" backend

◮ Achieves comparable performance ◮ Performs significantly faster than "GNU" in seven cases ◮ Reaches up to 1.6× speedup

◮ GNU backend

◮ Only a single, considerable lead

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 15 / 19

SLIDE 31

Intermezzo – Backend Comparison

◮ Using the maximum number of available threads ◮ Our "LLVM" backend

◮ Achieves comparable performance ◮ Performs significantly faster than "GNU" in seven cases ◮ Reaches up to 1.6× speedup

◮ GNU backend

◮ Only a single, considerable lead

◮ Additional switches

◮ Allow problem-specific adjustments ◮ ... without depending on env. variable

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 15 / 19

SLIDE 32

General Comparison LLVM OpenMP Backend versus clang

adi atax bicg cholesky correlation covariance deriche doitgen gemver gesummv g-schmidt lu ludcmp mvt symm syr2k syrk trmm

PolyBench-Benchmark 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 Achieved Speedup

clang Comparison Large Dataset · With Vectorization · Baseline: clang-8 -O3

LLVM OpenMP · Best Result · 12 Threads

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 16 / 19

SLIDE 33

Conclusion

◮ Our "LLVM" OpenMP backend for Polly

◮ Represents a superior alternative

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 17 / 19

SLIDE 34

Conclusion

◮ Our "LLVM" OpenMP backend for Polly

◮ Represents a superior alternative ◮ Acts as drop-in replacement

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 17 / 19

SLIDE 35

Conclusion

◮ Our "LLVM" OpenMP backend for Polly

◮ Represents a superior alternative ◮ Acts as drop-in replacement ◮ Provides more customization options

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 17 / 19

SLIDE 36

Conclusion

◮ Our "LLVM" OpenMP backend for Polly

◮ Represents a superior alternative ◮ Acts as drop-in replacement ◮ Provides more customization options ◮ Carries no clear drawbacks, but instead ...

◮ Reaches up to 1.6× speedup

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 17 / 19

SLIDE 37

Conclusion

◮ Our "LLVM" OpenMP backend for Polly

◮ Is publicly available

◮ Review accepted on March 19th https://reviews.llvm.org/D59100 ◮ Currently on Polly‘s master branch https://github.com/llvm/llvm-project/commit/89251ed

◮ References:

◮ Title graphic: https://polly.llvm.org/images/header-background.png ◮ T. Grosser, H. Zheng, R. Aloor, A. Simbürger, A. Größlinger, and L. - N. Pouchet,

“Polly - Polyhedral optimization in LLVM,” in Proceedings of the First International Workshop on Polyhedral Compilation Techniques (IMPACT), vol. 2011, 2011, p. 1.

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 18 / 19

SLIDE 38

Questions ?

◮ Ask them now, or ... ◮ Find me tomorrow, at the poster session

◮ 09:00 am - 10:00 am (Foyer)

2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 19 / 19