Pipelined Multithreading Generation in a Polyhedral Compiler - - PowerPoint PPT Presentation

pipelined multithreading generation in a polyhedral
SMART_READER_LITE
LIVE PREVIEW

Pipelined Multithreading Generation in a Polyhedral Compiler - - PowerPoint PPT Presentation

Pipelined Multithreading Generation in a Polyhedral Compiler January 22nd 2020, IMPACT20, HiPEAC, Bologna, Italy Harenome Ranaivoarivony-Razanajato, Cdric Bastoul, Vincent Loechner University of Surasbourg and Inria Nancy Grand Est Team


slide-1
SLIDE 1

Pipelined Multithreading Generation in a Polyhedral Compiler

January 22nd 2020, IMPACT’20, HiPEAC, Bologna, Italy Harenome Ranaivoarivony-Razanajato, Cédric Bastoul, Vincent Loechner

University of Surasbourg and Inria Nancy Grand Est

Team ICPS | Scientifjc and Parallel Computing University of Surasbourg

slide-2
SLIDE 2

Motivating Example

1 for (int i = 1; i < N; ++i) 2 A[i] = f1(A[i], A[i - 1]); // S1 3 for (int i = 1; i < N; ++i) 4 B[i] = f2(A[i], B[i - 1]); // S2 5 /* ... */ 6 for (int i = 1; i < N; ++i) 7 F[i] = f6(E[i], F[i - 1]); // S6

(a) Sequential Program

S6 S2 S1 …

(b) Dependency Graph

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 1

slide-3
SLIDE 3

Motivating Example

1 for (int i = 1; i < N; ++i) 2 A[i] = f1(A[i], A[i - 1]); // S1 3 for (int i = 1; i < N; ++i) 4 B[i] = f2(A[i], B[i - 1]); // S2 5 /* ... */ 6 for (int i = 1; i < N; ++i) 7 F[i] = f6(E[i], F[i - 1]); // S6

(a) Sequential Program

S6 S2 S1 …

(b) Dependency Graph

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 1

slide-4
SLIDE 4

Motivating Example

1 for (int i = 1; i < N; ++i) 2 A[i] = f1(A[i], A[i - 1]); // S1 3 for (int i = 1; i < N; ++i) 4 B[i] = f2(A[i], B[i - 1]); // S2 5 /* ... */ 6 for (int i = 1; i < N; ++i) 7 F[i] = f6(E[i], F[i - 1]); // S6

(a) Sequential Program

S1(1), thread 1 S2(1), thread 1 S1(2), thread 2 S3(1), thread 1 S2(2), thread 2 S1(3), thread 3

(b) Pipelined Execution

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 1

slide-5
SLIDE 5

Motivating Example

1 for (int i = 1; i < N; ++i) 2 A[i] = f1(A[i], A[i - 1]); // S1 3 for (int i = 1; i < N; ++i) 4 B[i] = f2(A[i], B[i - 1]); // S2 5 /* ... */ 6 for (int i = 1; i < N; ++i) 7 F[i] = f6(E[i], F[i - 1]); // S6

(a) Sequential Program

S1(1), thread 1 S2(1), thread 1 S1(2), thread 2 S3(1), thread 1 S2(2), thread 2 S1(3), thread 3

(b) Pipelined Execution

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 1

slide-6
SLIDE 6

Motivating Example

1 for (int i = 1; i < N; ++i) 2 A[i] = f1(A[i], A[i - 1]); // S1 3 for (int i = 1; i < N; ++i) 4 B[i] = f2(A[i], B[i - 1]); // S2 5 /* ... */ 6 for (int i = 1; i < N; ++i) 7 F[i] = f6(E[i], F[i - 1]); // S6

(a) Sequential Program

S1(1), thread 1 S2(1), thread 1 S1(2), thread 2 S3(1), thread 1 S2(2), thread 2 S1(3), thread 3

(b) Pipelined Execution

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 1

slide-7
SLIDE 7

Motivating Example

1 for (int i = 1; i < N; ++i) 2 A[i] = f1(A[i], A[i - 1]); // S1 3 for (int i = 1; i < N; ++i) 4 B[i] = f2(A[i], B[i - 1]); // S2 5 /* ... */ 6 for (int i = 1; i < N; ++i) 7 F[i] = f6(E[i], F[i - 1]); // S6

(a) Sequential Program

S1(1), thread 1 S2(1), thread 1 S1(2), thread 2 S3(1), thread 1 S2(2), thread 2 S1(3), thread 3

(b) Pipelined Execution

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 1

slide-8
SLIDE 8

Motivating Example

1 for (int i = 1; i < N; ++i) 2 A[i] = f1(A[i], A[i - 1]); // S1 3 for (int i = 1; i < N; ++i) 4 B[i] = f2(A[i], B[i - 1]); // S2 5 /* ... */ 6 for (int i = 1; i < N; ++i) 7 F[i] = f6(E[i], F[i - 1]); // S6

(a) Sequential Program

S1(1), thread 1 S2(1), thread 1 S1(2), thread 2 S3(1), thread 1 S2(2), thread 2 S1(3), thread 3

(b) Pipelined Execution

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 1

slide-9
SLIDE 9

Motivating Example

1 for (int i = 1; i < N; ++i) 2 A[i] = f1(A[i], A[i - 1]); // S1 3 for (int i = 1; i < N; ++i) 4 B[i] = f2(A[i], B[i - 1]); // S2 5 /* ... */ 6 for (int i = 1; i < N; ++i) 7 F[i] = f6(E[i], F[i - 1]); // S6

(a) Sequential Program

S1(1), thread 1 S2(1), thread 1 S1(2), thread 2 S3(1), thread 1 S2(2), thread 2 S1(3), thread 3

(b) Pipelined Execution

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 1

slide-10
SLIDE 10

Motivating Example

1 for (int i = 1; i < N; ++i) 2 A[i] = f1(A[i], A[i - 1]); // S1 3 for (int i = 1; i < N; ++i) 4 B[i] = f2(A[i], B[i - 1]); // S2 5 /* ... */ 6 for (int i = 1; i < N; ++i) 7 F[i] = f6(E[i], F[i - 1]); // S6

(a) Sequential Program

1 #pragma omp parallel 2 { 3 #pragma omp for schedule(static) ordered nowait 4 for (int i = 1; i < N; ++i) 5 #pragma omp ordered 6 A[i] = f1(A[i], A[i - 1]); // S1 7 #pragma omp for schedule(static) ordered nowait 8 for (int i = 1; i < N; ++i) 9 #pragma omp ordered 10 B[i] = f2(A[i], B[i - 1]); // S2 11 /* ... */ 12 #pragma omp for schedule(static) ordered nowait 13 for (int i = 1; i < N; ++i) 14 #pragma omp ordered 15 F[i] = f6(E[i], F[i - 1]); // S6 16 }

(b) Pipelined OpenMP target program

Speedup: 2.89 6 stages on an Intel Xeon E5-2620v3 @ 2.40 GHz, with N 100 000

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 1

slide-11
SLIDE 11

Motivating Example

1 for (int i = 1; i < N; ++i) 2 A[i] = f1(A[i], A[i - 1]); // S1 3 for (int i = 1; i < N; ++i) 4 B[i] = f2(A[i], B[i - 1]); // S2 5 /* ... */ 6 for (int i = 1; i < N; ++i) 7 F[i] = f6(E[i], F[i - 1]); // S6

(a) Sequential Program

1 #pragma omp parallel 2 { 3 #pragma omp for schedule(static) ordered nowait 4 for (int i = 1; i < N; ++i) 5 #pragma omp ordered 6 A[i] = f1(A[i], A[i - 1]); // S1 7 #pragma omp for schedule(static) ordered nowait 8 for (int i = 1; i < N; ++i) 9 #pragma omp ordered 10 B[i] = f2(A[i], B[i - 1]); // S2 11 /* ... */ 12 #pragma omp for schedule(static) ordered nowait 13 for (int i = 1; i < N; ++i) 14 #pragma omp ordered 15 F[i] = f6(E[i], F[i - 1]); // S6 16 }

(b) Pipelined OpenMP target program

Speedup: 2.89 6 stages on an Intel Xeon E5-2620v3 @ 2.40 GHz, with N = 100, 000

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 1

slide-12
SLIDE 12

Goals

  • Identifying software pipelines in a polyhedral compiler
  • Generate pipelined multithreading using OpenMP

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 2

slide-13
SLIDE 13

Polyhedral Model

Introduction Background Pipelined Multithreading Generation Experimental Results Conclusion

slide-14
SLIDE 14

OpenMP

  • #pragma based API for shared memory parallelism
  • Worksharing constructs
  • #pragma omp for
  • #pragma omp task
  • Synchronization
  • #pragma omp barrier: explicit synchronization barrier
  • omp_set_lock() and omp_unset_lock(): explicit lock

mechanism

  • Clauses
  • nowait clause on worksharing constructs: omit the implicit barrier at

the end of a worksharing construct

  • ordered clause on worksharing constructs: sequentialize a region

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 3

slide-15
SLIDE 15

OpenMP

  • #pragma based API for shared memory parallelism
  • Worksharing constructs
  • #pragma omp for
  • #pragma omp task
  • Synchronization
  • #pragma omp barrier: explicit synchronization barrier
  • omp_set_lock() and omp_unset_lock(): explicit lock

mechanism

  • Clauses
  • nowait clause on worksharing constructs: omit the implicit barrier at

the end of a worksharing construct

  • ordered clause on worksharing constructs: sequentialize a region

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 3

slide-16
SLIDE 16

OpenMP

  • #pragma based API for shared memory parallelism
  • Worksharing constructs
  • #pragma omp for
  • #pragma omp task
  • Synchronization
  • #pragma omp barrier: explicit synchronization barrier
  • omp_set_lock() and omp_unset_lock(): explicit lock

mechanism

  • Clauses
  • nowait clause on worksharing constructs: omit the implicit barrier at

the end of a worksharing construct

  • ordered clause on worksharing constructs: sequentialize a region

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 3

slide-17
SLIDE 17

Polyhedral Model

Introduction Background Pipelined Multithreading Generation Sequential Loop Fission Relaxed nowait prerequisites Alternative: Explicit synchronization Experimental Results Conclusion

slide-18
SLIDE 18

Sequential Loop Fission

  • Goal: maximize the number of pipeline stages
  • Dependence analysis: identify Surongly Connected Components

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 4

slide-19
SLIDE 19

Sequential Loop Fission

1 for (int i = 2; i < N; ++i) { 2 a[i] = h[i - 1] + R[i]; // S1 3 b[i] = a[i - 1] + a[i]; // S2 4 c[i] = b[i - 1] + b[i]; // S3 5 d[i] = c[i - 1] + c[i]; // S4 6 e[i] = d[i - 2] + d[i - 1]; // S5 7 f[i] = e[i - 2] + e[i - 1]; // S6 8 g[i] = f[i] + X[i]; // S7 9 h[i] = g[i] + Y[i]; // S8 10 u[i] = v[i - 1] + d[i]; // S9 11 v[i] = u[i] + Z[i]; // S10 12 }

(a) Original loop body

1 for (int i = 2; i < N; ++i) { 2 a[i] = h[i - 1] + R[i]; // S1 3 b[i] = a[i - 1] + a[i]; // S2 4 c[i] = b[i - 1] + b[i]; // S3 5 d[i] = c[i - 1] + c[i]; // S4 6 e[i] = d[i - 2] + d[i - 1]; // S5 7 f[i] = e[i - 2] + e[i - 1]; // S6 8 g[i] = f[i] + X[i]; // S7 9 h[i] = g[i] + Y[i]; // S8 10 } 11 for (int i = 2; i < N; ++i) { 12 u[i] = v[i - 1] + d[i]; // S9 13 v[i] = u[i] + Z[i]; // S10 14 }

(b) Fission of Surongly Connected Components

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 5

slide-20
SLIDE 20

Conditions on the nowait clause for parallel

The safe use of the nowait clause between two parallel loops requires that there are no dependencies between the loops or that:

  • the sizes of the iteration domains are equal
  • the chunk size is either the same or not specifjed
  • both loops are bound to the same parallel region
  • none of the loops is associated with a SIMD construct
  • the second loop depends only on the same logical iteration of the fjrst

loop

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 6

slide-21
SLIDE 21

Relaxed conditions on the nowait clause for ordered loops

The safe use of the nowait clause between two ordered loops requires that there are no dependencies between the loops or that:

  • the sizes of the iteration domains are equal
  • the chunk size is either the same or not specifjed
  • both loops are bound to the same parallel region
  • none of the loops is associated with a SIMD construct
  • the second loop depends only on the same logical iteration of the fjrst

loop

  • the second loop depends on the same logical iteration or previous logical

iterations of the fjrst loop

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 7

slide-22
SLIDE 22

Relaxed conditions on the nowait clause for ordered loops

The safe use of the nowait clause between two ordered loops requires that there are no dependencies between the loops or that:

  • the sizes of the iteration domains are equal
  • the chunk size is either the same or not specifjed
  • both loops are bound to the same parallel region
  • none of the loops is associated with a SIMD construct
  • the second loop depends only on the same logical iteration of the fjrst

loop

  • the second loop depends on the same logical iteration or previous logical

iterations of the fjrst loop

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 7

slide-23
SLIDE 23

Relaxed conditions on the nowait clause for ordered loops

1 #pragma omp parallel 2 { 3 #pragma omp for nowait 4 for (int i = 0; i < N; ++i) 5 A[i] = f1(A[i]); 6 #pragma omp for 7 for (int i = 0; i < N; ++i) 8 B[i] = f2(B[i], A[i]); 9 }

(a) Parallel for and nowait

1 #pragma omp parallel 2 { 3 #pragma omp for ordered nowait 4 for (int i = 0; i < N; ++i) 5 #pragma omp ordered 6 A[i] = f1(A[i]); 7 #pragma omp for ordered 8 for (int i = 0; i < N; ++i) 9 #pragma omp ordered 10 B[i] = f2(B[i], A[i-1]); 11 }

(b) Ordered for and nowait

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 8

slide-24
SLIDE 24

Annotations

  • Annotate sequential loops with #pragma omp for ordered
  • Enclose sequential loop bodies in #pragma omp ordered regions
  • Annotate loops with nowait where possible
  • Optimize by reverting ordered loops without nowait clauses to

#pragma omp single regions

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 9

slide-25
SLIDE 25

Polyhedral Model

Introduction Background Pipelined Multithreading Generation Sequential Loop Fission Relaxed nowait prerequisites Alternative: Explicit synchronization Experimental Results Conclusion

slide-26
SLIDE 26

Explicit synchronization

  • Loop blocking and loop fusion
  • #pragma omp for schedule(static, 1) on the blocking

loop

  • omp_set_lock() and omp_unset_lock() before and after each

loop of the pipeline

  • up to n × m locks required for m pipeline stages over n threads

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 10

slide-27
SLIDE 27

Explicit synchronization

1 #pragma omp parallel 2 { 3 #pragma omp for ordered nowait 4 for (size_t i = 1; i < N; ++i) 5 #pragma omp ordered 6 A[i] = f(A[i], A[i - 1]); 7 /* Other stages */ 8 #pragma omp for ordered 9 for (size_t i = 1; i < N; ++i) 10 #pragma omp ordered 11 F[i] = f(E[i], F[i - 1]); 12 }

(a) Original program

1

  • mp_lock_t** locks;

2 #pragma omp parallel 3 { 4 /* Choose num_threads, block_size, block_count. */ 5 /* Allocate, initialize and set the locks. */ 6 #pragma omp for schedule(static, 1) 7 for (size_t block = 0; block < block_count; ++block) { 8 /* Local loop bounds and indexes. */ 9 const size_t start = 1 + block * block_size; 10 const size_t end = MIN(start + block_size, N); 11 const size_t self = block % num_threads; 12 const size_t next = (block + 1) % num_threads; 13

  • mp_set_lock(&locks[self][0]);

14 for (size_t i = start; i < end; ++i) 15 A[i] = f(A[i], A[i-1]); 16

  • mp_unset_lock(&locks[next][0]);

17 /* Other stages of the pipeline */ 18

  • mp_set_lock(&locks[self][5]);

19 for (size_t i = start; i < end; ++i) 20 F[i] = f(E[i], F[i-1]); 21

  • mp_unset_lock(&locks[next][5]);

22 } 23 /* Destroy and free locks. */ 24 }

(b) Pipelined OpenMP target program

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 11

slide-28
SLIDE 28

Future Work

Introduction Background Pipelined Multithreading Generation Experimental Results Conclusion

slide-29
SLIDE 29

Experimental Setup

  • Tested on an Intel Xeon E5-2620v3 @ 2.40 GHz, linux 5.3.7
  • Code compiled using gcc 9.2.1 and clang 9.0.0 with options
  • O3 -march=native -fopenmp
  • FIFO scheduling enabled and process priority set to 75

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 12

slide-30
SLIDE 30

Benchmarks

benchmark parallel loops stages teaser 5 van_dongen1 2 wdf2 2 mix 1 2

1 (Vincent H Van Dongen, Guang R Gao, and Qi Ning. “A polynomial time method for optimal

software pipelining”. In: Parallel Processing: CONPAR 92—VAPP V. Springer, 1992, pp. 613–624)

2 (Alfred Fettweis. “Wave digital fjlters: Theory and practice”. In: Proceedings of the IEEE 74.2

[1986], pp. 270–327)

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 13

slide-31
SLIDE 31

Results — gcc/libgomp

vandongen wdf teaser teaser+nanosleep mix 2 4 6 8

8.25 · 10−3 2.83 · 10−3 1.31 · 10−3 2.8 6.12 · 10−4 6.07 · 10−4 1 0.19 0.87 0.79 0.74 1 0.75 2.89 0.96 0.92 1 7.81 2.97 1.07 1.13 1.02

speedup tasks+fission tasks

  • rdered
  • rdered+nowait

locks

Figure 7: Speedups or slowdowns over sequential version

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 14

slide-32
SLIDE 32

Results — clang/libomp

vandongen wdf teaser teaser+nanosleep mix 1 2 3 4

5.5 · 10−5 3.2 · 10−5 0.13 9 · 10−5 4.7 · 10−5 3.7 · 10−5 1.01 2.4 · 10−2 0.65 0.2 8.94 · 10−2 1 0.1 2.08 0.19 9.18 · 10−2 1 3.95 2.28 1.35 1.08 1

speedup tasks+fission tasks

  • rdered
  • rdered+nowait

locks

Figure 8: Speedups or slowdowns over sequential version

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 15

slide-33
SLIDE 33

Future Work

Introduction Background Pipelined Multithreading Generation Experimental Results Conclusion

slide-34
SLIDE 34

Contributions and Future Work

  • Contributions:
  • Identifying software pipelines in a polyhedral compiler
  • Generating pipelined multithreading
  • Future work:
  • Integration in an automatic parallelizer
  • Investigate methods to determine optimal block sizes

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 16

slide-35
SLIDE 35

Contributions and Future Work

  • Contributions:
  • Identifying software pipelines in a polyhedral compiler
  • Generating pipelined multithreading
  • Future work:
  • Integration in an automatic parallelizer
  • Investigate methods to determine optimal block sizes

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al. 16

slide-36
SLIDE 36

Appendix

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al.

slide-37
SLIDE 37

References i

[1] Alfred Fettweis. “Wave digital fjlters: Theory and practice”. In: Proceedings of the IEEE 74.2 (1986), pp. 270–327. [2] Harenome Razanajato, Cédric Bastoul, and Vincent Loechner. “Pipelined Multithreading Generation in a Polyhedral Compiler”. In: IMPACT 2020, 10th International Workshop on Polyhedral Compilation Techniques. Bologna, Italy, Jan. 2020. [3] Vincent H Van Dongen, Guang R Gao, and Qi Ning. “A polynomial time method for optimal software pipelining”. In: Parallel Processing: CONPAR 92—VAPP V. Springer, 1992, pp. 613–624.

Pipelined Multithreading Generation in a Polyhedral Compiler,Harenome Razanajato et al.