PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability - - PowerPoint PPT Presentation

pcere fine grained parallel benchmark decomposition for
SMART_READER_LITE
LIVE PREVIEW

PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability - - PowerPoint PPT Presentation

PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction Mihail Popov, Chadi Akel, Florent Conti, William Jalby, Pablo de Oliveira Castro UVSQ - PRiSM - ECR Mai 28, 2015 Introduction Evaluate strong scalability


slide-1
SLIDE 1

PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction

Mihail Popov, Chadi Akel, Florent Conti, William Jalby, Pablo de Oliveira Castro

UVSQ - PRiSM - ECR

Mai 28, 2015

slide-2
SLIDE 2

Introduction

Evaluate strong scalability

Evaluate strong scalability of OpenMP applications is costly and time-consuming Execute multiple times the whole application with different thread configurations Waste of ressources

According to Amdahl’s law sequential parts do not scale Parallel regions may share similar performance across invocations

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 1 / 18

slide-3
SLIDE 3

Introduction

PCERE: Parallel Codelet Extractor and REplayer

Accelerate strong scalability evaluation with PCERE PCERE is part of CERE (Codelet Extractor and REplayer) framework Decompose applications into small pieces called Codelets Each codelet maps a parallel region and is a standalone executable Extract codelets once Replay codelets instead of applications with different number of threads

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 2 / 18

slide-4
SLIDE 4

Introduction

Prediction model

A A A B

int main() { for(i=0;i<3;i++){ //sequentiel code #pragma omp parallel A } //sequentiel code #pragma omp parallel B //sequentiel code }

A B A B

Executing the whole application with different threads configurations

A A A A

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 3 / 18

slide-5
SLIDE 5

Introduction

Prediction model

A A A B

int main() { for(i=0;i<3;i++){ //sequentiel code #pragma omp parallel A } //sequentiel code #pragma omp parallel B //sequentiel code }

A B A B

Directly replaying the parallel regions Extracting parallel regions A and B and measuring sequentiel execution time M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 3 / 18

slide-6
SLIDE 6

Introduction

Prediction model

A A A B

int main() { for(i=0;i<3;i++){ //sequentiel code #pragma omp parallel A } //sequentiel code #pragma omp parallel B //sequentiel code }

A B A B

Add sequential time and parallel region multiple invocations

A A A A S S

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 3 / 18

slide-7
SLIDE 7

Outline

1

Overview

2

Extract and replay codelets

3

Prediction model evaluation

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 4 / 18

slide-8
SLIDE 8

Overview

Codelet capture and replay

OpenMP Applications Parallel region

  • utlining

Region Capture

Fast performance prediction Retarget for different architecture Change number of threads

  • r affinity

Warmup + Replay Capture of representative working sets Generate codelets wrapper Working sets memory dump

Codelet Replay

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 4 / 18

slide-9
SLIDE 9

Overview

LLVM OpenMP Intermediate Representation extraction

Extract codelets at Intermediate Representation for language portability and cross architecture evaluation

C C++ OpenMP Applications Openmp Clang front end LLVM IR LLVM IR LLVM opt

  • ptimization

Linking Executable binary LLVM llc static compiler Objects files Codelets extraction passes LLVM IR

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 4 / 18

slide-10
SLIDE 10

Overview

Clang front end transforms source code into IR

void main() { #pragma omp parallel { int p = omp_get_thread_num(); printf("%d",p); } }

C code

Clang OpenMP front end define i32 @main() { entry: ... call @__kmpc_fork_call @.omp_microtask.(...) ... } define internal void @.omp_microtask.(...) { entry: %p = alloca i32, align 4 %call = call i32 @omp_get_thread_num() store i32 %call, i32* %p, align 4 %1 = load i32* %p, align 4 call @printf(%1) }

LLVM simplified IR Thread execution model

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 5 / 18

slide-11
SLIDE 11

Extract and replay codelets

Deterministic codelet replay

Parallel region capture Parallel region replay

Dump call Exit Direct jump to parallel region Restore call

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 6 / 18

slide-12
SLIDE 12

Extract and replay codelets

Memory dump

System memory snapshot at the beginning of each parallel region

define i32 @main() { entry: ... call @__kmpc_fork_call @.omp_microtask.(...) ... } define internal void @.omp_microtask.(...) { entry: %p = alloca i32, align 4 %call = call i32 @omp_get_thread_num() store i32 %call, i32* %p, align 4 %1 = load i32* %p, align 4 call @printf(%1) }

LLVM simplified IR

define i32 @main() { entry: ... call @__extracted__.omp_microtask.(...) ... } define internal void @__extracted__.omp_microtask.(...){ newFuncRoot: call void @dump(...) call @__kmpc_fork_call @ .omp_microtask.(...) } define internal void @.omp_microtask.(...) { entry: ... }

LLVM simplified IR extract + dump passes M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 7 / 18

slide-13
SLIDE 13

Extract and replay codelets

Codelet replay

Reload codelet working set Reproduce cache state with optimistic cache warm-up Multiple working sets for a single codelet

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 8 / 18

slide-14
SLIDE 14

Extract and replay codelets

Codelets with different working sets

0e+00 1e+08 2e+08 3e+08 10 20 30 40

invocation Cycles

replay

Figure : MG resid execution time over the different invocations replayed with 4 threads

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 9 / 18

slide-15
SLIDE 15

Extract and replay codelets

Lock Support

Lock support on Linux uses Futex Each futex allocates a kernel space wait queue Memory capture saves only the user space memory Lock capture step that detects all the locks accessed by a codelet Replay wrapper initialize required locks in kernel space

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 10 / 18

slide-16
SLIDE 16

Prediction model evaluation

Test benchmarks and architectures

Using NAS Parallel Benchmark OpenMP 3.0 C version based on the Omni Compiler Project

Core2 Nehalem Sandy Bridge Ivy Bridge CPU E7500 Xeon E5620 E5 i7-3770 Frequency (GHz) 2.93 2.40 2.7 3.4 Sockets 1 2 2 1 Cores per socket 2 4 8 4 Threads per core 1 1 2 2 L1 cache (KB) 32 32 32 32 L2 cache (KB) 3MB 256 256 256 L3 cache (MB)

  • 12

20 8 Ram (GB) 4 24 64 32

Figure : Test architectures

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 11 / 18

slide-17
SLIDE 17

Prediction model evaluation

Reproducing parallel regions scaling with codelets

1 2 4 8 16 32 Threads 1 2 3 4 5 6 Runtime cycles 1e8

SP compute rhs Real Predicted

Figure : Real vs. PCERE execution time predictions on Sandy Bridge for the SP compute rhs codelet

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 12 / 18

slide-18
SLIDE 18

Prediction model evaluation

Prediction accuracy

BT EP LU FT SP CG IS MG Core2 2.4 0.1 1.4 2.3 1.8 1 4.2 1.5 Nehalem 3 0.4 0.6 0.5 6 9.8 2.2 0.6 Sandy Bridge 8.9 1.5 1.4 1.6 0.9 14.5 18 12.1 Ivy Bridge 0.7 1 2 3.4 1.2 3.7 5.3 5.4

Figure : NAS 3.0 C version average percentage error prediction accuracy

On Ivy Bridge, PCERE predicts FT execution time scalability with an error of 3.4%

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 13 / 18

slide-19
SLIDE 19

Prediction model evaluation

Benchmarking acceleration

BT EP LU FT SP CG IS MG Core2 31.5 1 54.3 1.5 87.2 24.2 1.1 0.9 Nehalem 43.2 1 51.9 2.1 97 21.1 1.2 2.2 Sandy Bridge 45.5 1 44.6 2.4 79 13.2 1.2 2.4 Ivy Bridge 39 1 45.5 2.1 82 17 1.1 1.8

Figure : NAS 3.0 C version average benchmarking acceleration

On Core2, PCERE CG scalability evaluation is 24.2 times faster than with normal executions

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 14 / 18

slide-20
SLIDE 20

Prediction model evaluation

PCERE prediction accuracy and benchmarking acceleration

Core2 Nehalem Sandy Bridge Ivy Bridge Accuracy 1.8% 2.9% 7.4% 2.8% Acceleration 25.2 27.4 23.7 23.7

Figure : NAS 3.0 C version average prediction accuracy and benchmarking acceleration per architecture

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 15 / 18

slide-21
SLIDE 21

Prediction model evaluation

Cross micro-architecture codelet replay

Capture-Replay is micro-architecture agnostic Capture on Nehalem → Replay on Sandy Bridge

Threads 1 2 4 8 16 32 Accuracy 3 2.3 3 7.8 11.5 17.6

Figure : NAS 3.0 C version average percentage error cross replay accuracy

Application BT EP LU FT SP CG IS MG Accuracy 9.8 0.3 1.1 3.8 3.9 18.1 6.4 17

Figure : NAS 3.0 C version average percentage error cross replay accuracy

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 16 / 18

slide-22
SLIDE 22

Prediction model evaluation

Limitation and future work

Limitations

No acceleration on applications with a single parallel region and no relevant sequential parts (EP) Prediction error due to variant sequential time across thread configurations (IS)

Future work

Improve warm-up strategy: use CERE page traces warm-up Apply a clustering approach over codelets OpenMP parameters space exploration with codelets

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 17 / 18

slide-23
SLIDE 23

Conclusion

Conclusion

To be released with CERE at http://benchmark-subsetting.github.io/pcere/ Extract codelets once, replay them many times Cross micro-architecture and thread configuration extraction and replay Accelerate strong scalability evaluation 25 times Strong scalability prediction average error of 3.7%

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 18 / 18

slide-24
SLIDE 24

Backup

Codelet replay

Optimistic cache warm-up: assuming that the codelet working set is hot in the original run

define void @run__extracted__.omp_microtask.() { entry: call void @load(...) ... %Arrange arguments call @__extracted__.omp_microtask.(...) } define internal void @__extracted__.omp_microtask.(...){ newFuncRoot: call @__kmpc_fork_call @ .omp_microtask.(...) } define internal void @.omp_microtask.(...) { entry: ... }

LLVM simplified IR Updated main C code

void main() { int i; int iteration = 1; for(i=0;i<iteration;i++) run__extracted__.omp_microtask.(); }

extract + replay passes M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 18 / 18

slide-25
SLIDE 25

Backup

Related work

Cross-platform performance prediction of parallel applications using partial execution. Yang, Leo T and Ma, Xiaosong and Mueller, Frank SC 2005 Detecting Phases in Parallel Applications on Shared Memory

  • Architectures. Perelman, Erez and Polito, Marzia and Bouguet, J-Y

and Sampson, Jack and Calder, Brad and Dulong, Carole IPDPS 2006 BarrierPoint: Sampled Simulation of Multi-Threaded Applications. Carlson, Trevor E and Heirman, Wim and Van Craeynest, Kenzo and Eeckhout, Lieven ISPASS 2014 Effective source-to-source outlining to support whole program empirical optimization. Liao, Chunhua and Quinlan, Daniel J and Vuduc, Richard and Panas, Thomas Languages and Compilers for Parallel Computing 2010

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 18 / 18

slide-26
SLIDE 26

Backup

Flags exploration

void main() { (...) Loop A Loop B (...) }

CERE Loops IR extraction with no optimization

Intermediate representation Loop A

Compile with an optimization point Replay representative invocations Fast

  • ptimization

point evaluation

Prediction model

Loop A inv 48 Loop A inv 2 Loop A time Loop A inv 48 Loop A inv 2

Representative invocations working sets With -O2

Loops profiling and extraction Codelet optimization and replay

Optimizatiopn space to explore

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 18 / 18

slide-27
SLIDE 27

Backup

Flags exploration

For each optimization sequence, only replay the relevant parts Codelets matching over 200 optimization sequences

Application Median error Average error CG 2.0 3.0 EP 0.1 0.3 FT 3.6 5.0 IS 1.3 4.8 LU 1.2 2.2 MG 1.9 3.4 SP 2.1 2.6 RTM 5.3 5.8

Figure : Matching error percentage per application

Speed-up evaluation versus matching error

  • O2 RTM evaluation is 237 times cheaper with codelets

M.Popov C.Akel F.Conti W.Jalby P.Oliveira PCERE Mai 28, 2015 18 / 18