Piecewise Holistic Autotuning of Compiler and Runtime Parameters - - PowerPoint PPT Presentation
Piecewise Holistic Autotuning of Compiler and Runtime Parameters - - PowerPoint PPT Presentation
Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R E Context Architecture, system,
Context
◮ Architecture, system, and application complexities increase ◮ System provides default good enough parameter
configurations
◮ Compiler optimizations: -O2, -O3 ◮ Thread affinity: scatter
◮ Outperforming default parameters leads to substantial benefits
but is a costly process
◮ Execution driven studies test different configurations ◮ Applications have redundancies ◮ Executing an application is time consuming ◮ The search space is huge ◮ Studies reduce the exploration cost by smartly navigating
through the search space
1 / 23
Piecewise Exploration
◮ Codelet Extractor and REplayer (CERE) decomposes
applications into small pieces called Codelets
◮ Each codelet maps a loop or a parallel region and is a
standalone executable
◮ Extract codelets once ◮ Replay codelets instead of applications with different
configurations to avoid redundancies
2 / 23
IS Motivating Example
int main() { create_seq() for(i=0;i<11;i++) rank() }
◮ IS benchmark
◮ IS create seq covers 40% of the execution time ◮ IS rank sorting algorithm performs 11 invocations with the
same execution time
◮ Piecewise exploration benefits
◮ Avoid create seq execution ◮ Evaluate a single invocation of rank ◮ IS rank and create seq are not sensitive to the same
- ptimizations
3 / 23
Outline
Codelet Extractor and Replayer (CERE) Prediction Model Thread and Compiler Tuning
4 / 23
CERE Workflow
Applications LLVM IR Region
- utlining
Region Capture
Fast performance prediction Retarget for: different architectures different optimizations Change: number of threads, affinity, runtime parameters Warmup + Replay Working set and cache capture Generate codelets wrapper Working sets memory dump
Codelet Replay
Invocation & Codelet subsetting
CERE can extract codelets from:
◮ Hot Loops ◮ OpenMP non-nested parallel regions
5 / 23
Codelet Capture and Replay
◮ Codelets are extracted at the LLVM Intermediate
Representation level
◮ The user can recompile each codelet and replay it while
changing compile options, runtime parameters, or the target system
◮ Performance accurate replay requires to capture the cache
state
◮ Semantically accurate replay requires to capture the memory
6 / 23
Memory Page Capture
region to capture
protect static and currently allocated process memory (/proc/self/maps) intercept memory allocation functions with LD_PRELOAD 1 allocate memory 2 protect memory and return to user program segmentation fault handler 1 dump accessed memory to disk 2 unlock accessed page and return to user program a[i]++; memory access a = malloc(256); memory allocation
◮ Capture access at page granularity: coarse but fast ◮ Small dump footprint: only touched pages are saved
7 / 23
Cache State Capture
◮ Cold
◮ Do not capture cache effects
◮ Working Set
◮ Warms all the working set during replay (Optimistic)
◮ Page Trace
◮ Before replay warms the last N pages accessed to restore a
cache state close to the original
8 / 23
CERE Cache Warmup
for (i=0; i < size; i++) a[i] += b[i];
... ...
array a[] pages array b[] pages 21 22 23 50 51 52
...
pages addresses 21 50 51 20 46 17 47 18 48 19 49 22
...
Reprotect 20 memory
{ {
FIFO (most recently unprotected) warmup page trace 9 / 23
OpenMP Regions Support
void main() { #pragma omp parallel { int p = omp_get_thread_num(); printf("%d",p); } }
C code
Clang OpenMP front end define i32 @main() { entry: ... call @__kmpc_fork_call @.omp_microtask.(...) ... } define internal void @.omp_microtask.(...) { entry: %p = alloca i32, align 4 %call = call i32 @omp_get_thread_num() store i32 %call, i32* %p, align 4 %1 = load i32* %p, align 4 call @printf(%1) }
LLVM simplified IR Thread execution model
10 / 23
Selecting Representative Invocations
◮ A region can have thousand of invocations ◮ Performance differs due to different working sets ◮ Cluster to select representative invocations
0e+00 2e+07 4e+07 6e+07 8e+07 1000 2000 3000
invocation Cycles
replay
Figure: SPEC tonto make ft@shell2.F90:1133 execution trace. 90%
- f NAS codelets can be reduced to four or less representatives.
11 / 23
Performance Classes Across Parameters
- riginal trace
replay 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 O0 + 4 threads O3 + 2 threads 10 20 30 40
invocation megacycles
◮ ”MG resid” invocations execution time ◮ Use three invocations to predict the application execution time ◮ Parameters do not change the performance classes
12 / 23
NUMA Aware Warmup
◮ First touch policy: threads allocate the pages that they are
the first to touch on their NUMA domain
◮ Detect the first thread that touches the memory pages ◮ During warmup the recorded NUMA-domains are restored
1 NUMA domain (compact) 2 NUMA domains (scatter) 0e+00 1e+10 2e+10 3e+10 4e+10 2 4 8 16 2 4 8 16 32
thread number Cycles
Original Single Thread Warmup NUMA Warmup
Figure: ”BT xsolve” replay
13 / 23
Test Architectures and Applications
◮ NAS SER and NPB OpenMP 3.0 C version CLASS A ◮ Blackscholes from the PARSEC benchmarks ◮ Reverse Time Migration (RTM) proto-application ◮ Compiler LLVM 3.4 Sandy Bridge Ivy Bridge CPU E5 i7-3770 Frequency (GHz) 2.7 3.4 Sockets 2 1 Cores per socket 8 4 Threads per core 2 2 L1 cache (KB) 32 32 L2 cache (KB) 256 256 L3 cache (MB) 20 8 Ram (GB) 64 16
Figure: Test architectures
14 / 23
Blackscholes Thread Affinities Exploration
◮ Different thread affinities to evaluate
◮ sn: n scatter threads ◮ cn: n compact threads without hyper threading ◮ hn: n compact threads with hyper threading
0e+00 1e+07 2e+07 3e+07 1 s2 c2 h2 s4 c4 h4 s8 c8 h8 s16 h16 h32
thread configuration Cycles
Original Replay
Figure: PARSEC Blackscholes thread configurations search
15 / 23
Outperforming Default Thread Configuration
hyperthread.h32 compact.c8 0.0 0.5 1.0 1.5 BT CG EP FT IS LU MG SP BT CG EP FT IS LU MG SP
Speed−up over standard (s16)
- riginal
replay
Figure: NAS thread configurations tuning
16 / 23
Autotuning LLVM Middle End Optimizations
◮ LLVM middle end offers more than 50 optimization passes ◮ Codelet replay enable per-region fast optimization tuning
Id of LLVM middle−end optimization passes combination Cycles (Ivybridge 3.4GHz)
6.0e+07 8.0e+07 1.0e+08 1.2e+08 200 400 600
- riginal
replay O3
Figure: ”SP ysolve” codelet. 1000 schedules of random passes combinations explored based on O3 passes.
CERE 149× cheaper than running the full benchmark ( 27× cheaper when tuning codelets covering 75% of SP)
17 / 23
Hybridization
Application Hotspots extraction into codelets O3 O2 avx sse ... Replay codelets with selected flags to test avx sse sse avx Assign to each codelet the flag that gave the best performance Application Extract hotspots into new files Compile files with their respective best flag Hybrid optimized binary
Flags selection Hybridization
18 / 23
Hybrid Compilation over the NAS
◮ Four parallel regions of SP cover 93% of the execution time ◮ No single sequence is the best for all the regions ◮ Codelets explore parameters for each region separately ◮ Produce an hybrid where each region is compiled using its
best sequence
rhs zsolve xsolve+ysolve total 8.5 9.0 9.5 13 14 15 16 27.5 30.0 32.5 50 55 60 hybrid O2 O3 rhs−best z−best hybrid O2 O3 rhs−best z−best hybrid O2 O3 rhs−best z−best hybrid O2 O3 rhs−best z−best
compiler optimizations gigacycles − compact 8
Figure: Hybrid compilation speeds up SP OpenMP 1.06×
19 / 23
Piecewise Exploration Benefits
0.90 0.95 1.00 1.05 1.10 BT IS SP
Benchmarks Speedup over −O3
hybrid (original exploration) hybrid (replay exploration) monolithic best standard BT IS SP xsolve ysolve zsolve createseq fullverify rank rhs@166 rhs@273 rhs@64 xsolve ysolve zsolve 1000 2000 3000 5 10 250 500 750 1000
Compiler optimization sequences
cost of piecewise exploration
- verhead of monolithic exploration
Figure: Piecewise exploration of the NAS SER
20 / 23
Codelets Tuning Results
Compiler passes Thread affinity #Regions Accuracy Acceleration #Regions Accuracy Acceleration BT 3 98.73 79.63 4 95.24 5.28 CG 2 98.65 3.39 2 79.48 1.23 FT 5 98.3 2.6 5 90.71 2.17 IS 3 96.64 1.26 2 94.85 1.04 SP 6 98.78 68.9 4 97.66 20.07 LU 7 95.04 8.49 2 99.00 12.64 EP 1 83.08 0.36 1 99.31 0.25 MG 4 97.22 0.28 4 93.04 0.45 AVG 95.8 20.61 93.66 5.39 ◮ NAS SER and OpenMP benchmarks average speedup of
1.08×
◮ Tuning a single codelet is 13× faster than full applications ◮ Codelet average accuracy is 94.6% ◮ RTM tuning through a codelet is 200× faster and achieves a
speedup of 1.11×
21 / 23
Related Works
◮ Kulkarni et al. ”Improving Both the Performance Benefits and
Speed of Optimization Phase Sequence Searches” (ACM Sigplan Notices 2010)
◮ Fursin et al. ”Quick and practical run-time evaluation of
multiple program optimizations” (HiPEAC 2007)
◮ Fursin et al. ”Milepost gcc: Machine learning enabled
self-tuning compiler” (Int. J. Parallel Prog. 2011)
◮ Purini et al. ”Finding good optimization sequences covering
program space” (TACO 2013)
22 / 23
Conclusion
◮ Piecewise tuning with codelets
◮ Accelerate the exploration process ◮ Improve the benefits
◮ Discussion
◮ Some regions are not independent: LU jacu and jacld ◮ Piecewise tuning sensitivity to the data set
◮ Future Work
◮ Combine codelets tuning with GA ◮ Use a clustering approach over codelets ◮ Improve the parallel warmup strategy
◮ https://benchmark-subsetting.github.io/cere/
23 / 23