Piecewise Holistic Autotuning of Compiler and Runtime Parameters - PowerPoint PPT Presentation

Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles – Exascale Computing Research August 2016 C E R E

Context ◮ Architecture, system, and application complexities increase ◮ System provides default good enough parameter configurations ◮ Compiler optimizations: -O2 , -O3 ◮ Thread affinity: scatter ◮ Outperforming default parameters leads to substantial benefits but is a costly process ◮ Execution driven studies test different configurations ◮ Applications have redundancies ◮ Executing an application is time consuming ◮ The search space is huge ◮ Studies reduce the exploration cost by smartly navigating through the search space 1 / 23

Piecewise Exploration ◮ Codelet Extractor and REplayer (CERE) decomposes applications into small pieces called Codelets ◮ Each codelet maps a loop or a parallel region and is a standalone executable ◮ Extract codelets once ◮ Replay codelets instead of applications with different configurations to avoid redundancies 2 / 23

IS Motivating Example int main() { create_seq() for(i=0;i<11;i++) rank() } ◮ IS benchmark ◮ IS create seq covers 40% of the execution time ◮ IS rank sorting algorithm performs 11 invocations with the same execution time ◮ Piecewise exploration benefits ◮ Avoid create seq execution ◮ Evaluate a single invocation of rank ◮ IS rank and create seq are not sensitive to the same optimizations 3 / 23

Outline Codelet Extractor and Replayer (CERE) Prediction Model Thread and Compiler Tuning 4 / 23

CERE Workflow Region Invocation Working set LLVM IR Region & Capture and cache Applications outlining Codelet capture subsetting Change: number of threads, a ffi nity, Codelet Working sets runtime parameters Replay memory dump Fast Warmup Generate Retarget for: performance + codelets di ff erent architectures prediction Replay wrapper di ff erent optimizations CERE can extract codelets from: ◮ Hot Loops ◮ OpenMP non-nested parallel regions 5 / 23

Codelet Capture and Replay ◮ Codelets are extracted at the LLVM Intermediate Representation level ◮ The user can recompile each codelet and replay it while changing compile options, runtime parameters, or the target system ◮ Performance accurate replay requires to capture the cache state ◮ Semantically accurate replay requires to capture the memory 6 / 23

Memory Page Capture region protect static and currently allocated to capture process memory (/proc/self/maps) memory intercept memory allocation functions allocation with LD_PRELOAD a = malloc(256); 1 allocate memory 2 protect memory and return to user program memory segmentation access fault handler a[i]++; 1 dump accessed memory to disk 2 unlock accessed page and return to user program ◮ Capture access at page granularity: coarse but fast ◮ Small dump footprint: only touched pages are saved 7 / 23

Cache State Capture ◮ Cold ◮ Do not capture cache effects ◮ Working Set ◮ Warms all the working set during replay (Optimistic) ◮ Page Trace ◮ Before replay warms the last N pages accessed to restore a cache state close to the original 8 / 23

CERE Cache Warmup for (i=0; i < size; i++) array a[] pages array b[] pages a[i] += b[i]; { { 21 22 23 50 51 52 ... ... ... memory pages addresses FIFO 22 51 21 50 20 (most recently unprotected) Reprotect 20 warmup page trace 46 17 47 18 48 19 49 ... 9 / 23

OpenMP Regions Support void main() de fi ne i32 @main() { { entry: #pragma omp parallel ... { call @__kmpc_fork_call @.omp_microtask.(...) int p = omp_get_thread_num(); ... printf("%d",p); } } } de fi ne internal void @.omp_microtask.(...) { entry: %p = alloca i32, align 4 Clang OpenMP %call = call i32 @omp_get_thread_num() front end store i32 %call, i32* %p, align 4 %1 = load i32* %p, align 4 call @printf(%1) } LLVM simpli fi ed IR C code Thread execution model 10 / 23

Selecting Representative Invocations ◮ A region can have thousand of invocations ◮ Performance differs due to different working sets ◮ Cluster to select representative invocations 8e+07 6e+07 Cycles 4e+07 2e+07 0e+00 0 1000 2000 3000 replay invocation Figure: SPEC tonto make ft@shell2.F90:1133 execution trace. 90% of NAS codelets can be reduced to four or less representatives. 11 / 23

Performance Classes Across Parameters original trace replay 0.3 O0 + 4 threads 0.2 0.1 megacycles 0.0 0.3 O3 + 2 threads 0.2 0.1 0.0 0 10 20 30 40 invocation ◮ ”MG resid” invocations execution time ◮ Use three invocations to predict the application execution time ◮ Parameters do not change the performance classes 12 / 23

NUMA Aware Warmup ◮ First touch policy: threads allocate the pages that they are the first to touch on their NUMA domain ◮ Detect the first thread that touches the memory pages ◮ During warmup the recorded NUMA-domains are restored Original Single Thread Warmup NUMA Warmup 1 NUMA domain (compact) 2 NUMA domains (scatter) 4e+10 Cycles 3e+10 2e+10 1e+10 0e+00 2 4 8 16 2 4 8 16 32 thread number Figure: ”BT xsolve” replay 13 / 23

Test Architectures and Applications ◮ NAS SER and NPB OpenMP 3.0 C version CLASS A ◮ Blackscholes from the PARSEC benchmarks ◮ Reverse Time Migration (RTM) proto-application ◮ Compiler LLVM 3.4 Sandy Bridge Ivy Bridge CPU E5 i7-3770 Frequency (GHz) 2.7 3.4 Sockets 2 1 Cores per socket 8 4 Threads per core 2 2 L1 cache (KB) 32 32 L2 cache (KB) 256 256 L3 cache (MB) 20 8 Ram (GB) 64 16 Figure: Test architectures 14 / 23

Blackscholes Thread Affinities Exploration ◮ Different thread affinities to evaluate ◮ sn: n scatter threads ◮ cn: n compact threads without hyper threading ◮ hn: n compact threads with hyper threading Original Replay 3e+07 Cycles 2e+07 1e+07 0e+00 1 s2 c2 h2 s4 c4 h4 s8 c8 h8 s16 h16 h32 thread configuration Figure: PARSEC Blackscholes thread configurations search 15 / 23

Outperforming Default Thread Configuration original replay hyperthread.h32 compact.c8 Speed−up over standard (s16) 1.5 1.0 0.5 0.0 BT CG EP FT IS LU MG SP BT CG EP FT IS LU MG SP Figure: NAS thread configurations tuning 16 / 23

Autotuning LLVM Middle End Optimizations ◮ LLVM middle end offers more than 50 optimization passes ◮ Codelet replay enable per-region fast optimization tuning original replay 1.2e+08 Cycles (Ivybridge 3.4GHz) O3 1.0e+08 8.0e+07 6.0e+07 0 200 400 600 Id of LLVM middle−end optimization passes combination Figure: ”SP ysolve” codelet. 1000 schedules of random passes combinations explored based on O3 passes. CERE 149 × cheaper than running the full benchmark ( 27 × cheaper when tuning codelets covering 75% of SP) 17 / 23

Hybridization Application O3 Assign to each codelet the fl ag Hotspots extraction sse O2 that gave the best performance into codelets ... avx avx sse sse Flags avx Replay codelets with selected selection fl ags to test Application Hybridization Compile fi les with their respective best fl ag Hybrid optimized binary Extract hotspots into new fi les 18 / 23

Hybrid Compilation over the NAS ◮ Four parallel regions of SP cover 93% of the execution time ◮ No single sequence is the best for all the regions ◮ Codelets explore parameters for each region separately ◮ Produce an hybrid where each region is compiled using its best sequence rhs zsolve xsolve+ysolve total gigacycles − compact 8 60 9.5 16 32.5 55 15 9.0 30.0 14 50 8.5 27.5 13 hybrid O2 O3 rhs−best z−best hybrid O2 O3 rhs−best z−best hybrid O2 O3 rhs−best z−best hybrid O2 O3 rhs−best z−best compiler optimizations Figure: Hybrid compilation speeds up SP OpenMP 1 . 06 × 19 / 23

Piecewise Exploration Benefits Speedup over −O3 1.10 hybrid (original exploration) 1.05 hybrid (replay exploration) 1.00 monolithic 0.95 best standard 0.90 BT IS SP Benchmarks cost of piecewise exploration overhead of monolithic exploration BT IS SP zsolve zsolve rank ysolve xsolve ysolve fullverify rhs@64 rhs@273 xsolve createseq rhs@166 0 1000 2000 3000 0 5 10 0 250 500 750 1000 Compiler optimization sequences Figure: Piecewise exploration of the NAS SER 20 / 23

Codelets Tuning Results Compiler passes Thread affinity #Regions Accuracy Acceleration #Regions Accuracy Acceleration BT 3 98.73 79.63 4 95.24 5.28 CG 2 98.65 3.39 2 79.48 1.23 FT 5 98.3 2.6 5 90.71 2.17 IS 3 96.64 1.26 2 94.85 1.04 SP 6 98.78 68.9 4 97.66 20.07 LU 7 95.04 8.49 2 99.00 12.64 EP 1 83.08 0.36 1 99.31 0.25 MG 4 97.22 0.28 4 93.04 0.45 95.8 20.61 93.66 5.39 AVG ◮ NAS SER and OpenMP benchmarks average speedup of 1 . 08 × ◮ Tuning a single codelet is 13 × faster than full applications ◮ Codelet average accuracy is 94.6% ◮ RTM tuning through a codelet is 200 × faster and achieves a speedup of 1 . 11 × 21 / 23

Piecewise Holistic Autotuning of Compiler and Runtime Parameters - PowerPoint PPT Presentation

Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R E Context Architecture, system,

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

A Generic Adaptive Runtime Autotuning Framework Isaac Dooley 7th Annual Workshop on Charm++ and

Piecewise Isometries and Piecewise Contractions in Electronic Engineering Jonathan Deane

Reeb Graphs and Piecewise Linear Functions Koen Klaren Eindhoven University of Technology

Piecewise w -Noetherian domains and their applications Gyu Whan Chang - Incheon National

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler

GRADUATE HOLISTIC NURSING ROUND TABLE HOLDING SPACE: ADVANCED HOLISTIC NURSING CONSTRAINTS,

Enhancing Student Learning through Holistic Mentoring Program Holistic Mentoring Program Karen KW

Cooking Academy Holistic Food Preparation Cooking Academy Holistic Food Preparation Module #3

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

11/8/2012 The Structure of a Compiler (2) The Structure of a Compiler (1) Any compiler must

Compiler Development (CMPSC 401) Janyl Jumadinova January 17, 2018 Janyl Jumadinova Compiler

Principles of Compiler Design - The Brainf*ck Compiler - Clifford Wolf - www.clifford.at

Comparing Discrete and Piecewise Affine Models of Gene Regulatory Networks Shahrad Jamshidi,

Piecewise Bounds for Estimating Bernoulli- Logistic Latent Gaussian Models Mohammad Emtiyaz Khan

Cadence tools Brandon Rumberg 1 Tools to cover Creating piecewise linear (PWL) files

METIS II overview Dr Magnus Frodigh Dr Magnus Frodigh Director, Wireless Access Networks,

Meet the Parents Primary 3 Briefing Session 24 Jan 2015 PERI Holistic Assessment English

Probabilistic Mission Defense and Assurance NATO STO IST-148 Symposium on Cyber Defence Situation

Exploring the Future of Out-Of-Core Computing with Compute-Local Non-Volatile Memory Myoungsoo

IFIP FIDIS S ummer S chool 2007: Enterprise Identity Management What s

OctopusDB Towards a one-size-fits-all Architecture for Database Systems Alekh Jindal Supervisor:

Three Approaches to Assessment in the Quantitative Reasoning Classroom Dr. Maura Mast

Trust More, Serverless SysTor2019 Stefan Brenner, June 3rd, 2019 Technische Universitt