Friends Dont Let Friends Tune Code Jeffrey K. Hollingsworth - PowerPoint PPT Presentation

Friends Don’t Let Friends Tune Code Jeffrey K. Hollingsworth University of Maryland hollings@cs.umd.edu Ananta Tiwari (UMD) 1

About the Title 2

Why Automate Performance Tuning?  Too many parameters that impact performance.  Optimal performance for a given system depends on:  Details of the processor  Details of the inputs (workload)  Which nodes are assigned to the program  Other things running on the system  Parameters come from:  User code  Libraries  Compiler choices Automated Parameter tuning can be used for adaptive tuning in complex software. 3

Automated Performance Tuning  Goal: Maximize achieved performance  Problems: Large number of parameters to tune  Shape of objective function unknown  Multiple libraries and coupled applications  Analytical model may not be available   Requirements: Runtime tuning for long running programs  Don’t try too many configurations  Avoid gradients  4

Active Harmony  Runtime performance optimization Can also support training runs   Automatic library selection (code) Monitor library performance  Switch library if necessary   Automatic performance tuning (parameter) Monitor system performance  Adjust runtime parameters   Hooks for Compiler Frameworks Working to integrate USC/ISI Chill  Looking at others too  5

Parallel Rank Ordering Algorithm  All, but the best point of simplex moves.  Computations can be done in parallel. 6

Application Parameter Tuning: GS2  Physics application (DOE SciDAC project)  Developed to study low-frequency turbulence in magnetized plasma  Performance (execution time) improvement by changing layout and three parameters (negrid, ntheta, nodes)  Data layout analysis 120 Execution time (seconds) 100 (benchmarking runs) 80  55.06s → 16.25s 60 (3.4x faster, W/O collision) 40  71.08s → 31.55s 20 0 (2.3x faster, W collision) lexys lxyes lyxes yxels yxles Data layout Linux 64x2 Seaborg 16x8 Seaborg 8x16 Seaborg 13x10 7

Tool Integration: CHiLL + Active Harmony Generate and evaluate different optimizations that would have been prohibitively time consuming for a programmer to explore manually. Ananta Tiwari, Chun Chen, Jacqueline Chame, Mary Hall, Jeffrey K. Hollingsworth, “A Scalable Auto-tuning Framework for Compiler Optimization,” IPDPS 2009, Rome, May 2009. 8

SMG2000 Optimization Outlined Code for (si = 0; si < stencil_size; si++) for (kk = 0; kk < hypre__mz; kk++) for (jj = 0; jj < hypre__my; jj++) for (ii = 0; ii < hypre__mx; ii++) rp[((ri+ii)+(jj*hypre__sy3))+(kk*hypre__sz3)] -= ((Ap_0[((ii+(jj*hypre__sy1))+ (kk*hypre__sz1))+ (((A->data_indices)[i])[si])])* (xp_0[((ii+(jj*hypre__sy2))+(kk*hypre__sz2))+(( *dxp_s)[si])])); CHiLL Transformation Recipe Constraints on Search permute([2,3,1,4]) 0 ≤ TI , TJ, TK ≤ 122 tile(0,4,TI) 0 ≤ UI ≤ 16 tile(0,3,TJ) 0 ≤ US ≤ 10 tile(0,3,TK) compilers ∈ {gcc, icc} unroll(0,6,US) Search space: unroll(0,7,UI) 122 3 x16x10x2 = 581M points 9

SMG2000 Search and Results Parallel search evaluates 490 points and converges in 20 steps Selected parameters: TI=122,TJ=106,TK=56,UI=8,US=3,Comp=gcc Performance gain on residual computation: 2.37X Performance gain on full app: 27.23% improvement 10

Auto Tuning For Different Platforms  Fixed parameters:  Code: PMLB  Processors: 64  Study how parameters differ for the two systems  Use harmony determined parameters from one system  Run a post-line (fix parameters for entire run) run on another Speedup (post-line) run Speedup (post-line) run on UMD Cluster on Carver Cluster Problem UMD Best Carver Best Carver Best UMD Best Size Config Config Config Config 384 3 1.44 1.19 1.32 1.30 448 3 1.42 1.13 1.51 1.38 512 3 1.30 1.26 1.34 1.30 576 3 1.38 1.16 1.42 1.39 11

Autotuning PFloTran (Trisolve) Outlined Code CHiLL Transformation Recipe #define SIZE 15 original() void forward_solve_kernel( … ) { known(bs > 14) …. known(bs < 16) for (cntr = SIZE - 1; cntr >= 0; cntr--) { unroll(1,2,u1) x[cntr] = t + bs * (*vi ++); unroll(1,3,u2) for (j=0; j<bs; j++) for (k=0; k<bs; k++) s[k]-= v[cntr][bs* j+k] * x[cntr][j]; } } Search space: Constraints on Search 0 <= u1 <= 16 17x17x4 = 1156 points 0 <= u2 <= 16 compilers ∈ {gnu, pathscale, cray, pgi} 12

PFloTran: Trisolve Results Compiler Original Active Harmony Exhaustive Time Time (u1,u2) Speedup Time (u1,u2) Speedup pathscale 0.58 0.32 (3,11) 1.81 0.30 (3,15) 1.93 gnu 0.71 0.47 (5,13) 1.51 0.46 (5,7) 1.54 pgi 0.90 0.53 (5,3) 1.70 0.53 (5,3) 1.70 cray 1.13 0.70 (15,5) 1.61 0.69 (15,15) 1.63 13

Compiling New Code Variants at Runtime PM 1 , PM 2 , … PM N Search Steps (SS) Harmony Timeline Active Harmony Outlined Code Transformation Parameters code-section SS 1 SS 2 SS N Code Server Code Generation Tools v 1 s v 2 s v N s compiler compiler compiler READY Signal Application Application Execution timeline v N s.so v 1 s.so v 2 s.so stall_phase Performance PM 2 Measurements PM N PM 1 (PM) 14

Online Code Generation Results  Two platforms umd-cluster (64 nodes, Intel Xeon dual-core nodes) –  myrinet interconnect Carver (1120 compute nodes, Intel Nehalem. two quad  core processors) – infiniband interconnect  Code servers UMD-cluster – local idle machines  Carver – outsourced to a machine at umd   Codes Poisson Solver  PMLB Parallel Multi-block Lattice Boltzman  SMG2000  15 15

How Many Nodes to Generate Code?  Fixed parameters:  Code: poission solver  problem-size (1024 3 )  number of processors (128)  Up to 128 new variants are generated at each search step Code Servers Search Step Stalled steps + Variations Speedup + s + evaluated + 1 6* 46 502 0.75 2 17* 13 710 0.97 4 27 7.2 928 1.04 8 23 4.5 818 1.23 12 22 4.1 833 1.21 16 26 3.6 931 1.24 * Search did not complete before application terminated + Mean of 5 runs 16

Conclusions and Future Work  Ongoing Work  More end-to-end Application Studies  Continued Evaluation of Online Code Generation  Conclusions  Auto tuning can be done at many levels  Offline – using training runs (choices fixed for an entire run)  Compiler options  Programmer supplied per-run tunable parameters  Compiler Transformations  Online –training or production (choices change during execution)  Programmer supplied per-timestep parameters  Compiler Transformations  It Works!  Real programs run faster 17

Friends Dont Let Friends Tune Code Jeffrey K. Hollingsworth - PowerPoint PPT Presentation

Friends Dont Let Friends Tune Code Jeffrey K. Hollingsworth University of Maryland hollings@cs.umd.edu Ananta Tiwari (UMD) 1 About the Title 2 Why Automate Performance Tuning? Too many parameters that impact performance. Optimal

7 Habits of highly effective woodworkers Workshop tune-up/makeover Workbench tune-up

Snake orbit effect on the spin tune in RHIC M. Bai, V. Ptitsyn, T. Roser Spin tune versus snake

Tune In & Tune Up San Joaquin Valley, CA Helping Immigrant Families Build Financial

To tune or not to tune Thomas Pasquier tfjmp@cs.ubc.ca https://tfjmp.org The team - Ayat Fekry

They Don t Want Them Or You t Want Them Or You They Don Don t Have Them: t Have

Don Juans Troubles Don Juans Troubles Hey, Anna, how are you? Don Juans Troubles Hey,

The Need for Tuning (1 of 2) You dont need to tune your code! Most important Code

The Power of Brand Let s start with a game Fast Food Let s start with a game Tennis

Let There be Light Let There be Light: Let There be Light: Let There be Light Climatic

There is nothing wrong with having friends! There is nothing wrong with having friends.

Horizontal tunes Horizontal tune increases with intensity during impedance measurement, with 56

Efficient methods for tune and chromaticity The NAFF method measurements in lepton and hadron

Spin TuneMeter @ Injection H. Huang, P. Oddo, C. Liu, A. Marusic, V. Ranjbar April 7, 2017 APEX

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

1. Preliminaries Let F be a number field. For each place v of F , let F v be the completion of F at

CS 2340 Objects and Design Behavioral Patterns Christopher Simpkins chris.simpkins@gatech.edu

Automated Verification of Shape and Size Properties via Separation Logic Huu Hai Nguyen 1 ,

higher-order type theory and Coq Vladimir Komendantsky University of St Andrews, UK 18

CSE 127 Computer Security Deian Stefan, Stefan Savage, Winter 2018, Lecture 3 Low Level Software

Teaching Team Roger A. McEowen CALT Director mceowen@iastate.edu

tr ts r t

Topology Optimization for Computational Fabrication Jun Wu Depart. of Design Engineering, TU

Parallel Computing Portable Software & Cost-Effective Hardware 2001.05.28-2001.06.01

Friends Dont Let Friends Tune Code Jeffrey K. Hollingsworth - PowerPoint PPT Presentation

Friends Dont Let Friends Tune Code Jeffrey K. Hollingsworth University of Maryland hollings@cs.umd.edu Ananta Tiwari (UMD) 1 About the Title 2 Why Automate Performance Tuning? Too many parameters that impact performance. Optimal

7 Habits of highly effective woodworkers Workshop tune-up/makeover Workbench tune-up

Snake orbit effect on the spin tune in RHIC M. Bai, V. Ptitsyn, T. Roser Spin tune versus snake

Tune In &amp; Tune Up San Joaquin Valley, CA Helping Immigrant Families Build Financial

To tune or not to tune Thomas Pasquier tfjmp@cs.ubc.ca https://tfjmp.org The team - Ayat Fekry

They Don t Want Them Or You t Want Them Or You They Don Don t Have Them: t Have

Don Juans Troubles Don Juans Troubles Hey, Anna, how are you? Don Juans Troubles Hey,

The Need for Tuning (1 of 2) You dont need to tune your code! Most important Code

The Power of Brand Let s start with a game Fast Food Let s start with a game Tennis

Let There be Light Let There be Light: Let There be Light: Let There be Light Climatic

There is nothing wrong with having friends! There is nothing wrong with having friends.

Horizontal tunes Horizontal tune increases with intensity during impedance measurement, with 56

Efficient methods for tune and chromaticity The NAFF method measurements in lepton and hadron

Spin TuneMeter @ Injection H. Huang, P. Oddo, C. Liu, A. Marusic, V. Ranjbar April 7, 2017 APEX

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

1. Preliminaries Let F be a number field. For each place v of F , let F v be the completion of F at

CS 2340 Objects and Design Behavioral Patterns Christopher Simpkins chris.simpkins@gatech.edu

Automated Verification of Shape and Size Properties via Separation Logic Huu Hai Nguyen 1 ,

higher-order type theory and Coq Vladimir Komendantsky University of St Andrews, UK 18

CSE 127 Computer Security Deian Stefan, Stefan Savage, Winter 2018, Lecture 3 Low Level Software

Teaching Team Roger A. McEowen CALT Director mceowen@iastate.edu

tr ts r t

Topology Optimization for Computational Fabrication Jun Wu Depart. of Design Engineering, TU

Parallel Computing Portable Software &amp; Cost-Effective Hardware 2001.05.28-2001.06.01

Tune In & Tune Up San Joaquin Valley, CA Helping Immigrant Families Build Financial

Parallel Computing Portable Software & Cost-Effective Hardware 2001.05.28-2001.06.01