Unication of static analyses and runtime measurements for improving - PowerPoint PPT Presentation

Uni�cation of static analyses and runtime measurements for improving vectorization Ashay Rane, Rakesh Krishnaiyer, Chris Newburn, James Browne, Leo Fialho and Zakhar Matveev th 4 August, 2014 Petascale Tools Workshop 1

Overview of this work Goal: To increase the applicability and ef�ciency of vectorization by: 1. Understand compiler vectorization messages. 2. Find what information is the compiler missing. 3. Gather and analyze runtime measurements. 4. Feed runtime information back to compiler. 2

Why vectorization? • Increased SIMD vector lengths, hence perf boost. • Improves energy ef�ciency of the processor. • Inherent limitations for compiler because of lack of runtime information. • Lots of headroom available to improve vectorization. 3

Time taken by non-vectorized loops Application Time heartwall 07.43% euler 12.42% kmeans 19.54% backprop 32.52% leukocyte 35.01% lavaMD 37.42% srad_v1 48.45% pre_euler_double 71.60% pre_euler 75.94% euler_double 78.99% streamcluster 85.58% 4

Causes of poor vectorization Limited information available at compile-time, hence compiler assumed: • Inter-iteration dependence. • Varying trip count (non-countable loop). • Temporal array references. • Mis-aligned loads and stores. 5

Reasons for poor vectorization Example: Rodinia LavaMD. • Hot function kernel_cpu(box* b, fp* qv, ...) de�ned in kernel_cpu.c . • Compiler does not know caller arguments when compiling kernel_cpu.c . • Assumes pointers b and qv may overlap in memory. • Concludes possible existence of vector dependence. 6

Reasons for poor vectorization Example: NAS CG. • Unknown loop trip count: for (k = rowstr[j]; k < rowstr[j+1]; k++) { } • Double indirection in loop body: suml += a[k]*p[colidx[k]]; . • Compiler generates gather and scatter instructions for each iteration. 7

Reasons for poor vectorization Example: NBody. • Operates on dynamically allocated ( malloc() ed) arrays • Memory allocator may allocate objects in any way that it desires. • Compiler cannot guarantee alignment of objects to cache-line boundary. 8

Our contributions - MACVEC tool 1. What information does the compiler need? 2. How to measure without high overhead? 3. How to feed information back to compiler? 9

Tool (MACVEC) workflow 1. Pro�le application for hotspots using production inputs. 2. Parse compiler vectorization reports to �nd loops not fully vectorized. 3. Instrument hot-loops that are not fully vectorized. 4. Gather measurements, analyze results and generate recommendations. 5. Verify validity of the recommended changes. 6. Implement changes, measure performance gains. 10

Tool (MACVEC) workflow 1. Pro�le application for hotspots using production inputs. 2. Parse compiler vectorization reports to �nd loops not fully vectorized. 3. Instrument hot-loops that are not fully vectorized. 4. Gather measurements, analyze results and generate recommendations. 5. Verify validity of the recommended changes. 6. Implement changes, measure performance gains. Automated step 11 Manual step

Dynamic pro�ling measurements • Loop trip counts. • Array access strides. • Alignment of arrays. • Overlapping pointers. • Non-temporal or streaming stores. • Branch path outcomes. 12

Measurement collection overhead Measurement Overhead (geo. mean) Trip count 1.08x Strides 1.05x Alignment 1.12x Pointer overlap 1.07x Branch outcomes 1.07x 13

Rule-based recommendations Loop trip count Precondition: • Loop trip count less than threshold (1024). Recommendation: #pragma loop_count( n ) 14

Rule-based recommendations Stride Precondition: • Non-unit but �xed-length strides for speci�c data structures. Recommendation: Convert from array-of-structs to struct-of-arrays refs. 15

Rule-based recommendations Stride Precondition: • Code to be compiled for Intel Xeon Phi. • Fixed-length strides that are more than 4 cache lines apart. Recommendation: #pragma prefetch array , -opt-gather-scatter-unroll . 16

Rule-based recommendations Alignment Precondition: • All arrays aligned to cache-line boundary. • Loop is vectorizable. Recommendation: #pragma vector aligned . 17

Rule-based recommendations Non-temporal stores Precondition: • Low reuse for arrays used in loop body. • Loop is vectorizable. Recommendation: #pragma vector nontemporal . 18

Rule-based recommendations Streaming stores Precondition: • Arrays are written but never read back. • Arrays are accessed with unit stride, no mask register. • Low reuse for speci�c array. Recommendation: -opt-streaming-stores=always . 19

Rule-based recommendations Pointer-overlap checks Precondition: • Span of memory accessed using pointers does not overlap with other pointer accesses. Recommendation: restrict keyword. 20

Rule-based recommendations Branch path analysis Precondition: • Branch evalutes to always true or always false. Recommendation: __builtin_expect() . 21

Results: running time improvements Validation applications Xeon Xeon Phi NBody 0.93x 1.45x STREAM Copy 1.06x 1.00x STREAM Scale 1.41x 1.32x STREAM Add 1.30x 1.29x STREAM Triad 1.29x 1.30x 22

Results: running time improvements Small benchmarks Xeon Xeon Phi NAS CG 1.06x 2.18x LavaMD 2.19x 8.99x SRAD 0.99x 1.09x 23

Results: running time improvements Full applications Xeon Xeon Phi LBM 1.06x 1.20x Lulesh 1.03 1.00x MILC 1.10x 1.60x 24

Safety of recommended changes • Are recommendations independent of standard compiler optimizations? • Will recommendations be applicable across multiple program inputs? • Seven of the nine recommendations are guaranteed to be safe. • O(1) runtime checks guarantee safety for remaining recommendations. 25

Summary • Identi�ed some key metrics necessary to improve vectorization. • Combined static and dynamic information to generate recommendations. • MACVEC will be available in the next release of PerfExpert. 26

Unication of static analyses and runtime measurements for improving - PowerPoint PPT Presentation

Unication of static analyses and runtime measurements for improving vectorization Ashay Rane, Rakesh Krishnaiyer, Chris Newburn, James Browne, Leo Fialho and Zakhar Matveev th 4 August, 2014 Petascale Tools Workshop 1 Overview of this

Vote/Veto Classi fi cation, Ensemble Clustering and Sequence Classi fi cation for Author Identi fi

Static and Method Overloading static One per class, not per object static variables

Towards an Independent Semantics and Veri fi cation Technology for the HLPSL Speci fi cation

Cation- -Binding Hosts Binding Hosts Cation Classes of cyclic and acyclic ligands Crown ethers

STAY-cation A collaboration between HART & Hillsborough County Public Schools STAY-cation

Veri fi cation of Erlang-style Concurrency Emanuele DOsualdo , Jonathan Kochems and Luke Ong

Cation Vacancies in Nitride Semiconductors: Cation Vacancies in Nitride Semiconductors: A

Static and dynamic verification Static and dynamic V&V Software inspections Concerned

Microsticky Microsticky Measurements by Measurements by Measurements by Microsticky

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

static vs automatic storage classes Three types of memory allocations static storage class

1 Static Equilibrium From Static Eq. to Dynamic Eq. System of mass points Static

STARTS: STARTS: STARTS: STARTS: STAtic STAtic Regression Test Selection Regression Test

Wrap Up Static, Packages, Exceptions Static methods // Example: // Java's built in Math class

Measurements of BB Angular Correlations Measurements of BB Angular Correlations Measurements of

rs rt tr s

sttt t r

Coarse-grained Force Field Development of Room Temperature Ionic Liquids Presenter: Alireza

Cation-exchange Capacity Material CEC(meq./100g) 2 Concentrat

rr ss t str

Ba y esian Learning [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6] Ba y es

A F amily of Data P a rallel Derivations Maurice Clint Stephen Fitzpatrick T erence

RoboChart & RoboSim Modelling Robots and Collections Alvaro Miyazawa Department of Computer

Unication of static analyses and runtime measurements for improving - PowerPoint PPT Presentation

Unication of static analyses and runtime measurements for improving vectorization Ashay Rane, Rakesh Krishnaiyer, Chris Newburn, James Browne, Leo Fialho and Zakhar Matveev th 4 August, 2014 Petascale Tools Workshop 1 Overview of this

Vote/Veto Classi fi cation, Ensemble Clustering and Sequence Classi fi cation for Author Identi fi

Static and Method Overloading static One per class, not per object static variables

Towards an Independent Semantics and Veri fi cation Technology for the HLPSL Speci fi cation

Cation- -Binding Hosts Binding Hosts Cation Classes of cyclic and acyclic ligands Crown ethers

STAY-cation A collaboration between HART &amp; Hillsborough County Public Schools STAY-cation

Veri fi cation of Erlang-style Concurrency Emanuele DOsualdo , Jonathan Kochems and Luke Ong

Cation Vacancies in Nitride Semiconductors: Cation Vacancies in Nitride Semiconductors: A

Static and dynamic verification Static and dynamic V&amp;V Software inspections Concerned

Microsticky Microsticky Measurements by Measurements by Measurements by Microsticky

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

static vs automatic storage classes Three types of memory allocations static storage class

1 Static Equilibrium From Static Eq. to Dynamic Eq. System of mass points Static

STARTS: STARTS: STARTS: STARTS: STAtic STAtic Regression Test Selection Regression Test

Wrap Up Static, Packages, Exceptions Static methods // Example: // Java's built in Math class

Measurements of BB Angular Correlations Measurements of BB Angular Correlations Measurements of

rs rt tr s

sttt t r

Coarse-grained Force Field Development of Room Temperature Ionic Liquids Presenter: Alireza

Cation-exchange Capacity Material CEC(meq./100g) 2 Concentrat

rr ss t str

Ba y esian Learning [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6] Ba y es

A F amily of Data P a rallel Derivations Maurice Clint Stephen Fitzpatrick T erence

RoboChart &amp; RoboSim Modelling Robots and Collections Alvaro Miyazawa Department of Computer

STAY-cation A collaboration between HART & Hillsborough County Public Schools STAY-cation

Static and dynamic verification Static and dynamic V&V Software inspections Concerned

RoboChart & RoboSim Modelling Robots and Collections Alvaro Miyazawa Department of Computer