[PPT] - Using Graph-Based Characteriza4on for Predic4ve Modeling of PowerPoint Presentation

SLIDE 1

Using ¡Graph-‑Based ¡Characteriza4on ¡ for ¡Predic4ve ¡Modeling ¡of ¡ Vectorizable ¡Loop ¡Nests ¡

William ¡Killian ¡

PhD ¡Prelimary ¡Exam ¡Presenta4on ¡ Department ¡of ¡Computer ¡and ¡Informa4on ¡Science ¡ ¡ CommiIee ¡ John ¡Cavazos ¡and ¡Xiaoming ¡Li ¡ ¡ January ¡20, ¡2015 ¡ ¡

SLIDE 2

Code ¡Op4miza4on ¡

Applica'on ¡

Architecture ¡ Op'miza'on ¡ Compiler ¡

SLIDE 3

Problems ¡with ¡Op4mizing ¡Code ¡

New ¡compilers ¡

– New ¡op4miza4ons ¡ – Extended ¡language ¡features ¡

New ¡architectures ¡

– Old ¡programming ¡model ¡won’t ¡work ¡well ¡(GPUs) ¡ – New/improved ¡capabili4es ¡(ISA, ¡cache ¡coherency) ¡ – Compilers ¡don’t ¡update ¡with ¡the ¡architecture ¡

Old ¡code ¡

– Legacy ¡code ¡expected ¡to ¡work ¡ – Maintenance ¡of ¡exis4ng ¡code ¡

SLIDE 4

Workflow ¡

Original ¡Code ¡ Programmer ¡ Op4mized ¡ Code ¡ Compiler ¡ Op4mized ¡ Code ¡ Compiler ¡ Generated ¡ Executable ¡ Execu4on ¡ Informa4on ¡

SLIDE 5

Workflow ¡with ¡Op4miza4ons ¡

Original ¡Code ¡ Programmer ¡ Op4mized ¡ Code ¡ Compiler ¡ Op4mized ¡ Code ¡ Compiler ¡ Generated ¡ Executable ¡ Execu4on ¡ Informa4on ¡

Manual ¡op4miza4ons ¡
Code ¡Transforma4ons ¡
Command-‑line ¡flags ¡
Compiler ¡Direc4ves ¡
Built-‑in ¡Heuris4cs ¡
Generated ¡ASM ¡
Target ¡Arch ¡

App ¡ Arch ¡ Opt ¡ Compiler ¡

SLIDE 6

Op4miza4ons ¡

Manual ¡

– Loop ¡transforma4ons ¡– ¡unrolling, ¡fusion/fission ¡ – Data ¡structure ¡changes ¡– ¡Array ¡of ¡Structures ¡à ¡Structure ¡of ¡Arrays ¡

Compiler ¡Direc4ves ¡

– Source ¡code ¡hints ¡to ¡the ¡compiler ¡indicated ¡by ¡user ¡ – Usually ¡used ¡for ¡local ¡(scope ¡or ¡loop) ¡op4miza4ons ¡ – May ¡automa4cally ¡transform ¡code ¡(e.g. ¡#pragma ¡unroll ¡4) ¡

Command-‑Line ¡flags ¡

– Op4miza4ons ¡applied ¡to ¡en4re ¡program ¡(-‑O3, ¡–funroll=4) ¡ – May ¡specify ¡target ¡architecture ¡and ¡features ¡permiIed ¡(e.g. ¡–march=avx) ¡

SLIDE 7

(Intel) ¡SIMD ¡Architecture ¡Evolu4on ¡

Vector ¡Machines ¡

1980s ¡

64-‑bit ¡
MMX ¡

1997 ¡

128-‑bit ¡
SSE ¡(SP) ¡
SSE2 ¡(DP,INT) ¡
SSE3, ¡SSSE3, ¡SSE4 ¡

1999-‑2008 ¡

256-‑bit ¡
AVX ¡(SP,DP) ¡
AVX2 ¡(INT) ¡

2011-‑2013 ¡

512-‑bit ¡
NVI ¡– ¡Xeon ¡Phi ¡
AVX-‑512 ¡

2013-‑2015 ¡

SLIDE 8

(Intel) ¡SIMD ¡Architecture ¡Evolu4on ¡

Vector ¡Machines ¡

1980s ¡

64-‑bit ¡
MMX ¡

1997 ¡

128-‑bit ¡
SSE ¡(SP) ¡
SSE2 ¡(DP,INT) ¡
SSE3, ¡SSSE3, ¡SSE4 ¡

1999-‑2008 ¡

256-‑bit ¡
AVX ¡(SP,DP) ¡
AVX2 ¡(INT) ¡

2011-‑2013 ¡

512-‑bit ¡
NVI ¡– ¡Xeon ¡Phi ¡
AVX-‑512 ¡

2013-‑2015 ¡

How ¡can ¡we ¡choose ¡the ¡best ¡op'miza'ons ¡to ¡exploit ¡vectoriza'on? ¡

SLIDE 9

Research ¡Ques4ons ¡

Op'miza'on ¡Search ¡Space: ¡ With ¡many ¡types ¡of ¡vectoriza4on ¡op4miza4ons, ¡how ¡do ¡ we ¡choose ¡which ¡ones ¡to ¡apply? ¡ Automa'on: ¡ How ¡can ¡we ¡automa4cally ¡select ¡op4miza4ons, ¡apply ¡

p4miza4ons, ¡and ¡evaluate ¡performance? ¡

Automa'c ¡Performance ¡Improvement: ¡ How ¡can ¡we ¡quickly ¡select ¡good ¡vectoriza4on ¡

p4miza4ons ¡that ¡improve ¡performance? ¡

¡

SLIDE 10

Research ¡Ques4ons ¡

Op'miza'on ¡Search ¡Space: ¡ With ¡many ¡types ¡of ¡vectoriza4on ¡op4miza4ons, ¡how ¡do ¡ we ¡choose ¡which ¡ones ¡to ¡apply? ¡ Automa'on: ¡ How ¡can ¡we ¡automa4cally ¡select ¡op4miza4ons, ¡apply ¡

p4miza4ons, ¡and ¡evaluate ¡performance? ¡

Automa'c ¡Performance ¡Improvement: ¡ How ¡can ¡we ¡quickly ¡select ¡good ¡vectoriza4on ¡

p4miza4ons ¡that ¡improve ¡performance? ¡

¡

SLIDE 11

Op4miza4on ¡Search ¡Space ¡

Used ¡source ¡code ¡direc4ves ¡to ¡drive ¡the ¡
p4miza4on ¡selec4on ¡and ¡modifica4on ¡
Guide ¡the ¡internal ¡compiler ¡vectoriza4on ¡

heuris4cs ¡to ¡improve ¡performance. ¡

Use ¡6 ¡different ¡op4miza4ons ¡that ¡provide ¡varying ¡

levels ¡of ¡guidance ¡to ¡the ¡compiler ¡

Exhaus4ve ¡search ¡space ¡of ¡direc4ves ¡on ¡

individual ¡loop ¡nests ¡

SLIDE 12

Op4miza4on ¡Search ¡Space ¡

Apply ¡No ¡Optimization ¡

– Let ¡the ¡compiler ¡perform ¡default ¡vectoriza4on ¡

#pragma ¡vector ¡always ¡-‑ ¡May ¡generate ¡slower ¡code ¡

– Ignore ¡speed-‑up ¡factor ¡predicted ¡by ¡internal ¡model ¡

#pragma ¡ivdep ¡-‑ ¡May ¡generate ¡invalid ¡code ¡

– Ignore ¡built-‑in ¡check ¡for ¡all ¡unproven ¡vector ¡dependences ¡ – Proven ¡vector ¡dependences ¡will ¡not ¡be ¡vectorized ¡

#pragma ¡simd ¡-‑ ¡May ¡generate ¡invalid ¡code ¡

– Ignore ¡all ¡dependencies ¡and ¡reduc4ons ¡ – Can ¡vectorize ¡an ¡en4re ¡loop ¡nest ¡(outer-‑loop ¡vectoriza4on) ¡ – Op4onal ¡argument ¡vectorlength(n). ¡Vector ¡length ¡states ¡ how ¡many ¡safe ¡itera4ons ¡can ¡be ¡done ¡at ¡once ¡(n ¡= ¡2, ¡4, ¡8) ¡ ¡

SLIDE 13

Research ¡Ques4ons ¡

Op'miza'on ¡Search ¡Space: ¡ With ¡many ¡types ¡of ¡vectoriza4on ¡op4miza4ons, ¡how ¡do ¡ we ¡choose ¡which ¡ones ¡to ¡apply? ¡ Automa'on: ¡ How ¡can ¡we ¡automa4cally ¡select ¡op4miza4ons, ¡apply ¡

p4miza4ons, ¡and ¡evaluate ¡performance? ¡

Automa'c ¡Performance ¡Improvement: ¡ How ¡can ¡we ¡quickly ¡select ¡good ¡vectoriza4on ¡

p4miza4ons ¡that ¡improve ¡performance? ¡

¡

SLIDE 14

Version ¡Genera4on ¡Automa4on ¡

Crea4on ¡of ¡two ¡u4li4es ¡
autovec ¡

– Simplified ¡direc4ve ¡language; ¡provides ¡support ¡for ¡permuta4on ¡

f ¡op4miza4ons ¡

– Source-‑to-‑source ¡compiler ¡

VALT ¡– ¡Vectora4on ¡And ¡Loop ¡Transforma4on ¡

– Provides ¡developer ¡with ¡concise ¡language ¡to ¡specify ¡ vectoriza4on ¡and ¡loop ¡op4miza4on ¡direc4ves ¡ – Extension ¡of ¡autovec ¡ – Supports ¡mul4ple ¡backend ¡compilers ¡

SLIDE 15

autovec ¡language ¡to ¡Intel ¡direc4ves ¡

autovec ¡direc've ¡ Intel-‑specific ¡pragma ¡

#pragma ¡autovec ¡permute ¡

Generates ¡each ¡of ¡the ¡following ¡version ¡into ¡new ¡file ¡

#pragma ¡autovec ¡vl(x) ¡ #pragma ¡simd ¡vectorlength(x) ¡ #pragma ¡autovec ¡ivdep ¡ #pragma ¡ivdep ¡ #pragma ¡autovec ¡always ¡ #pragma ¡vector ¡always ¡ #pragma ¡autovec ¡none ¡

No ¡optimization ¡
#pragma ¡vector ¡always ¡
#pragma ¡ivdep ¡
#pragma ¡simd ¡vectorlength(2) ¡
#pragma ¡simd ¡vectorlength(4) ¡
#pragma ¡simd ¡vectorlength(8) ¡

Permute ¡was ¡configured ¡to ¡generate: ¡

SLIDE 16

VALT ¡language ¡grammar ¡

SLIDE 17

VALT ¡language ¡to ¡Intel ¡direc4ves ¡

VALT ¡direc've ¡ Intel-‑specific ¡pragma ¡ #pragma ¡vector(default) ¡ No ¡code ¡emiIed ¡ #pragma ¡vector(none) ¡ #pragma ¡novector ¡ #pragma ¡vector(always) ¡ #pragma ¡vector ¡always ¡ #pragma ¡vector(ignore) ¡ #pragma ¡ivdep ¡ #pragma ¡vector(aligned) ¡ #pragma ¡vector ¡aligned ¡ #pragma ¡vector(temp) ¡ #pragma ¡vector ¡temporal ¡ #pragma ¡vector(nontemp) ¡ #pragma ¡vector ¡nontemporal ¡ #pragma ¡vectorsize(x) ¡ #pragma ¡simd ¡vectorlength(x) ¡ #pragma ¡loop(unroll(x)) ¡ #pragma ¡unroll(x) ¡ #pragma ¡loop(jam(x)) ¡ #pragma ¡unroll_and_jam(x) ¡ #pragma ¡loop(nofusion) ¡ #pragma ¡nofusion ¡ #pragma ¡loop(dist) ¡ ¡ #pragma ¡distribute_point ¡

SLIDE 18

Version ¡Genera4on ¡Workflow ¡

SRC

ICC 15 autovec

GEN

… … … … …..

pt seq

speedups K versions

SLIDE 19

Research ¡Ques4ons ¡

Op'miza'on ¡Search ¡Space: ¡ With ¡many ¡types ¡of ¡vectoriza4on ¡op4miza4ons, ¡how ¡do ¡ we ¡choose ¡which ¡ones ¡to ¡apply? ¡ Automa'on: ¡ How ¡can ¡we ¡automa4cally ¡select ¡op4miza4ons, ¡apply ¡

p4miza4ons, ¡and ¡evaluate ¡performance? ¡

Automa'c ¡Performance ¡Improvement: ¡ How ¡can ¡we ¡quickly ¡select ¡good ¡vectoriza4on ¡

p4miza4ons ¡that ¡improve ¡performance? ¡

¡

SLIDE 20

Machine ¡Learning ¡-‑ ¡Previous ¡Solu4ons ¡

Stock ¡et ¡al. ¡proposed ¡using ¡machine ¡learning ¡

techniques ¡to ¡improve ¡automa4c ¡ vectoriza4on ¡

Park ¡et ¡al. ¡proposed ¡using ¡graph-‑based ¡

learning ¡techniques ¡to ¡op4mize ¡programs ¡at ¡ loop-‑nest ¡granularity ¡

SLIDE 21

Proposed ¡Solu4on: ¡

Use ¡graph-‑based ¡learning ¡techniques ¡to ¡

choose ¡vectoriza4on ¡op4miza4ons ¡for ¡ vectorizable ¡loop ¡nests ¡

Construct ¡a ¡graph-‑based ¡speedup ¡predictor ¡

that ¡can ¡predict ¡a ¡speedup ¡when ¡applying ¡ vectoriza4on ¡op4miza4ons ¡to ¡a ¡loop ¡nest ¡

SLIDE 22

LLVM MinIR

IR SRC

bb0 . . . bbN bb0 -> bb1, bb2 bb1 -> bb3 … bb4 -> bb2

CFG Topology Feature Vector

CFG

LLVM ¡used ¡to ¡generate ¡IR ¡
MinIR ¡used ¡to ¡generate ¡CFG ¡
Feature ¡vector ¡generated ¡

per ¡basic ¡block ¡

– Total ¡# ¡of ¡Instruc4ons ¡ – # ¡of ¡Add/Sub/Mul/Div ¡ – # ¡of ¡Load/Store ¡ – # ¡of ¡comparisons ¡ – # ¡of ¡condi4onal ¡Branches ¡ – # ¡of ¡uncondi4onal ¡branches ¡

Feature ¡Extrac4on ¡

SLIDE 23

Total ¡of ¡6 ¡basic ¡blocks ¡
One ¡entry, ¡one ¡return ¡
Basic ¡blocks ¡may ¡not ¡contain ¡code ¡

¡ float ¡aa[LEN2]LEN2]; ¡ ¡ for ¡(int ¡i ¡= ¡0; ¡i ¡< ¡LEN2; ¡i++) ¡ ¡ ¡for ¡(int ¡j ¡= ¡0; ¡j ¡< ¡i; ¡j++) ¡ ¡ ¡ ¡ ¡aa[i][j] ¡= ¡aa[j][i] ¡+ ¡bb[i][j]; ¡

Example ¡Control ¡Flow ¡Graph ¡for ¡Loop ¡Nest ¡

SLIDE 24

Machine ¡Learning ¡Model ¡Construc4on ¡

bb0 . . . bbN bb0 -> bb1, bb2 bb1 -> bb3 … bb4 -> bb2 bb0 . . . bbN bb0 -> bb1, bb2 bb1 -> bb3 … bb4 -> bb2 bb0 . . . bbN bb0 -> bb1, bb2 bb1 -> bb3 … bb4 -> bb2 … … … … …..

pt seq

speedups

… … … … …..

pt seq

speedups

… … … … …..

pt seq

speedups

Machine Learning Algorithm Machine Learning Model

1 to N-1 1 to N-1

SLIDE 25

Op4miza4on ¡Encoding ¡

Bit ¡Configura'on ¡ Encoded ¡Op'miza'on ¡ 000000 ¡ No ¡Loop ¡ 100000 ¡ No ¡Op4miza4on ¡Performed ¡ 110000 ¡ #pragma ¡vector ¡always ¡ 111000 ¡ #pragma ¡ivdep ¡ 111100 ¡ #pragma ¡simd ¡vectorlength(2) ¡ 111110 ¡ #pragma ¡simd ¡vectorlength(4) ¡ 111111 ¡ #pragma ¡simd ¡vectorlength(8) ¡

SLIDE 26

Machine ¡Learning ¡Algorithm ¡

1. Training ¡data: ¡
2. Construct ¡kernel ¡similarity ¡matrix ¡by ¡compu4ng ¡similarity ¡

between ¡all ¡training ¡data ¡points ¡(loop ¡nest ¡+ ¡op4miza4on) ¡

3. Train ¡on ¡kernel ¡matrix ¡with ¡speedup ¡as ¡ ¡produc4on ¡target ¡

L = set of all loop nests O = set of all optimization sequences speedup l,o

( ) - observed speedup from applying o to loop nest l

scores = l,o,speedup l,o

( )

( ) ∀l ∈ L,∀o ∈ O, lsize = osize ∧valid l,o

( )

{ }

!"#!"

!,! = !"#!""# !,!

!!×!! !count1 !"#$%!!

!

− count1 !"#$%!!

! ! !

∀!!in! 0, !"#$%! , ∀!!in! 0, !"#$%! !!

SLIDE 27

Machine ¡Learning ¡Algorithm ¡

Graph ¡kernel ¡func4ons ¡used ¡to ¡transform ¡the ¡training ¡into ¡a ¡

different, ¡linearly-‑separable ¡feature ¡space. ¡

– Shortest ¡path ¡graph ¡kernel ¡ – Used ¡previously ¡by ¡Park ¡et ¡al. ¡for ¡speedup ¡predictors ¡ – Similarity ¡calculated ¡by ¡normalizing ¡intersec4on ¡kernel ¡matrix ¡

Linear ¡classifier ¡is ¡constructed ¡that ¡separates ¡the ¡points ¡into ¡

mul4ple ¡classes. ¡

Support ¡vector ¡machines ¡(SVMs) ¡used ¡to ¡construct ¡predic4ve ¡

models ¡from ¡the ¡kernel ¡similarity ¡matrix ¡

Predicts ¡a ¡speedup ¡

SLIDE 28

Using ¡Machine ¡Learning ¡Model ¡for ¡ Unseen ¡Program ¡

SRC

Feature Extract … … … … …..

pt seq

predicted speedups

Machine Learning Model

SLIDE 29

EXPERIMENT ¡SETUP ¡

SLIDE 30

TSVC ¡

151 ¡loop ¡nests ¡with ¡

varying ¡access ¡paIerns, ¡ computa4ons, ¡and ¡ memory ¡access ¡types ¡

Originally ¡used ¡to ¡

evaluate ¡how ¡well ¡a ¡ compiler ¡can ¡recognize ¡ paIerns ¡for ¡vectoriza4on ¡

Millisecond ¡4ming ¡

granularity ¡with ¡each ¡ loop-‑nest ¡within ¡a ¡repeat ¡ loop ¡

PolyBench/C ¡

30 ¡sta4c ¡control-‑flow ¡

micro-‑benchmarks ¡from ¡ several ¡scien4fic ¡domains ¡

Modified ¡to ¡create ¡

different ¡individual ¡ versions ¡op4mizing ¡single ¡ loop ¡nests ¡(65 ¡total ¡loop ¡ nests) ¡

Clock-‑4ck ¡4ming ¡

granularity ¡across ¡the ¡ en4re ¡kernel ¡execu4on ¡

Benchmark ¡Selec4on ¡

¡

SLIDE 31

Nehalem ¡(NHM) ¡

Core ¡i7 ¡950 ¡
3.06GHz ¡quad-‑core ¡
8MB ¡L3 ¡cache ¡
24GB ¡DDR3-‑1333 ¡
128-‑bit ¡vector ¡width ¡
Up ¡to ¡SSE ¡4.2 ¡ISA ¡
45nm ¡
Q2 ¡2009 ¡

Haswell ¡(HSW) ¡

Core ¡i7 ¡5930K ¡
3.5GHz ¡hex-‑core ¡
15MB ¡L3 ¡cache ¡
32GB ¡DDR4-‑2133 ¡
256-‑bit ¡vector ¡width ¡
Up ¡to ¡AVX2 ¡ISA ¡
22nm ¡
Q4 ¡2013 ¡

¡

Machine ¡Configura4on ¡

Processor ¡dynamic ¡frequency ¡scaling ¡was ¡disabled ¡for ¡all ¡experiments ¡ We ¡only ¡analyzed ¡speedup ¡for ¡cross-‑architecture ¡comparison, ¡not ¡performance ¡

SLIDE 32

Nehalem ¡(NHM) ¡

Core ¡i7 ¡950 ¡
3.06GHz ¡quad-‑core ¡
8MB ¡L3 ¡cache ¡
24GB ¡DDR3-‑1333 ¡
128-‑bit ¡vector ¡width ¡
Up ¡to ¡SSE ¡4.2 ¡ISA ¡
45nm ¡
Q2 ¡2009 ¡

Haswell ¡(HSW) ¡

Core ¡i7 ¡5930K ¡
3.5GHz ¡hex-‑core ¡
15MB ¡L3 ¡cache ¡
32GB ¡DDR4-‑2133 ¡
256-‑bit ¡vector ¡width ¡
Up ¡to ¡AVX2 ¡ISA ¡
22nm ¡
Q4 ¡2013 ¡

¡

Machine ¡Configura4on ¡

Processor ¡dynamic ¡frequency ¡scaling ¡was ¡disabled ¡for ¡all ¡experiments ¡ We ¡only ¡analyzed ¡speedup ¡for ¡cross-‑architecture ¡comparison, ¡not ¡performance ¡

SLIDE 33

Execu4on ¡Configura4on ¡

Each ¡loop ¡nest ¡executed ¡10 ¡4mes ¡
Ensured ¡execu4on ¡4mes ¡within ¡1% ¡(0.8% ¡observed) ¡
Verified ¡correctness ¡of ¡execu4on ¡for ¡each ¡version ¡by ¡

dumping ¡live-‑out ¡data ¡(PolyBench ¡loop ¡nests) ¡or ¡ checksum ¡(TSVC ¡loop ¡nests) ¡

Average ¡speedup ¡recorded ¡for ¡each ¡loop ¡nest ¡and ¡
p4miza4on ¡sequence ¡pair ¡

– Used ¡for ¡exhaus4ve ¡search ¡performance ¡and ¡speedup ¡ predictor ¡

SLIDE 34

EXHAUSTIVE ¡SEARCH ¡SPACE ¡ SPEEDUP ¡

Experiment ¡Results ¡

SLIDE 35

TSVC ¡Results ¡

Nehalem ¡ Haswell ¡

1 2 4 8 Speedup normailized over '-O3 -xHOST'

TSVC ¡Loop ¡Nests ¡(N ¡= ¡151) ¡sorted ¡by ¡increasing ¡speedup ¡

SLIDE 36

TSVC ¡Cross-‑Architecture ¡Analysis ¡

1 2 4 8

Speedup normailized over '-O3 -xHOST' TSVC Loop Nests (N=151) sorted by increasing speedup on Haswell

Nehalem Haswell

Correla4on: ¡C ¡= ¡0.8945 ¡ ¡

SLIDE 37

PolyBench ¡Results ¡-‑ ¡Nehalem ¡

1 2 4 8 16 32

2mm-2 3mm-1 fdtd-2d-2 gemm gramschmidt-1 syrk-1 gemver-4 gemver-3 2mm-1 durbin-2 reg-detect-2 gramschmidt-2 fdtd-2d-1 fdtd-apml 3mm-2 fdtd-2d-4 correlation-2 trmm syr2k-2 reg-detect-4 mvt-1 adi-2 fdtd-2d-3 doitgen-2 3mm-3 ludcmp-4 gemver-1 reg-detect-3 jacobi-1d-imper durbin-1 jacobi-2d-imper dynprog-2 syr2k-1 adi-3 dynprog-1 syrk-2 adi-6 adi-5 adi-1 seidel-2d ludcmp-3 symm correlation-1 correlation-3 covariance-3 lu-bench cholesky doitgen-1 ludcmp-2 ludcmp-1 reg-detect-1 trisolv adi-4 covariance-2 covariance-1 gemver-2 mvt-2 gesummv correlation-4 gramschmidt-3 floyd-warshall

Speedup normailized over '-O3 -xHOST'

Polybench Loop Nests (N=65) sorted by increasing speedup

SLIDE 38

PolyBench ¡Results ¡-‑ ¡Haswell ¡

1 2 4 8 16 32

2mm-1 2mm-2 3mm-2 3mm-3 doitgen-1 gemm gemver-4 ludcmp-4 syr2k-2 reg-detect-1 adi-2 syrk-2 fdtd-apml fdtd-2d-3 adi-3 fdtd-2d-4 reg-detect-4 adi-5 reg-detect-2 dynprog-2 durbin-1 gemver-3 doitgen-2 3mm-1 dynprog-1 adi-6 covariance-3 covariance-2 correlation-1 seidel-2d ludcmp-3 reg-detect-3 correlation-2 jacobi-2d-imper covariance-1 cholesky gramschmidt-2 jacobi-1d-imper adi-1 fdtd-2d-1 syrk-1 mvt-1 correlation-3 trmm syr2k-1 gramschmidt-1 fdtd-2d-2 gemver-1 durbin-2 symm trisolv adi-4 ludcmp-1 ludcmp-2 lu-bench gemver-2 mvt-2 gesummv correlation-4 gramschmidt-3 floyd-warshall

Speedup normailized over '-O3 -xHOST'

Polybench Loop Nests (N=65) sorted by increasing speedup

SLIDE 39

PolyBench ¡Cross-‑Architecture ¡Analysis ¡

Correla4on: ¡C ¡= ¡0.8894 ¡ ¡

1 2 4 8 16 32

2mm-1 2mm-2 3mm-2 3mm-3 doitgen-1 gemm gemver-4 ludcmp-4 syr2k-2 reg-detect-1 adi-2 syrk-2 fdtd-apml fdtd-2d-3 adi-3 fdtd-2d-4 reg-detect-4 adi-5 reg-detect-2 dynprog-2 durbin-1 gemver-3 doitgen-2 3mm-1 dynprog-1 adi-6 covariance-3 covariance-2 correlation-1 seidel-2d ludcmp-3 reg-detect-3 correlation-2 jacobi-2d-imper covariance-1 cholesky gramschmidt-2 jacobi-1d-imper adi-1 fdtd-2d-1 syrk-1 mvt-1 correlation-3 trmm syr2k-1 gramschmidt-1 fdtd-2d-2 gemver-1 durbin-2 symm trisolv adi-4 ludcmp-1 ludcmp-2 lu-bench gemver-2 mvt-2 gesummv correlation-4 gramschmidt-3 floyd-warshall

Speedup normailized over '-O3 -xHOST'

Polybench Loop Nests (N=65) sorted by increasing speedup on Haswell

Nehalem Haswell

SLIDE 40

EXHAUSTIVE ¡SEARCH ¡SPACE ¡ VALID ¡CODE ¡GENERATION ¡

Experiment ¡Results ¡

SLIDE 41

Version ¡Genera4on ¡Example ¡

TSVC ¡s256 ¡Speedups ¡ ¡

1 2 3 4 5 6 7 8

VL8_N VL8_VA VL8_VL8 VL8_VL4 VL8_IV VL8_VL2 VL4_N VL4_VL8 VL4_VL4 VL4_VA VL4_IV VL4_VL2 IV_VL8 VA_VL8 N_VL4 N_VL8 VA_VL4 IV_VL4 VL2_VL8 VL2_VA VL2_VL4 VL2_N VL2_VL2 VL2_IV IV_VL2 VA_VL2 N_VL2 IV_VA IV_IV VA_N N_VA VA_IV IV_N N_N VA_VA N_IV Speedup over -O3 -xHOST Optimization Sequence Valid Invalid

SLIDE 42

Version ¡Genera4on ¡Example ¡

TSVC ¡s126 ¡Speedups ¡ ¡

1 2 3 4 5 6 7 8 IV_N N_VA VA_N VA_IV N_N IV_IV VA_VA N_IV IV_VA IV_VL2 N_VL4 VA_VL4 VA_VL8 IV_VL8 N_VL8 VA_VL2 N_VL2 IV_VL4 VL8_VL8 VL8_IV VL8_VL2 VL8_N VL8_VL4 VL8_VA VL4_N VL4_VL4 VL4_IV VL4_VL8 VL4_VL2 VL4_VA VL2_VA VL2_VL4 VL2_VL8 VL2_VL2 VL2_N VL2_IV Speedup over -O3 -xHOST Optimization Sequence Valid Invalid

SLIDE 43

Benchmark ¡ Arch ¡ Valid ¡ Invalid ¡ Error ¡ TSVC ¡ Nehalem ¡ 1832 ¡ 151 ¡ 3 ¡ TSVC ¡ Haswell ¡ 1826 ¡ 155 ¡ 5 ¡ PolyBench ¡ Nehalem ¡ 5204 ¡ 3826 ¡ 0 ¡ Polybench ¡ Haswell ¡ 5204 ¡ 3826 ¡ 0 ¡ Architecture ¡ Valid ¡Fastest ¡ Invalid ¡Fastest ¡ Nehalem ¡ 140 ¡ 11 ¡ Haswell ¡ 143 ¡ 8 ¡

Version ¡Genera4on ¡Sta4s4cs ¡

Version ¡Genera'on ¡Across ¡Architecture ¡ Version ¡Genera'on ¡Performance ¡for ¡TSVC ¡

SLIDE 44

GRAPH-‑BASED ¡SPEEDUP ¡PREDICTOR ¡ ¡

Experiment ¡Results ¡

SLIDE 45

Evalua4on ¡Model ¡

Leave-‑One-‑Out ¡Cross ¡Valida4on ¡

– For ¡a ¡given ¡loop ¡nest, ¡construct ¡a ¡model ¡based ¡on ¡all ¡other ¡loop ¡ nests ¡as ¡training ¡data ¡ – Compare ¡predictor’s ¡speedup ¡to ¡actual ¡speedup ¡ – 151 ¡models ¡for ¡TSVC, ¡65 ¡models ¡for ¡PolyBench ¡

Evalua4on ¡Method ¡

– 1-‑shot ¡: ¡only ¡consider ¡top ¡predic4on ¡ – 3-‑shot ¡: ¡consider ¡top ¡three ¡predic4ons ¡ – Top ¡predic4on ¡is ¡the ¡op4miza4on ¡with ¡best ¡observed ¡speedup ¡

SLIDE 46

TSVC ¡Speedup ¡Predictor ¡-‑ ¡Nehalem ¡

loop ¡ 1-‑shot ¡ 3-‑shot ¡ Op'mal ¡ s126 ¡ 1.91 ¡ 1.92 ¡ 6.28 ¡ s221 ¡ 0.39 ¡ 1.42 ¡ 1.42 ¡ s2251 ¡ 2.03 ¡ 2.44 ¡ 2.44 ¡ s244 ¡ 0.46 ¡ 1.36 ¡ 1.36 ¡ s256 ¡ 0.98 ¡ 0.99 ¡ 7.92 ¡ s3112 ¡ 1.99 ¡ 4.03 ¡ 4.03 ¡ s321 ¡ 0.50 ¡ 0.97 ¡ 2.13 ¡ s424 ¡ 0.99 ¡ 1.94 ¡ 2.88 ¡

Arith. ¡Mean ¡

0.70 ¡ 0.98 ¡ 1.47 ¡

Geo. ¡Mean ¡

0.61 ¡ 0.85 ¡ 1.32 ¡ 74% ¡of ¡Op4mal ¡on ¡Nehalem ¡3-‑shot ¡

SLIDE 47

TSVC ¡Speedup ¡Predictor ¡-‑ ¡Haswell ¡

loop ¡ 1-‑shot ¡ 3-‑shot ¡ Op'mal ¡ s126 ¡ 1.74 ¡ 1.84 ¡ 5.95 ¡ s221 ¡ 0.36 ¡ 1.35 ¡ 1.35 ¡ s2251 ¡ 1.49 ¡ 1.64 ¡ 1.64 ¡ s244 ¡ 0.32 ¡ 1.18 ¡ 1.30 ¡ s256 ¡ 0.99 ¡ 1.00 ¡ 7.88 ¡ s3112 ¡ 1.99 ¡ 4.52 ¡ 4.52 ¡ s321 ¡ 0.44 ¡ 1.00 ¡ 1.65 ¡ s424 ¡ 1.00 ¡ 1.77 ¡ 2.57 ¡

Arith. ¡Mean ¡

0.62 ¡ 0.94 ¡ 1.36 ¡

Geo. ¡Mean ¡

0.51 ¡ 0.80 ¡ 1.21 ¡ 77% ¡of ¡Op4mal ¡on ¡Haswell ¡3-‑shot ¡

SLIDE 48

PolyBench ¡Speedup ¡Predictor ¡-‑ ¡Nehalem ¡

loop ¡ 1-‑shot ¡ 3-‑shot ¡ Op'mal ¡ 2mm-‑1 ¡ 0.25 ¡ 0.25 ¡ 1.00 ¡ adi-‑4 ¡ 0.99 ¡ 1.32 ¡ 1.40 ¡ correla4on-‑1 ¡ 1.00 ¡ 1.22 ¡ 1.22 ¡ covariance-‑1 ¡ 1.37 ¡ 1.68 ¡ 1.68 ¡ dynprog-‑1 ¡ 0.98 ¡ 1.00 ¡ 1.10 ¡ floyd-‑warshall ¡ 5.22 ¡ 8.52 ¡ 11.30 ¡ gemm ¡ 0.14 ¡ 0.32 ¡ 1.00 ¡ grammschmidt-‑3 ¡ 1.00 ¡ 7.38 ¡ 8.17 ¡

Arith. ¡Mean ¡

0.97 ¡ 1.21 ¡ 1.46 ¡

Geo. ¡Mean ¡

0.85 ¡ 0.98 ¡ 1.24 ¡ 84.44% ¡of ¡Op4mal ¡on ¡Nehalem ¡3-‑shot ¡

SLIDE 49

PolyBench ¡Speedup ¡Predictor ¡-‑ ¡Haswell ¡

loop ¡ 1-‑shot ¡ 3-‑shot ¡ Op'mal ¡ 2mm-‑1 ¡ 0.67 ¡ 0.68 ¡ 1.00 ¡ adi-‑4 ¡ 1.22 ¡ 1.22 ¡ 1.22 ¡ correla4on-‑1 ¡ 0.98 ¡ 1.00 ¡ 1.01 ¡ covariance-‑1 ¡ 1.02 ¡ 1.02 ¡ 1.02 ¡ dynprog-‑1 ¡ 0.99 ¡ 1.00 ¡ 1.00 ¡ floyd-‑warshall ¡ 25.88 ¡ 25.88 ¡ 28.86 ¡ gemm ¡ 0.12 ¡ 0.38 ¡ 1.00 ¡ grammschmidt-‑3 ¡ 3.37 ¡ 3.40 ¡ 5.22 ¡

Arith. ¡Mean ¡

1.34 ¡ 1.47 ¡ 1.66 ¡

Geo. ¡Mean ¡

0.90 ¡ 1.03 ¡ 1.20 ¡ 88.74% ¡of ¡Op4mal ¡on ¡Haswell ¡3-‑shot ¡

SLIDE 50

Threats ¡to ¡Validity ¡

Correctness ¡of ¡generated ¡code ¡

– PolyBench ¡-‑ ¡Analyzing ¡live ¡out ¡data ¡s4ll ¡may ¡not ¡verify ¡correctness ¡ – TSVC ¡-‑ ¡used ¡a ¡checksum ¡computa4on. ¡Invalid ¡results ¡s4ll ¡possible ¡

Speedup ¡measurement ¡

– Execu4on ¡performed ¡on ¡single ¡user ¡mode ¡with ¡4ming ¡at ¡a ¡kernel ¡level ¡ – Speedup ¡a ¡“trend” ¡for ¡PolyBench ¡– ¡not ¡representa4ve ¡of ¡observable ¡ speedup ¡for ¡en4re ¡kernel ¡

Machine ¡learning ¡model ¡

– Op4miza4on ¡bit ¡vector ¡only ¡defines ¡a ¡“level” ¡of ¡vectoriza4on ¡ – SVM ¡training ¡parameters ¡not ¡explored ¡ – Generated ¡model ¡seems ¡to ¡find ¡smaller ¡varia4ons ¡between ¡code ¡– ¡ similar ¡kernel ¡matrices ¡are ¡generated ¡

SLIDE 51

Contribu4ons ¡

VALT ¡direc4ve ¡compiler ¡to ¡simple ¡code ¡genera4on ¡

across ¡different ¡compiler ¡backends ¡

autovec ¡-‑ ¡exhaus4ve ¡search ¡code ¡version ¡generator ¡
Graph-‑based ¡speedup ¡predictor ¡designed ¡to ¡predict ¡the ¡

best ¡vector ¡op4miza4ons ¡to ¡apply ¡to ¡a ¡given ¡loop ¡nest ¡

Performance ¡analysis ¡of ¡vectorizable ¡micro-‑benchmarks ¡

that ¡can ¡carry ¡across ¡to ¡similar ¡types ¡of ¡kernels ¡

SLIDE 52

Differences ¡from ¡Related ¡Work ¡

Stock ¡et ¡al. ¡work ¡only ¡targeted ¡Tensor ¡Contradic4on ¡

and ¡stencil ¡kernels ¡and ¡didn’t ¡use ¡graph-‑based ¡learning ¡

– Our ¡approach ¡works ¡on ¡many ¡different ¡types ¡of ¡code ¡and ¡ uses ¡graph-‑based ¡features ¡to ¡construct ¡the ¡model ¡

Park ¡et ¡al. ¡focused ¡on ¡a ¡different ¡set ¡of ¡op4miza4ons, ¡

primarily ¡targe4ng ¡loop ¡transforma4ons, ¡ autoparalleliza4on, ¡and ¡choosing ¡whether ¡or ¡not ¡to ¡ vectorize ¡

– Our ¡approach ¡explores ¡the ¡vectoriza4on ¡search ¡space ¡of ¡ loop ¡nests, ¡and ¡allows ¡us ¡to ¡poten4ally ¡reach ¡a ¡more ¡local ¡ maximum ¡speedup ¡given ¡our ¡op4miza4on ¡search ¡space ¡ ¡

SLIDE 53

Future ¡Work ¡

Extend ¡VALT ¡to ¡support ¡mul4ple ¡backends ¡(PGI ¡Compiler) ¡
Change ¡how ¡op4miza4ons ¡are ¡represented ¡

– Annotate ¡graph-‑based ¡representa4on ¡ – Would ¡eliminate ¡encoding ¡for ¡maximum ¡loop ¡nest ¡size ¡

Extend ¡work ¡to ¡addi4onal ¡compilers ¡

– Newer ¡versions ¡of ¡GCC ¡(4.9+), ¡PGI ¡compiler ¡

Target ¡wider ¡vector ¡size ¡architectures ¡

– Xeon ¡Phi ¡(Knight’s ¡Corner) ¡– ¡512-‑bit ¡vector ¡width; ¡limited ¡ISA ¡ – Knight’s ¡Landing ¡and ¡Skylake ¡– ¡AVX-‑512 ¡

SLIDE 54

Conclusion ¡

Provided ¡automated ¡and ¡manual ¡techniques ¡for ¡improving ¡

performance ¡codes ¡with ¡vectoriza4on ¡op4miza4ons ¡

Non-‑experts ¡can ¡use ¡the ¡u4li4es ¡developed ¡to ¡

automa4cally ¡op4mize ¡codes ¡to ¡exploit ¡vector ¡hardware ¡

With ¡the ¡contribu4ons ¡presented, ¡we ¡

– achieved ¡up ¡to ¡a ¡30x ¡speedup ¡through ¡exhaus4ve ¡search ¡ – predicted ¡within ¡88% ¡of ¡search ¡space ¡op4mal ¡using ¡the ¡ proposed ¡speedup ¡predictor ¡

SLIDE 55

QUESTIONS? ¡

Thank ¡You! ¡

Using ¡Graph-­‑Based ¡Characteriza4on ¡ for ¡Predic4ve ¡Modeling ¡of ¡ Vectorizable ¡Loop ¡Nests ¡

William ¡Killian ¡

Code ¡Op4miza4on ¡

Problems ¡with ¡Op4mizing ¡Code ¡

– New ¡op4miza4ons ¡ – Extended ¡language ¡features ¡

– Old ¡programming ¡model ¡won’t ¡work ¡well ¡(GPUs) ¡ – New/improved ¡capabili4es ¡(ISA, ¡cache ¡coherency) ¡ – Compilers ¡don’t ¡update ¡with ¡the ¡architecture ¡

– Legacy ¡code ¡expected ¡to ¡work ¡ – Maintenance ¡of ¡exis4ng ¡code ¡

Workflow ¡

Workflow ¡with ¡Op4miza4ons ¡

Op4miza4ons ¡

(Intel) ¡SIMD ¡Architecture ¡Evolu4on ¡

(Intel) ¡SIMD ¡Architecture ¡Evolu4on ¡

How ¡can ¡we ¡choose ¡the ¡best ¡op'miza'ons ¡to ¡exploit ¡vectoriza'on? ¡

Research ¡Ques4ons ¡

Op'miza'on ¡Search ¡Space: ¡ With ¡many ¡types ¡of ¡vectoriza4on ¡op4miza4ons, ¡how ¡do ¡ we ¡choose ¡which ¡ones ¡to ¡apply? ¡ Automa'on: ¡ How ¡can ¡we ¡automa4cally ¡select ¡op4miza4ons, ¡apply ¡

Automa'c ¡Performance ¡Improvement: ¡ How ¡can ¡we ¡quickly ¡select ¡good ¡vectoriza4on ¡

¡

Research ¡Ques4ons ¡

Op'miza'on ¡Search ¡Space: ¡ With ¡many ¡types ¡of ¡vectoriza4on ¡op4miza4ons, ¡how ¡do ¡ we ¡choose ¡which ¡ones ¡to ¡apply? ¡ Automa'on: ¡ How ¡can ¡we ¡automa4cally ¡select ¡op4miza4ons, ¡apply ¡

Automa'c ¡Performance ¡Improvement: ¡ How ¡can ¡we ¡quickly ¡select ¡good ¡vectoriza4on ¡

¡

Op4miza4on ¡Search ¡Space ¡

heuris4cs ¡to ¡improve ¡performance. ¡

levels ¡of ¡guidance ¡to ¡the ¡compiler ¡

individual ¡loop ¡nests ¡

Op4miza4on ¡Search ¡Space ¡

– Let ¡the ¡compiler ¡perform ¡default ¡vectoriza4on ¡

– Ignore ¡speed-­‑up ¡factor ¡predicted ¡by ¡internal ¡model ¡

– Ignore ¡built-­‑in ¡check ¡for ¡all ¡unproven ¡vector ¡dependences ¡ – Proven ¡vector ¡dependences ¡will ¡not ¡be ¡vectorized ¡

– Ignore ¡all ¡dependencies ¡and ¡reduc4ons ¡ – Can ¡vectorize ¡an ¡en4re ¡loop ¡nest ¡(outer-­‑loop ¡vectoriza4on) ¡ – Op4onal ¡argument ¡vectorlength(n). ¡Vector ¡length ¡states ¡ how ¡many ¡safe ¡itera4ons ¡can ¡be ¡done ¡at ¡once ¡(n ¡= ¡2, ¡4, ¡8) ¡ ¡

Research ¡Ques4ons ¡

Op'miza'on ¡Search ¡Space: ¡ With ¡many ¡types ¡of ¡vectoriza4on ¡op4miza4ons, ¡how ¡do ¡ we ¡choose ¡which ¡ones ¡to ¡apply? ¡ Automa'on: ¡ How ¡can ¡we ¡automa4cally ¡select ¡op4miza4ons, ¡apply ¡

Automa'c ¡Performance ¡Improvement: ¡ How ¡can ¡we ¡quickly ¡select ¡good ¡vectoriza4on ¡

¡

Version ¡Genera4on ¡Automa4on ¡

– Simplified ¡direc4ve ¡language; ¡provides ¡support ¡for ¡permuta4on ¡

– Source-­‑to-­‑source ¡compiler ¡

– Provides ¡developer ¡with ¡concise ¡language ¡to ¡specify ¡ vectoriza4on ¡and ¡loop ¡op4miza4on ¡direc4ves ¡ – Extension ¡of ¡autovec ¡ – Supports ¡mul4ple ¡backend ¡compilers ¡

autovec ¡language ¡to ¡Intel ¡direc4ves ¡

VALT ¡language ¡grammar ¡

VALT ¡language ¡to ¡Intel ¡direc4ves ¡

Version ¡Genera4on ¡Workflow ¡

Research ¡Ques4ons ¡

Op'miza'on ¡Search ¡Space: ¡ With ¡many ¡types ¡of ¡vectoriza4on ¡op4miza4ons, ¡how ¡do ¡ we ¡choose ¡which ¡ones ¡to ¡apply? ¡ Automa'on: ¡ How ¡can ¡we ¡automa4cally ¡select ¡op4miza4ons, ¡apply ¡

Automa'c ¡Performance ¡Improvement: ¡ How ¡can ¡we ¡quickly ¡select ¡good ¡vectoriza4on ¡

¡

Machine ¡Learning ¡-­‑ ¡Previous ¡Solu4ons ¡

techniques ¡to ¡improve ¡automa4c ¡ vectoriza4on ¡

learning ¡techniques ¡to ¡op4mize ¡programs ¡at ¡ loop-­‑nest ¡granularity ¡

Proposed ¡Solu4on: ¡

choose ¡vectoriza4on ¡op4miza4ons ¡for ¡ vectorizable ¡loop ¡nests ¡

that ¡can ¡predict ¡a ¡speedup ¡when ¡applying ¡ vectoriza4on ¡op4miza4ons ¡to ¡a ¡loop ¡nest ¡

per ¡basic ¡block ¡

Feature ¡Extrac4on ¡

Example ¡Control ¡Flow ¡Graph ¡for ¡Loop ¡Nest ¡

Machine ¡Learning ¡Model ¡Construc4on ¡

Op4miza4on ¡Encoding ¡

Machine ¡Learning ¡Algorithm ¡

between ¡all ¡training ¡data ¡points ¡(loop ¡nest ¡+ ¡op4miza4on) ¡

L = set of all loop nests O = set of all optimization sequences speedup l,o

( ) - observed speedup from applying o to loop nest l

scores = l,o,speedup l,o

( )

( ) ∀l ∈ L,∀o ∈ O, lsize = osize ∧valid l,o

( )

{ }

∀!!in! 0, !"#$%! , ∀!!in! 0, !"#$%! !!

Machine ¡Learning ¡Algorithm ¡

different, ¡linearly-­‑separable ¡feature ¡space. ¡

– Shortest ¡path ¡graph ¡kernel ¡ – Used ¡previously ¡by ¡Park ¡et ¡al. ¡for ¡speedup ¡predictors ¡ – Similarity ¡calculated ¡by ¡normalizing ¡intersec4on ¡kernel ¡matrix ¡

mul4ple ¡classes. ¡

models ¡from ¡the ¡kernel ¡similarity ¡matrix ¡

Using ¡Machine ¡Learning ¡Model ¡for ¡ Unseen ¡Program ¡

EXPERIMENT ¡SETUP ¡

TSVC ¡

varying ¡access ¡paIerns, ¡ computa4ons, ¡and ¡ memory ¡access ¡types ¡

evaluate ¡how ¡well ¡a ¡ compiler ¡can ¡recognize ¡ paIerns ¡for ¡vectoriza4on ¡

granularity ¡with ¡each ¡ loop-­‑nest ¡within ¡a ¡repeat ¡ loop ¡

PolyBench/C ¡

micro-­‑benchmarks ¡from ¡ several ¡scien4fic ¡domains ¡

Using ¡Graph-‑Based ¡Characteriza4on ¡ for ¡Predic4ve ¡Modeling ¡of ¡ Vectorizable ¡Loop ¡Nests ¡

– Ignore ¡speed-‑up ¡factor ¡predicted ¡by ¡internal ¡model ¡

– Ignore ¡built-‑in ¡check ¡for ¡all ¡unproven ¡vector ¡dependences ¡ – Proven ¡vector ¡dependences ¡will ¡not ¡be ¡vectorized ¡

– Ignore ¡all ¡dependencies ¡and ¡reduc4ons ¡ – Can ¡vectorize ¡an ¡en4re ¡loop ¡nest ¡(outer-‑loop ¡vectoriza4on) ¡ – Op4onal ¡argument ¡vectorlength(n). ¡Vector ¡length ¡states ¡ how ¡many ¡safe ¡itera4ons ¡can ¡be ¡done ¡at ¡once ¡(n ¡= ¡2, ¡4, ¡8) ¡ ¡

– Source-‑to-‑source ¡compiler ¡

Machine ¡Learning ¡-‑ ¡Previous ¡Solu4ons ¡

learning ¡techniques ¡to ¡op4mize ¡programs ¡at ¡ loop-‑nest ¡granularity ¡

different, ¡linearly-‑separable ¡feature ¡space. ¡

granularity ¡with ¡each ¡ loop-‑nest ¡within ¡a ¡repeat ¡ loop ¡

micro-‑benchmarks ¡from ¡ several ¡scien4fic ¡domains ¡

dumping ¡live-‑out ¡data ¡(PolyBench ¡loop ¡nests) ¡or ¡ checksum ¡(TSVC ¡loop ¡nests) ¡

TSVC ¡Cross-‑Architecture ¡Analysis ¡

PolyBench ¡Results ¡-‑ ¡Nehalem ¡

PolyBench ¡Results ¡-‑ ¡Haswell ¡

PolyBench ¡Cross-‑Architecture ¡Analysis ¡

GRAPH-‑BASED ¡SPEEDUP ¡PREDICTOR ¡ ¡

– 1-‑shot ¡: ¡only ¡consider ¡top ¡predic4on ¡ – 3-‑shot ¡: ¡consider ¡top ¡three ¡predic4ons ¡ – Top ¡predic4on ¡is ¡the ¡op4miza4on ¡with ¡best ¡observed ¡speedup ¡

TSVC ¡Speedup ¡Predictor ¡-‑ ¡Nehalem ¡

TSVC ¡Speedup ¡Predictor ¡-‑ ¡Haswell ¡

PolyBench ¡Speedup ¡Predictor ¡-‑ ¡Nehalem ¡

PolyBench ¡Speedup ¡Predictor ¡-‑ ¡Haswell ¡

– PolyBench ¡-‑ ¡Analyzing ¡live ¡out ¡data ¡s4ll ¡may ¡not ¡verify ¡correctness ¡ – TSVC ¡-‑ ¡used ¡a ¡checksum ¡computa4on. ¡Invalid ¡results ¡s4ll ¡possible ¡

and ¡stencil ¡kernels ¡and ¡didn’t ¡use ¡graph-‑based ¡learning ¡

– Our ¡approach ¡works ¡on ¡many ¡different ¡types ¡of ¡code ¡and ¡ uses ¡graph-‑based ¡features ¡to ¡construct ¡the ¡model ¡

– Annotate ¡graph-‑based ¡representa4on ¡ – Would ¡eliminate ¡encoding ¡for ¡maximum ¡loop ¡nest ¡size ¡

– Xeon ¡Phi ¡(Knight’s ¡Corner) ¡– ¡512-‑bit ¡vector ¡width; ¡limited ¡ISA ¡ – Knight’s ¡Landing ¡and ¡Skylake ¡– ¡AVX-‑512 ¡