GRAPHITE Two Years After First Lessons Learned From Real-World - - PowerPoint PPT Presentation

graphite two years after
SMART_READER_LITE
LIVE PREVIEW

GRAPHITE Two Years After First Lessons Learned From Real-World - - PowerPoint PPT Presentation

GRAPHITE Two Years After First Lessons Learned From Real-World Polyhedral Compilation Konrad Trifunovic 2 Albert Cohen 2 David Edelsohn 3 Li Feng 6 Tobias Grosser 5 Harsha Jagasia 1 Razya Ladelsky 4 Sebastian Pop 1 odin 1 Ramakrishna Upadrasta 2


slide-1
SLIDE 1

GRAPHITE Two Years After

First Lessons Learned From Real-World Polyhedral Compilation Konrad Trifunovic2 Albert Cohen2 David Edelsohn3 Li Feng6 Tobias Grosser5 Harsha Jagasia1 Razya Ladelsky4 Sebastian Pop1 Jan Sj¨

  • din1

Ramakrishna Upadrasta2

1Open Source Compiler Engineering, AMD, Austin, Texas, USA 2INRIA Saclay – ˆ

Ile-de-France and LRI, Paris-Sud 11 University, Orsay, France

3IBM T. J. Watson Research, Yorktown Heights, USA 4IBM Haifa Research, Haifa, Israel 5University of Passau, Passau, Germany 6Xi’an Jiaotong University, Xi’an, China

January 30, 2010

GROW Workshop, Jan 2010, Pisa, Italy 1 / 13

slide-2
SLIDE 2

1.Motivation

Keeping sustained performance increase

GROW Workshop, Jan 2010, Pisa, Italy 2 / 13

slide-3
SLIDE 3

1.Motivation

Keeping sustained performance increase Multi-level parallelism (ILP) Instruction-Level-Parallelism (instruction scheduling) Data-level parallelism (vectorization) Thread-level parallelism (automatic parallelization)

GROW Workshop, Jan 2010, Pisa, Italy 2 / 13

slide-4
SLIDE 4

1.Motivation

Keeping sustained performance increase Multi-level parallelism (ILP) Instruction-Level-Parallelism (instruction scheduling) Data-level parallelism (vectorization) Thread-level parallelism (automatic parallelization) Memory hierarchy Caches Registers Scratchpad memories

GROW Workshop, Jan 2010, Pisa, Italy 2 / 13

slide-5
SLIDE 5

1.Motivation

Keeping sustained performance increase Multi-level parallelism (ILP) Instruction-Level-Parallelism (instruction scheduling) Data-level parallelism (vectorization) Thread-level parallelism (automatic parallelization) Memory hierarchy Caches Registers Scratchpad memories Need for complex program (loop) optimizations

GROW Workshop, Jan 2010, Pisa, Italy 2 / 13

slide-6
SLIDE 6

2.Why polyhedral model in GCC?

Why polyhedral model in GCC?

GROW Workshop, Jan 2010, Pisa, Italy 3 / 13

slide-7
SLIDE 7

2.Why polyhedral model in GCC?

Why polyhedral model in GCC?

Source to source compilers Syntax based Output source code might lose semantical information Need for source code normalization

GROW Workshop, Jan 2010, Pisa, Italy 3 / 13

slide-8
SLIDE 8

2.Why polyhedral model in GCC?

Why polyhedral model in GCC?

Source to source compilers Syntax based Output source code might lose semantical information Need for source code normalization Low level internal polyhedral representation Semantics based SSA GIMPLE form Scalar evolution analysis (inductions, reductions) Leveraging > 100 optimization passes of GCC Tight interaction with vectorizer, parallelizer and memory layout

  • ptimizations

GROW Workshop, Jan 2010, Pisa, Italy 3 / 13

slide-9
SLIDE 9

3.Compilation workflow

Compilation workflow

GPOLY construction Transformations transformed GPOLY GLOOG (CLOOG based) GRAPHITE pass GIMPLE, SSA, CFG Legality check

GIMPLE, SSA, CFG

SCoPs GPOLY SCoP detection

GIMPLE GIMPLE C C++ F95 GENERIC GIMPLE+CFG+SSA+LOOP GRAPHITE RTL ASM x86 PPC SPU

GROW Workshop, Jan 2010, Pisa, Italy 4 / 13

slide-10
SLIDE 10

4.Polyhedral model – Domains

GPOLY - Iteration domains

DS = {(v, h) | 0 ≤ v, h ≤ N − 1}

v hv ≥ 0 v < N

h ≥ 0 h < N

for (v=0; v<N; v++) for (h=0; h<N; h++)

  • ut[v][h] = 0;

GROW Workshop, Jan 2010, Pisa, Italy 5 / 13

slide-11
SLIDE 11

4.Polyhedral model – Domains

GPOLY - Iteration domains

DS = {(v, h) | 0 ≤ v, h ≤ N − 1}

v hv ≥ 0 v < N

h ≥ 0 h < N

for (v=0; v<N; v++) for (h=0; h<N; h++)

  • ut[v][h] = 0;

2 6 6 4 1 −1 1 −1 1 0−1 1 −1 3 7 7 5 B B @ v h N 1 1 C C A ≥ 0 v ≥ 0 v ≤ N − 1 h ≥ 0 h ≤ N − 1

GROW Workshop, Jan 2010, Pisa, Italy 5 / 13

slide-12
SLIDE 12

4.Polyhedral model – Data accesses

Data accesses - mapping iterations to memory

f (i, g) = F × (i, g, 1)T

v hv ≥ 0 v < N

h ≥ 0 h < N

t1 t2

  • ut[1][1] out[1][2] out[1][3]
  • ut[2][1] out[2][2] out[2][3]
  • ut[3][1] out[3][2] out[3][3]

Linearized memory layout

GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

slide-13
SLIDE 13

4.Polyhedral model – Data accesses

Data accesses - mapping iterations to memory

f (i, g) = F × (i, g, 1)T

v hv ≥ 0 v < N

h ≥ 0 h < N

t1 t2

  • ut[1][1] out[1][2] out[1][3]
  • ut[2][1] out[2][2] out[2][3]
  • ut[3][1] out[3][2] out[3][3]

Linearized memory layout

GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

slide-14
SLIDE 14

4.Polyhedral model – Data accesses

Data accesses - mapping iterations to memory

f (i, g) = F × (i, g, 1)T

v hv ≥ 0 v < N

h ≥ 0 h < N

t1 t2

  • ut[1][1] out[1][2] out[1][3]
  • ut[2][1] out[2][2] out[2][3]
  • ut[3][1] out[3][2] out[3][3]

Linearized memory layout

GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

slide-15
SLIDE 15

4.Polyhedral model – Data accesses

Data accesses - mapping iterations to memory

f (i, g) = F × (i, g, 1)T

v hv ≥ 0 v < N

h ≥ 0 h < N

t1 t2

  • ut[1][1] out[1][2] out[1][3]
  • ut[2][1] out[2][2] out[2][3]
  • ut[3][1] out[3][2] out[3][3]

Linearized memory layout

GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

slide-16
SLIDE 16

4.Polyhedral model – Data accesses

Data accesses - mapping iterations to memory

f (i, g) = F × (i, g, 1)T

v hv ≥ 0 v < N

h ≥ 0 h < N

t1 t2

  • ut[1][1] out[1][2] out[1][3]
  • ut[2][1] out[2][2] out[2][3]
  • ut[3][1] out[3][2] out[3][3]

Linearized memory layout

GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

slide-17
SLIDE 17

4.Polyhedral model – Data accesses

Data accesses - mapping iterations to memory

f (i, g) = F × (i, g, 1)T

v hv ≥ 0 v < N

h ≥ 0 h < N

t1 t2

  • ut[1][1] out[1][2] out[1][3]
  • ut[2][1] out[2][2] out[2][3]
  • ut[3][1] out[3][2] out[3][3]

Linearized memory layout

GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

slide-18
SLIDE 18

4.Polyhedral model – Data accesses

Data accesses - mapping iterations to memory

f (i, g) = F × (i, g, 1)T

v hv ≥ 0 v < N

h ≥ 0 h < N

t1 t2

  • ut[1][1] out[1][2] out[1][3]
  • ut[2][1] out[2][2] out[2][3]
  • ut[3][1] out[3][2] out[3][3]

Linearized memory layout

GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

slide-19
SLIDE 19

4.Polyhedral model – Data accesses

Data accesses - mapping iterations to memory

f (i, g) = F × (i, g, 1)T

v hv ≥ 0 v < N

h ≥ 0 h < N

t1 t2

  • ut[1][1] out[1][2] out[1][3]
  • ut[2][1] out[2][2] out[2][3]
  • ut[3][1] out[3][2] out[3][3]

Linearized memory layout

GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

slide-20
SLIDE 20

4.Polyhedral model – Data accesses

Data accesses - mapping iterations to memory

f (i, g) = F × (i, g, 1)T

v hv ≥ 0 v < N

h ≥ 0 h < N

t1 t2

  • ut[1][1] out[1][2] out[1][3]
  • ut[2][1] out[2][2] out[2][3]
  • ut[3][1] out[3][2] out[3][3]

Linearized memory layout

GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

slide-21
SLIDE 21

4.Polyhedral model – Scheduling

Scheduling - execution order

t = θS(i) = ΘS × (i, g, 1)T

v h v ≥ 0

v < N h ≥ 0 h < N

v h v ≥ 0

v < N h ≥ 0 h < N

for (v=0; v<N; v++) for (h=0; h<N; h++)

  • ut[v][h] = 0;

ΘS = 1 1

  • v

h v ≥ 0

v < N h ≥ 0 h < N

for (t1 =0; t1 <N; t1 ++) for (t2 =0; t2 <N; t2 ++)

  • ut[t2][t1] = 0;

Θ′S = 1 1

  • GROW Workshop, Jan 2010, Pisa, Italy

7 / 13

slide-22
SLIDE 22

5.SSA-based polyhedral model b_I_lsm.5_30 = PHI <b_I_lsm.5_5(4)> b[i_21] = b_I_lsm.5_30; i_11 = i_21 + 1; bb 6 bb 5 bb 7 bb 8 else goto <bb 8>; if (i_11 < N) goto <bb 7>; goto <bb 4> goto <bb 3>; return; bb 3 i_21 = PHI <i_11(7), 0(2)> b[i_21] = 0.0; b_I_lsm.5_16 = b[i_21]; goto <bb 6>; else if (j_10 < N) goto <bb 5>; D.0_6 = A[i_21][j_22]; D.3_9 = D.2_8 + pre.3_28; b_I_lsm.5_5 = D.3_9; j_10 = j_22 + 1; D.2_8 = D.1_7 * D.0_6; bb 4 pre.3_28 = PHI <D.3_9(5), 0.0(3)> j_22 = PHI <j_10(5), 0(3)> D.1_7 = x[j_22]; } b[i] += A[i][j] * x[j]; for (j = 0; j < N; j++) b[i] = 0; for (i = 0; i < N; i++) { MVT kernel Polyhedral representation DS

bb3 = {(i) | 0 ≤ i ≤ N − 1}

DS

bb4 = {(i, j) | 0 ≤ i ≤ N − 1 ∧ 0 ≤ j ≤ N − 1}

DS

bb6 = {(i) | 0 ≤ i ≤ N − 1}

Fdr1 = {(i, a, s1) | a = 0 ∧ s1 = i ∧ 0 ≤ s1 ≤ N − 1} Fdr2 = {(i, j, a, s1) | a = 1 ∧ s1 = j ∧ 0 ≤ s1 ≤ N − 1} Fdr4 = {(i, a, s1) | a = 0 ∧ s1 = i ∧ 0 ≤ s1 ≤ N − 1} θbb3 = {(i, t1, t2, t3) | t1 = 0 ∧ t2 = i ∧ t3 = 0} θbb4 = {(i, j, t1, t2, t3, t4, t5) | t1 = 0 ∧ t2 = i ∧ t3 = 1 ∧ t4 = j ∧ t5 = 0} GROW Workshop, Jan 2010, Pisa, Italy 8 / 13

slide-23
SLIDE 23

6.Research – Cost-modelling for vectorization

Cost-modelling for vectorization

for (v=0; v<N; v++) for (h=0; h<N; h++) { s=0; for (i=0; i<K; i++) for (j=0; j<K; j++) s+= img[v+i][h+j] * filter[i][j];

  • ut[v][h]= s;

}

[Trifunovic et al. 2009]

GROW Workshop, Jan 2010, Pisa, Italy 9 / 13

slide-24
SLIDE 24

6.Research – Cost-modelling for vectorization

Cost-modelling for vectorization

for (v=0; v<N; v++) for (h=0; h<N; h++) { s=0; for (i=0; i<K; i++) for (j=0; j<K; j++) s+= img[v+i][h+j] * filter[i][j];

  • ut[v][h]= s;

} for (v=0; v<N; v++) for (h=0; h<N; h++) { s=0; for (i=0; i<K; i++) { vs [0:3]={0 ,0 ,0 ,0}; for (j=0; j<K; j+=4) { vs [0:3]+= img[v+i][h+j:h+j+3] *filter[i][j:j+3] } s+= sum(vs [0:3]); }

  • ut[v][h] = s;

} }

[Trifunovic et al. 2009]

GROW Workshop, Jan 2010, Pisa, Italy 9 / 13

slide-25
SLIDE 25

6.Research – Cost-modelling for vectorization

Cost-modelling for vectorization

for (v=0; v<N; v++) for (h=0; h<N; h++) { s=0; for (i=0; i<K; i++) for (j=0; j<K; j++) s+= img[v+i][h+j] * filter[i][j];

  • ut[v][h]= s;

} for (v=0; v<N; v++) for (h=0; h<N; h++) { s=0; for (i=0; i<K; i++) { vs [0:3]={0 ,0 ,0 ,0}; for (j=0; j<K; j+=4) { vs [0:3]+= img[v+i][h+j:h+j+3] *filter[i][j:j+3] } s+= sum(vs [0:3]); }

  • ut[v][h] = s;

} }

Reduction costs: sum operation – vector vs is reduced into scalar s: N2 · K number of times [Trifunovic et al. 2009]

GROW Workshop, Jan 2010, Pisa, Italy 9 / 13

slide-26
SLIDE 26

6.Research – Cost-modelling for vectorization

Cost-modelling for vectorization

for (v=0; v<N; v++) for (h=0; h<N; h++) { s=0; for (i=0; i<K; i++) for (j=0; j<K; j++) s+= img[v+i][h+j] * filter[i][j];

  • ut[v][h]= s;

} for (v=0; v<N; v++) for (h=0; h<N; h++) { s=0; for (i=0; i<K; i++) { vs [0:3]={0 ,0 ,0 ,0}; for (j=0; j<K; j+=4) { vs [0:3]+= img[v+i][h+j:h+j+3] *filter[i][j:j+3] } s+= sum(vs [0:3]); }

  • ut[v][h] = s;

} }

Reduction costs: sum operation – vector vs is reduced into scalar s: N2 · K number of times Benefits: VF = 4 scalar ops are replaced by 1 vector operation [Trifunovic et al. 2009]

GROW Workshop, Jan 2010, Pisa, Italy 9 / 13

slide-27
SLIDE 27

6.Research – Cost-modelling for vectorization

Cost-modelling for vectorization

for (v=0; v<N; v++) for (h=0; h<N; h++) { s=0; for (i=0; i<K; i++) for (j=0; j<K; j++) s+= img[v+i][h+j] * filter[i][j];

  • ut[v][h]= s;

} for (v=0; v<N; v++) for (h=0; h<N; h++) { s=0; for (i=0; i<K; i++) { vs [0:3]={0 ,0 ,0 ,0}; for (j=0; j<K; j+=4) { vs [0:3]+= img[v+i][h+j:h+j+3] *filter[i][j:j+3] } s+= sum(vs [0:3]); }

  • ut[v][h] = s;

} }

Reduction costs: sum operation – vector vs is reduced into scalar s: N2 · K number of times Benefits: VF = 4 scalar ops are replaced by 1 vector operation [Trifunovic et al. 2009]

GROW Workshop, Jan 2010, Pisa, Italy 9 / 13

slide-28
SLIDE 28

6.Research – Cost-modelling for vectorization

Cost-modelling for vectorization

Vectorization pass selection loop Vectorized Graphite pass and loop−nest−level model Loop nest optimization Analytical modeling Middle End Front−end Back−end RTL GIMPLE−SSA and instruction−level model Vectorization API

GROW Workshop, Jan 2010, Pisa, Italy 10 / 13

slide-29
SLIDE 29

6.Research – Automatic parallelization

Autopar

v h

v ≥ 0 v < N h ≥ 0 h < N

v h

v ≥ 0 v < N h ≥ 0 h < N

parloop () { for (h = 0; h < N; h++) for (v = 1; v < N; v++) x[h][v] = x[h][v -1] + 1; } (a) parloop () { . paral_data.x = &x; __builtin_GOMP_parallel_start (parloop._loopfn , &. paral_data , 4); parloop._loopfn (&. paral_data ); __builtin_GOMP_parallel_end (); } parloop._loopfn (. paral_data) { for (h = start; h < end; h++) for (v = 1; v < N; v++) (*. paral_data ->x)[h][v] = x[h][v -1] + 1; } (b) GROW Workshop, Jan 2010, Pisa, Italy 11 / 13

slide-30
SLIDE 30

7.Alias Analysis and GRAPHITE

Encoding aliasing information

Representation of Alias-sets in GRAPHITE Dependence Analysis requires Alias information Alias sets encoded as an extra dimension of access functions int a[10], b[10]; void foo (int *p);

p b a A_1 A_2

Points-to mapping: a → {A1}, p → {A1, A2}, b → {A2} Equivalent to solving Minimum Edge Clique Cover (ECC) NP-Complete problem

GROW Workshop, Jan 2010, Pisa, Italy 12 / 13

slide-31
SLIDE 31

7.Alias Analysis and GRAPHITE

Empirical Analysis on Alias Graphs (4481 graphs) Only 11 graphs are interesting, up to 90 vertices In others, every connected component is a clique!

9, 10, 11, 12, 13, 14, 15, 16 9, 10, 11, 12, 13, 14, 15, 16 1,5 3,7 4,6 2,8 1,5 3,7 4,6 2,8

(i) (ii)

Alias Graph from H.264

Future Work A faster algorithm using modular decomposition properties Currently, the fastest is a O(|V ||E|) algorithm ([Gramm et al. 2009], in Haskell, using Patricia trees, does not seem simple to implement)

GROW Workshop, Jan 2010, Pisa, Italy 13 / 13

slide-32
SLIDE 32

8.Development

Development

Libraries used PPL - The Parma Polyhedra Library http://www.cs.unipr.it/ppl/ CLooG - the Chunky Loop Generator http://www.cloog.org Year Commits 2009 497 2008 216 2007 36 2006 10 Weekly phonecalls Every Wednesday 15.00 Pisa time sip:00077723146@iptel.org http://gcc.gnu.org/wiki/Graphite

GROW Workshop, Jan 2010, Pisa, Italy 14 / 13

slide-33
SLIDE 33

9.Bibliography

Bibliography

[Gramm et al. 2009] J. Gramm, J. Guo, F. H¨ uffner and R. Niedermeier. Data reduction and exact algorithms for clique cover. J. Exp. Algorithmics, 14:2.2-2.15, 1999. [Trifunovic et al. 2009] K. Trifunovic, D. Nuzman, A. Cohen, A. Zaks and

  • I. Rosen. Polyhedral-Model Guided Loop-Nest Auto-Vectorization. In

Parallel Architectures and Compilation Techniques (PACT’09), Raleigh, North Carolina, Sept. 2009

GROW Workshop, Jan 2010, Pisa, Italy 15 / 13

slide-34
SLIDE 34

10.Questions

Thank You for Your attention

Questions?

GROW Workshop, Jan 2010, Pisa, Italy 16 / 13