Optimal ILP and Register Tiling: Analytical Model and Optimization - - PowerPoint PPT Presentation

optimal ilp and register tiling analytical model and
SMART_READER_LITE
LIVE PREVIEW

Optimal ILP and Register Tiling: Analytical Model and Optimization - - PowerPoint PPT Presentation

Optimal ILP and Register Tiling: Analytical Model and Optimization Framework Lakshminarayanan. Renganarayana, Upadrasta Ramakrishna, Sanjay Rajopadhye Computer Science Department Colorado State University Overview ILP and register reuse


slide-1
SLIDE 1

Optimal ILP and Register Tiling: Analytical Model and Optimization Framework

  • Lakshminarayanan. Renganarayana,

Upadrasta Ramakrishna, Sanjay Rajopadhye Computer Science Department Colorado State University

slide-2
SLIDE 2

October 21, 2005 LCPC '05 2

Overview

 ILP and register reuse  Execution time and register pressure functions  Optimal ILP and register tiling problem  Optimal tiling problem as convex opt. problem  Validation  Related work  Conclusions & Future work

slide-3
SLIDE 3

October 21, 2005 LCPC '05 3

ILP and Register Reuse

 Loop programs

 dominate application execution time  main sources of ILP and register reuse

 Transformations

 expose / exploit ILP  enable register reuse

 These transformations interact in subtle ways  ILP - Register Reuse tradeoff?

slide-4
SLIDE 4

October 21, 2005 LCPC '05 4

ILP - Register Reuse Tradeoff

 Optimal combination of transformations  Quantification of interactions  A mathematical model

 to study the interactions  to choose the optimal trans. parameters

 TTBOOK: no such model has been studied

slide-5
SLIDE 5

October 21, 2005 LCPC '05 5

Contributions

 Cost model with trans. params. as variables

 closed forms: execution time & register pressure

 Convex optimization problem formulation  A globally optimal solution  First such formulation & optimal solution

slide-6
SLIDE 6

October 21, 2005 LCPC '05 6

Exposing and Exploiting ILP

 Exposing ILP

 Unroll and Jam  Loop permutation or skewing  Multi-dimensional scheduling

 Exploiting ILP

 DAG schedulers  Software pipelining

slide-7
SLIDE 7

October 21, 2005 LCPC '05 7

Exposing ILP with Unroll and Jam

for i1 = 1 to 6 for i2 = 1 to 6 A[i1,i2] = A[i1-1,i2]+A[i1,i2-1] DAG exposes parallelism

Unrolled Loop body (9 iterations)

slide-8
SLIDE 8

October 21, 2005 LCPC '05 8

Exposing ILP with Permutation

for i2 = 1 to 6 for i1 = 1 to 6 A[i1,i2] = 3.23 * A[i1,i2-1] for i1 = 1 to 6 for i2 = 1 to 6 A[i1,i2] = 3.23 * A[i1,i2-1]

slide-9
SLIDE 9

October 21, 2005 LCPC '05 9

Exposing ILP with Skewing

for i1 = 1 to 6 for i2 = 1 to 6 A[i1,i2] = A[i1-1,i2]+A[i1,i2-1] All the iterations

  • f inner-most

loop are parallel  Sufficient ILP  Performance limited

  • nly by the available

execution resources

slide-10
SLIDE 10

October 21, 2005 LCPC '05 10

Register Reuse

 Unrol and Jam  Scalar replacement

 scalar replacement enables register placement  classic register allocators are sufficient

 Loop tiling  array register allocation

 registers allocated to array values  no code size increase

 requires an array register allocator

slide-11
SLIDE 11

October 21, 2005 LCPC '05 11

Scalar Replacement

for i1 = 1 to 6 for i2 = 1 to 6 A[i1,i2] = A[i1-1,i2]+A[i1,i2-1] for i1 = 1 to 6 step 2 for i2 = 1 to 6 step 2 A[i1,i2] = A[i1-1,i2]+A[i1,i2-1] A[i1+1,i2] = A[i1,i2]+A[i1+1,i2-1] A[i1,i2+1] = A[i1-1, i2+1]+A[i1,i2] A[i1+1,i2+1] = A[i1,i2+1]+A[i1+1,i2] for i1 = 1 to 6 step 2 T = A[i1,0] for i2 = 1 to 6 step 2 T = A[i1-1,i2]+ T A[i1+1,i2] = T +A[i1+1,i2-1] A[i1,i2+1] = A[i1-1, i2+1]+T A[i1+1,i2+1] = A[i1,i2+1]+A[i1+1,i2] A[i1,i2] = T unroll 2 x 2 replace A[i1,i2] by a scalar T

  • Saves 2 loop independent loads

plus 1 loop carried load

  • T can be allocated to a register

Which array references to scalar replace?

slide-12
SLIDE 12

October 21, 2005 LCPC '05 12

Tiling

for i1 = 1 to 6 for i2 = 1 to 6 A[i1,i2] = A[i1-1,i2]+A[i1,i2-1] Tiling

  • Similar to Unroll and Jam
  • Decreases life time of values
  • Limits MAXLIVE
slide-13
SLIDE 13

October 21, 2005 LCPC '05 13

Register tiling

for i1 = 1 to 6 for i2 = 1 to 6 A[i1,i2] = A[i1-1,i2]+A[i1,i2-1]

3x3 register tile

Tile sizes:

  • Affects load/store savings
  • Constrained by number of registers

How to choose the tile sizes?

slide-14
SLIDE 14

October 21, 2005 LCPC '05 14

Traditional vs. Our Approach

Unroll and Jam + Scalar Promotion Permutation

  • r Skewing

+ Tiling Choose optimal unroll and scalar promotion parameters DAG Scheduler or Software Pipelining + Scalar Register Allocation Choose optimal skew and tile parameters Software Pipelining + Scalar & Array Register Allocation Traditional Approach Our Approach Code Transformation

  • Sched. & Reg. Alloc
slide-15
SLIDE 15

October 21, 2005 LCPC '05 15

Program, Tiling, and Architecture Class

 Input loops:

 perfectly nested, rectangular loops  uniform dependence bodies

 Rectangular tiling

 we assume: input loop nest admits rectangular tiling

 ILP-exposed by: permutation or skewing  Architectures: superscalar or VLIW

slide-16
SLIDE 16

October 21, 2005 LCPC '05 16

Execution Time

(When permutation exposes ILP)

T = (ntiles * tile_cost) + loop_overhead tile_cost = max(comp_cost,load_store_cost) comp_cost = α * tile_vol load_store_cost = β * LS(t,D) loop_overhead = η * LO(t,N) t = vector of tile sizes N = vector of iter. space sizes D = dependence matrix ntiles = N1/t1 * … * Nn/tn

slide-17
SLIDE 17

October 21, 2005 LCPC '05 17

Execution Time Model

(when permutation cannot expose ILP: skew)

  • Partial tiles treated

as full tiles.

  • Number of tiles

approximated by N1/t1 * … * Nn/tn

Skewing affects

  • iteration space shape -- makes counting of partial, full, and no. of tiles hard.
  • dependence lengths -- affects the amount of data loaded / stored in a tile.
  • Dep. matrix = SD
  • LS(t,SD) is the

load store volume

slide-18
SLIDE 18

October 21, 2005 LCPC '05 18

Optimal ILP and Register Tiling: Optimization Problem Formulation

minimize TotalExecutionTime(t,S) subject to LoadStoreVolume(t,S) ≤ Registers

For a fixed skew S  t is the only variable  opt. prob. reduces to an integer convex opt. prob.

slide-19
SLIDE 19

October 21, 2005 LCPC '05 19

Solution Steps

 Yes!  No skewing, only tiling

 Fix S=I in opt. prob.

 Solve for optimal tile

sizes

 Single integer convex

  • pt. problem.

 No!  Construct set (Γ)of valid

skews

 For each element in Γ

solve the fixed skew

  • ptimization problem

 Pick the best  Only d(d-1) problems

Can permutation expose a parallel loop?

slide-20
SLIDE 20

October 21, 2005 LCPC '05 20

Solving for Optimal Tile Sizes

 Opt. Prob. for tile sizes is a Integer Geometric

Program (à la Integer Linear Programs)

 GPs can be transformed into convex opt. probs.  Standard solvers are available  Running time:

 depends on #vars & #constraints  few seconds (< 10 secs.)

slide-21
SLIDE 21

October 21, 2005 LCPC '05 21

Validation

 Experimental validation requires

 array register allocator  architectural support (like rotating registers)

 Similar model used for finding optimal unroll factor

 optimal unroll factors can be found with small tweaks

 In tiling for memory hierarchy

 we have successfully used a similar model  almost all the cost models used by other researchers can be

cast into our GP framework [RR-SC04]

slide-22
SLIDE 22

October 21, 2005 LCPC '05 22

Related Work

 Unroll and Jam approach

 [Callhan et al.-90], [Carr-Kennedy-94], [Sarkar-01]

 Hierarchical tiling

 [Carter et al.-95], [Mitchell et al.-98]

 Software pipelining of loop nests

 [Ramanujam-94], [Rong et al. 04], [Rong et al. 05]

 Code generation for register tiling

 [Jiminez et al.-02], [Sarkar-01]

slide-23
SLIDE 23

October 21, 2005 LCPC '05 23

Conclusions & Future Work

 A mathematical formulation of the combined

ILP and register tiling problem.

 A globally optimal solution.  Future work:

 adapting modulo schedulers to pipeline skewed loops  developing an array register allocator  experimental validation on benchmarks