optimal ilp and register tiling analytical model and
play

Optimal ILP and Register Tiling: Analytical Model and Optimization - PowerPoint PPT Presentation

Optimal ILP and Register Tiling: Analytical Model and Optimization Framework Lakshminarayanan. Renganarayana, Upadrasta Ramakrishna, Sanjay Rajopadhye Computer Science Department Colorado State University Overview ILP and register reuse


  1. Optimal ILP and Register Tiling: Analytical Model and Optimization Framework Lakshminarayanan. Renganarayana, Upadrasta Ramakrishna, Sanjay Rajopadhye Computer Science Department Colorado State University

  2. Overview  ILP and register reuse  Execution time and register pressure functions  Optimal ILP and register tiling problem  Optimal tiling problem as convex opt. problem  Validation  Related work  Conclusions & Future work October 21, 2005 LCPC '05 2

  3. ILP and Register Reuse  Loop programs  dominate application execution time  main sources of ILP and register reuse  Transformations  expose / exploit ILP  enable register reuse  These transformations interact in subtle ways  ILP - Register Reuse tradeoff? October 21, 2005 LCPC '05 3

  4. ILP - Register Reuse Tradeoff  Optimal combination of transformations  Quantification of interactions  A mathematical model  to study the interactions  to choose the optimal trans. parameters  TTBOOK: no such model has been studied October 21, 2005 LCPC '05 4

  5. Contributions  Cost model with trans. params. as variables  closed forms: execution time & register pressure  Convex optimization problem formulation  A globally optimal solution  First such formulation & optimal solution October 21, 2005 LCPC '05 5

  6. Exposing and Exploiting ILP  Exposing ILP  Unroll and Jam  Loop permutation or skewing  Multi-dimensional scheduling  Exploiting ILP  DAG schedulers  Software pipelining October 21, 2005 LCPC '05 6

  7. Exposing ILP with Unroll and Jam Unrolled Loop body for i 1 = 1 to 6 (9 iterations) for i 2 = 1 to 6 A[i 1 ,i 2 ] = A[i 1 -1,i 2 ]+A[i 1 ,i 2 -1] DAG exposes parallelism October 21, 2005 LCPC '05 7

  8. Exposing ILP with Permutation for i 2 = 1 to 6 for i 1 = 1 to 6 for i 1 = 1 to 6 for i 2 = 1 to 6 A[i 1 ,i 2 ] = 3.23 * A[i 1 ,i 2 -1] A[i 1 ,i 2 ] = 3.23 * A[i 1 ,i 2 -1] October 21, 2005 LCPC '05 8

  9. Exposing ILP with Skewing for i 1 = 1 to 6 All the iterations for i 2 = 1 to 6 of inner-most A[i 1 ,i 2 ] = A[i 1 -1,i 2 ]+A[i 1 ,i 2 -1] loop are parallel  Sufficient ILP  Performance limited only by the available execution resources October 21, 2005 LCPC '05 9

  10. Register Reuse  Unrol and Jam  Scalar replacement  scalar replacement enables register placement  classic register allocators are sufficient  Loop tiling  array register allocation  registers allocated to array values  no code size increase  requires an array register allocator October 21, 2005 LCPC '05 10

  11. Scalar Replacement for i 1 = 1 to 6 for i 1 = 1 to 6 step 2 for i 2 = 1 to 6 for i 2 = 1 to 6 step 2 A[i 1 ,i 2 ] = A[i 1 -1,i 2 ]+A[i 1 ,i 2 -1] A[i 1 ,i 2 ] = A[i 1 -1,i 2 ]+A[i 1 ,i 2 -1] A[i 1 +1,i 2 ] = A[i 1 ,i 2 ]+A[i 1 +1,i 2 -1] A[i 1 ,i 2 +1] = A[i 1 -1, i 2 +1]+A[i 1 ,i 2 ] unroll 2 x 2 A[i 1 +1,i 2 +1] = A[i 1 ,i 2 +1]+A[i 1 +1,i 2 ] for i 1 = 1 to 6 step 2 replace A[i 1 ,i 2 ] by T = A[i 1 ,0] a scalar T for i 2 = 1 to 6 step 2 T = A[i 1 -1,i 2 ]+ T A[i 1 +1,i 2 ] = T +A[i 1 +1,i 2 -1]  Saves 2 loop independent loads A[i 1 ,i 2 +1] = A[i 1 -1, i 2 +1]+T plus 1 loop carried load A[i 1 +1,i 2 +1] = A[i 1 ,i 2 +1]+A[i 1 +1,i 2 ] A[i 1 ,i 2 ] = T  T can be allocated to a register Which array references to scalar replace? October 21, 2005 LCPC '05 11

  12. Tiling Tiling  Similar to Unroll and Jam  Decreases life time of values  Limits MAXLIVE for i 1 = 1 to 6 for i 2 = 1 to 6 A[i 1 ,i 2 ] = A[i 1 -1,i 2 ]+A[i 1 ,i 2 -1] October 21, 2005 LCPC '05 12

  13. Register tiling for i 1 = 1 to 6 for i 2 = 1 to 6 A[i 1 ,i 2 ] = A[i 1 -1,i 2 ]+A[i 1 ,i 2 -1] 3x3 register tile Tile sizes:  Affects load/store savings  Constrained by number of registers  How to choose the tile sizes? October 21, 2005 LCPC '05 13

  14. Traditional vs. Our Approach Sched. & Reg. Alloc Code Transformation Traditional Approach DAG Scheduler or Choose optimal Unroll and Jam Software Pipelining unroll and + + scalar promotion Scalar Promotion Scalar Register parameters Allocation Our Approach Choose optimal Permutation Software Pipelining skew and or Skewing + tile + Scalar & Array parameters Tiling Register Allocation October 21, 2005 LCPC '05 14

  15. Program, Tiling, and Architecture Class  Input loops:  perfectly nested, rectangular loops  uniform dependence bodies  Rectangular tiling  we assume: input loop nest admits rectangular tiling  ILP-exposed by: permutation or skewing  Architectures: superscalar or VLIW October 21, 2005 LCPC '05 15

  16. Execution Time (When permutation exposes ILP) T = (ntiles * tile_cost) + loop_overhead tile_cost = max(comp_cost,load_store_cost) comp_cost = α * tile_vol load_store_cost = β * LS(t,D) loop_overhead = η * LO(t,N) ntiles = N 1 /t 1 * … * N n /t n t = vector of tile sizes N = vector of iter. space sizes D = dependence matrix October 21, 2005 LCPC '05 16

  17. Execution Time Model (when permutation cannot expose ILP: skew) Skewing affects  iteration space shape -- makes counting of partial, full, and no. of tiles hard.  dependence lengths -- affects the amount of data loaded / stored in a tile.  Partial tiles treated as full tiles.  Number of tiles approximated by N 1 /t 1 * … * N n /t n  Dep. matrix = SD  LS(t,SD) is the load store volume October 21, 2005 LCPC '05 17

  18. Optimal ILP and Register Tiling: Optimization Problem Formulation minimize TotalExecutionTime(t,S) subject to LoadStoreVolume(t,S) ≤ Registers For a fixed skew S  t is the only variable  opt. prob. reduces to an integer convex opt. prob. October 21, 2005 LCPC '05 18

  19. Solution Steps Can permutation expose a parallel loop?  Yes!  No!  Construct set ( Γ ) of valid  No skewing, only tiling skews  Fix S = I in opt. prob.  For each element in Γ  Solve for optimal tile solve the fixed skew sizes optimization problem  Single integer convex  Pick the best opt. problem.  Only d(d-1) problems October 21, 2005 LCPC '05 19

  20. Solving for Optimal Tile Sizes  Opt. Prob. for tile sizes is a Integer Geometric Program (à la Integer Linear Programs)  GPs can be transformed into convex opt. probs.  Standard solvers are available  Running time:  depends on #vars & #constraints  few seconds (< 10 secs.) October 21, 2005 LCPC '05 20

  21. Validation  Experimental validation requires  array register allocator  architectural support (like rotating registers)  Similar model used for finding optimal unroll factor  optimal unroll factors can be found with small tweaks  In tiling for memory hierarchy  we have successfully used a similar model  almost all the cost models used by other researchers can be cast into our GP framework [RR-SC04] October 21, 2005 LCPC '05 21

  22. Related Work  Unroll and Jam approach  [Callhan et al.-90], [Carr-Kennedy-94], [Sarkar-01]  Hierarchical tiling  [Carter et al.-95], [Mitchell et al.-98]  Software pipelining of loop nests  [Ramanujam-94], [Rong et al. 04], [Rong et al. 05]  Code generation for register tiling  [Jiminez et al.-02], [Sarkar-01] October 21, 2005 LCPC '05 22

  23. Conclusions & Future Work  A mathematical formulation of the combined ILP and register tiling problem.  A globally optimal solution.  Future work:  adapting modulo schedulers to pipeline skewed loops  developing an array register allocator  experimental validation on benchmarks October 21, 2005 LCPC '05 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend