register alloca on and instruc on scheduling in unison
play

Register Allocaon and Instrucon Scheduling in Unison Roberto - PowerPoint PPT Presentation

Register Allocaon and Instrucon Scheduling in Unison Roberto Castaeda Lozano SICS, KTH joint work with: G. Hjort Blindell KTH, SICS M. Carlsson SICS F. Drejhammar SICS C. Schulte KTH, SICS This research has been


  1. Register Alloca�on and Instruc�on Scheduling in Unison Roberto Castañeda Lozano – SICS, KTH joint work with: G. Hjort Blindell – KTH, SICS M. Carlsson – SICS F. Drejhammar – SICS C. Schulte – KTH, SICS This research has been par�ally funded by Ericsson AB and the Swedish Research Council (VR 621-2011-6229)

  2. PreRA PreEmit Code Genera�on in LLVM instruc�on register instruc�on … … scheduling alloca�on scheduling Stages, heuris�cs Pros: compila�on speed Cons: subop�mal, complex 2 / 28

  3. PreEmit PreRA Introducing Unison register alloca�on integrated constraint … … combinatorial solver problem instruc�on scheduling Integra�on, combinatorial op�miza�on Pros: simple, op�mal Cons: compila�on slowdown 3 / 28

  4. Unison Is Prac�cal and Effec�ve For LLVM Users tradi�onal LLVM for compile/debug cycle LLVM + Unison for release builds For LLVM developers evalua�on of heuris�cs iden�fica�on of improvement opportuni�es 4 / 28

  5. Op�mal Approaches 1 Model 2 Results 3 Case Studies 4 Conclusion 5 5 / 28

  6. Earlier Op�mal Approaches Global register alloca�on Local instruc�on scheduling prac�cal and effec�ve Global instruc�on scheduling scales up to medium-size problems Integrated op�mal approaches ignore essen�al register alloca�on subproblems do not scale beyond small problems 6 / 28

  7. Integrated Op�mal Approaches register alloca�on instr. sched. max. GL approach SP RA CO SO RP LS RM MB BD MU 2D size - - - - - Wilson 1994 30 ✓ ✓ ✓ ✓ ✓ ✓ ✓ - - - - - - - - Chang 1997 ∼ 10 ✓ ✓ ✓ ✓ - - - - - - Gebotys 1997 108 ✓ ✓ ✓ ✓ ✓ ✓ - - - - - ICG 1999 23 ✓ ✓ ✓ ✓ ✓ ✓ ✓ - - - - - - PROPAN 2000 42 ✓ ✓ ✓ ✓ ✓ ✓ - - - - - - - Nagar. 2007 ? ✓ ✓ ✓ ✓ ✓ - - - - - OPTIMIST 2012 100 ✓ ✓ ✓ ✓ ✓ ✓ ✓ Unison 605 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Few global approaches Few register alloca�on subproblems Low scalability Unison: All subproblems, be�er scalability key: constraint programming 7 / 28

  8. Op�mal Approaches 1 Model 2 Results 3 Case Studies 4 Conclusion 5 8 / 28

  9. Model Register alloca�on allocate temps to registers or memory register assignment spilling coalescing live range spli�ng … Instruc�on scheduling Connec�on: temp live ranges 9 / 28

  10. Register Assignment as Rectangle Packing Register Assignment Rectangle Packing temp live ranges rectangles temp size (16 bits, 32 bits, …) rectangle width interfering temps cannot share registers rectangles cannot overlap t 1 t 2 t 3 t 4 R1 R2 R3 R4 … R1 R2 R3 R4 … t 1 t 2 t 3 t 1 0 t 2 �me t 3 1 t 4 t 4 2 Model based on Pereira and Palsberg, 2008 no - overlap ( ⟨ r t 1 , r t 1 + 1 , ls t 1 , le t 1 ⟩ , ⟨ r t 2 , r t 2 + 2 , ls t 2 , le t 2 ⟩ ,... ) 10 / 28

  11. Op�mal Approaches 1 Model 2 Results 3 Case Studies 4 Conclusion 5 11 / 28

  12. epic g721 gsm jpeg mpeg2 adpcm Speedup over LLVM 3.8 50% ◾ 40% ◾ 30% ◾ 20% ◾ ◾ ◾ ◾ ◾ ◾ ◾ 10% ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ 0% 50 MediaBench func�ons Hexagon V4 processor Provably op�mal ( ◾ ) for 54% of the func�ons Compila�on �me: from seconds to minutes 12 / 28

  13. Op�mal Approaches 1 Model 2 Results 3 Case Studies 4 Conclusion 5 13 / 28

  14. Disclaimer The case studies are generated with LLVM 3.8 for Hexagon V4 14 / 28

  15. Disclaimer opt is run with different op�miza�on levels and some llc passes are disabled for simplicity ( llc is always run with O3 ) 15 / 28

  16. Disclaimer I am no expert on LLVM – just a humble user 16 / 28

  17. int fac(int n) { int f = 1; while (n > 0) { f = f * n; n--; } return f; } Case Study: fac Simple itera�ve factorial: Exposes opportunity for be�er coalescing Illustrates effect of integrated reasoning 17 / 28

  18. (Unison) r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 18 / 28

  19. r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 r1 = #1; p0 = cmp.gt(r0, #0); if (!p0.new) jump:nt .LBB0_2 r1 = mpyi(r1, r0); r0 = add(r0, #-1); p0 = cmp.gt(r0, #1); if (p0.new) jump:t .LBB0_1 r0 = r1; jumpr r31 (Unison) 18 / 28

  20. r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 LLVM’s loop is twice as slow, why? r1 = #1; p0 = cmp.gt(r0, #0); if (!p0.new) jump:nt .LBB0_2 r1 = mpyi(r1, r0); r0 = add(r0, #-1); p0 = cmp.gt(r0, #1); if (p0.new) jump:t .LBB0_1 r0 = r1; jumpr r31 (Unison) 18 / 28

  21. (Unison) # After Simple Register Coalescing: = C2_cmpgti %vreg10, 1 .. .. %vreg10 = COPY %vreg2 J2_jumpt %vreg8, <BB#1> %vreg8 = M2_mpyi %vreg9, %vreg10 %vreg9 = A2_addi %vreg10, -1 %vreg2 BB#1: r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 18 / 28

  22. (Unison) # After Simple Register Coalescing: = C2_cmpgti %vreg10, 1 .. .. %vreg10 = COPY %vreg2 J2_jumpt %vreg8, <BB#1> %vreg8 = M2_mpyi %vreg9, %vreg10 %vreg9 = A2_addi %vreg10, -1 %vreg2 BB#1: r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 %vreg2 and %vreg10 not coalesced 18 / 28

  23. (Unison) # After Simple Register Coalescing: = C2_cmpgti %vreg10, 1 .. .. %vreg10 = COPY %vreg2 J2_jumpt %vreg8, <BB#1> %vreg8 = M2_mpyi %vreg9, %vreg10 %vreg9 = A2_addi %vreg10, -1 %vreg2 BB#1: r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 corresponding move requires a new bundle 18 / 28

  24. (Unison) # After Simple Register Coalescing: = A2_addi %vreg10, -1 .. .. %vreg10 = COPY %vreg2 J2_jumpt %vreg8, <BB#1> %vreg2 = C2_cmpgti %vreg10, 1 %vreg8 = M2_mpyi %vreg9, %vreg10 %vreg9 BB#1: r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 postponing A2_addi enables coalescing %vreg2 , %vreg10 18 / 28

  25. r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 Unison’s ini�aliza�on is one cycle faster, why? r1 = #1; p0 = cmp.gt(r0, #0); if (!p0.new) jump:nt .LBB0_2 r1 = mpyi(r1, r0); r0 = add(r0, #-1); p0 = cmp.gt(r0, #1); if (p0.new) jump:t .LBB0_1 r0 = r1; jumpr r31 (Unison) 18 / 28

  26. (LLVM) (Unison) int fac(int n) { int f = 1; f n R0 R1 f n R0 R1 } return f; } n--; f = f * n; while(n > 0) { Calling conven�on: argument, return value in R0 19 / 28

  27. (LLVM) (Unison) int fac(int n) { int f = 1; f n R0 R1 f n R0 R1 } return f; } n--; f = f * n; while(n > 0) { However n and f interfere: move required 19 / 28

  28. (Unison) int fac(int n) { int f = 1; f n R0 R1 f n R0 R1 } return f; } n--; f = f * n; while(n > 0) { (LLVM) LLVM moves n in the ini�aliza�on block 19 / 28

  29. int fac(int n) { R0 R1 f n R0 R1 f int f = 1; n } return f; } n--; f = f * n; while(n > 0) { (LLVM) (Unison) Unison moves f in the return block: does it ma�er? 19 / 28

  30. r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 It does! Unison’s move can be scheduled in parallel r1 = #1; p0 = cmp.gt(r0, #0); if (!p0.new) jump:nt .LBB0_2 r1 = mpyi(r1, r0); r0 = add(r0, #-1); p0 = cmp.gt(r0, #1); if (p0.new) jump:t .LBB0_1 r0 = r1; jumpr r31 (Unison) 20 / 28

  31. r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 What if cmp.gt could choose r0 during scheduling? r1 = #1; p0 = cmp.gt(r0, #0); if (!p0.new) jump:nt .LBB0_2 r1 = mpyi(r1, r0); r0 = add(r0, #-1); p0 = cmp.gt(r0, #1); if (p0.new) jump:t .LBB0_1 r0 = r1; jumpr r31 (Unison) 20 / 28

  32. int fib(int n) { if(n <= 2) { return 1; } return fib(n-1) + fib(n-2); } Case Study: fib Simple recursive Fibonacci : Exposes opportunity for be�er spilling 21 / 28

  33. (Unison) memd(r29 + #-16) = r17:16; allocframe(#16) p0 = cmp.gt(r0, #2); if (p0.new) jump:nt .LBB0_2; memw(r29+#0) = r0 r0 = memw(r29 + #0) call fib; r0 = add(r0, #-1) r0 = #1; jump .LBB0_3; memw(r29+#4) = r0.new r16 = r0; r1 = memw(r29 + #0) call fib; r0 = add(r1, #-2) r0 = add(r16, r0); memw(r29+#4) = r0.new r0 = memw(r29 + #4); r17:16 = memd(r29 + #8) (LLVM) dealloc_return 22 / 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend