Register Alloca�on and Instruc�on Scheduling in Unison Roberto Castañeda Lozano – SICS, KTH joint work with: G. Hjort Blindell – KTH, SICS M. Carlsson – SICS F. Drejhammar – SICS C. Schulte – KTH, SICS This research has been par�ally funded by Ericsson AB and the Swedish Research Council (VR 621-2011-6229)
PreRA PreEmit Code Genera�on in LLVM instruc�on register instruc�on … … scheduling alloca�on scheduling Stages, heuris�cs Pros: compila�on speed Cons: subop�mal, complex 2 / 28
PreEmit PreRA Introducing Unison register alloca�on integrated constraint … … combinatorial solver problem instruc�on scheduling Integra�on, combinatorial op�miza�on Pros: simple, op�mal Cons: compila�on slowdown 3 / 28
Unison Is Prac�cal and Effec�ve For LLVM Users tradi�onal LLVM for compile/debug cycle LLVM + Unison for release builds For LLVM developers evalua�on of heuris�cs iden�fica�on of improvement opportuni�es 4 / 28
Op�mal Approaches 1 Model 2 Results 3 Case Studies 4 Conclusion 5 5 / 28
Earlier Op�mal Approaches Global register alloca�on Local instruc�on scheduling prac�cal and effec�ve Global instruc�on scheduling scales up to medium-size problems Integrated op�mal approaches ignore essen�al register alloca�on subproblems do not scale beyond small problems 6 / 28
Integrated Op�mal Approaches register alloca�on instr. sched. max. GL approach SP RA CO SO RP LS RM MB BD MU 2D size - - - - - Wilson 1994 30 ✓ ✓ ✓ ✓ ✓ ✓ ✓ - - - - - - - - Chang 1997 ∼ 10 ✓ ✓ ✓ ✓ - - - - - - Gebotys 1997 108 ✓ ✓ ✓ ✓ ✓ ✓ - - - - - ICG 1999 23 ✓ ✓ ✓ ✓ ✓ ✓ ✓ - - - - - - PROPAN 2000 42 ✓ ✓ ✓ ✓ ✓ ✓ - - - - - - - Nagar. 2007 ? ✓ ✓ ✓ ✓ ✓ - - - - - OPTIMIST 2012 100 ✓ ✓ ✓ ✓ ✓ ✓ ✓ Unison 605 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Few global approaches Few register alloca�on subproblems Low scalability Unison: All subproblems, be�er scalability key: constraint programming 7 / 28
Op�mal Approaches 1 Model 2 Results 3 Case Studies 4 Conclusion 5 8 / 28
Model Register alloca�on allocate temps to registers or memory register assignment spilling coalescing live range spli�ng … Instruc�on scheduling Connec�on: temp live ranges 9 / 28
Register Assignment as Rectangle Packing Register Assignment Rectangle Packing temp live ranges rectangles temp size (16 bits, 32 bits, …) rectangle width interfering temps cannot share registers rectangles cannot overlap t 1 t 2 t 3 t 4 R1 R2 R3 R4 … R1 R2 R3 R4 … t 1 t 2 t 3 t 1 0 t 2 �me t 3 1 t 4 t 4 2 Model based on Pereira and Palsberg, 2008 no - overlap ( ⟨ r t 1 , r t 1 + 1 , ls t 1 , le t 1 ⟩ , ⟨ r t 2 , r t 2 + 2 , ls t 2 , le t 2 ⟩ ,... ) 10 / 28
Op�mal Approaches 1 Model 2 Results 3 Case Studies 4 Conclusion 5 11 / 28
epic g721 gsm jpeg mpeg2 adpcm Speedup over LLVM 3.8 50% ◾ 40% ◾ 30% ◾ 20% ◾ ◾ ◾ ◾ ◾ ◾ ◾ 10% ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ ◾ 0% 50 MediaBench func�ons Hexagon V4 processor Provably op�mal ( ◾ ) for 54% of the func�ons Compila�on �me: from seconds to minutes 12 / 28
Op�mal Approaches 1 Model 2 Results 3 Case Studies 4 Conclusion 5 13 / 28
Disclaimer The case studies are generated with LLVM 3.8 for Hexagon V4 14 / 28
Disclaimer opt is run with different op�miza�on levels and some llc passes are disabled for simplicity ( llc is always run with O3 ) 15 / 28
Disclaimer I am no expert on LLVM – just a humble user 16 / 28
int fac(int n) { int f = 1; while (n > 0) { f = f * n; n--; } return f; } Case Study: fac Simple itera�ve factorial: Exposes opportunity for be�er coalescing Illustrates effect of integrated reasoning 17 / 28
(Unison) r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 18 / 28
r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 r1 = #1; p0 = cmp.gt(r0, #0); if (!p0.new) jump:nt .LBB0_2 r1 = mpyi(r1, r0); r0 = add(r0, #-1); p0 = cmp.gt(r0, #1); if (p0.new) jump:t .LBB0_1 r0 = r1; jumpr r31 (Unison) 18 / 28
r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 LLVM’s loop is twice as slow, why? r1 = #1; p0 = cmp.gt(r0, #0); if (!p0.new) jump:nt .LBB0_2 r1 = mpyi(r1, r0); r0 = add(r0, #-1); p0 = cmp.gt(r0, #1); if (p0.new) jump:t .LBB0_1 r0 = r1; jumpr r31 (Unison) 18 / 28
(Unison) # After Simple Register Coalescing: = C2_cmpgti %vreg10, 1 .. .. %vreg10 = COPY %vreg2 J2_jumpt %vreg8, <BB#1> %vreg8 = M2_mpyi %vreg9, %vreg10 %vreg9 = A2_addi %vreg10, -1 %vreg2 BB#1: r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 18 / 28
(Unison) # After Simple Register Coalescing: = C2_cmpgti %vreg10, 1 .. .. %vreg10 = COPY %vreg2 J2_jumpt %vreg8, <BB#1> %vreg8 = M2_mpyi %vreg9, %vreg10 %vreg9 = A2_addi %vreg10, -1 %vreg2 BB#1: r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 %vreg2 and %vreg10 not coalesced 18 / 28
(Unison) # After Simple Register Coalescing: = C2_cmpgti %vreg10, 1 .. .. %vreg10 = COPY %vreg2 J2_jumpt %vreg8, <BB#1> %vreg8 = M2_mpyi %vreg9, %vreg10 %vreg9 = A2_addi %vreg10, -1 %vreg2 BB#1: r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 corresponding move requires a new bundle 18 / 28
(Unison) # After Simple Register Coalescing: = A2_addi %vreg10, -1 .. .. %vreg10 = COPY %vreg2 J2_jumpt %vreg8, <BB#1> %vreg2 = C2_cmpgti %vreg10, 1 %vreg8 = M2_mpyi %vreg9, %vreg10 %vreg9 BB#1: r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 postponing A2_addi enables coalescing %vreg2 , %vreg10 18 / 28
r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 Unison’s ini�aliza�on is one cycle faster, why? r1 = #1; p0 = cmp.gt(r0, #0); if (!p0.new) jump:nt .LBB0_2 r1 = mpyi(r1, r0); r0 = add(r0, #-1); p0 = cmp.gt(r0, #1); if (p0.new) jump:t .LBB0_1 r0 = r1; jumpr r31 (Unison) 18 / 28
(LLVM) (Unison) int fac(int n) { int f = 1; f n R0 R1 f n R0 R1 } return f; } n--; f = f * n; while(n > 0) { Calling conven�on: argument, return value in R0 19 / 28
(LLVM) (Unison) int fac(int n) { int f = 1; f n R0 R1 f n R0 R1 } return f; } n--; f = f * n; while(n > 0) { However n and f interfere: move required 19 / 28
(Unison) int fac(int n) { int f = 1; f n R0 R1 f n R0 R1 } return f; } n--; f = f * n; while(n > 0) { (LLVM) LLVM moves n in the ini�aliza�on block 19 / 28
int fac(int n) { R0 R1 f n R0 R1 f int f = 1; n } return f; } n--; f = f * n; while(n > 0) { (LLVM) (Unison) Unison moves f in the return block: does it ma�er? 19 / 28
r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 It does! Unison’s move can be scheduled in parallel r1 = #1; p0 = cmp.gt(r0, #0); if (!p0.new) jump:nt .LBB0_2 r1 = mpyi(r1, r0); r0 = add(r0, #-1); p0 = cmp.gt(r0, #1); if (p0.new) jump:t .LBB0_1 r0 = r1; jumpr r31 (Unison) 20 / 28
r0 = #1; r1 = r0 p0 = cmp.gt(r1, #0); if (!p0.new) jump:nt .LBB0_2 r0 = mpyi(r0, r1); r2 = add(r1, #-1); p0 = cmp.gt(r1, #1) if (p0) jump .LBB0_1; r1 = r2 (LLVM) jumpr r31 What if cmp.gt could choose r0 during scheduling? r1 = #1; p0 = cmp.gt(r0, #0); if (!p0.new) jump:nt .LBB0_2 r1 = mpyi(r1, r0); r0 = add(r0, #-1); p0 = cmp.gt(r0, #1); if (p0.new) jump:t .LBB0_1 r0 = r1; jumpr r31 (Unison) 20 / 28
int fib(int n) { if(n <= 2) { return 1; } return fib(n-1) + fib(n-2); } Case Study: fib Simple recursive Fibonacci : Exposes opportunity for be�er spilling 21 / 28
(Unison) memd(r29 + #-16) = r17:16; allocframe(#16) p0 = cmp.gt(r0, #2); if (p0.new) jump:nt .LBB0_2; memw(r29+#0) = r0 r0 = memw(r29 + #0) call fib; r0 = add(r0, #-1) r0 = #1; jump .LBB0_3; memw(r29+#4) = r0.new r16 = r0; r1 = memw(r29 + #0) call fib; r0 = add(r1, #-2) r0 = add(r16, r0); memw(r29+#4) = r0.new r0 = memw(r29 + #4); r17:16 = memd(r29 + #8) (LLVM) dealloc_return 22 / 28
Recommend
More recommend