AArch64 performance analysis and resulted enhancements on GCC Feng - - PowerPoint PPT Presentation

aarch64 performance analysis and resulted enhancements on
SMART_READER_LITE
LIVE PREVIEW

AArch64 performance analysis and resulted enhancements on GCC Feng - - PowerPoint PPT Presentation

AArch64 performance analysis and resulted enhancements on GCC Feng Xue, Jiangning Liu November 23, 2019 Agenda Loop split on semi-invariant conditional statement IPA constant propagation and recursive function versioning Some issues


slide-1
SLIDE 1

November 23, 2019

AArch64 performance analysis and resulted enhancements on GCC

Feng Xue, Jiangning Liu

slide-2
SLIDE 2

Agenda

2

  • Loop split on semi-invariant conditional statement
  • IPA constant propagation and recursive function versioning
  • Some issues in current register allocator
  • Trapless conditional selection instruction generation
slide-3
SLIDE 3

Loop conditional statement elimination

3

  • Loop Split

for (i = 0; i < 100; i++) { if (i < 40) S1; else S2; } for (i = 0; i < 40; i++) S1; for (i = 40; i < 100; i++) S2;

  • Loop Unswitch

if (a != b) { for (i = 0; i < 100; i++) S1; } else { for (i = 0; i < 100; i++) S2; } for (i = 0; i < 100; i++) { if (a != b) S1; else S2; }

slide-4
SLIDE 4

Loop semi-invariant conditional statement

4

  • Loop invariant condition?

extern int flag; for (i = 0; i < 100; i++) { if (flag) printf(…); } for (i = 0; i < 100; i++) { if (a < 10) a = new_value (); }

f(a)? No change to a a = ... ...

  • Simple semi-invariant pattern
slide-5
SLIDE 5

How to eliminate semi-invariant condition?

5

  • Loop Unswitch

if (flag) { for (i = 0; i < 100; i++) { if (flag) printf(…); S1; } } else { for (i = 0; i < 100; i++) { S1; }

  • Loop Split

for (i = 0; i < 100; i++) { if (flag) printf(…); else { S1; i++; break; } } for (; i < 100; i++) S1;

slide-6
SLIDE 6

Identify semi-invariant condition

6

  • Conditional expression tree evaluation
  • Normal value operation
  • SSA-PHI merge operation

foo(int p, int q, int r) { a = r; for (i = 0; i < 100; i++) { if (a) b = q; else b = p; if (b * b < 10) a = new_value(); } }

B_3 = PHI(B_1, B_2) if(A_1) B_2 = ... B_1 = ... cond = (B_3 * B_3 < 10) A_1 = PHI(...)

Both value expression and the condition that it control-depends on should be semi-invariant.

slide-7
SLIDE 7

Identify semi-invariant condition

7

  • Semi-invariant loop iteration value

V_3 = PHI(V_1, V_4) if(cond) V_1 = PHI(init, V_5) V_5 = V_3 V_4 = ...

slide-8
SLIDE 8

IPA constant propagation

8

  • Jump function

f(int a, int b) { g(b, 3, -a, a + 1); } JF{f->g}[0] = param#1 JF{f->g}[1] = 3 JF{f->g}[2] = -param#0 JF{f->g}[3] = param#0 + 1

  • In-memory constant

f() { int a = 1; struct {f0, f1} b = {2, 3}; g(&a, b); } JF_agg{f->g}[0, @0] = 1 JF_agg{f->g}[1, @0] = 2 JF_agg{f->g}[1, @4] = 3

slide-9
SLIDE 9

IPA constant propagation

9

  • Parameter passing in FORTRAN

subroutine f(a) integer, intent(in) a call g(a + 1) end subroutine f(int *a) { int t = *a + 1; g(&t) }

  • Enhanced in-memory constant propagation

▪ JF_agg[i, @offset] = constant ▪ JF_agg[i, @offset] = param#j OP constant ▪ JF_agg[i, @offset] = *(param#j + offset2) OP constant

slide-10
SLIDE 10

Recursive function optimizations

10

f(int i) { if (i == 4) { do_work(); return; } do_prepare(); f(i + 1); do_post(); } main() { f(1); }

  • Recursive tail call transformation
  • Recursive inlining
  • Recursive versioning

f<i=1>() f<i=2>() f<i=3>() f<i=4>() main()

slide-11
SLIDE 11

Recursive function versioning

11

  • Only for self-recursive function
  • New option for recursive versioning depth
  • Recursive constant propagation strategy

f(int i) { g(i); f(i + 1); } B() { f(1); } C() { f(6); } D() { g(0); }

1 7,8,9

B() C() f(i) f(i) g(i) D()

6 2,3,4 1 6 Versioning depth is supposed to be 4.

slide-12
SLIDE 12

IPA constant propagation TODOs

12

  • Global variable value propagation

int CST; init() { CST = 4; } calc(int i) { return i / CST; } main() { init(); ... = calc(100); } calc(100) -> calc(100, CST)

  • Extend jump function

f(int a, int b) { g(1 – a, b ? 1 : 2, a + b); } JF{f->g}[0] = 1 – param#0 JF{f->g}[1] = param#1 ? 1 : 2; JF{f->g}[2] = param#0 + param#1

slide-13
SLIDE 13

Issues in register allocator

13

  • Context sensitive
  • Root cause

▪ Execution profile normalization error f1() { S1 } f2() { if (cond) S1 else S2 }

Irrelevant code Different allocation result

▪ Code generation instability impacts inlining ▪ Hard to do code and performance comparison

f1() { BB1 (30) -> 30/10 = 3 BB2 (1000) -> 1000/10 = 100 } f2() { if (cond) BB1 (3) -> 3/10 = 0.3 ≈ 1 BB2 (100)-> 100/10 = 10 }

slide-14
SLIDE 14

Issues in register allocator

14

  • Top-down allocation order
  • Possible solutions

Region 1

v1 =... ...= v1

Region 2

mem mem mem mem reg reg spill reload

▪ Local information impacts global allocation decision in too early stage

▪ Use live range split to replace spilling ▪ Do post refinement on outside region

slide-15
SLIDE 15

15

Trapless conditional selection instruction generation

int f(int k, int b) { int a[2]; if (b < a[k]) { a[k] = b; } return a[0]+a[2]; }

▪ For “a” is local variable, always writable, introducing extra write on “a” will not cause trap.

sp, sp, #16 uxtw x0, w0 add x2, sp, 8 ldr w3, [x2, x0, lsl 2] cmp w3, w1 bls .L2 str w1, [x2, x0, lsl 2] .L2: ldr w1, [sp, 8] ldr w0, [sp, 16] add sp, sp, 16 add w0, w1, w0 ret uxtw x2, w0 add x3, sp, 8 ldr w5, [sp, 16] ldr w4, [x3, x2, lsl 2] cmp w4, w1 csel w1, w1, w4, hi str w1, [x3, x2, lsl 2] ldr w0, [sp, 8] add sp, sp, 16 add w0, w0, w5 ret

slide-16
SLIDE 16

Build something with us. 与我们一起创造未来!

http://developer.amperecomputing.com

16

slide-17
SLIDE 17

17

Thanks 谢谢