Improving Machine Outliner for ThinLTO
(Global Machine Outliner + Frame Code Outliner)
Facebook Kyungwoo Lee, Nikolai Tillmann
1
Improving Machine Outliner for ThinLTO (Global Machine Outliner + - - PowerPoint PPT Presentation
Improving Machine Outliner for ThinLTO (Global Machine Outliner + Frame Code Outliner) Facebook Kyungwoo Lee, Nikolai Tillmann 1 The Machine Outliner Today Machine outliner in LLVM significantly reduces code size Works quite well with
(Global Machine Outliner + Frame Code Outliner)
Facebook Kyungwoo Lee, Nikolai Tillmann
1
2
int f1(int x) { // ...more code... return x * 128 + 77; } int f2(int x) { // ...more code... return x * 128 + 77; } int g(int x) { // ...more code... return x * 128 + 77; }
int f1(int x) { // ...more code... return __outlined(x); } int f2(int x) { // ...more code... return __outlined(x); } int g(int x) { // ...more code... return __outlined(x); } int __outlined(int x) { return x * 128 + 77; } int f1(int x) { // ...more code... return __outlined(x); } int f2(int x) { // ...more code... return __outlined(x); } int __outlined(int x) { return x * 128 + 77; } int g(int x) { // ...more code... return x * 128 + 77; }
a.c: b.c: LTO ThinLTO
3
instructions and micro-
and/or local
(X30) often comes late in the prologue or early in the epilogue
(Prologue) stp x22, x21, [sp, #-48]! stp x20, x19, [sp, #16] stp x29, x30, [sp, #32] // Can’t outline add x29, sp, #32 ... (Epilogue) ldp x29, x30, [sp, #32] // Can’t outline ldp x20, x19, [sp, #16] ldp x22, x11, [sp], #48 ret
4
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
7 z i p B u l l e t C l a m A V S P A S S c
s u m e r
y p e s e t k i m w i t u + + l e n c
m a f f t s q l i t e 3 t r a m p 3 d
4 g e
e a n
Text Size Reduction
ThinLTO LTO
5
6
7
.o .o .o .o .o .o IR IR IR IR IR IR
Opt Opt Opt Opt Opt Opt CG CG CG CG CG CG Traditional Linking Frontend Linker
files in parallel
analysis, runs in parallel for each module:
combines results
Interprocedural Analysis
8
.o .o .o .o .o .o IR IR IR IR IR IR
Opt Opt Opt Opt Opt Opt CG CG CG CG CG CG Traditional Linking Frontend Linker
1st round:
functions 2nd round:
candidates that match MIR hashes Linking:
modules
Interprocedural Analysis
CG CG CG CG CG CG
Gathering of all outlined MIR hashes
1st CG round 2nd CG round
synchronization
9
10
int __outlined1(int x) { return x * 128 + 77; } mov eax, DWORD PTR [rbp-4] sal eax, 7 add eax, 77 int __outlined2(int x) { return x * 128 + 33; } mov eax, DWORD PTR [rbp-4] sal eax, 7 add eax, 33
mov eax, DWORD PTR [rbp-4] sal eax, 7 add eax, 77 add eax, 33
a.c:
root
11
int __outlined1(int x) { return x * 128 + 77; } mov eax, DWORD PTR [rbp-4] // Y sal eax, 7 // B add eax, 77 // U int __outlined2(int x) { return x * 128 + 33; } mov eax, DWORD PTR [rbp-4] // Y sal eax, 7 // B add eax, 33 // Q
Y B U Q
a.c:
root Stable Hashes (actual hashes are 64-bit)
12
1) For an outlining candidate (whose sequence occurring at least 2 times)
2) For sequence occurring only once in a module
13
… mov DWORD PTR [rbp-8], eax // H mov eax, DWORD PTR [rbp-4] // Y sal eax, 7 // B add eax, 77 // U add eax, 33 // R mov DWORD PTR [rbp-8], eax // A … Y B U Q root
b.c:
14
… mov DWORD PTR [rbp-8], eax // H mov eax, DWORD PTR [rbp-4] // Y sal eax, 7 // B add eax, 77 // U add eax, 33 // R mov DWORD PTR [rbp-8], eax // A … Y B U Q root
b.c:
15
… mov DWORD PTR [rbp-8], eax // H mov eax, DWORD PTR [rbp-4] // Y sal eax, 7 // B add eax, 77 // U add eax, 33 // R mov DWORD PTR [rbp-8], eax // A … Y B U Q root
b.c:
16
… mov DWORD PTR [rbp-8], eax // H mov eax, DWORD PTR [rbp-4] // Y sal eax, 7 // B add eax, 77 // U add eax, 33 // R mov DWORD PTR [rbp-8], eax // A … Y B U Q root
b.c:
17
… mov DWORD PTR [rbp-8], eax // H mov eax, DWORD PTR [rbp-4] // Y sal eax, 7 // B add eax, 77 // U add eax, 33 // R mov DWORD PTR [rbp-8], eax // A … Y B U Q root
b.c:
We found a match… Outline this sequence!
18
… mov DWORD PTR [rbp-8], eax // H mov eax, DWORD PTR [rbp-4] // Y sal eax, 7 // B add eax, 77 // U add eax, 33 // R mov DWORD PTR [rbp-8], eax // A … Y B U Q root
b.c:
19
… mov DWORD PTR [rbp-8], eax // H mov eax, DWORD PTR [rbp-4] // Y sal eax, 7 // B add eax, 77 // U add eax, 33 // R mov DWORD PTR [rbp-8], eax // A … Y B U Q root
b.c:
20
int f1(int x) { // ...more code... return x * 128 + 77; } int f2(int x) { // ...more code... return x * 128 + 77; } int g(int x) { // ...more code... return x * 128 + 77; } int f1(int x) { // ...more code... return __outlined1(x); } int f2(int x) { // ...more code... return __outlined1(x); } int g(int x) { // ...more code... return __outlined2(x); } int __outlined1(int x) { return x * 128 + 77; } int __outlined2(int x) { return x * 128 + 77; }
a.c: b.c: ThinLTO with 2-round CodeGen
21
22
with examples for for AArch64/iOS
23
while saving CSR
while restoring CSR
(Prologue) stp x29, x30, [sp, #-16]! stp x20, x19, [sp, #-16]! stp x22, x21, [sp, #-16]! add x29, sp, #32 ... (Epilogue) ldp x22, x21, [sp], #16 ldp x20, x19, [sp], #16 ldp x29, x30, [sp], #16 ret
24
compiler
helpers in each module pass
ODR to deduplicate helpers by linker
at each prologue site.
(Prologue) stp x29, x30, [sp, #-16]! bl _PROLOG_INTEGER_19202122 add x29, sp, #32 ... (Epilogue) bl _EPILOG_INTEGER_21221920 ldp x29, x30, [sp], #16 ret
25
to stash/restore LR value to the context of epilogue.
epilogue that a direct branch follows.
(Epilogue) bl _EPILOG_INTEGER_21221920LRFP ret (Helpers) _EPILOG_INTEGER_21221920LRFP: mov x16, x30 // Save LR of epilogue to X16 ldp x22, x21, [sp], #16 ldp x20, x19, [sp], #16 ldp x29, x30, [sp], #16 // Restore LR (of caller) br x16 // Jump on X16 back to epilogue
26
the helper
(BL) at epilogue
from the helper
merged at different offsets for further saving
(Epilogue) b _EPILOG_INTEGER_21221920LRFP_TAIL (Helper) _EPILOG_INTEGER_21221920LRFP_TAIL: ldp x22, x21, [sp], #16 ldp x20, x19, [sp], #16 ldp x29, x30, [sp], #16 ret
27
28
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 7zip Bullet ClamAV SPASS consumer-typeset kimwitu++ lencod mafft sqlite3 tramp3d-v4 geomean
Text Size Reduction (ThinLTO)
BaseOutline Global Global/FrameOpt
29
30
0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00 7zip Bullet ClamAV SPASS consumer-typeset kimwitu++ lencod mafft sqlite3 tramp3d-v4 geomean
LinkTime Relative to NoOutline (ThinLTO)
BaseOutline Global Global/FrameOpt
internal benchmark
time overhead for large internal benchmark
31
32