THE ATTRIBUTOR: A VERSATILE INTER-PROCEDURAL FIXPOINT ITERATION - - PowerPoint PPT Presentation

the attributor a versatile inter procedural fixpoint
SMART_READER_LITE
LIVE PREVIEW

THE ATTRIBUTOR: A VERSATILE INTER-PROCEDURAL FIXPOINT ITERATION - - PowerPoint PPT Presentation

THE ATTRIBUTOR: A VERSATILE INTER-PROCEDURAL FIXPOINT ITERATION FRAMEWORK LLVM-Dev19 October 22, 2019 San Jose, CA, USA Johannes Doerfert*, Hideto Ueno, Stefan Stipanovic *Leadership Computing Facility *Argonne National Laboratory *


slide-1
SLIDE 1

THE ATTRIBUTOR: A VERSATILE INTER-PROCEDURAL FIXPOINT ITERATION FRAMEWORK

LLVM-Dev’19 — October 22, 2019 — San Jose, CA, USA

Johannes Doerfert*, Hideto Ueno, Stefan Stipanovic *Leadership Computing Facility *Argonne National Laboratory *https://www.alcf.anl.gov/

slide-2
SLIDE 2

ACKNOWLEDGMENT

Two of the authors were supported by Google Summer of Code (GSoC)! This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative efgort of two U.S. Department of Energy organizations (Offjce of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative.

1/16

slide-3
SLIDE 3
  • I. BACKGROUND
slide-4
SLIDE 4

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-5
SLIDE 5

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-6
SLIDE 6

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-7
SLIDE 7

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-8
SLIDE 8

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-9
SLIDE 9

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-10
SLIDE 10

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-11
SLIDE 11

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-12
SLIDE 12

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-13
SLIDE 13

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-14
SLIDE 14

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-15
SLIDE 15

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-16
SLIDE 16

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-17
SLIDE 17

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-18
SLIDE 18

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-19
SLIDE 19

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-20
SLIDE 20

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

What is the alignment of:

(1) the return type ? (2) the returned value ? (3) the argument ? (1) the return type ? (4) the returned value ? (5) the return type ? (1, ∞) (1, ∞) (16, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, ∞) (1, 16) (16, 16) (1, 16) (16, 16) ⊙

2/16

slide-21
SLIDE 21

ABSTRACT STATES

3/16

slide-22
SLIDE 22

ABSTRACT STATES

3/16

slide-23
SLIDE 23

ABSTRACT STATES

3/16

slide-24
SLIDE 24

ABSTRACT STATES

3/16

slide-25
SLIDE 25

FIXPOINT DATA FLOW ANALYSIS — ALIGNMENT EXAMPLE

int * checkAndAdvance( int * __attribute__((aligned(16))) p ) { if (*p == 0) return checkAndAdvance(p + 4) ; return p ; }

4/16

slide-26
SLIDE 26

THE ATTRIBUTOR — USAGE

Attributor A; // Select what information is to be deduced. IRPosition IRPRet = IRPosition::returned(Fn) ; const auto &AA = A.getOrCreateAAFor< AAAlign >(IRPRet); // Deduce information and manifest it in the IR. auto Changed = A.run(*Fn->getParent());

5/16

slide-27
SLIDE 27

THE ATTRIBUTOR — USAGE

Attributor A; // Select what information is to be deduced. IRPosition IRPRet = IRPosition::returned(Fn) ; const auto &AA = A.getOrCreateAAFor< AAAlign >(IRPRet); // Deduce information and manifest it in the IR. auto Changed = A.run(*Fn->getParent());

5/16

slide-28
SLIDE 28

THE ATTRIBUTOR — USAGE

Attributor A; // Select what information is to be deduced. IRPosition IRPRet = IRPosition::returned(Fn) ; const auto &AA = A.getOrCreateAAFor< AAAlign >(IRPRet); // Deduce information and manifest it in the IR. auto Changed = A.run(*Fn->getParent());

5/16

slide-29
SLIDE 29

THE ATTRIBUTOR — USAGE

// Restrict deduction to specific abstract attributes. auto Whitelist = {&AAAlign::ID}; Attributor A(Whitelist); // Select what information is to be deduced. IRPosition IRPRet = IRPosition::returned(Fn) ; const auto &AA = A.getOrCreateAAFor< AAAlign >(IRPRet); // Deduce information and manifest it in the IR. auto Changed = A.run(*Fn->getParent());

5/16

slide-30
SLIDE 30

THE ATTRIBUTOR — USAGE

// Restrict deduction to specific abstract attributes. auto Whitelist = {&AAAlign::ID, /* Think IP-SCCP */ &AAIsDead::ID, &AAValueSimplify::ID }; Attributor A(Whitelist); // Select what information is to be deduced. IRPosition IRPRet = IRPosition::returned(Fn) ; const auto &AA = A.getOrCreateAAFor< AAAlign >(IRPRet); // Deduce information and manifest it in the IR. auto Changed = A.run(*Fn->getParent());

5/16

slide-31
SLIDE 31

THE ATTRIBUTOR — USAGE

// Restrict deduction to specific abstract attributes. auto Whitelist = {&AAAlign::ID, /* Think IP-SCCP */ &AAIsDead::ID, &AAValueSimplify::ID }; Attributor A(Whitelist); // Select what information is to be deduced. IRPosition IRPRet = IRPosition::returned(Fn) ; const auto &AA = A.getOrCreateAAFor< AAAlign >(IRPRet); // Deduce information and manifest it in the IR. auto Changed = A.run(*Fn->getParent());

5/16

AAAlign is unaware of AAIsDead and AAValueSimplify!

slide-32
SLIDE 32

THE ATTRIBUTOR — WHAT IT IS

  • easy way to perform fjxpoint analyses

dependence tracking, work list algorithm, timeouts, …

  • powerful way to perform fjxpoint analyses

utilize concurrently deduced information, e.g., liveness

  • alternative to inlining

IPO + internalization + function rewriting, e.g., argument promotion

6/16

slide-33
SLIDE 33

THE ATTRIBUTOR — WHAT IT IS

  • easy way to perform fjxpoint analyses

dependence tracking, work list algorithm, timeouts, …

  • powerful way to perform fjxpoint analyses

utilize concurrently deduced information, e.g., liveness

  • alternative to inlining

IPO + internalization + function rewriting, e.g., argument promotion

6/16

slide-34
SLIDE 34

THE ATTRIBUTOR — WHAT IT IS

  • easy way to perform fjxpoint analyses

dependence tracking, work list algorithm, timeouts, …

  • powerful way to perform fjxpoint analyses

utilize concurrently deduced information, e.g., liveness

  • alternative to inlining

IPO + internalization + function rewriting, e.g., argument promotion

6/16

slide-35
SLIDE 35

THE ATTRIBUTOR — WHAT IT IS

  • easy way to perform fjxpoint analyses

dependence tracking, work list algorithm, timeouts, …

  • powerful way to perform fjxpoint analyses

utilize concurrently deduced information, e.g., liveness

  • alternative to inlining

IPO + internalization + function rewriting, e.g., argument promotion

6/16

slide-36
SLIDE 36

THE ATTRIBUTOR — WHAT IT IS

  • easy way to perform fjxpoint analyses

dependence tracking, work list algorithm, timeouts, …

  • powerful way to perform fjxpoint analyses

utilize concurrently deduced information, e.g., liveness

  • alternative to inlining

IPO + internalization + function rewriting, e.g., argument promotion

6/16

All good, but why?

slide-37
SLIDE 37
  • II. MOTIVATION
slide-38
SLIDE 38

THE ATTRIBUTOR — THE WHY IPO?

inlining has limits:

  • recursion

≡ loops

  • code size
  • parallelism (think pthread_create) ⇑
  • (declarations) ⇒

7/16

slide-39
SLIDE 39

THE ATTRIBUTOR — THE WHY IPO?

inlining has limits:

  • recursion

≡ loops

  • code size
  • parallelism (think pthread_create) ⇑
  • (declarations) ⇒

7/16

slide-40
SLIDE 40

THE ATTRIBUTOR — THE WHY IPO?

inlining has limits:

  • recursion

≡ loops

  • code size
  • parallelism (think pthread_create) ⇑
  • (declarations) ⇒

7/16

slide-41
SLIDE 41

THE ATTRIBUTOR — THE WHY IPO?

inlining has limits:

  • recursion ≡ loops
  • code size
  • parallelism (think pthread_create) ⇑
  • (declarations) ⇒

7/16

slide-42
SLIDE 42

THE ATTRIBUTOR — THE WHY IPO?

inlining has limits:

  • recursion ≡ loops
  • code size
  • parallelism (think pthread_create) ⇑
  • (declarations) ⇒

7/16

slide-43
SLIDE 43

THE ATTRIBUTOR — THE WHY IPO?

inlining has limits:

  • recursion ≡ loops
  • code size
  • parallelism (think pthread_create) ⇑
  • (declarations) ⇒

7/16

slide-44
SLIDE 44

THE ATTRIBUTOR — THE WHY IPO?

inlining has limits:

  • recursion ≡ loops
  • code size
  • parallelism (think pthread_create) ⇑
  • (declarations) ⇒

7/16

"Header Time Optimization": Cross-Translation Unit Optimization via Annotated Headers

William S. Moses (wmoses@mit.edu), Johannes Doerfert (jdoerfert@anl.gov) MIT CSAIL, Argonne National Lab Writing Optimizable Code is Hard How do we ensure that norm is hoisted outside the loop (and normalize vectorized)? double norm(double *A, int n); void normalize(double *out, double *in, int n) { for (int i = 0; i < n; ++i)
  • ut[i] = in[i] / norm(in, n);
} We could try adding: restrict type, const type, pure aribute, #pragma vectorize(enable), #pragma interleave(enable), __declspec((noalias)). None of those work. What we really want are two LLVM aributes: __attribute__((fn_attr("readonly"), fn_attr("argmemonly"))) double norm(double *A, int n); void normalize(double *restrict out, double *restrict in, int n); This is a problem in real programs! In the DOE RSBench benchmark [2] adding “read- none” to fast_cexp gives a 7% improvement to the enre program (with another 1% for “unwind”). Automatically Making Code Optimizable LLVM automacally derives these aributes as part of the compilaon process, then throws it away when it’s done Let’s ensure this informaon is accessible across translaon units. Why not always use LTO? Running LTO (even ThinLTO [3]) is a burden on compile mes LTO may not be available in your build / operang system It’s oen impossible to run LTO on your enre program (e.g. using an external library) Also, it’s interesng to see how much of LTO’s speedups come from “easily fixable” mechanisms and provide user’s the agency to fix them in source code (making the speedups available to everyone independent from compiler/linker used) Header Files HTO creates new files in a given directory that can be included in any C/C++ program (chosen for easiest experimentaon). Not all LLVM aributes are representable with exisng Clang aributes. We created a generic way to represent LLVM aributes in Clang (shown below). struct Vector; struct Matrix; __attribute__((fn_attr("readonly"), arg_attr(0, "readonly"), ret_attr("noalias"))) Vector* matvec(Matrix *M, Vector *B); Introducing "Header Time Optimization" At the end of the compilaon process, denote what derived aributes can be safely added to funcons using LLVM’s exisng analyses and Aributor [1]. Header me opmizaon has three modes of operaon: remark mode (Figure 1), pipeline mode (Figure 2, 3), and diff mode (in progress) where we create a diff for
  • riginal source tree.
// file1.c double fcexp(double *A, int n) { ... } file1.c:2:1: remark: derived following attributes: fn_attr("readonly") arg_attr(0, "readonly") [-Rannotations] double fcexp(double* a, int n) { clang -Rannotations Figure 1. Remark Mode: print out opmizaon remarks for aributes that should be added to funcons // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } // hto/fileN.h attribute((fn_attr("readnone"))) double fcexp(double *A); // hto/file1.c attribute((fn_attr("readnone"))) double fcexp(double *A); clang -hto_dir=hto // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } executable.o clang -include hto/* Figure 2. Pipeline Mode: automacally generate a new header file with this new informaon, then use this header to recompile the source with this informaon. // libsum.c double sum(double *A) { ... } // sum.h attribute((fn_attr("readnone"))) double sum(double *A); libsum.o clang -hto_dir=hto // user.c double fcexp(double *A) { ... } executable.o clang user.c -lsum Figure 3. Pipeline mode for a library. The annotated header is shipped with the library and used to compile user code. Present Limitations & Future Work We currently don’t generate annotaons for funcons with anonymous structs (we have a script to automacally generate random names), C++ member funcons (since they can’t forward declared), array type of struct/classes (type mystruct[3] is incom- plete ahead of me). When we allow users to output a diff (easier for integraon) rather than pipeline (easier for experiments), these limitaons are resolved and we get more performance gains. In the future we plan to generate standard C/C++ aributes when they exist. Experiments Ran mul-source benchmarks in LLVM test suite Annotated headers allow more LLVM opmizaons to perform beer opmizaons: 165% increase in mem2reg promoons, 33% increase in correlated value propagaons, 28% increase common subexpression eliminaon, etc. HTO was able to find sigificnat speedups for many programs. Comparing with LTO we find that there are three places of interest: where neither found a speedup, where LTO found a speedup HTO didn’t and where both HTO and LTO found a speedup. Speedup of HTO over Normal Benchmark
  • 5%
0% 5% 10% 15% 20% 25% Speedup 1 HTO Normal Speedup of HTO and LTO over Normal
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% LTO speedup
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% HTO speedup LTO beer than HTO HTO matches LTO Figure 4. Speedups of HTO and LTO on the LLVM mulsource test suite Let’s now look at the benchmarks where either LTO or HTO found a speedup. We see that for more than half of the LTO speedups can be simply derived by funcon annotaons/HTO alone. For the other half of the speedups, LTO takes sigificantly longer to compile, implying that inlining/IPO is necessary. Speedup of HTO and LTO Benchmark
  • 2%
0% 2% 5% 10% 15% 20% 25% Speedup HTO < LTO ±2% HTO LTO ±2% HTO LTO Compile+Link me overhead of LTO over HTO Benchmark 0% 25% 50% 75% 100% 125% 150% LTO Compilation Slowdown LTO HTO 1 205% (runtime) HTO < LTO ±2% (runtime) HTO LTO ±2% Figure 5. Comparison between LTO and HTO on codes where a speedup exists. Acknowledgements & References William S. Moses was supported in part by a DOE Computaonal Sciences Graduate Fellowship DE-SC0019323, Google Summer of Code, NSF Grant 1533644 and 1533644, LANL grant 531711, and IBM grant W1771646. Johannes Doerfert was supported by the Exascale Compung Project (17-SC-20-SC), a collaborave effort of two U.S. Department of Energy
  • rganizaons (Office of Science and the Naonal Nuclear Security Administraon) responsible for the planning and prepa-
raon of a capable exascale ecosystem, including soware, applicaons, hardware, advanced system engineering, and early testbed plaorms, in support of the naon’s exascale compung imperave. [1] J. Doerfert, H. Ueno, and S.0 Spanovic: The Aributor: A Versale Inter-procedural Fixpoint Iteraon Framework. US LLVM Dev Meeng, 2019. [2] Johannes Doerfert, Brian Homerding, and Hal Finkel. Performance exploraon through opmisc stac program annotaons. In Internaonal Conference on High Performance Compung, pages 247–268. Springer, 2019. [3] Teresa Johnson, Mehdi Amini, and Xinliang David Li. Thinlto: scalable and incremental lto. In 2017 IEEE/ACM Internaonal Symposium on Code Generaon and Opmizaon (CGO), pages 111–121. IEEE, 2017.
slide-45
SLIDE 45

THE ATTRIBUTOR — WHY A FRAMEWORK?

8/16

slide-46
SLIDE 46

THE ATTRIBUTOR — WHY A FRAMEWORK?

8/16

slide-47
SLIDE 47

THE ATTRIBUTOR — WHY A FRAMEWORK?

8/16

slide-48
SLIDE 48
  • III. DESIGN
slide-49
SLIDE 49

LLVM-IR POSITIONS

9/16

slide-50
SLIDE 50

LLVM-IR POSITIONS

9/16

slide-51
SLIDE 51

AAVALUESIMPLIFYRETURNED::UPDATEIMPL(ATTRIBUTOR &A)

ChangeStatus updateImpl(Attributor &A) override { }

10/16

slide-52
SLIDE 52

AAVALUESIMPLIFYRETURNED::UPDATEIMPL(ATTRIBUTOR &A)

ChangeStatus updateImpl(Attributor &A) override { Optional<Value *> Before = getAssumedSimplifiedValue(); Optional<Value *> After = getAssumedSimplifiedValue(); if (Before == After) return ChangeStatus::UNCHANGED; return ChangeStatus::CHANGED; }

10/16

slide-53
SLIDE 53

AAVALUESIMPLIFYRETURNED::UPDATEIMPL(ATTRIBUTOR &A)

ChangeStatus updateImpl(Attributor &A) override { Optional<Value *> Before = getAssumedSimplifiedValue(); auto Pred = [&](Instruction &I) { }; if (!A.checkForAllInstructions(Pred, this, {Instruction::Ret})) return indicatePessimisticFixpoint(); Optional<Value *> After = getAssumedSimplifiedValue(); if (Before == After) return ChangeStatus::UNCHANGED; return ChangeStatus::CHANGED; }

10/16

slide-54
SLIDE 54

AAVALUESIMPLIFYRETURNED::UPDATEIMPL(ATTRIBUTOR &A)

ChangeStatus updateImpl(Attributor &A) override { Optional<Value *> Before = getAssumedSimplifiedValue(); auto Pred = [&](Instruction &I) { A.getAAFor<AAValueSimplify>(this, I.getOperand(0)); }; if (!A.checkForAllInstructions(Pred, this, {Instruction::Ret})) return indicatePessimisticFixpoint(); Optional<Value *> After = getAssumedSimplifiedValue(); if (Before == After) return ChangeStatus::UNCHANGED; return ChangeStatus::CHANGED; }

10/16

slide-55
SLIDE 55

AAVALUESIMPLIFYRETURNED::UPDATEIMPL(ATTRIBUTOR &A)

ChangeStatus updateImpl(Attributor &A) override { Optional<Value *> Before = getAssumedSimplifiedValue(); auto Pred = [&](Instruction &I) { return combine(A.getAAFor<AAValueSimplify>(this, I.getOperand(0))); }; if (!A.checkForAllInstructions(Pred, this, {Instruction::Ret})) return indicatePessimisticFixpoint(); Optional<Value *> After = getAssumedSimplifiedValue(); if (Before == After) return ChangeStatus::UNCHANGED; return ChangeStatus::CHANGED; }

10/16

slide-56
SLIDE 56

NEW ATTRIBUTES

11/16

slide-57
SLIDE 57

NEW ATTRIBUTES

nofree

11/16

slide-58
SLIDE 58

NEW ATTRIBUTES

nosync

11/16

slide-59
SLIDE 59

NEW ATTRIBUTES

willreturn

11/16

slide-60
SLIDE 60

NEW ATTRIBUTES

dereferenceable_globally

11/16

slide-61
SLIDE 61

NON-ATTRIBUTE DEDUCTIONS

12/16

slide-62
SLIDE 62

NON-ATTRIBUTE DEDUCTIONS

liveness

12/16

slide-63
SLIDE 63

NON-ATTRIBUTE DEDUCTIONS

returned values

12/16

slide-64
SLIDE 64

NON-ATTRIBUTE DEDUCTIONS

value simplify

12/16

slide-65
SLIDE 65

NON-ATTRIBUTE DEDUCTIONS

heap-2-stack

12/16

slide-66
SLIDE 66

NON-ATTRIBUTE DEDUCTIONS

pointer privatization

12/16

slide-67
SLIDE 67

THE ATTRIBUTOR — CHALLENGES

when to specialize for call sites (≡ “inlining + outlining”)

13/16

slide-68
SLIDE 68

THE ATTRIBUTOR — CHALLENGES

when to specialize for call sites (≡ “inlining + outlining”)

13/16

slide-69
SLIDE 69

THE ATTRIBUTOR — CHALLENGES

how to seed abstract attributes (heuristics, pgo-based, ...)

13/16

slide-70
SLIDE 70

THE ATTRIBUTOR — CHALLENGES

reduce overheads

13/16

slide-71
SLIDE 71

THE ATTRIBUTOR — CHALLENGES

combine deduction schemes, e.g., context-based & def-use-based

13/16

slide-72
SLIDE 72

THE ATTRIBUTOR — CHALLENGES

13/16

slide-73
SLIDE 73

EVALUATION — FUNCTIONATTRS (LATE) VS. ATTRIBUTOR (EARLY)

loc. attribute # w/o A. # w/ A.

  • A. Δ
  • tot. w/o A.
  • tot. w/ A.

fn. nosync 7612 0.0% 4.36%

  • arg. dereferenceable

61825 66317 +7.27% 35.4% 38.0% fn. nofree 5762 10188 +76.81% 3.3% 5.83% fn. willreturn 4146 0.0% 2.37% arg. writeonly 3562 0.0% 2.04% arg. readnone 5377 6040 +12.33% 3.08% 3.46% fn. noreturn 965 1611 +66.94% 0.553% 0.923% arg. align 419 900 +114.80% 0.24% 0.515%

  • ret. dereferenceable

19041 19479 +2.30% 11.2% 11.4% arg. nocapture 28991 29413 +1.46% 16.6% 16.8% arg. readonly 14946 15281 +2.24% 8.56% 8.75% arg. returned 512 599 +16.99% 0.293% 0.343% arg. noalias 4098 4158 +1.46% 2.35% 2.38% ret. noalias 1150 1194 +3.83% 0.676% 0.701%

14/16

slide-74
SLIDE 74

EVALUATION — FUNCTIONATTRS (LATE) VS. ATTRIBUTOR (EARLY)

loc. attribute # w/o A. # w/ A.

  • A. Δ
  • tot. w/o A.
  • tot. w/ A.

fn. nosync 7612 0.0% 4.36%

  • arg. dereferenceable

61825 66317 +7.27% 35.4% 38.0% fn. nofree 5762 10188 +76.81% 3.3% 5.83% fn. willreturn 4146 0.0% 2.37% arg. writeonly 3562 0.0% 2.04% arg. readnone 5377 6040 +12.33% 3.08% 3.46% fn. noreturn 965 1611 +66.94% 0.553% 0.923% arg. align 419 900 +114.80% 0.24% 0.515%

  • ret. dereferenceable

19041 19479 +2.30% 11.2% 11.4% arg. nocapture 28991 29413 +1.46% 16.6% 16.8% arg. readonly 14946 15281 +2.24% 8.56% 8.75% arg. returned 512 599 +16.99% 0.293% 0.343% arg. noalias 4098 4158 +1.46% 2.35% 2.38% ret. noalias 1150 1194 +3.83% 0.676% 0.701%

14/16

slide-75
SLIDE 75

EVALUATION — FUNCTIONATTRS (LATE) VS. ATTRIBUTOR (EARLY)

loc. attribute # w/o A. # w/ A.

  • A. Δ
  • tot. w/o A.
  • tot. w/ A.

fn. nosync 7612 0.0% 4.36%

  • arg. dereferenceable

61825 66317 +7.27% 35.4% 38.0% fn. nofree 5762 10188 +76.81% 3.3% 5.83% fn. willreturn 4146 0.0% 2.37% arg. writeonly 3562 0.0% 2.04% arg. readnone 5377 6040 +12.33% 3.08% 3.46% fn. noreturn 965 1611 +66.94% 0.553% 0.923% arg. align 419 900 +114.80% 0.24% 0.515%

  • ret. dereferenceable

19041 19479 +2.30% 11.2% 11.4% arg. nocapture 28991 29413 +1.46% 16.6% 16.8% arg. readonly 14946 15281 +2.24% 8.56% 8.75% arg. returned 512 599 +16.99% 0.293% 0.343% arg. noalias 4098 4158 +1.46% 2.35% 2.38% ret. noalias 1150 1194 +3.83% 0.676% 0.701%14/16

slide-76
SLIDE 76

EVALUATION — FUNCTIONATTRS (LATE) VS. ATTRIBUTOR (EARLY)

loc. attribute # w/o A. # w/ A.

  • A. Δ
  • tot. w/o A.
  • tot. w/ A.

fn. nosync 7612 0.0% 4.36%

  • arg. dereferenceable

61825 66317 +7.27% 35.4% 38.0% fn. nofree 5762 10188 +76.81% 3.3% 5.83% fn. willreturn 4146 0.0% 2.37% arg. writeonly 3562 0.0% 2.04% arg. readnone 5377 6040 +12.33% 3.08% 3.46% fn. noreturn 965 1611 +66.94% 0.553% 0.923% arg. align 419 900 +114.80% 0.24% 0.515%

  • ret. dereferenceable

19041 19479 +2.30% 11.2% 11.4% arg. nocapture 28991 29413 +1.46% 16.6% 16.8% arg. readonly 14946 15281 +2.24% 8.56% 8.75% arg. returned 512 599 +16.99% 0.293% 0.343% arg. noalias 4098 4158 +1.46% 2.35% 2.38% ret. noalias 1150 1194 +3.83% 0.676% 0.701%14/16

Details on our poster!

slide-77
SLIDE 77

EVALUATION — (ATTRIBUTOR AIDED) “HEADER TIME OPTIMIZTION” (HTO)

LTO better than HTO HTO matches LTO

15/16

slide-78
SLIDE 78

EVALUATION — (ATTRIBUTOR AIDED) “HEADER TIME OPTIMIZTION” (HTO)

LTO better than HTO HTO matches LTO

15/16

Details on our poster!

slide-79
SLIDE 79

THE ATTRIBUTOR FRAMEWORK @ LLVM-DEV’19

Tutorial: tomorrow 1:45pm - 2:55pm Posters: tomorrow 4:00pm - 5:00pm

❝♦➤✢t➨✤✐➯✠✆↕✝ ⑨➣✗➭✢➫✝ ✘✐❧➡✢✉➩✑tr→✢➫✢✐➯✠✝ ⑨❜❛❝❦❣r➥✗✉♥↕✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ✘✐➤✢➞✢➫✢✐❛➡✢✐③→✢➫✢✐♦➤✝ ♦➛☞ ❛❜➩✑tr❛↔✢➫✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ➲✗➞✢t➝✝ ⑨❦♥♦✇➤✝ ✧ ✧ ↕❡✆➨✤✐➯✠✆↕✝ t❤r➥✗✉❣➝✝ ❡✆➳✢✐➩✑➫✢✐♥➜✞ ❛♥↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ✘♣r➥✗♣❛❣→✢➫✢✐♦➤✝ ⑨❤→✢➦✢➦❡✆♥➩ ✧➩✑➫❡✢➦✝✲⑨❜➵✞✲⑨➩✑➫❡✢♣✧ ✘✉➤✢➫✢✐➡✝ →✝ ⑨➛✌✐➳✢♣➥✗✐➤✢➫✝ ✘✐➩ ➨✟✆❛❝➝❡✆↕✝ ✇♦r➠✝✲⑨➡✢✐➩✑➫✝✲⑨➩✑t②➡❡✝✱ ↕❡✢➦❡✆♥↕❡✆♥↔❡✝ tr❛❝➠✢✐♥➜✞✱ ➫✢✐➢❡✆➥✗➭✢➫✝✱ ✳ ✳ ✳ ♠❛➤✢✐➛☛✆➩✑➫✝ ⑨➛✌✐♥❛➡✝ ⑨➩✑t→✢➫❡✝ ✘✐➤✝ t➝❡✝ ■➆✝✱ ❛❞↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩✱ ➨✟✆♠♦➯✠✝ ↕❡✆❛↕✝ ❝♦↕❡✝✱ ➨✟✢♣❧❛↔❡✝ ❝♦♥➩✑t❛➤✢t➩✱ ➨✟✆✇➨✤➞✢➫❡✝ ⑨➛✌✉♥↔✢➫✢✐♦➤✝ ⑨➩✑✐❣♥→✢➫✢✉➨✟✆➩✱ ✳✳✳ ⑨➛✌✉♥↔✢➫✢✐♦♥➩ ⑨❤❛➯✠✝ ♠♦➨✟✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ❛❜➩✑tr❛↔✢➫✢✐♦♥➩ ✫ ⑨➝❡✆➡✢➦❡✆r➩ ❛✈→✢✐❧❛❜➡❡✝ ⑨❢♦r t➝✢✐➩ ✬✬⑨➩✑t❛♥❞❛r↕✝✬✬ ⑨❧❛②➥✗➭✢➫✝✱ ➥✗t➝❡✆r ↕❡✆➩✑✐❣♥➩ ✘♣♦s➩✑✐❜➡❡✝ t❤➥✗✉❣➝✝ ↕ ❡ ✆ ↕ ✢ ✉ ↔ ❡ ✝ ✘ ✐ ♥ ❢ ♦ r ♠ → ✢ ➫ ✢ ✐ ♦ ➤ ✝ ♦ ➤ ❡ ✝ ❝ ❛ ❧ ➡ ✝ ⑨ ➩ ✑ ➞ ✢ ➫ ❡ ✝ ✴ ✐ ♥ ➩ ✑ t ➨ ✤ ✉ ↔ ✝ ✲ ➫ ✢ ✐ ♦ ➤ ✝ ✴ ✈ ❛ ➡ ✢ ➭ ❡ ✝ ✴ ✳ ✳ ✳ → ✢ ➫ ✝ → ✝ ➫ ✢ ✐ ➢ ❡ ✝ t➝✢✐➩ ❝➥✗➭✢♣➡✢✐♥➜✞ ✘✐➩ t➝❡✝ ➨✟✆❛s♦➤✝ ➲✠✝ ✇❛➤✢➫✝ →✝ ⑨➩✑✐♥❣➡❡✝ ⑨❢r❛➢❡✆✇♦r➠✝

"Header Time Optimization": Cross-Translation Unit Optimization via Annotated Headers

William S. Moses (wmoses@mit.edu), Johannes Doerfert (jdoerfert@anl.gov) MIT CSAIL, Argonne National Lab Writing Optimizable Code is Hard How do we ensure that norm is hoisted outside the loop (and normalize vectorized)? double norm(double *A, int n); void normalize(double *out, double *in, int n) { for (int i = 0; i < n; ++i)
  • ut[i] = in[i] / norm(in, n);
} We could try adding: restrict type, const type, pure aribute, #pragma vectorize(enable), #pragma interleave(enable), __declspec((noalias)). None of those work. What we really want are two LLVM aributes: __attribute__((fn_attr("readonly"), fn_attr("argmemonly"))) double norm(double *A, int n); void normalize(double *restrict out, double *restrict in, int n); This is a problem in real programs! In the DOE RSBench benchmark [2] adding “read- none” to fast_cexp gives a 7% improvement to the enre program (with another 1% for “unwind”). Automatically Making Code Optimizable LLVM automacally derives these aributes as part of the compilaon process, then throws it away when it’s done Let’s ensure this informaon is accessible across translaon units. Why not always use LTO? Running LTO (even ThinLTO [3]) is a burden on compile mes LTO may not be available in your build / operang system It’s oen impossible to run LTO on your enre program (e.g. using an external library) Also, it’s interesng to see how much of LTO’s speedups come from “easily fixable” mechanisms and provide user’s the agency to fix them in source code (making the speedups available to everyone independent from compiler/linker used) Header Files HTO creates new files in a given directory that can be included in any C/C++ program (chosen for easiest experimentaon). Not all LLVM aributes are representable with exisng Clang aributes. We created a generic way to represent LLVM aributes in Clang (shown below). struct Vector; struct Matrix; __attribute__((fn_attr("readonly"), arg_attr(0, "readonly"), ret_attr("noalias"))) Vector* matvec(Matrix *M, Vector *B); Introducing "Header Time Optimization" At the end of the compilaon process, denote what derived aributes can be safely added to funcons using LLVM’s exisng analyses and Aributor [1]. Header me opmizaon has three modes of operaon: remark mode (Figure 1), pipeline mode (Figure 2, 3), and diff mode (in progress) where we create a diff for
  • riginal source tree.
// file1.c double fcexp(double *A, int n) { ... } file1.c:2:1: remark: derived following attributes: fn_attr("readonly") arg_attr(0, "readonly") [-Rannotations] double fcexp(double* a, int n) { clang -Rannotations Figure 1. Remark Mode: print out opmizaon remarks for aributes that should be added to funcons // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } // hto/fileN.h attribute((fn_attr("readnone"))) double fcexp(double *A); // hto/file1.c attribute((fn_attr("readnone"))) double fcexp(double *A); clang -hto_dir=hto // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } executable.o clang -include hto/* Figure 2. Pipeline Mode: automacally generate a new header file with this new informaon, then use this header to recompile the source with this informaon. // libsum.c double sum(double *A) { ... } // sum.h attribute((fn_attr("readnone"))) double sum(double *A); libsum.o clang -hto_dir=hto // user.c double fcexp(double *A) { ... } executable.o clang user.c -lsum Figure 3. Pipeline mode for a library. The annotated header is shipped with the library and used to compile user code. Present Limitations & Future Work We currently don’t generate annotaons for funcons with anonymous structs (we have a script to automacally generate random names), C++ member funcons (since they can’t forward declared), array type of struct/classes (type mystruct[3] is incom- plete ahead of me). When we allow users to output a diff (easier for integraon) rather than pipeline (easier for experiments), these limitaons are resolved and we get more performance gains. In the future we plan to generate standard C/C++ aributes when they exist. Experiments Ran mul-source benchmarks in LLVM test suite Annotated headers allow more LLVM opmizaons to perform beer opmizaons: 165% increase in mem2reg promoons, 33% increase in correlated value propagaons, 28% increase common subexpression eliminaon, etc. HTO was able to find sigificnat speedups for many programs. Comparing with LTO we find that there are three places of interest: where neither found a speedup, where LTO found a speedup HTO didn’t and where both HTO and LTO found a speedup. Speedup of HTO over Normal Benchmark
  • 5%
0% 5% 10% 15% 20% 25% Speedup 1 HTO Normal Speedup of HTO and LTO over Normal
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% LTO speedup
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% HTO speedup LTO beer than HTO HTO matches LTO Figure 4. Speedups of HTO and LTO on the LLVM mulsource test suite Let’s now look at the benchmarks where either LTO or HTO found a speedup. We see that for more than half of the LTO speedups can be simply derived by funcon annotaons/HTO alone. For the other half of the speedups, LTO takes sigificantly longer to compile, implying that inlining/IPO is necessary. Speedup of HTO and LTO Benchmark
  • 2%
0% 2% 5% 10% 15% 20% 25% Speedup HTO < LTO ±2% HTO LTO ±2% HTO LTO Compile+Link me overhead of LTO over HTO Benchmark 0% 25% 50% 75% 100% 125% 150% LTO Compilation Slowdown LTO HTO 1 205% (runtime) HTO < LTO ±2% (runtime) HTO LTO ±2% Figure 5. Comparison between LTO and HTO on codes where a speedup exists. Acknowledgements & References William S. Moses was supported in part by a DOE Computaonal Sciences Graduate Fellowship DE-SC0019323, Google Summer of Code, NSF Grant 1533644 and 1533644, LANL grant 531711, and IBM grant W1771646. Johannes Doerfert was supported by the Exascale Compung Project (17-SC-20-SC), a collaborave effort of two U.S. Department of Energy
  • rganizaons (Office of Science and the Naonal Nuclear Security Administraon) responsible for the planning and prepa-
raon of a capable exascale ecosystem, including soware, applicaons, hardware, advanced system engineering, and early testbed plaorms, in support of the naon’s exascale compung imperave. [1] J. Doerfert, H. Ueno, and S.0 Spanovic: The Aributor: A Versale Inter-procedural Fixpoint Iteraon Framework. US LLVM Dev Meeng, 2019. [2] Johannes Doerfert, Brian Homerding, and Hal Finkel. Performance exploraon through opmisc stac program annotaons. In Internaonal Conference on High Performance Compung, pages 247–268. Springer, 2019. [3] Teresa Johnson, Mehdi Amini, and Xinliang David Li. Thinlto: scalable and incremental lto. In 2017 IEEE/ACM Internaonal Symposium on Code Generaon and Opmizaon (CGO), pages 111–121. IEEE, 2017.

16/16

slide-80
SLIDE 80

THE ATTRIBUTOR FRAMEWORK @ LLVM-DEV’19

Tutorial: tomorrow 1:45pm - 2:55pm Posters: tomorrow 4:00pm - 5:00pm

❝♦➤✢t➨✤✐➯✠✆↕✝ ⑨➣✗➭✢➫✝ ✘✐❧➡✢✉➩✑tr→✢➫✢✐➯✠✝ ⑨❜❛❝❦❣r➥✗✉♥↕✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ✘✐➤✢➞✢➫✢✐❛➡✢✐③→✢➫✢✐♦➤✝ ♦➛☞ ❛❜➩✑tr❛↔✢➫✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ➲✗➞✢t➝✝ ⑨❦♥♦✇➤✝ ✧ ✧ ↕❡✆➨✤✐➯✠✆↕✝ t❤r➥✗✉❣➝✝ ❡✆➳✢✐➩✑➫✢✐♥➜✞ ❛♥↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ✘♣r➥✗♣❛❣→✢➫✢✐♦➤✝ ⑨❤→✢➦✢➦❡✆♥➩ ✧➩✑➫❡✢➦✝✲⑨❜➵✞✲⑨➩✑➫❡✢♣✧ ✘✉➤✢➫✢✐➡✝ →✝ ⑨➛✌✐➳✢♣➥✗✐➤✢➫✝ ✘✐➩ ➨✟✆❛❝➝❡✆↕✝ ✇♦r➠✝✲⑨➡✢✐➩✑➫✝✲⑨➩✑t②➡❡✝✱ ↕❡✢➦❡✆♥↕❡✆♥↔❡✝ tr❛❝➠✢✐♥➜✞✱ ➫✢✐➢❡✆➥✗➭✢➫✝✱ ✳ ✳ ✳ ♠❛➤✢✐➛☛✆➩✑➫✝ ⑨➛✌✐♥❛➡✝ ⑨➩✑t→✢➫❡✝ ✘✐➤✝ t➝❡✝ ■➆✝✱ ❛❞↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩✱ ➨✟✆♠♦➯✠✝ ↕❡✆❛↕✝ ❝♦↕❡✝✱ ➨✟✢♣❧❛↔❡✝ ❝♦♥➩✑t❛➤✢t➩✱ ➨✟✆✇➨✤➞✢➫❡✝ ⑨➛✌✉♥↔✢➫✢✐♦➤✝ ⑨➩✑✐❣♥→✢➫✢✉➨✟✆➩✱ ✳✳✳ ⑨➛✌✉♥↔✢➫✢✐♦♥➩ ⑨❤❛➯✠✝ ♠♦➨✟✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ❛❜➩✑tr❛↔✢➫✢✐♦♥➩ ✫ ⑨➝❡✆➡✢➦❡✆r➩ ❛✈→✢✐❧❛❜➡❡✝ ⑨❢♦r t➝✢✐➩ ✬✬⑨➩✑t❛♥❞❛r↕✝✬✬ ⑨❧❛②➥✗➭✢➫✝✱ ➥✗t➝❡✆r ↕❡✆➩✑✐❣♥➩ ✘♣♦s➩✑✐❜➡❡✝ t❤➥✗✉❣➝✝ ↕ ❡ ✆ ↕ ✢ ✉ ↔ ❡ ✝ ✘ ✐ ♥ ❢ ♦ r ♠ → ✢ ➫ ✢ ✐ ♦ ➤ ✝ ♦ ➤ ❡ ✝ ❝ ❛ ❧ ➡ ✝ ⑨ ➩ ✑ ➞ ✢ ➫ ❡ ✝ ✴ ✐ ♥ ➩ ✑ t ➨ ✤ ✉ ↔ ✝ ✲ ➫ ✢ ✐ ♦ ➤ ✝ ✴ ✈ ❛ ➡ ✢ ➭ ❡ ✝ ✴ ✳ ✳ ✳ → ✢ ➫ ✝ → ✝ ➫ ✢ ✐ ➢ ❡ ✝ t➝✢✐➩ ❝➥✗➭✢♣➡✢✐♥➜✞ ✘✐➩ t➝❡✝ ➨✟✆❛s♦➤✝ ➲✠✝ ✇❛➤✢➫✝ →✝ ⑨➩✑✐♥❣➡❡✝ ⑨❢r❛➢❡✆✇♦r➠✝

"Header Time Optimization": Cross-Translation Unit Optimization via Annotated Headers

William S. Moses (wmoses@mit.edu), Johannes Doerfert (jdoerfert@anl.gov) MIT CSAIL, Argonne National Lab Writing Optimizable Code is Hard How do we ensure that norm is hoisted outside the loop (and normalize vectorized)? double norm(double *A, int n); void normalize(double *out, double *in, int n) { for (int i = 0; i < n; ++i)
  • ut[i] = in[i] / norm(in, n);
} We could try adding: restrict type, const type, pure aribute, #pragma vectorize(enable), #pragma interleave(enable), __declspec((noalias)). None of those work. What we really want are two LLVM aributes: __attribute__((fn_attr("readonly"), fn_attr("argmemonly"))) double norm(double *A, int n); void normalize(double *restrict out, double *restrict in, int n); This is a problem in real programs! In the DOE RSBench benchmark [2] adding “read- none” to fast_cexp gives a 7% improvement to the enre program (with another 1% for “unwind”). Automatically Making Code Optimizable LLVM automacally derives these aributes as part of the compilaon process, then throws it away when it’s done Let’s ensure this informaon is accessible across translaon units. Why not always use LTO? Running LTO (even ThinLTO [3]) is a burden on compile mes LTO may not be available in your build / operang system It’s oen impossible to run LTO on your enre program (e.g. using an external library) Also, it’s interesng to see how much of LTO’s speedups come from “easily fixable” mechanisms and provide user’s the agency to fix them in source code (making the speedups available to everyone independent from compiler/linker used) Header Files HTO creates new files in a given directory that can be included in any C/C++ program (chosen for easiest experimentaon). Not all LLVM aributes are representable with exisng Clang aributes. We created a generic way to represent LLVM aributes in Clang (shown below). struct Vector; struct Matrix; __attribute__((fn_attr("readonly"), arg_attr(0, "readonly"), ret_attr("noalias"))) Vector* matvec(Matrix *M, Vector *B); Introducing "Header Time Optimization" At the end of the compilaon process, denote what derived aributes can be safely added to funcons using LLVM’s exisng analyses and Aributor [1]. Header me opmizaon has three modes of operaon: remark mode (Figure 1), pipeline mode (Figure 2, 3), and diff mode (in progress) where we create a diff for
  • riginal source tree.
// file1.c double fcexp(double *A, int n) { ... } file1.c:2:1: remark: derived following attributes: fn_attr("readonly") arg_attr(0, "readonly") [-Rannotations] double fcexp(double* a, int n) { clang -Rannotations Figure 1. Remark Mode: print out opmizaon remarks for aributes that should be added to funcons // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } // hto/fileN.h attribute((fn_attr("readnone"))) double fcexp(double *A); // hto/file1.c attribute((fn_attr("readnone"))) double fcexp(double *A); clang -hto_dir=hto // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } executable.o clang -include hto/* Figure 2. Pipeline Mode: automacally generate a new header file with this new informaon, then use this header to recompile the source with this informaon. // libsum.c double sum(double *A) { ... } // sum.h attribute((fn_attr("readnone"))) double sum(double *A); libsum.o clang -hto_dir=hto // user.c double fcexp(double *A) { ... } executable.o clang user.c -lsum Figure 3. Pipeline mode for a library. The annotated header is shipped with the library and used to compile user code. Present Limitations & Future Work We currently don’t generate annotaons for funcons with anonymous structs (we have a script to automacally generate random names), C++ member funcons (since they can’t forward declared), array type of struct/classes (type mystruct[3] is incom- plete ahead of me). When we allow users to output a diff (easier for integraon) rather than pipeline (easier for experiments), these limitaons are resolved and we get more performance gains. In the future we plan to generate standard C/C++ aributes when they exist. Experiments Ran mul-source benchmarks in LLVM test suite Annotated headers allow more LLVM opmizaons to perform beer opmizaons: 165% increase in mem2reg promoons, 33% increase in correlated value propagaons, 28% increase common subexpression eliminaon, etc. HTO was able to find sigificnat speedups for many programs. Comparing with LTO we find that there are three places of interest: where neither found a speedup, where LTO found a speedup HTO didn’t and where both HTO and LTO found a speedup. Speedup of HTO over Normal Benchmark
  • 5%
0% 5% 10% 15% 20% 25% Speedup 1 HTO Normal Speedup of HTO and LTO over Normal
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% LTO speedup
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% HTO speedup LTO beer than HTO HTO matches LTO Figure 4. Speedups of HTO and LTO on the LLVM mulsource test suite Let’s now look at the benchmarks where either LTO or HTO found a speedup. We see that for more than half of the LTO speedups can be simply derived by funcon annotaons/HTO alone. For the other half of the speedups, LTO takes sigificantly longer to compile, implying that inlining/IPO is necessary. Speedup of HTO and LTO Benchmark
  • 2%
0% 2% 5% 10% 15% 20% 25% Speedup HTO < LTO ±2% HTO LTO ±2% HTO LTO Compile+Link me overhead of LTO over HTO Benchmark 0% 25% 50% 75% 100% 125% 150% LTO Compilation Slowdown LTO HTO 1 205% (runtime) HTO < LTO ±2% (runtime) HTO LTO ±2% Figure 5. Comparison between LTO and HTO on codes where a speedup exists. Acknowledgements & References William S. Moses was supported in part by a DOE Computaonal Sciences Graduate Fellowship DE-SC0019323, Google Summer of Code, NSF Grant 1533644 and 1533644, LANL grant 531711, and IBM grant W1771646. Johannes Doerfert was supported by the Exascale Compung Project (17-SC-20-SC), a collaborave effort of two U.S. Department of Energy
  • rganizaons (Office of Science and the Naonal Nuclear Security Administraon) responsible for the planning and prepa-
raon of a capable exascale ecosystem, including soware, applicaons, hardware, advanced system engineering, and early testbed plaorms, in support of the naon’s exascale compung imperave. [1] J. Doerfert, H. Ueno, and S.0 Spanovic: The Aributor: A Versale Inter-procedural Fixpoint Iteraon Framework. US LLVM Dev Meeng, 2019. [2] Johannes Doerfert, Brian Homerding, and Hal Finkel. Performance exploraon through opmisc stac program annotaons. In Internaonal Conference on High Performance Compung, pages 247–268. Springer, 2019. [3] Teresa Johnson, Mehdi Amini, and Xinliang David Li. Thinlto: scalable and incremental lto. In 2017 IEEE/ACM Internaonal Symposium on Code Generaon and Opmizaon (CGO), pages 111–121. IEEE, 2017.

16/16

1) introduce a new llvm::Attribute 2) derive the new llvm::Attribute with the Attributor 3) use the new llvm::Attribute to improve alias analysis

slide-81
SLIDE 81

THE ATTRIBUTOR FRAMEWORK @ LLVM-DEV’19

Tutorial: tomorrow 1:45pm - 2:55pm Posters: tomorrow 4:00pm - 5:00pm

❝♦➤✢t➨✤✐➯✠✆↕✝ ⑨➣✗➭✢➫✝ ✘✐❧➡✢✉➩✑tr→✢➫✢✐➯✠✝ ⑨❜❛❝❦❣r➥✗✉♥↕✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ✘✐➤✢➞✢➫✢✐❛➡✢✐③→✢➫✢✐♦➤✝ ♦➛☞ ❛❜➩✑tr❛↔✢➫✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ➲✗➞✢t➝✝ ⑨❦♥♦✇➤✝ ✧ ✧ ↕❡✆➨✤✐➯✠✆↕✝ t❤r➥✗✉❣➝✝ ❡✆➳✢✐➩✑➫✢✐♥➜✞ ❛♥↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ✘♣r➥✗♣❛❣→✢➫✢✐♦➤✝ ⑨❤→✢➦✢➦❡✆♥➩ ✧➩✑➫❡✢➦✝✲⑨❜➵✞✲⑨➩✑➫❡✢♣✧ ✘✉➤✢➫✢✐➡✝ →✝ ⑨➛✌✐➳✢♣➥✗✐➤✢➫✝ ✘✐➩ ➨✟✆❛❝➝❡✆↕✝ ✇♦r➠✝✲⑨➡✢✐➩✑➫✝✲⑨➩✑t②➡❡✝✱ ↕❡✢➦❡✆♥↕❡✆♥↔❡✝ tr❛❝➠✢✐♥➜✞✱ ➫✢✐➢❡✆➥✗➭✢➫✝✱ ✳ ✳ ✳ ♠❛➤✢✐➛☛✆➩✑➫✝ ⑨➛✌✐♥❛➡✝ ⑨➩✑t→✢➫❡✝ ✘✐➤✝ t➝❡✝ ■➆✝✱ ❛❞↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩✱ ➨✟✆♠♦➯✠✝ ↕❡✆❛↕✝ ❝♦↕❡✝✱ ➨✟✢♣❧❛↔❡✝ ❝♦♥➩✑t❛➤✢t➩✱ ➨✟✆✇➨✤➞✢➫❡✝ ⑨➛✌✉♥↔✢➫✢✐♦➤✝ ⑨➩✑✐❣♥→✢➫✢✉➨✟✆➩✱ ✳✳✳ ⑨➛✌✉♥↔✢➫✢✐♦♥➩ ⑨❤❛➯✠✝ ♠♦➨✟✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ❛❜➩✑tr❛↔✢➫✢✐♦♥➩ ✫ ⑨➝❡✆➡✢➦❡✆r➩ ❛✈→✢✐❧❛❜➡❡✝ ⑨❢♦r t➝✢✐➩ ✬✬⑨➩✑t❛♥❞❛r↕✝✬✬ ⑨❧❛②➥✗➭✢➫✝✱ ➥✗t➝❡✆r ↕❡✆➩✑✐❣♥➩ ✘♣♦s➩✑✐❜➡❡✝ t❤➥✗✉❣➝✝ ↕ ❡ ✆ ↕ ✢ ✉ ↔ ❡ ✝ ✘ ✐ ♥ ❢ ♦ r ♠ → ✢ ➫ ✢ ✐ ♦ ➤ ✝ ♦ ➤ ❡ ✝ ❝ ❛ ❧ ➡ ✝ ⑨ ➩ ✑ ➞ ✢ ➫ ❡ ✝ ✴ ✐ ♥ ➩ ✑ t ➨ ✤ ✉ ↔ ✝ ✲ ➫ ✢ ✐ ♦ ➤ ✝ ✴ ✈ ❛ ➡ ✢ ➭ ❡ ✝ ✴ ✳ ✳ ✳ → ✢ ➫ ✝ → ✝ ➫ ✢ ✐ ➢ ❡ ✝ t➝✢✐➩ ❝➥✗➭✢♣➡✢✐♥➜✞ ✘✐➩ t➝❡✝ ➨✟✆❛s♦➤✝ ➲✠✝ ✇❛➤✢➫✝ →✝ ⑨➩✑✐♥❣➡❡✝ ⑨❢r❛➢❡✆✇♦r➠✝

"Header Time Optimization": Cross-Translation Unit Optimization via Annotated Headers

William S. Moses (wmoses@mit.edu), Johannes Doerfert (jdoerfert@anl.gov) MIT CSAIL, Argonne National Lab Writing Optimizable Code is Hard How do we ensure that norm is hoisted outside the loop (and normalize vectorized)? double norm(double *A, int n); void normalize(double *out, double *in, int n) { for (int i = 0; i < n; ++i)
  • ut[i] = in[i] / norm(in, n);
} We could try adding: restrict type, const type, pure aribute, #pragma vectorize(enable), #pragma interleave(enable), __declspec((noalias)). None of those work. What we really want are two LLVM aributes: __attribute__((fn_attr("readonly"), fn_attr("argmemonly"))) double norm(double *A, int n); void normalize(double *restrict out, double *restrict in, int n); This is a problem in real programs! In the DOE RSBench benchmark [2] adding “read- none” to fast_cexp gives a 7% improvement to the enre program (with another 1% for “unwind”). Automatically Making Code Optimizable LLVM automacally derives these aributes as part of the compilaon process, then throws it away when it’s done Let’s ensure this informaon is accessible across translaon units. Why not always use LTO? Running LTO (even ThinLTO [3]) is a burden on compile mes LTO may not be available in your build / operang system It’s oen impossible to run LTO on your enre program (e.g. using an external library) Also, it’s interesng to see how much of LTO’s speedups come from “easily fixable” mechanisms and provide user’s the agency to fix them in source code (making the speedups available to everyone independent from compiler/linker used) Header Files HTO creates new files in a given directory that can be included in any C/C++ program (chosen for easiest experimentaon). Not all LLVM aributes are representable with exisng Clang aributes. We created a generic way to represent LLVM aributes in Clang (shown below). struct Vector; struct Matrix; __attribute__((fn_attr("readonly"), arg_attr(0, "readonly"), ret_attr("noalias"))) Vector* matvec(Matrix *M, Vector *B); Introducing "Header Time Optimization" At the end of the compilaon process, denote what derived aributes can be safely added to funcons using LLVM’s exisng analyses and Aributor [1]. Header me opmizaon has three modes of operaon: remark mode (Figure 1), pipeline mode (Figure 2, 3), and diff mode (in progress) where we create a diff for
  • riginal source tree.
// file1.c double fcexp(double *A, int n) { ... } file1.c:2:1: remark: derived following attributes: fn_attr("readonly") arg_attr(0, "readonly") [-Rannotations] double fcexp(double* a, int n) { clang -Rannotations Figure 1. Remark Mode: print out opmizaon remarks for aributes that should be added to funcons // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } // hto/fileN.h attribute((fn_attr("readnone"))) double fcexp(double *A); // hto/file1.c attribute((fn_attr("readnone"))) double fcexp(double *A); clang -hto_dir=hto // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } executable.o clang -include hto/* Figure 2. Pipeline Mode: automacally generate a new header file with this new informaon, then use this header to recompile the source with this informaon. // libsum.c double sum(double *A) { ... } // sum.h attribute((fn_attr("readnone"))) double sum(double *A); libsum.o clang -hto_dir=hto // user.c double fcexp(double *A) { ... } executable.o clang user.c -lsum Figure 3. Pipeline mode for a library. The annotated header is shipped with the library and used to compile user code. Present Limitations & Future Work We currently don’t generate annotaons for funcons with anonymous structs (we have a script to automacally generate random names), C++ member funcons (since they can’t forward declared), array type of struct/classes (type mystruct[3] is incom- plete ahead of me). When we allow users to output a diff (easier for integraon) rather than pipeline (easier for experiments), these limitaons are resolved and we get more performance gains. In the future we plan to generate standard C/C++ aributes when they exist. Experiments Ran mul-source benchmarks in LLVM test suite Annotated headers allow more LLVM opmizaons to perform beer opmizaons: 165% increase in mem2reg promoons, 33% increase in correlated value propagaons, 28% increase common subexpression eliminaon, etc. HTO was able to find sigificnat speedups for many programs. Comparing with LTO we find that there are three places of interest: where neither found a speedup, where LTO found a speedup HTO didn’t and where both HTO and LTO found a speedup. Speedup of HTO over Normal Benchmark
  • 5%
0% 5% 10% 15% 20% 25% Speedup 1 HTO Normal Speedup of HTO and LTO over Normal
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% LTO speedup
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% HTO speedup LTO beer than HTO HTO matches LTO Figure 4. Speedups of HTO and LTO on the LLVM mulsource test suite Let’s now look at the benchmarks where either LTO or HTO found a speedup. We see that for more than half of the LTO speedups can be simply derived by funcon annotaons/HTO alone. For the other half of the speedups, LTO takes sigificantly longer to compile, implying that inlining/IPO is necessary. Speedup of HTO and LTO Benchmark
  • 2%
0% 2% 5% 10% 15% 20% 25% Speedup HTO < LTO ±2% HTO LTO ±2% HTO LTO Compile+Link me overhead of LTO over HTO Benchmark 0% 25% 50% 75% 100% 125% 150% LTO Compilation Slowdown LTO HTO 1 205% (runtime) HTO < LTO ±2% (runtime) HTO LTO ±2% Figure 5. Comparison between LTO and HTO on codes where a speedup exists. Acknowledgements & References William S. Moses was supported in part by a DOE Computaonal Sciences Graduate Fellowship DE-SC0019323, Google Summer of Code, NSF Grant 1533644 and 1533644, LANL grant 531711, and IBM grant W1771646. Johannes Doerfert was supported by the Exascale Compung Project (17-SC-20-SC), a collaborave effort of two U.S. Department of Energy
  • rganizaons (Office of Science and the Naonal Nuclear Security Administraon) responsible for the planning and prepa-
raon of a capable exascale ecosystem, including soware, applicaons, hardware, advanced system engineering, and early testbed plaorms, in support of the naon’s exascale compung imperave. [1] J. Doerfert, H. Ueno, and S.0 Spanovic: The Aributor: A Versale Inter-procedural Fixpoint Iteraon Framework. US LLVM Dev Meeng, 2019. [2] Johannes Doerfert, Brian Homerding, and Hal Finkel. Performance exploraon through opmisc stac program annotaons. In Internaonal Conference on High Performance Compung, pages 247–268. Springer, 2019. [3] Teresa Johnson, Mehdi Amini, and Xinliang David Li. Thinlto: scalable and incremental lto. In 2017 IEEE/ACM Internaonal Symposium on Code Generaon and Opmizaon (CGO), pages 111–121. IEEE, 2017.

16/16

1) introduce a new llvm::Attribute 2) derive the new llvm::Attribute with the Attributor 3) use the new llvm::Attribute to improve alias analysis

slide-82
SLIDE 82

THE ATTRIBUTOR FRAMEWORK @ LLVM-DEV’19

Tutorial: tomorrow 1:45pm - 2:55pm Posters: tomorrow 4:00pm - 5:00pm

❝♦➤✢t➨✤✐➯✠✆↕✝ ⑨➣✗➭✢➫✝ ✘✐❧➡✢✉➩✑tr→✢➫✢✐➯✠✝ ⑨❜❛❝❦❣r➥✗✉♥↕✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ✘✐➤✢➞✢➫✢✐❛➡✢✐③→✢➫✢✐♦➤✝ ♦➛☞ ❛❜➩✑tr❛↔✢➫✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ➲✗➞✢t➝✝ ⑨❦♥♦✇➤✝ ✧ ✧ ↕❡✆➨✤✐➯✠✆↕✝ t❤r➥✗✉❣➝✝ ❡✆➳✢✐➩✑➫✢✐♥➜✞ ❛♥↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ✘♣r➥✗♣❛❣→✢➫✢✐♦➤✝ ⑨❤→✢➦✢➦❡✆♥➩ ✧➩✑➫❡✢➦✝✲⑨❜➵✞✲⑨➩✑➫❡✢♣✧ ✘✉➤✢➫✢✐➡✝ →✝ ⑨➛✌✐➳✢♣➥✗✐➤✢➫✝ ✘✐➩ ➨✟✆❛❝➝❡✆↕✝ ✇♦r➠✝✲⑨➡✢✐➩✑➫✝✲⑨➩✑t②➡❡✝✱ ↕❡✢➦❡✆♥↕❡✆♥↔❡✝ tr❛❝➠✢✐♥➜✞✱ ➫✢✐➢❡✆➥✗➭✢➫✝✱ ✳ ✳ ✳ ♠❛➤✢✐➛☛✆➩✑➫✝ ⑨➛✌✐♥❛➡✝ ⑨➩✑t→✢➫❡✝ ✘✐➤✝ t➝❡✝ ■➆✝✱ ❛❞↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩✱ ➨✟✆♠♦➯✠✝ ↕❡✆❛↕✝ ❝♦↕❡✝✱ ➨✟✢♣❧❛↔❡✝ ❝♦♥➩✑t❛➤✢t➩✱ ➨✟✆✇➨✤➞✢➫❡✝ ⑨➛✌✉♥↔✢➫✢✐♦➤✝ ⑨➩✑✐❣♥→✢➫✢✉➨✟✆➩✱ ✳✳✳ ⑨➛✌✉♥↔✢➫✢✐♦♥➩ ⑨❤❛➯✠✝ ♠♦➨✟✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ❛❜➩✑tr❛↔✢➫✢✐♦♥➩ ✫ ⑨➝❡✆➡✢➦❡✆r➩ ❛✈→✢✐❧❛❜➡❡✝ ⑨❢♦r t➝✢✐➩ ✬✬⑨➩✑t❛♥❞❛r↕✝✬✬ ⑨❧❛②➥✗➭✢➫✝✱ ➥✗t➝❡✆r ↕❡✆➩✑✐❣♥➩ ✘♣♦s➩✑✐❜➡❡✝ t❤➥✗✉❣➝✝ ↕ ❡ ✆ ↕ ✢ ✉ ↔ ❡ ✝ ✘ ✐ ♥ ❢ ♦ r ♠ → ✢ ➫ ✢ ✐ ♦ ➤ ✝ ♦ ➤ ❡ ✝ ❝ ❛ ❧ ➡ ✝ ⑨ ➩ ✑ ➞ ✢ ➫ ❡ ✝ ✴ ✐ ♥ ➩ ✑ t ➨ ✤ ✉ ↔ ✝ ✲ ➫ ✢ ✐ ♦ ➤ ✝ ✴ ✈ ❛ ➡ ✢ ➭ ❡ ✝ ✴ ✳ ✳ ✳ → ✢ ➫ ✝ → ✝ ➫ ✢ ✐ ➢ ❡ ✝ t➝✢✐➩ ❝➥✗➭✢♣➡✢✐♥➜✞ ✘✐➩ t➝❡✝ ➨✟✆❛s♦➤✝ ➲✠✝ ✇❛➤✢➫✝ →✝ ⑨➩✑✐♥❣➡❡✝ ⑨❢r❛➢❡✆✇♦r➠✝

"Header Time Optimization": Cross-Translation Unit Optimization via Annotated Headers

William S. Moses (wmoses@mit.edu), Johannes Doerfert (jdoerfert@anl.gov) MIT CSAIL, Argonne National Lab Writing Optimizable Code is Hard How do we ensure that norm is hoisted outside the loop (and normalize vectorized)? double norm(double *A, int n); void normalize(double *out, double *in, int n) { for (int i = 0; i < n; ++i)
  • ut[i] = in[i] / norm(in, n);
} We could try adding: restrict type, const type, pure aribute, #pragma vectorize(enable), #pragma interleave(enable), __declspec((noalias)). None of those work. What we really want are two LLVM aributes: __attribute__((fn_attr("readonly"), fn_attr("argmemonly"))) double norm(double *A, int n); void normalize(double *restrict out, double *restrict in, int n); This is a problem in real programs! In the DOE RSBench benchmark [2] adding “read- none” to fast_cexp gives a 7% improvement to the enre program (with another 1% for “unwind”). Automatically Making Code Optimizable LLVM automacally derives these aributes as part of the compilaon process, then throws it away when it’s done Let’s ensure this informaon is accessible across translaon units. Why not always use LTO? Running LTO (even ThinLTO [3]) is a burden on compile mes LTO may not be available in your build / operang system It’s oen impossible to run LTO on your enre program (e.g. using an external library) Also, it’s interesng to see how much of LTO’s speedups come from “easily fixable” mechanisms and provide user’s the agency to fix them in source code (making the speedups available to everyone independent from compiler/linker used) Header Files HTO creates new files in a given directory that can be included in any C/C++ program (chosen for easiest experimentaon). Not all LLVM aributes are representable with exisng Clang aributes. We created a generic way to represent LLVM aributes in Clang (shown below). struct Vector; struct Matrix; __attribute__((fn_attr("readonly"), arg_attr(0, "readonly"), ret_attr("noalias"))) Vector* matvec(Matrix *M, Vector *B); Introducing "Header Time Optimization" At the end of the compilaon process, denote what derived aributes can be safely added to funcons using LLVM’s exisng analyses and Aributor [1]. Header me opmizaon has three modes of operaon: remark mode (Figure 1), pipeline mode (Figure 2, 3), and diff mode (in progress) where we create a diff for
  • riginal source tree.
// file1.c double fcexp(double *A, int n) { ... } file1.c:2:1: remark: derived following attributes: fn_attr("readonly") arg_attr(0, "readonly") [-Rannotations] double fcexp(double* a, int n) { clang -Rannotations Figure 1. Remark Mode: print out opmizaon remarks for aributes that should be added to funcons // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } // hto/fileN.h attribute((fn_attr("readnone"))) double fcexp(double *A); // hto/file1.c attribute((fn_attr("readnone"))) double fcexp(double *A); clang -hto_dir=hto // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } executable.o clang -include hto/* Figure 2. Pipeline Mode: automacally generate a new header file with this new informaon, then use this header to recompile the source with this informaon. // libsum.c double sum(double *A) { ... } // sum.h attribute((fn_attr("readnone"))) double sum(double *A); libsum.o clang -hto_dir=hto // user.c double fcexp(double *A) { ... } executable.o clang user.c -lsum Figure 3. Pipeline mode for a library. The annotated header is shipped with the library and used to compile user code. Present Limitations & Future Work We currently don’t generate annotaons for funcons with anonymous structs (we have a script to automacally generate random names), C++ member funcons (since they can’t forward declared), array type of struct/classes (type mystruct[3] is incom- plete ahead of me). When we allow users to output a diff (easier for integraon) rather than pipeline (easier for experiments), these limitaons are resolved and we get more performance gains. In the future we plan to generate standard C/C++ aributes when they exist. Experiments Ran mul-source benchmarks in LLVM test suite Annotated headers allow more LLVM opmizaons to perform beer opmizaons: 165% increase in mem2reg promoons, 33% increase in correlated value propagaons, 28% increase common subexpression eliminaon, etc. HTO was able to find sigificnat speedups for many programs. Comparing with LTO we find that there are three places of interest: where neither found a speedup, where LTO found a speedup HTO didn’t and where both HTO and LTO found a speedup. Speedup of HTO over Normal Benchmark
  • 5%
0% 5% 10% 15% 20% 25% Speedup 1 HTO Normal Speedup of HTO and LTO over Normal
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% LTO speedup
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% HTO speedup LTO beer than HTO HTO matches LTO Figure 4. Speedups of HTO and LTO on the LLVM mulsource test suite Let’s now look at the benchmarks where either LTO or HTO found a speedup. We see that for more than half of the LTO speedups can be simply derived by funcon annotaons/HTO alone. For the other half of the speedups, LTO takes sigificantly longer to compile, implying that inlining/IPO is necessary. Speedup of HTO and LTO Benchmark
  • 2%
0% 2% 5% 10% 15% 20% 25% Speedup HTO < LTO ±2% HTO LTO ±2% HTO LTO Compile+Link me overhead of LTO over HTO Benchmark 0% 25% 50% 75% 100% 125% 150% LTO Compilation Slowdown LTO HTO 1 205% (runtime) HTO < LTO ±2% (runtime) HTO LTO ±2% Figure 5. Comparison between LTO and HTO on codes where a speedup exists. Acknowledgements & References William S. Moses was supported in part by a DOE Computaonal Sciences Graduate Fellowship DE-SC0019323, Google Summer of Code, NSF Grant 1533644 and 1533644, LANL grant 531711, and IBM grant W1771646. Johannes Doerfert was supported by the Exascale Compung Project (17-SC-20-SC), a collaborave effort of two U.S. Department of Energy
  • rganizaons (Office of Science and the Naonal Nuclear Security Administraon) responsible for the planning and prepa-
raon of a capable exascale ecosystem, including soware, applicaons, hardware, advanced system engineering, and early testbed plaorms, in support of the naon’s exascale compung imperave. [1] J. Doerfert, H. Ueno, and S.0 Spanovic: The Aributor: A Versale Inter-procedural Fixpoint Iteraon Framework. US LLVM Dev Meeng, 2019. [2] Johannes Doerfert, Brian Homerding, and Hal Finkel. Performance exploraon through opmisc stac program annotaons. In Internaonal Conference on High Performance Compung, pages 247–268. Springer, 2019. [3] Teresa Johnson, Mehdi Amini, and Xinliang David Li. Thinlto: scalable and incremental lto. In 2017 IEEE/ACM Internaonal Symposium on Code Generaon and Opmizaon (CGO), pages 111–121. IEEE, 2017.

16/16

1) introduce a new llvm::Attribute 2) derive the new llvm::Attribute with the Attributor 3) use the new llvm::Attribute to improve alias analysis

slide-83
SLIDE 83

THE ATTRIBUTOR FRAMEWORK @ LLVM-DEV’19

Tutorial: tomorrow 1:45pm - 2:55pm Posters: tomorrow 4:00pm - 5:00pm

❝♦➤✢t➨✤✐➯✠✆↕✝ ⑨➣✗➭✢➫✝ ✘✐❧➡✢✉➩✑tr→✢➫✢✐➯✠✝ ⑨❜❛❝❦❣r➥✗✉♥↕✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ✘✐➤✢➞✢➫✢✐❛➡✢✐③→✢➫✢✐♦➤✝ ♦➛☞ ❛❜➩✑tr❛↔✢➫✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ➲✗➞✢t➝✝ ⑨❦♥♦✇➤✝ ✧ ✧ ↕❡✆➨✤✐➯✠✆↕✝ t❤r➥✗✉❣➝✝ ❡✆➳✢✐➩✑➫✢✐♥➜✞ ❛♥↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ✘♣r➥✗♣❛❣→✢➫✢✐♦➤✝ ⑨❤→✢➦✢➦❡✆♥➩ ✧➩✑➫❡✢➦✝✲⑨❜➵✞✲⑨➩✑➫❡✢♣✧ ✘✉➤✢➫✢✐➡✝ →✝ ⑨➛✌✐➳✢♣➥✗✐➤✢➫✝ ✘✐➩ ➨✟✆❛❝➝❡✆↕✝ ✇♦r➠✝✲⑨➡✢✐➩✑➫✝✲⑨➩✑t②➡❡✝✱ ↕❡✢➦❡✆♥↕❡✆♥↔❡✝ tr❛❝➠✢✐♥➜✞✱ ➫✢✐➢❡✆➥✗➭✢➫✝✱ ✳ ✳ ✳ ♠❛➤✢✐➛☛✆➩✑➫✝ ⑨➛✌✐♥❛➡✝ ⑨➩✑t→✢➫❡✝ ✘✐➤✝ t➝❡✝ ■➆✝✱ ❛❞↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩✱ ➨✟✆♠♦➯✠✝ ↕❡✆❛↕✝ ❝♦↕❡✝✱ ➨✟✢♣❧❛↔❡✝ ❝♦♥➩✑t❛➤✢t➩✱ ➨✟✆✇➨✤➞✢➫❡✝ ⑨➛✌✉♥↔✢➫✢✐♦➤✝ ⑨➩✑✐❣♥→✢➫✢✉➨✟✆➩✱ ✳✳✳ ⑨➛✌✉♥↔✢➫✢✐♦♥➩ ⑨❤❛➯✠✝ ♠♦➨✟✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ❛❜➩✑tr❛↔✢➫✢✐♦♥➩ ✫ ⑨➝❡✆➡✢➦❡✆r➩ ❛✈→✢✐❧❛❜➡❡✝ ⑨❢♦r t➝✢✐➩ ✬✬⑨➩✑t❛♥❞❛r↕✝✬✬ ⑨❧❛②➥✗➭✢➫✝✱ ➥✗t➝❡✆r ↕❡✆➩✑✐❣♥➩ ✘♣♦s➩✑✐❜➡❡✝ t❤➥✗✉❣➝✝ ↕ ❡ ✆ ↕ ✢ ✉ ↔ ❡ ✝ ✘ ✐ ♥ ❢ ♦ r ♠ → ✢ ➫ ✢ ✐ ♦ ➤ ✝ ♦ ➤ ❡ ✝ ❝ ❛ ❧ ➡ ✝ ⑨ ➩ ✑ ➞ ✢ ➫ ❡ ✝ ✴ ✐ ♥ ➩ ✑ t ➨ ✤ ✉ ↔ ✝ ✲ ➫ ✢ ✐ ♦ ➤ ✝ ✴ ✈ ❛ ➡ ✢ ➭ ❡ ✝ ✴ ✳ ✳ ✳ → ✢ ➫ ✝ → ✝ ➫ ✢ ✐ ➢ ❡ ✝ t➝✢✐➩ ❝➥✗➭✢♣➡✢✐♥➜✞ ✘✐➩ t➝❡✝ ➨✟✆❛s♦➤✝ ➲✠✝ ✇❛➤✢➫✝ →✝ ⑨➩✑✐♥❣➡❡✝ ⑨❢r❛➢❡✆✇♦r➠✝

"Header Time Optimization": Cross-Translation Unit Optimization via Annotated Headers

William S. Moses (wmoses@mit.edu), Johannes Doerfert (jdoerfert@anl.gov) MIT CSAIL, Argonne National Lab Writing Optimizable Code is Hard How do we ensure that norm is hoisted outside the loop (and normalize vectorized)? double norm(double *A, int n); void normalize(double *out, double *in, int n) { for (int i = 0; i < n; ++i)
  • ut[i] = in[i] / norm(in, n);
} We could try adding: restrict type, const type, pure aribute, #pragma vectorize(enable), #pragma interleave(enable), __declspec((noalias)). None of those work. What we really want are two LLVM aributes: __attribute__((fn_attr("readonly"), fn_attr("argmemonly"))) double norm(double *A, int n); void normalize(double *restrict out, double *restrict in, int n); This is a problem in real programs! In the DOE RSBench benchmark [2] adding “read- none” to fast_cexp gives a 7% improvement to the enre program (with another 1% for “unwind”). Automatically Making Code Optimizable LLVM automacally derives these aributes as part of the compilaon process, then throws it away when it’s done Let’s ensure this informaon is accessible across translaon units. Why not always use LTO? Running LTO (even ThinLTO [3]) is a burden on compile mes LTO may not be available in your build / operang system It’s oen impossible to run LTO on your enre program (e.g. using an external library) Also, it’s interesng to see how much of LTO’s speedups come from “easily fixable” mechanisms and provide user’s the agency to fix them in source code (making the speedups available to everyone independent from compiler/linker used) Header Files HTO creates new files in a given directory that can be included in any C/C++ program (chosen for easiest experimentaon). Not all LLVM aributes are representable with exisng Clang aributes. We created a generic way to represent LLVM aributes in Clang (shown below). struct Vector; struct Matrix; __attribute__((fn_attr("readonly"), arg_attr(0, "readonly"), ret_attr("noalias"))) Vector* matvec(Matrix *M, Vector *B); Introducing "Header Time Optimization" At the end of the compilaon process, denote what derived aributes can be safely added to funcons using LLVM’s exisng analyses and Aributor [1]. Header me opmizaon has three modes of operaon: remark mode (Figure 1), pipeline mode (Figure 2, 3), and diff mode (in progress) where we create a diff for
  • riginal source tree.
// file1.c double fcexp(double *A, int n) { ... } file1.c:2:1: remark: derived following attributes: fn_attr("readonly") arg_attr(0, "readonly") [-Rannotations] double fcexp(double* a, int n) { clang -Rannotations Figure 1. Remark Mode: print out opmizaon remarks for aributes that should be added to funcons // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } // hto/fileN.h attribute((fn_attr("readnone"))) double fcexp(double *A); // hto/file1.c attribute((fn_attr("readnone"))) double fcexp(double *A); clang -hto_dir=hto // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } executable.o clang -include hto/* Figure 2. Pipeline Mode: automacally generate a new header file with this new informaon, then use this header to recompile the source with this informaon. // libsum.c double sum(double *A) { ... } // sum.h attribute((fn_attr("readnone"))) double sum(double *A); libsum.o clang -hto_dir=hto // user.c double fcexp(double *A) { ... } executable.o clang user.c -lsum Figure 3. Pipeline mode for a library. The annotated header is shipped with the library and used to compile user code. Present Limitations & Future Work We currently don’t generate annotaons for funcons with anonymous structs (we have a script to automacally generate random names), C++ member funcons (since they can’t forward declared), array type of struct/classes (type mystruct[3] is incom- plete ahead of me). When we allow users to output a diff (easier for integraon) rather than pipeline (easier for experiments), these limitaons are resolved and we get more performance gains. In the future we plan to generate standard C/C++ aributes when they exist. Experiments Ran mul-source benchmarks in LLVM test suite Annotated headers allow more LLVM opmizaons to perform beer opmizaons: 165% increase in mem2reg promoons, 33% increase in correlated value propagaons, 28% increase common subexpression eliminaon, etc. HTO was able to find sigificnat speedups for many programs. Comparing with LTO we find that there are three places of interest: where neither found a speedup, where LTO found a speedup HTO didn’t and where both HTO and LTO found a speedup. Speedup of HTO over Normal Benchmark
  • 5%
0% 5% 10% 15% 20% 25% Speedup 1 HTO Normal Speedup of HTO and LTO over Normal
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% LTO speedup
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% HTO speedup LTO beer than HTO HTO matches LTO Figure 4. Speedups of HTO and LTO on the LLVM mulsource test suite Let’s now look at the benchmarks where either LTO or HTO found a speedup. We see that for more than half of the LTO speedups can be simply derived by funcon annotaons/HTO alone. For the other half of the speedups, LTO takes sigificantly longer to compile, implying that inlining/IPO is necessary. Speedup of HTO and LTO Benchmark
  • 2%
0% 2% 5% 10% 15% 20% 25% Speedup HTO < LTO ±2% HTO LTO ±2% HTO LTO Compile+Link me overhead of LTO over HTO Benchmark 0% 25% 50% 75% 100% 125% 150% LTO Compilation Slowdown LTO HTO 1 205% (runtime) HTO < LTO ±2% (runtime) HTO LTO ±2% Figure 5. Comparison between LTO and HTO on codes where a speedup exists. Acknowledgements & References William S. Moses was supported in part by a DOE Computaonal Sciences Graduate Fellowship DE-SC0019323, Google Summer of Code, NSF Grant 1533644 and 1533644, LANL grant 531711, and IBM grant W1771646. Johannes Doerfert was supported by the Exascale Compung Project (17-SC-20-SC), a collaborave effort of two U.S. Department of Energy
  • rganizaons (Office of Science and the Naonal Nuclear Security Administraon) responsible for the planning and prepa-
raon of a capable exascale ecosystem, including soware, applicaons, hardware, advanced system engineering, and early testbed plaorms, in support of the naon’s exascale compung imperave. [1] J. Doerfert, H. Ueno, and S.0 Spanovic: The Aributor: A Versale Inter-procedural Fixpoint Iteraon Framework. US LLVM Dev Meeng, 2019. [2] Johannes Doerfert, Brian Homerding, and Hal Finkel. Performance exploraon through opmisc stac program annotaons. In Internaonal Conference on High Performance Compung, pages 247–268. Springer, 2019. [3] Teresa Johnson, Mehdi Amini, and Xinliang David Li. Thinlto: scalable and incremental lto. In 2017 IEEE/ACM Internaonal Symposium on Code Generaon and Opmizaon (CGO), pages 111–121. IEEE, 2017.

16/16

slide-84
SLIDE 84

THE ATTRIBUTOR FRAMEWORK @ LLVM-DEV’19

Tutorial: tomorrow 1:45pm - 2:55pm Posters: tomorrow 4:00pm - 5:00pm

ATTRIBUTOR, A FRAMEWORK FOR INTER-PROCEDURAL INFORMATION DEDUCTION Johannes Doerfert, Hideto Ueno, Stefan Stipanovic Argonne National Laboratory, University of Tokyo, University of Novi Sad ATTRIBUTOR, A FRAMEWORK FOR INTER-PROCEDURAL INFORMATION DEDUCTION Johannes Doerfert, Hideto Ueno, Stefan Stipanovic Argonne National Laboratory, University of Tokyo, University of Novi Sad Abstract LLVM functions, arguments, and other entities can be tagged with attributes to encode information, e.g., readonly if a function only reads memory, or nounwind if a function will not throw exceptions. These attributes are used, explicitly
  • r implicitly, by many optimizations to decide if a transformation is valid or not.
The goal of this project was to replace the current function attributes inference algorithms as well as strongly entangled IPOs, e.g., argument promotion. This is be accomplished via intra- and inter-procedural fixpoint analyses in which the (optimistic) state can be shared at will. The Attributor makes this process easy, through new abstractions that prove to be useful not only for attribute deduction but for other transformations and analyses. For example, the Attributor will not deduce information for dead code, it will simplify values (think IPSCCP), perform heap-to-stack conversion, and more. As part of this work we also added and infer new attributes (nofree, nosync, willreturn) and we started to use the now available information in more places, e.g., dereferenceable is now used to improve alias queries. Running Example Acknowledgements Hideto Ueno and Stefan Stipanovic were supported by Google Summer of Code (GSoC) and through LLVM Foundation travel grants. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department
  • f Energy organizations (Office of Science and the National Nuclear Security Administrat ion) responsible for the planning and
preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. Additionally, this research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. (LLVM) IRPositions Abstract States (1) Initialize (2) Propagate (3) Manifest Evaluation: # Attributes in the IR location attribute # w/o Attributor # w/ Attributor Attributor ∆ total w/o Attributor total w/ Attributor argument nonnull 806 67469 82.7 × 0.46% 38.6% function nosync 7612 0.0% 4.36% returned nonnull 950 20046 20.11 × 0.558% 11.8% function nosyn 7612 0.0% 4.36% argument dereferenceable 61825 66317 +7.27% 35.4% 38.0% function nofree 5762 10188 +76.81% 3.3% 5.83% function willreturn 4146 0.0% 2.37% argument writeonly 3562 0.0% 2.04% argument readnone 5377 6040 +12.33% 3.08% 3.46% function noreturn 965 1611 +66.94% 0.553% 0.923% argument align 419 900 +114.80% 0.24% 0.515% returned dereferenceable 19041 19479 +2.30% 11.2% 11.4% returned align 432 0.0% 0.254% argument nocapture 28991 29413 +1.456% 16.6% 16.8% argument readonly 14946 15281 +2.24% 8.56% 8.75% argument returned 512 599 +16.99% 0.293% 0.343% function norecurse 8627 8714 +1.00% 4.94% 4.99% function nounwind 92888 92823
  • 0.07%
53.2% 53.2% argument noalias 4098 4158 +1.46% 2.35% 2.38% returned noalias 1150 1194 +3.83% 0.676% 0.701% function readnone 2324 2336 +0.52% 1.33% 1.34% function writeonly 1344 1354 +0.74% 0.77% 0.775% Abstract Attribute Hierarchy Liveness-Aware Helpers Attribute Interaction Abstract Attribute Highlights AAIsDead — omnipresent liveness information Must-Be-Executed-Context-based deduction Heap-To-Stack Conversion ❝♦➤✢t➨✤✐➯✠✆↕✝ ⑨➣✗➭✢➫✝ ✘✐❧➡✢✉➩✑tr→✢➫✢✐➯✠✝ ⑨❜❛❝❦❣r➥✗✉♥↕✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ✘✐➤✢➞✢➫✢✐❛➡✢✐③→✢➫✢✐♦➤✝ ♦➛☞ ❛❜➩✑tr❛↔✢➫✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ➲✗➞✢t➝✝ ⑨❦♥♦✇➤✝ ✧IR-knowledge✧ ↕❡✆➨✤✐➯✠✆↕✝ t❤r➥✗✉❣➝✝ ❡✆➳✢✐➩✑➫✢✐♥➜✞ APIs ❛♥↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ✘♣r➥✗♣❛❣→✢➫✢✐♦➤✝ ⑨❤→✢➦✢➦❡✆♥➩ ✧➩✑➫❡✢➦✝✲⑨❜➵✞✲⑨➩✑➫❡✢♣✧ ✘✉➤✢➫✢✐➡✝ →✝ ⑨➛✌✐➳✢♣➥✗✐➤✢➫✝ ✘✐➩ ➨✟✆❛❝➝❡✆↕✝ ✇♦r➠✝✲⑨➡✢✐➩✑➫✝✲⑨➩✑t②➡❡✝✱ ↕❡✢➦❡✆♥↕❡✆♥↔❡✝ tr❛❝➠✢✐♥➜✞✱ ➫✢✐➢❡✆➥✗➭✢➫✝✱ ✳ ✳ ✳ ♠❛➤✢✐➛☛✆➩✑➫✝ ⑨➛✌✐♥❛➡✝ ⑨➩✑t→✢➫❡✝ ✘✐➤✝ t➝❡✝ ■➆✝✱ ❛❞↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩✱ ➨✟✆♠♦➯✠✝ ↕❡✆❛↕✝ ❝♦↕❡✝✱ ➨✟✢♣❧❛↔❡✝ ❝♦♥➩✑t❛➤✢t➩✱ ➨✟✆✇➨✤➞✢➫❡✝ ⑨➛✌✉♥↔✢➫✢✐♦➤✝ ⑨➩✑✐❣♥→✢➫✢✉➨✟✆➩✱ ✳✳✳ > 40% ⑨➛✌✉♥↔✢➫✢✐♦♥➩ ⑨❤❛➯✠✝ ♠♦➨✟✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ❛❜➩✑tr❛↔✢➫✢✐♦♥➩ ✫ ⑨➝❡✆➡✢➦❡✆r➩ ❛✈→✢✐❧❛❜➡❡✝ ⑨❢♦r t➝✢✐➩ ✬✬⑨➩✑t❛♥❞❛r↕✝✬✬ ⑨❧❛②➥✗➭✢➫✝✱ ➥✗t➝❡✆r ↕❡✆➩✑✐❣♥➩ ✘♣♦s➩✑✐❜➡❡✝ t❤➥✗✉❣➝✝ ↕ ❡ ✆ ↕ ✢ ✉ ↔ ❡ ✝ ✘ ✐ ♥ ❢ ♦ r ♠ → ✢ ➫ ✢ ✐ ♦ ➤ ✝ ♦ ➤ ❡ ✝ ❝ ❛ ❧ ➡ ✝ ⑨ ➩ ✑ ➞ ✢ ➫ ❡ ✝ ✴ ✐ ♥ ➩ ✑ t ➨ ✤ ✉ ↔ ✝ ✲ ➫ ✢ ✐ ♦ ➤ ✝ ✴ ✈ ❛ ➡ ✢ ➭ ❡ ✝ ✴ ✳ ✳ ✳ → ✢ ➫ ✝ → ✝ ➫ ✢ ✐ ➢ ❡ ✝ t➝✢✐➩ ❝➥✗➭✢♣➡✢✐♥➜✞ ✘✐➩ t➝❡✝ ➨✟✆❛s♦➤✝ ➲✠✝ ✇❛➤✢➫✝ →✝ ⑨➩✑✐♥❣➡❡✝ ⑨❢r❛➢❡✆✇♦r➠✝

"Header Time Optimization": Cross-Translation Unit Optimization via Annotated Headers

William S. Moses (wmoses@mit.edu), Johannes Doerfert (jdoerfert@anl.gov) MIT CSAIL, Argonne National Lab Writing Optimizable Code is Hard How do we ensure that norm is hoisted outside the loop (and normalize vectorized)? double norm(double *A, int n); void normalize(double *out, double *in, int n) { for (int i = 0; i < n; ++i)
  • ut[i] = in[i] / norm(in, n);
} We could try adding: restrict type, const type, pure aribute, #pragma vectorize(enable), #pragma interleave(enable), __declspec((noalias)). None of those work. What we really want are two LLVM aributes: __attribute__((fn_attr("readonly"), fn_attr("argmemonly"))) double norm(double *A, int n); void normalize(double *restrict out, double *restrict in, int n); This is a problem in real programs! In the DOE RSBench benchmark [2] adding “read- none” to fast_cexp gives a 7% improvement to the enre program (with another 1% for “unwind”). Automatically Making Code Optimizable LLVM automacally derives these aributes as part of the compilaon process, then throws it away when it’s done Let’s ensure this informaon is accessible across translaon units. Why not always use LTO? Running LTO (even ThinLTO [3]) is a burden on compile mes LTO may not be available in your build / operang system It’s oen impossible to run LTO on your enre program (e.g. using an external library) Also, it’s interesng to see how much of LTO’s speedups come from “easily fixable” mechanisms and provide user’s the agency to fix them in source code (making the speedups available to everyone independent from compiler/linker used) Header Files HTO creates new files in a given directory that can be included in any C/C++ program (chosen for easiest experimentaon). Not all LLVM aributes are representable with exisng Clang aributes. We created a generic way to represent LLVM aributes in Clang (shown below). struct Vector; struct Matrix; __attribute__((fn_attr("readonly"), arg_attr(0, "readonly"), ret_attr("noalias"))) Vector* matvec(Matrix *M, Vector *B); Introducing "Header Time Optimization" At the end of the compilaon process, denote what derived aributes can be safely added to funcons using LLVM’s exisng analyses and Aributor [1]. Header me opmizaon has three modes of operaon: remark mode (Figure 1), pipeline mode (Figure 2, 3), and diff mode (in progress) where we create a diff for
  • riginal source tree.
// file1.c double fcexp(double *A, int n) { ... } file1.c:2:1: remark: derived following attributes: fn_attr("readonly") arg_attr(0, "readonly") [-Rannotations] double fcexp(double* a, int n) { clang -Rannotations Figure 1. Remark Mode: print out opmizaon remarks for aributes that should be added to funcons // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } // hto/fileN.h attribute((fn_attr("readnone"))) double fcexp(double *A); // hto/file1.c attribute((fn_attr("readnone"))) double fcexp(double *A); clang -hto_dir=hto // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } executable.o clang -include hto/* Figure 2. Pipeline Mode: automacally generate a new header file with this new informaon, then use this header to recompile the source with this informaon. // libsum.c double sum(double *A) { ... } // sum.h attribute((fn_attr("readnone"))) double sum(double *A); libsum.o clang -hto_dir=hto // user.c double fcexp(double *A) { ... } executable.o clang user.c -lsum Figure 3. Pipeline mode for a library. The annotated header is shipped with the library and used to compile user code. Present Limitations & Future Work We currently don’t generate annotaons for funcons with anonymous structs (we have a script to automacally generate random names), C++ member funcons (since they can’t forward declared), array type of struct/classes (type mystruct[3] is incom- plete ahead of me). When we allow users to output a diff (easier for integraon) rather than pipeline (easier for experiments), these limitaons are resolved and we get more performance gains. In the future we plan to generate standard C/C++ aributes when they exist. Experiments Ran mul-source benchmarks in LLVM test suite Annotated headers allow more LLVM opmizaons to perform beer opmizaons: 165% increase in mem2reg promoons, 33% increase in correlated value propagaons, 28% increase common subexpression eliminaon, etc. HTO was able to find sigificnat speedups for many programs. Comparing with LTO we find that there are three places of interest: where neither found a speedup, where LTO found a speedup HTO didn’t and where both HTO and LTO found a speedup. Speedup of HTO over Normal Benchmark
  • 5%
0% 5% 10% 15% 20% 25% Speedup 1 HTO Normal Speedup of HTO and LTO over Normal
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% LTO speedup
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% HTO speedup LTO beer than HTO HTO matches LTO Figure 4. Speedups of HTO and LTO on the LLVM mulsource test suite Let’s now look at the benchmarks where either LTO or HTO found a speedup. We see that for more than half of the LTO speedups can be simply derived by funcon annotaons/HTO alone. For the other half of the speedups, LTO takes sigificantly longer to compile, implying that inlining/IPO is necessary. Speedup of HTO and LTO Benchmark
  • 2%
0% 2% 5% 10% 15% 20% 25% Speedup HTO < LTO ±2% HTO LTO ±2% HTO LTO Compile+Link me overhead of LTO over HTO Benchmark 0% 25% 50% 75% 100% 125% 150% LTO Compilation Slowdown LTO HTO 1 205% (runtime) HTO < LTO ±2% (runtime) HTO LTO ±2% Figure 5. Comparison between LTO and HTO on codes where a speedup exists. Acknowledgements & References William S. Moses was supported in part by a DOE Computaonal Sciences Graduate Fellowship DE-SC0019323, Google Summer of Code, NSF Grant 1533644 and 1533644, LANL grant 531711, and IBM grant W1771646. Johannes Doerfert was supported by the Exascale Compung Project (17-SC-20-SC), a collaborave effort of two U.S. Department of Energy
  • rganizaons (Office of Science and the Naonal Nuclear Security Administraon) responsible for the planning and prepa-
raon of a capable exascale ecosystem, including soware, applicaons, hardware, advanced system engineering, and early testbed plaorms, in support of the naon’s exascale compung imperave. [1] J. Doerfert, H. Ueno, and S.0 Spanovic: The Aributor: A Versale Inter-procedural Fixpoint Iteraon Framework. US LLVM Dev Meeng, 2019. [2] Johannes Doerfert, Brian Homerding, and Hal Finkel. Performance exploraon through opmisc stac program annotaons. In Internaonal Conference on High Performance Compung, pages 247–268. Springer, 2019. [3] Teresa Johnson, Mehdi Amini, and Xinliang David Li. Thinlto: scalable and incremental lto. In 2017 IEEE/ACM Internaonal Symposium on Code Generaon and Opmizaon (CGO), pages 111–121. IEEE, 2017.

16/16

slide-85
SLIDE 85

THE ATTRIBUTOR FRAMEWORK @ LLVM-DEV’19

Tutorial: tomorrow 1:45pm - 2:55pm Posters: tomorrow 4:00pm - 5:00pm

ATTRIBUTOR, A FRAMEWORK FOR INTER-PROCEDURAL INFORMATION DEDUCTION Johannes Doerfert, Hideto Ueno, Stefan Stipanovic Argonne National Laboratory, University of Tokyo, University of Novi Sad ATTRIBUTOR, A FRAMEWORK FOR INTER-PROCEDURAL INFORMATION DEDUCTION Johannes Doerfert, Hideto Ueno, Stefan Stipanovic Argonne National Laboratory, University of Tokyo, University of Novi Sad Abstract LLVM functions, arguments, and other entities can be tagged with attributes to encode information, e.g., readonly if a function only reads memory, or nounwind if a function will not throw exceptions. These attributes are used, explicitly
  • r implicitly, by many optimizations to decide if a transformation is valid or not.
The goal of this project was to replace the current function attributes inference algorithms as well as strongly entangled IPOs, e.g., argument promotion. This is be accomplished via intra- and inter-procedural fixpoint analyses in which the (optimistic) state can be shared at will. The Attributor makes this process easy, through new abstractions that prove to be useful not only for attribute deduction but for other transformations and analyses. For example, the Attributor will not deduce information for dead code, it will simplify values (think IPSCCP), perform heap-to-stack conversion, and more. As part of this work we also added and infer new attributes (nofree, nosync, willreturn) and we started to use the now available information in more places, e.g., dereferenceable is now used to improve alias queries. Running Example Acknowledgements Hideto Ueno and Stefan Stipanovic were supported by Google Summer of Code (GSoC) and through LLVM Foundation travel grants. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department
  • f Energy organizations (Office of Science and the National Nuclear Security Administrat ion) responsible for the planning and
preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. Additionally, this research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. (LLVM) IRPositions Abstract States (1) Initialize (2) Propagate (3) Manifest Evaluation: # Attributes in the IR location attribute # w/o Attributor # w/ Attributor Attributor ∆ total w/o Attributor total w/ Attributor argument nonnull 806 67469 82.7 × 0.46% 38.6% function nosync 7612 0.0% 4.36% returned nonnull 950 20046 20.11 × 0.558% 11.8% function nosyn 7612 0.0% 4.36% argument dereferenceable 61825 66317 +7.27% 35.4% 38.0% function nofree 5762 10188 +76.81% 3.3% 5.83% function willreturn 4146 0.0% 2.37% argument writeonly 3562 0.0% 2.04% argument readnone 5377 6040 +12.33% 3.08% 3.46% function noreturn 965 1611 +66.94% 0.553% 0.923% argument align 419 900 +114.80% 0.24% 0.515% returned dereferenceable 19041 19479 +2.30% 11.2% 11.4% returned align 432 0.0% 0.254% argument nocapture 28991 29413 +1.456% 16.6% 16.8% argument readonly 14946 15281 +2.24% 8.56% 8.75% argument returned 512 599 +16.99% 0.293% 0.343% function norecurse 8627 8714 +1.00% 4.94% 4.99% function nounwind 92888 92823
  • 0.07%
53.2% 53.2% argument noalias 4098 4158 +1.46% 2.35% 2.38% returned noalias 1150 1194 +3.83% 0.676% 0.701% function readnone 2324 2336 +0.52% 1.33% 1.34% function writeonly 1344 1354 +0.74% 0.77% 0.775% Abstract Attribute Hierarchy Liveness-Aware Helpers Attribute Interaction Abstract Attribute Highlights AAIsDead — omnipresent liveness information Must-Be-Executed-Context-based deduction Heap-To-Stack Conversion ❝♦➤✢t➨✤✐➯✠✆↕✝ ⑨➣✗➭✢➫✝ ✘✐❧➡✢✉➩✑tr→✢➫✢✐➯✠✝ ⑨❜❛❝❦❣r➥✗✉♥↕✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ✘✐➤✢➞✢➫✢✐❛➡✢✐③→✢➫✢✐♦➤✝ ♦➛☞ ❛❜➩✑tr❛↔✢➫✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ➲✗➞✢t➝✝ ⑨❦♥♦✇➤✝ ✧IR-knowledge✧ ↕❡✆➨✤✐➯✠✆↕✝ t❤r➥✗✉❣➝✝ ❡✆➳✢✐➩✑➫✢✐♥➜✞ APIs ❛♥↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩ ✘♣r➥✗♣❛❣→✢➫✢✐♦➤✝ ⑨❤→✢➦✢➦❡✆♥➩ ✧➩✑➫❡✢➦✝✲⑨❜➵✞✲⑨➩✑➫❡✢♣✧ ✘✉➤✢➫✢✐➡✝ →✝ ⑨➛✌✐➳✢♣➥✗✐➤✢➫✝ ✘✐➩ ➨✟✆❛❝➝❡✆↕✝ ✇♦r➠✝✲⑨➡✢✐➩✑➫✝✲⑨➩✑t②➡❡✝✱ ↕❡✢➦❡✆♥↕❡✆♥↔❡✝ tr❛❝➠✢✐♥➜✞✱ ➫✢✐➢❡✆➥✗➭✢➫✝✱ ✳ ✳ ✳ ♠❛➤✢✐➛☛✆➩✑➫✝ ⑨➛✌✐♥❛➡✝ ⑨➩✑t→✢➫❡✝ ✘✐➤✝ t➝❡✝ ■➆✝✱ ❛❞↕✝ →✢➫✢t➨✤✐➣✗➭✢➫❡✆➩✱ ➨✟✆♠♦➯✠✝ ↕❡✆❛↕✝ ❝♦↕❡✝✱ ➨✟✢♣❧❛↔❡✝ ❝♦♥➩✑t❛➤✢t➩✱ ➨✟✆✇➨✤➞✢➫❡✝ ⑨➛✌✉♥↔✢➫✢✐♦➤✝ ⑨➩✑✐❣♥→✢➫✢✉➨✟✆➩✱ ✳✳✳ > 40% ⑨➛✌✉♥↔✢➫✢✐♦♥➩ ⑨❤❛➯✠✝ ♠♦➨✟✝ ✘✐♥❢♦r♠→✢➫✢✐♦➤✝ ❛❜➩✑tr❛↔✢➫✢✐♦♥➩ ✫ ⑨➝❡✆➡✢➦❡✆r➩ ❛✈→✢✐❧❛❜➡❡✝ ⑨❢♦r t➝✢✐➩ ✬✬⑨➩✑t❛♥❞❛r↕✝✬✬ ⑨❧❛②➥✗➭✢➫✝✱ ➥✗t➝❡✆r ↕❡✆➩✑✐❣♥➩ ✘♣♦s➩✑✐❜➡❡✝ t❤➥✗✉❣➝✝ ↕ ❡ ✆ ↕ ✢ ✉ ↔ ❡ ✝ ✘ ✐ ♥ ❢ ♦ r ♠ → ✢ ➫ ✢ ✐ ♦ ➤ ✝ ♦ ➤ ❡ ✝ ❝ ❛ ❧ ➡ ✝ ⑨ ➩ ✑ ➞ ✢ ➫ ❡ ✝ ✴ ✐ ♥ ➩ ✑ t ➨ ✤ ✉ ↔ ✝ ✲ ➫ ✢ ✐ ♦ ➤ ✝ ✴ ✈ ❛ ➡ ✢ ➭ ❡ ✝ ✴ ✳ ✳ ✳ → ✢ ➫ ✝ → ✝ ➫ ✢ ✐ ➢ ❡ ✝ t➝✢✐➩ ❝➥✗➭✢♣➡✢✐♥➜✞ ✘✐➩ t➝❡✝ ➨✟✆❛s♦➤✝ ➲✠✝ ✇❛➤✢➫✝ →✝ ⑨➩✑✐♥❣➡❡✝ ⑨❢r❛➢❡✆✇♦r➠✝

"Header Time Optimization": Cross-Translation Unit Optimization via Annotated Headers

William S. Moses (wmoses@mit.edu), Johannes Doerfert (jdoerfert@anl.gov) MIT CSAIL, Argonne National Lab Writing Optimizable Code is Hard How do we ensure that norm is hoisted outside the loop (and normalize vectorized)? double norm(double *A, int n); void normalize(double *out, double *in, int n) { for (int i = 0; i < n; ++i)
  • ut[i] = in[i] / norm(in, n);
} We could try adding: restrict type, const type, pure aribute, #pragma vectorize(enable), #pragma interleave(enable), __declspec((noalias)). None of those work. What we really want are two LLVM aributes: __attribute__((fn_attr("readonly"), fn_attr("argmemonly"))) double norm(double *A, int n); void normalize(double *restrict out, double *restrict in, int n); This is a problem in real programs! In the DOE RSBench benchmark [2] adding “read- none” to fast_cexp gives a 7% improvement to the enre program (with another 1% for “unwind”). Automatically Making Code Optimizable LLVM automacally derives these aributes as part of the compilaon process, then throws it away when it’s done Let’s ensure this informaon is accessible across translaon units. Why not always use LTO? Running LTO (even ThinLTO [3]) is a burden on compile mes LTO may not be available in your build / operang system It’s oen impossible to run LTO on your enre program (e.g. using an external library) Also, it’s interesng to see how much of LTO’s speedups come from “easily fixable” mechanisms and provide user’s the agency to fix them in source code (making the speedups available to everyone independent from compiler/linker used) Header Files HTO creates new files in a given directory that can be included in any C/C++ program (chosen for easiest experimentaon). Not all LLVM aributes are representable with exisng Clang aributes. We created a generic way to represent LLVM aributes in Clang (shown below). struct Vector; struct Matrix; __attribute__((fn_attr("readonly"), arg_attr(0, "readonly"), ret_attr("noalias"))) Vector* matvec(Matrix *M, Vector *B); Introducing "Header Time Optimization" At the end of the compilaon process, denote what derived aributes can be safely added to funcons using LLVM’s exisng analyses and Aributor [1]. Header me opmizaon has three modes of operaon: remark mode (Figure 1), pipeline mode (Figure 2, 3), and diff mode (in progress) where we create a diff for
  • riginal source tree.
// file1.c double fcexp(double *A, int n) { ... } file1.c:2:1: remark: derived following attributes: fn_attr("readonly") arg_attr(0, "readonly") [-Rannotations] double fcexp(double* a, int n) { clang -Rannotations Figure 1. Remark Mode: print out opmizaon remarks for aributes that should be added to funcons // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } // hto/fileN.h attribute((fn_attr("readnone"))) double fcexp(double *A); // hto/file1.c attribute((fn_attr("readnone"))) double fcexp(double *A); clang -hto_dir=hto // fileN.c double fcexp(double *A) { ... } // file1.c double fcexp(double *A) { ... } executable.o clang -include hto/* Figure 2. Pipeline Mode: automacally generate a new header file with this new informaon, then use this header to recompile the source with this informaon. // libsum.c double sum(double *A) { ... } // sum.h attribute((fn_attr("readnone"))) double sum(double *A); libsum.o clang -hto_dir=hto // user.c double fcexp(double *A) { ... } executable.o clang user.c -lsum Figure 3. Pipeline mode for a library. The annotated header is shipped with the library and used to compile user code. Present Limitations & Future Work We currently don’t generate annotaons for funcons with anonymous structs (we have a script to automacally generate random names), C++ member funcons (since they can’t forward declared), array type of struct/classes (type mystruct[3] is incom- plete ahead of me). When we allow users to output a diff (easier for integraon) rather than pipeline (easier for experiments), these limitaons are resolved and we get more performance gains. In the future we plan to generate standard C/C++ aributes when they exist. Experiments Ran mul-source benchmarks in LLVM test suite Annotated headers allow more LLVM opmizaons to perform beer opmizaons: 165% increase in mem2reg promoons, 33% increase in correlated value propagaons, 28% increase common subexpression eliminaon, etc. HTO was able to find sigificnat speedups for many programs. Comparing with LTO we find that there are three places of interest: where neither found a speedup, where LTO found a speedup HTO didn’t and where both HTO and LTO found a speedup. Speedup of HTO over Normal Benchmark
  • 5%
0% 5% 10% 15% 20% 25% Speedup 1 HTO Normal Speedup of HTO and LTO over Normal
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% LTO speedup
  • 10%
  • 5%
0% 5% 10% 15% 20% 25% 2% HTO speedup LTO beer than HTO HTO matches LTO Figure 4. Speedups of HTO and LTO on the LLVM mulsource test suite Let’s now look at the benchmarks where either LTO or HTO found a speedup. We see that for more than half of the LTO speedups can be simply derived by funcon annotaons/HTO alone. For the other half of the speedups, LTO takes sigificantly longer to compile, implying that inlining/IPO is necessary. Speedup of HTO and LTO Benchmark
  • 2%
0% 2% 5% 10% 15% 20% 25% Speedup HTO < LTO ±2% HTO LTO ±2% HTO LTO Compile+Link me overhead of LTO over HTO Benchmark 0% 25% 50% 75% 100% 125% 150% LTO Compilation Slowdown LTO HTO 1 205% (runtime) HTO < LTO ±2% (runtime) HTO LTO ±2% Figure 5. Comparison between LTO and HTO on codes where a speedup exists. Acknowledgements & References William S. Moses was supported in part by a DOE Computaonal Sciences Graduate Fellowship DE-SC0019323, Google Summer of Code, NSF Grant 1533644 and 1533644, LANL grant 531711, and IBM grant W1771646. Johannes Doerfert was supported by the Exascale Compung Project (17-SC-20-SC), a collaborave effort of two U.S. Department of Energy
  • rganizaons (Office of Science and the Naonal Nuclear Security Administraon) responsible for the planning and prepa-
raon of a capable exascale ecosystem, including soware, applicaons, hardware, advanced system engineering, and early testbed plaorms, in support of the naon’s exascale compung imperave. [1] J. Doerfert, H. Ueno, and S.0 Spanovic: The Aributor: A Versale Inter-procedural Fixpoint Iteraon Framework. US LLVM Dev Meeng, 2019. [2] Johannes Doerfert, Brian Homerding, and Hal Finkel. Performance exploraon through opmisc stac program annotaons. In Internaonal Conference on High Performance Compung, pages 247–268. Springer, 2019. [3] Teresa Johnson, Mehdi Amini, and Xinliang David Li. Thinlto: scalable and incremental lto. In 2017 IEEE/ACM Internaonal Symposium on Code Generaon and Opmizaon (CGO), pages 111–121. IEEE, 2017.

16/16

Visit our posters and tutorial!

slide-86
SLIDE 86
slide-87
SLIDE 87

THE ATTRIBUTOR — EVALUATION — ASSUMING EXACT DEFINITIONS

loc. attribute # w/o A. # w/ A.

  • A. Δ
  • tot. w/o A.
  • tot. w/ A.

fn. nosync 78491 0.0% 45.90% arg. dereferenceable 59578 64214 +7.78% 34.8% 37.50% fn. nofree 25649 76719 +199.11% 15.0% 44.90% fn. willreturn 64748 0.0% 37.90% arg. writeonly 4229 0.0% 2.47% arg. readnone 40505 38414

  • 5.16%

23.7% 22.50% fn. noreturn 879 2394 +172.36% 0.514% 1.40% arg. align 449 1028 +128.95% 0.263% 0.60% ret. dereferenceable 18064 19419 +7.50% 10.8% 11.60% arg. nocapture 153523 155294 +1.15% 89.8% 90.80% arg. returned 9418 13937 +47.98% 5.51% 8.15% arg. noalias 4113 4189 +1.85% 2.41% 2.45% ret. noalias 3015 3310 +9.78% 1.81% 1.98% fn. writeonly 8089 9877 +22.10% 4.73% 5.78% fn. nounwind 123516 125480 +1.59% 72.2% 73.40%

slide-88
SLIDE 88

MUST-BE-EXECUTED-CONTEXT

slide-89
SLIDE 89

MUST-BE-EXECUTED-CONTEXT

slide-90
SLIDE 90

INLINING VS. IPO

The “inline-fjrst” approach: I: aggressive inlining, e.g., all 𝑂 call sites II: perform intra-procedural analyses + transformations (𝑂 times) III: derive information + transformation opportunities inter-procedurally The “IPO-fjrst” approach: I: derive information + transformation opportunities inter-procedurally II: internalize & specialize functions if necessary & benefjcial III: inline where benefjt can be expected

slide-91
SLIDE 91

INLINING VS. IPO

The “inline-fjrst” approach: I: aggressive inlining, e.g., all 𝑂 call sites II: perform intra-procedural analyses + transformations (𝑂 times) III: derive information + transformation opportunities inter-procedurally The “IPO-fjrst” approach: I: derive information + transformation opportunities inter-procedurally II: internalize & specialize functions if necessary & benefjcial III: inline where benefjt can be expected

slide-92
SLIDE 92

INLINING VS. IPO

The “inline-fjrst” approach: I: aggressive inlining, e.g., all 𝑂 call sites II: perform intra-procedural analyses + transformations (𝑂 times) III: derive information + transformation opportunities inter-procedurally The “IPO-fjrst” approach: I: derive information + transformation opportunities inter-procedurally II: internalize & specialize functions if necessary & benefjcial III: inline where benefjt can be expected