Advances in Loop Analysis Frameworks and Optimizations Adam Nemet - - PowerPoint PPT Presentation

advances in loop analysis frameworks and optimizations
SMART_READER_LITE
LIVE PREVIEW

Advances in Loop Analysis Frameworks and Optimizations Adam Nemet - - PowerPoint PPT Presentation

Advances in Loop Analysis Frameworks and Optimizations Adam Nemet & Michael Zolotukhin Apple Loop Unrolling for (x = 0; x < 6; x++) { foo(x); } Loop Unrolling for (x = 0; x < 6; x += 2) { for (x = 0; x < 6; x++) { foo(x);


slide-1
SLIDE 1

Advances in Loop Analysis Frameworks and Optimizations

Adam Nemet & Michael Zolotukhin Apple

slide-2
SLIDE 2

Loop Unrolling

foo(x); } for (x = 0; x < 6; x++) {

slide-3
SLIDE 3

Loop Unrolling

foo(x); foo(x + 1); } for (x = 0; x < 6; x += 2) { for (x = 0; x < 6; x++) {

slide-4
SLIDE 4

Loop Unrolling

foo(x); foo(x + 1); } for (x = 0; x < 6; x += 2) { for (x = 0; x < 6; x++) { foo(x + 2); foo(x + 3); foo(x + 5); foo(x + 4); {

slide-5
SLIDE 5

Unrolling: Pros and Cons

+ Removes loop overhead + Enables other optimizations – Increases code size – Increases compile time – Might regress performance

slide-6
SLIDE 6

New Heuristics

  • Aim for bigger loops
  • Analyze the loop body and predict potential
  • ptimization candidates for later passes
slide-7
SLIDE 7


 
 
 
 
 
 r += a[i] * b[i];
 
 
 
 
 
 
 
 
 
 r += a[i] * b[i];
 
 
 
 
 
 
 
 
 
 r += a[i] * b[i];
 
 
 


Example

const int b[50] = {1, 0, 0, …, 0, 0};
 
 int foo(int *a) {
 int r = 0;
 
 for (int i = 0; i < 50; i++) {
 r += a[i] * b[i];
 }
 
 return r;
 }m

slide-8
SLIDE 8

1;z0;z 0;z0;z
 
 
 
 
 
 
 
 
 
 const int b[50] = {1, 0, 0, …, 0, 0};
 
 int foo(int *a) {
 int r = 0;
 r += a[0] * b[0]; 
 r += a[1] * b[1]; 
 …z
 r += a[48] * b[48]; 
 r += a[49] * b[49];
 return r;
 }m

Example

slide-9
SLIDE 9

Example

const int b[50] = {1, 0, 0, …, 0, 0};
 
 int foo(int *a) {
 int r = 0;
 r += a[0] * 1;z 
 r += a[1] * 0;z
 …z
 r += a[48] * 0;z 
 r += a[49] * 0;z
 return r;
 }m

slide-10
SLIDE 10

Example

const int b[50] = {1, 0, 0, …, 0, 0};
 
 int foo(int *a) {
 return a[0];z
 }m
 
 
 
 
 


slide-11
SLIDE 11

Analyzing Loop

  • Simulate the loop execution instruction by

instruction, iteration by iteration

  • Try to predict possible simplifications of every

instruction

  • Compute accurate costs of the original loop and

its unrolled version

slide-12
SLIDE 12

How It Works


Iteration 0

%r = 0
 loop:
 %y = b[i]
 %x = a[i]
 %t = %x * %y
 %r = %r + %t
 %i = %i + 1
 %cmp = %i < 50
 br %cmp, loop, exit
 exit:
 ret %r

Original loop
 cost Unrolled loop
 cost

slide-13
SLIDE 13

How It Works


Iteration 0

%r = 0
 loop:
 %y = b[i]
 %x = a[i]
 %t = %x * %y
 %r = %r + %t
 %i = %i + 1
 %cmp = %i < 50
 br %cmp, loop, exit
 exit:
 ret %r = 1

Original loop
 cost Unrolled loop
 cost

slide-14
SLIDE 14

How It Works


Iteration 0

%r = 0
 loop:
 %y = b[i]
 %x = a[i]
 %t = %x * %y
 %r = %r + %t
 %i = %i + 1
 %cmp = %i < 50
 br %cmp, loop, exit
 exit:
 ret %r = 1

Original loop
 cost Unrolled loop
 cost

slide-15
SLIDE 15

How It Works


Iteration 0

%r = 0
 loop:
 %y = b[i]
 %x = a[i]
 %t = %x * %y
 %r = %r + %t
 %i = %i + 1
 %cmp = %i < 50
 br %cmp, loop, exit
 exit:
 ret %r = 1 = %x

Original loop
 cost Unrolled loop
 cost

slide-16
SLIDE 16

How It Works


Iteration 0

%r = 0
 loop:
 %y = b[i]
 %x = a[i]
 %t = %x * %y
 %r = %r + %t
 %i = %i + 1
 %cmp = %i < 50
 br %cmp, loop, exit
 exit:
 ret %r = 1 = %x = %t

Original loop
 cost Unrolled loop
 cost

slide-17
SLIDE 17

How It Works


Iteration 0

%r = 0
 loop:
 %y = b[i]
 %x = a[i]
 %t = %x * %y
 %r = %r + %t
 %i = %i + 1
 %cmp = %i < 50
 br %cmp, loop, exit
 exit:
 ret %r = 1 = %x = %t = 1

Original loop
 cost Unrolled loop
 cost

slide-18
SLIDE 18

How It Works


Iteration 0

%r = 0
 loop:
 %y = b[i]
 %x = a[i]
 %t = %x * %y
 %r = %r + %t
 %i = %i + 1
 %cmp = %i < 50
 br %cmp, loop, exit
 exit:
 ret %r = 1 = %x = %t = 1 = true

Original loop
 cost Unrolled loop
 cost

slide-19
SLIDE 19

How It Works


Iteration 0

%r = 0
 loop:
 %y = b[i]
 %x = a[i]
 %t = %x * %y
 %r = %r + %t
 %i = %i + 1
 %cmp = %i < 50
 br %cmp, loop, exit
 exit:
 ret %r = 1 = %x = %t = 1 = true

Original loop
 cost Unrolled loop
 cost

slide-20
SLIDE 20

How It Works


Iteration 1

%r = 0
 loop:
 %y = b[i]
 %x = a[i]
 %t = %x * %y
 %r = %r + %t
 %i = %i + 1
 %cmp = %i < 50
 br %cmp, loop, exit
 exit:
 ret %r

Original loop
 cost Unrolled loop
 cost

slide-21
SLIDE 21

How It Works


Iteration 1

%r = 0
 loop:
 %y = b[i]
 %x = a[i]
 %t = %x * %y
 %r = %r + %t
 %i = %i + 1
 %cmp = %i < 50
 br %cmp, loop, exit
 exit:
 ret %r = 0 = 0 = 2 = true = %r

Original loop
 cost Unrolled loop
 cost

slide-22
SLIDE 22

%r = 0
 loop:
 %y = b[i]
 %x = a[i]
 %t = %x * %y
 %r = %r + %t
 %i = %i + 1
 %cmp = %i < 50
 br %cmp, loop, exit
 exit:
 ret %r

Original loop
 cost Unrolled loop
 cost

How It Works


Iteration 49

slide-23
SLIDE 23

How It Works


Original loop
 cost Unrolled loop
 cost

Execution
 speed-up

slide-24
SLIDE 24

How It Works


Tiny? Huge? Great
 speed-up?

Unroll Do not unroll Unroll Do not unroll

slide-25
SLIDE 25

Unrolling: Results

  • Up to 70% performance gains on kernels
  • Few performance gains across various testsuites
  • No performance regressions
  • Some compile time regressions
slide-26
SLIDE 26

Unrolling: Future Work

  • Enable new heuristics by default after

investigating compile time regressions

  • Model other optimizations
  • Find trip count
slide-27
SLIDE 27

Next Up

slide-28
SLIDE 28

General Optimizations

Approach

Loop Transformations Case Study

slide-29
SLIDE 29

General Optimizations

Approach

Loop Transformations 456.hmmer from SPECint 2006

slide-30
SLIDE 30

Case Study

for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } }

slide-31
SLIDE 31

Case Study

for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } }

What does this loop do?

slide-32
SLIDE 32

Case Study

mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY;

slide-33
SLIDE 33

Case Study

mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY;

slide-34
SLIDE 34

Case Study

dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY;

slide-35
SLIDE 35

Case Study

dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY;

slide-36
SLIDE 36

Case Study

if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

slide-37
SLIDE 37

Case Study

if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

slide-38
SLIDE 38

Case Study

for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; } if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

How can we optimize this loop?

slide-39
SLIDE 39

Can We Vectorize It?

for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; } if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

No!

slide-40
SLIDE 40

Can We Vectorize It?

dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY;

slide-41
SLIDE 41

Can We Vectorize It?

= dc[k-1] + tpdd[k-1]; dc[k] =

slide-42
SLIDE 42

Can We Vectorize It?

t; = dc[k-1] + tpdd[k-1]; dc[k] = t

slide-43
SLIDE 43

Can We Vectorize It?

t; = dc[k-1] + tpdd[k-1]; dc[k] = t

slide-44
SLIDE 44

Can We Vectorize It?

t; = dc[k-1] + tpdd[k-1]; dc[k] = t

slide-45
SLIDE 45

Iteration K+1: Iteration K:

Can We Vectorize It?

t2 = dc[k] + tpdd[k]; dc[k+1] = t2; dc[k] = t; t = dc[k-1] + tpdd[k-1];

slide-46
SLIDE 46

Iteration K+1: Iteration K:

Can We Vectorize It?

t2 = dc[k] + tpdd[k]; dc[k+1] = t2; dc[k] = t; t = dc[k-1] + tpdd[k-1];

slide-47
SLIDE 47

Can We Vectorize It?

if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; dc[k] = = dc[k-1] + tpdd[k-1];

slide-48
SLIDE 48

Case Study

for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; } if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

Non-vectorizable

slide-49
SLIDE 49

Case Study

for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; } if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

Non-vectorizable Vectorizable

slide-50
SLIDE 50

Case Study

if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

slide-51
SLIDE 51

Case Study

if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

slide-52
SLIDE 52

Case Study

for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; } if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

Vectorizable Non-vectorizable Non-vectorizable

slide-53
SLIDE 53

Case Study

for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; } if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

Non-vectorizable Vectorizable Non-vectorizable

} for (k = 1; k <= M; k++) {

slide-54
SLIDE 54

Plan

  • Distribute loop
  • Let LoopVectorizer vectorize top loop
  • > Partial Loop Vectorization
slide-55
SLIDE 55

Loop Distribution

slide-56
SLIDE 56

Pros and Cons

+ Partial loop vectorization + Improve memory access pattern:

  • Cache associativity
  • Number of HW prefetcher streams

+ Reduce spilling

  • Loop overhead
  • Instructions duplicated across new loops
  • Instruction-level parallelism
slide-57
SLIDE 57

for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; } for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } }

Legality

Loop Dependence Analysis Run-time Alias Checks

slide-58
SLIDE 58

Loop Access Analysis

  • Born from the Loop Vectorizer
  • Generalized as new analysis pass
  • Computed on-demand and cached
  • New Loop Versioning utility
slide-59
SLIDE 59

Algorithm

  • Light-weight
  • Uses only LoopAccessAnalysis
  • No Program Dependence Graph
  • No Control Dependence
  • Inner loops only
  • Different from textbook algorithm
  • No reordering of memory operations
slide-60
SLIDE 60

Algorithm

st 2 ld 8 st 10 mul 1 mul 9 ld 3 st 4 st 7 ld 5 add 6

slide-61
SLIDE 61

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6

slide-62
SLIDE 62

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6

slide-63
SLIDE 63

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6

slide-64
SLIDE 64

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6

slide-65
SLIDE 65

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6

slide-66
SLIDE 66

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6

slide-67
SLIDE 67

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6

slide-68
SLIDE 68

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6

slide-69
SLIDE 69

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6

slide-70
SLIDE 70

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6

slide-71
SLIDE 71

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6 dup of mul 1

slide-72
SLIDE 72

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6 dup of mul 1

slide-73
SLIDE 73

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6 dup of mul 1

slide-74
SLIDE 74

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6 dup of mul 1

slide-75
SLIDE 75

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6 dup of mul 1

slide-76
SLIDE 76

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6 dup of mul 1 dup of ld 3

slide-77
SLIDE 77

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6 dup of mul 1

slide-78
SLIDE 78

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6 dup of mul 1

slide-79
SLIDE 79

st 2

Algorithm

ld 3 st 4 ld 8 st 10 mul 1 st 7 ld 5 mul 9 add 6 dup of mul 1

slide-80
SLIDE 80

Recap

  • Distributed loop
  • Versioned with run-time alias checks
  • Top loop vectorized
slide-81
SLIDE 81

Case Study

for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; } if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

Vectorized

} for (k = 1; k <= M; k++) {

slide-82
SLIDE 82

Case Study

} for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

slide-83
SLIDE 83

Case Study

dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY;

slide-84
SLIDE 84

Case Study

Load Load Add Cmp Csel Cmp Csel Load Load Add Store DC[k] DC[k-1] —>

slide-85
SLIDE 85

Case Study

Load Load Add Cmp Csel Cmp Csel Load Load Add HW st -> ld forwarding Store

slide-86
SLIDE 86

Case Study

Load Load Add Cmp Csel Cmp Csel Load Load Add HW st -> ld forwarding SW st -> ld forwarding Store

slide-87
SLIDE 87

Case Study

dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY;

slide-88
SLIDE 88

Loop Load Elimination

slide-89
SLIDE 89

Algorithm

  • 1. Find loop-carried dependences with iteration

distance of one

  • 2. Between store -> load?
  • 3. No (may-)intervening store
  • 4. Propagate value stored to uses of load
slide-90
SLIDE 90

= sc;

Algorithm

} for (k = 1; k <= M; k++) { if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} dc[k] = = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = if (dc[k] < -INFTY) dc[k] = = -INFTY;

slide-91
SLIDE 91

Algorithm

} for (k = 1; k <= M; k++) { if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} dc[k] = = dc[k-1] + tpdd[k-1]; = sc; if (dc[k] < -INFTY) dc[k] = = -INFTY; T if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = T T

slide-92
SLIDE 92

Algorithm

} for (k = 1; k <= M; k++) { if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} dc[k] = = T + tpdd[k-1]; = sc; if (dc[k] < -INFTY) dc[k] = = -INFTY; T if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = T T

slide-93
SLIDE 93

Algorithm

} for (k = 1; k <= M; k++) { if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} dc[k] = = T + tpdd[k-1]; = sc; if (dc[k] < -INFTY) dc[k] = = -INFTY; T if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = T T T = dc[0];

slide-94
SLIDE 94

Algorithm

} if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} for (k = 1; k <= M; k++) { dc[k] = = T + tpdd[k-1]; = sc; if (dc[k] < -INFTY) dc[k] = = -INFTY; T if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = T T T = dc[0];

slide-95
SLIDE 95

Loop Load Elimination

  • Simple and cheap using Loop Access Analysis
  • With Loop Versioning can optimize more loops
  • GVN Load-PRE can be simplified to not worry

about loop cases

slide-96
SLIDE 96

Recap

  • Distributed loop into two loops
  • Versioned with run-time alias checks
  • Vectorized top loop
  • Store-to-load forwarding in bottom loop
  • Versioned with run-time alias checks
slide-97
SLIDE 97

Results

  • 20-30% gain on 456.hmmer on ARM64 and x86
  • Loop Access Analysis pass
  • Loop Versioning utility
  • Loop Distribution pass
  • Loop Load Elimination pass
slide-98
SLIDE 98

Future Work

  • Commit Loop Load Elimination
  • Tune Loop Distribution and turn it on by default
  • Loop Distribution with Program Dependence

Graph

slide-99
SLIDE 99

Acknowledgements

  • Chandler Carruth
  • Hal Finkel
  • Arnold Schwaighofer
  • Daniel Berlin
slide-100
SLIDE 100

Q&A

slide-101
SLIDE 101

Back-up

slide-102
SLIDE 102

InnerLoopVectorizer::canVectorizeMemory()

Loop Vectorizer

Memory Dependence Checker

Access Analysis Runtime Pointer Check

areDepsSafe(CheckDeps, DepCands) Memory Accesses processMemAccesses() canCheckPtrAtRT() P

  • i

n t e r s

slide-103
SLIDE 103

LoopAccessInfo

Loop Access Analysis

Memory Dependence Checker

Access Analysis Runtime Pointer Check

areDepsSafe(CheckDeps, DepCands) Memory Accesses processMemAccesses() canCheckPtrAtRT() P

  • i

n t e r s getChecks() getDependences() canVectorizeMemory()

slide-104
SLIDE 104

LoopAccessAnalysis

Loop Access Analysis

LoopAccessInfo

Mem

  • ry

Acc Run

areDepsSafe( Memo process canCh P

  • i

n t e r get getDe canVect

LoopAccessInfo

Mem

  • ry

Acc Run

areDepsSafe( Memo process canCh P

  • i

n t e r get getDe canVect

LoopAccessInfo

Mem

  • ry

Acc Run

areDepsSafe( Memo process canCh P

  • i

n t e r get getDe canVect

LoopAccessInfo

Mem

  • ry

Acc Run

areDepsSafe( Memo process canCh P

  • i

n t e r get getDe canVect getInfo(Loop)

slide-105
SLIDE 105

Loop Versioning

Loop Alias Checks Original Loop

LoopAccessInfo

Me

Acc Run

areDepsSafe( Memo process canCh P

  • i

n t e r

getChecks()

slide-106
SLIDE 106

Loop Versioning

  • Users:
  • Loop Distribution
  • Loop Load Elimination
  • WIP LICM-based Loop Versioning
  • Future work:
  • Run-time trip count check
  • Merge versions into a slow path and a fast path
slide-107
SLIDE 107

Loop Versioning

Loop

slide-108
SLIDE 108

Loop Versioning

Loop 2 Loop 1

slide-109
SLIDE 109

Loop Versioning

Loop 2 Distribution Checks Undistributed Loop Loop 1

slide-110
SLIDE 110

Loop Versioning

Vectorized Loop 2 Distribution Checks Undistributed Loop Vectorization Check Scalar Loop 2 Loop 1

slide-111
SLIDE 111

Loop Versioning

Vectorized Loop 2 Distribution Checks Undistributed Loop Vectorization Check Scalar Loop 2 Loop 1

+ Metadata

slide-112
SLIDE 112

Loop Versioning

Vectorized Loop 2 Distribution Checks Undistributed Loop Vectorization Check Scalar Loop 2 Loop 1

+ Metadata

slide-113
SLIDE 113

Loop Versioning

Vectorized Loop 2 Distribution Checks Undistributed Loop Vectorization Check Loop 1

+ Metadata

slide-114
SLIDE 114

Loop Versioning

Vectorized Loop 2 Distribution Checks Undistributed Loop Vectorization Check Loop 1

+ Metadata

slide-115
SLIDE 115

Loop Versioning

Vectorized Loop 2 Distribution + Vectorization Checks Undistributed Loop Loop 1

+ Metadata