In Search of Near-Optimal Optimization Phase Orderings Prasad A. - - PowerPoint PPT Presentation

in search of near optimal optimization phase orderings
SMART_READER_LITE
LIVE PREVIEW

In Search of Near-Optimal Optimization Phase Orderings Prasad A. - - PowerPoint PPT Presentation

Florida State University In Search of Near-Optimal Optimization Phase Orderings Prasad A. Kulkarni David B. Whalley Gary S. Tyson Jack W. Davidson Languages, Compilers, and Tools for Embedded Systems Florida State University Optimization


slide-1
SLIDE 1

Florida State University

Languages, Compilers, and Tools for Embedded Systems

In Search of Near-Optimal Optimization Phase Orderings

Prasad A. Kulkarni David B. Whalley Gary S. Tyson Jack W. Davidson

slide-2
SLIDE 2

Languages, Compilers, and Tools for Embedded Systems

2

Florida State University

Optimization Phase Ordering

  • Optimizing compilers apply several
  • ptimization phases to improve the

performance of applications.

  • Optimization phases interact with each other.
  • Determining the order of applying optimization

phases to obtain the best performance has been a long standing problem in compilers.

slide-3
SLIDE 3

Languages, Compilers, and Tools for Embedded Systems

3

Florida State University

Exhaustive Phase Order Evaluation

  • Determine the performance of all possible
  • rderings of optimization phases.
  • Exhaustive phase order evaluation involves
  • generating all distinct function instances that

can be produced by changing optimization phase orderings (CGO ’06)

  • determining the dynamic performance of each

distinct function instance for each function

slide-4
SLIDE 4

Languages, Compilers, and Tools for Embedded Systems

4

Florida State University

Outline

  • Experimental framework
  • Exhaustive phase order space enumeration
  • Accurately determining dynamic performance
  • Correlation between dynamic frequency

measures and processor cycles

  • Genetic algorithm performance results
  • Future work and conclusions
slide-5
SLIDE 5

Languages, Compilers, and Tools for Embedded Systems

5

Florida State University

Outline

  • Experimental framework
  • Exhaustive phase order space enumeration
  • Accurately determining dynamic performance
  • Correlation between dynamic frequency

measures and processor cycles

  • Genetic algorithm performance results
  • Future work and conclusions
slide-6
SLIDE 6

Languages, Compilers, and Tools for Embedded Systems

6

Florida State University

Experimental Framework

  • We used the VPO compilation system
  • established compiler framework, started development

in 1988

  • comparable performance to gcc –O2
  • VPO performs all transformations on a single

representation (RTLs), so it is possible to perform most phases in an arbitrary order.

  • Experiments use all the 15 available optimization

phases in VPO.

  • Target architecture was the StrongARM SA-100

processor.

slide-7
SLIDE 7

Languages, Compilers, and Tools for Embedded Systems

7

Florida State University

Disclaimers

  • Instruction scheduling and predication not included.
  • VPO does not contain optimization phases normally

associated with compiler front ends

  • no memory hierarchy optimizations
  • no inlining or other interprocedural optimizations
  • Did not vary how phases are applied.
  • Did not include optimizations that require profile

data.

slide-8
SLIDE 8

Languages, Compilers, and Tools for Embedded Systems

8

Florida State University

Benchmarks

  • Used one program from each of the six

MiBench categories.

  • Total of 111 functions.

searches for given words in phrases stringsearch

  • ffice

secure hash algorithm sha security image compression / decompression jpeg consumer fast fourier transform fft telecomm Dijkstra’s shortest path algorithm dijkstra network test processor bit manipulation abilities bitcount auto

Description Program Category

slide-9
SLIDE 9

Languages, Compilers, and Tools for Embedded Systems

9

Florida State University

Outline

  • Experimental framework
  • Exhaustive phase order space enumeration
  • Accurately determining dynamic performance
  • Correlation between dynamic frequency

measures and processor cycles

  • Genetic algorithm performance results
  • Future work and conclusions
slide-10
SLIDE 10

Languages, Compilers, and Tools for Embedded Systems

10

Florida State University

Exhaustive Phase Order Enumeration

  • Exhaustive enumeration is difficult
  • compilers typically contain many different
  • ptimization phases
  • optimizations may be successful multiple times

for each function / program

  • On average, we would need to evaluate 1512

different phase orders per function.

slide-11
SLIDE 11

Languages, Compilers, and Tools for Embedded Systems

11

Florida State University

Naive Optimization Phase Order Space

a b c d a b c d a d a d a d b c b c b c

  • All combinations of optimization phase

sequences are attempted.

L2 L1 L0

slide-12
SLIDE 12

Languages, Compilers, and Tools for Embedded Systems

12

Florida State University

Eliminating Dormant Phases

  • Get feedback from the compiler indicating

if any transformations were successfully applied in a phase.

L2 L1 L0 a b c d b c d a d a d c b

slide-13
SLIDE 13

Languages, Compilers, and Tools for Embedded Systems

13

Florida State University

Identical / Equivalent Function Instances

  • Some optimization phases are independent
  • example: branch chaining and register allocation
  • Different phase sequences can produce the

same code.

  • Two function instances can be identical

except for register numbers or basic block numbers used.

slide-14
SLIDE 14

Languages, Compilers, and Tools for Embedded Systems

14

Florida State University

Resulting Search Space

  • Merging equivalent function instances

transforms the tree to a DAG.

L2 L1 L0 a b c c d a d a d

slide-15
SLIDE 15

Languages, Compilers, and Tools for Embedded Systems

15

Florida State University

Outline

  • Experimental framework
  • Exhaustive phase order space enumeration
  • Accurately determining dynamic performance
  • Correlation between dynamic frequency

measures and processor cycles

  • Genetic algorithm performance results
  • Future work and conclusions
slide-16
SLIDE 16

Languages, Compilers, and Tools for Embedded Systems

16

Florida State University

Finding the Best Dynamic Function Instance

  • On average, there were over 25,000 distinct

function instances for each studied function.

  • Executing all distinct function instances would be

too time consuming.

  • Many embedded development environments use

simulation instead of direct execution.

  • Use data obtained from a few executions to

estimate the performance of all remaining function instances.

slide-17
SLIDE 17

Languages, Compilers, and Tools for Embedded Systems

17

Florida State University

Quickly Obtaining Dynamic Frequency Measures

  • Two different instances of the same

function having identical control-flow graphs will execute each block the same number of times.

  • Statically estimate the number of cycles

required to execute each basic block.

  • dynamic frequency measure =

Σ (static cycles * block frequency)

slide-18
SLIDE 18

Languages, Compilers, and Tools for Embedded Systems

18

Florida State University

Dynamic Frequency Statistics

Worst Batch 47.64 4.60 182.8 27.5 25362.6 average …. …. .... .... .... .... 75.32 4.29 143 30 8566 main(d) 4.49 0.20 9 40 570 enqueue(d) 51.12 0.04 1168 44 86370 dijkstra(d) 12.00 0.00 14 59 102 dequeue(d) 4.68 4.09 8 33 48 ntbl_bit…(b) 18.69 18.69 20 50 253 ntbl_bitcnt(b) 233.31 8.33 171 45 92834 main(b) 4.33 2.40 10 63 86 bitcount(b) 3.96 0.00 3 82 147 bit_shifter(b) 1.40 1.40 4 72 155 bit_count.(b) 4.00 0.00 4 198 56 BW_btbl...(b) 4.55 0.00 2 88 40 AR_btbl...(b) % from optimal Leaf CF Insts. Function

slide-19
SLIDE 19

Languages, Compilers, and Tools for Embedded Systems

19

Florida State University

Outline

  • Experimental framework
  • Exhaustive phase order space enumeration
  • Accurately determining dynamic performance
  • Correlation between dynamic frequency

measures and processor cycles

  • Genetic algorithm performance results
  • Future work and conclusions
slide-20
SLIDE 20

Languages, Compilers, and Tools for Embedded Systems

20

Florida State University

Cycle level Simulation

  • SimpleScalar toolset includes several

different simulators

  • sim-uop - functional simulator, relatively fast,

provides only dynamic instruction counts

  • sim-outorder – cycle accurate simulator, much

slower, also model microarchitecture

  • Extended sim-outorder to switch to a

functional mode when not in the function of interest.

slide-21
SLIDE 21

Languages, Compilers, and Tools for Embedded Systems

21

Florida State University

Complete Function Correlation

slide-22
SLIDE 22

Languages, Compilers, and Tools for Embedded Systems

22

Florida State University

Complete Function Correlation

slide-23
SLIDE 23

Languages, Compilers, and Tools for Embedded Systems

23

Florida State University

Leaf Function Correlation

  • Leaf function instances are generated from
  • ptimization sequences when no additional phases

can be successfully applied.

  • On average there are only about 183 leaf function

instances, as compared to over 25,000 total instances.

  • Leaf function instances represent possible code

that can be generated from an iterative compiler when the phase order is varied.

slide-24
SLIDE 24

Languages, Compilers, and Tools for Embedded Systems

24

Florida State University

Leaf versus Nonleaf Performance

slide-25
SLIDE 25

Languages, Compilers, and Tools for Embedded Systems

25

Florida State University

Leaf Function Correlation Statistics

  • Pearson’s correlation coefficient
  • Σxy – (ΣxΣy)/n

sqrt( (Σx2 – (Σx)2/n) * (Σy2 - (Σy)2/n) )

Pcorr = Lcorr = cycle count for best leaf

  • cy. cnt for leaf with best dynamic freq count
slide-26
SLIDE 26

Languages, Compilers, and Tools for Embedded Systems

26

Florida State University

Leaf Function Correlation Statistics (cont…)

4.38 …. 4 2 4 6 2 2 6 1 2 2 2 1 Leaves 0.98 …. 1.00 1.00 0.97 1.00 1.00 0.95 1.00 0.92 1.00 1.00 1.00 1.00 Ratio Lcorr 0% Leaves Ratio 21 0.996 0.96 average …. …. …. .... 4 1.00 0.98 main(d) 4 1.00 1.00 enqueue(d) 269 1.00 1.00 dijkstra(d) 6 1.00 0.99 dequeue(d) 2 1.00 0.99 ntbl_bit…(b) 2 0.95 1.00 ntbl_bitcnt(b) 23 1.00 1.00 main(b) 1 0.92 0.89 bitcount(b) 2 1.00 1.00 bit_shifter(b) 2 1.00 1.00 bit_count.(b) 2 1.00 1.00 BW_btbl...(b) 1 1.00 1.00 AR_btbl...(b) Lcorr 1% Pcorr Function

slide-27
SLIDE 27

Languages, Compilers, and Tools for Embedded Systems

27

Florida State University

Outline

  • Experimental framework
  • Exhaustive phase order space enumeration
  • Accurately determining dynamic performance
  • Correlation between dynamic frequency

measures and processor cycles

  • Genetic algorithm performance evaluation
  • Future work and conclusions
slide-28
SLIDE 28

Languages, Compilers, and Tools for Embedded Systems

28

Florida State University

Genetic Algorithm Properties

  • Genes are phases, chromosomes are sequences.
  • There are 20 chromosomes per generation.
  • Crossover is used to replace 4 poorly performing

chromosomes per generation.

  • All, except the best sequence and the 4 newly

generated sequences are subject to mutation.

  • We modified our GA to use phase enabling and

disabling relationships during the mutation phase

  • f the GA.
slide-29
SLIDE 29

Languages, Compilers, and Tools for Embedded Systems

29

Florida State University

GA Evaluation Results

Modified GA 0.51 …. 3.96 0.00 0.00 0.00 0.00 6.55 0.00 0.00 0.00 0.00 0.00 0.00 Diff 0.87 …. N Y Y Y Y N Y Y Y Y Y Y Opt Original GA 0.02 …. 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Difff 0.97 …. Y Y Y Y Y Y Y Y Y Y Y Y Opt average .... main(d) enqueue(d) dijkstra(d) dequeue(d) ntbl_bit…(b) ntbl_bitcnt(b) main(b) bitcount(b) bit_shifter(b) bit_count.(b) BW_btbl...(b) AR_btbl...(b) Function

slide-30
SLIDE 30

Languages, Compilers, and Tools for Embedded Systems

30

Florida State University

Outline

  • Experimental framework
  • Exhaustive phase order space enumeration
  • Accurately determining dynamic performance
  • Correlation between dynamic frequency

measures and processor cycles

  • Genetic algorithm performance evaluation
  • Future work and conclusions
slide-31
SLIDE 31

Languages, Compilers, and Tools for Embedded Systems

31

Florida State University

Future Work

  • Find more equivalent performing function

instances to further reduce the phase order space.

  • Study effect of limiting scope of phases so

that the most deeply nested loops of a function are optimized first.

  • Improve conventional compilation speed

and performance.

slide-32
SLIDE 32

Languages, Compilers, and Tools for Embedded Systems

32

Florida State University

Conclusions

  • We demonstrated how a near-optimal phase
  • rdering can be obtained in a short period of time.
  • We showed that our measure of dynamic

frequency counts correlate extremely well to simulator cycles.

  • We also showed how the enumerated space can be

used to evaluate the effectiveness of heuristic phase order search algorithms.

slide-33
SLIDE 33

Languages, Compilers, and Tools for Embedded Systems

33

Florida State University

Optimization Space Properties

  • Phase ordering problem can be made more

manageable by exploiting certain properties

  • f the optimization search space
  • optimization phases might not apply any

transformations

  • many optimization phases are independent
  • Thus, many different orderings of
  • ptimization phases produce the same code.
slide-34
SLIDE 34

Languages, Compilers, and Tools for Embedded Systems

34

Florida State University

Re-stating the Phase Ordering Problem

  • Rather than considering all attempted phase

sequences, the phase ordering problem can be addressed by enumerating all distinct function instances that can be produced by combination of optimization phases.

  • We were able to exhaustively enumerate

109 out of 111 functions, in a few minutes for most.

slide-35
SLIDE 35

Languages, Compilers, and Tools for Embedded Systems

35

Florida State University

Detecting Identical Function Instances

  • Some optimization phases are independent
  • example: branch chaining & register allocation
  • Different phase sequences can produce the

same code

r[2] = 1; r[2] = 1; r[3] = r[4] + r[2]; r[3] = r[4] + r[2]; ⇒ ⇒instruction selection instruction selection r[3] = r[4] + 1; r[3] = r[4] + 1; r[2] = 1; r[2] = 1; r[3] = r[4] + r[2]; r[3] = r[4] + r[2]; ⇒ ⇒constant propagation constant propagation r[2] = 1; r[2] = 1; r[3] = r[4] + 1; r[3] = r[4] + 1; ⇒ ⇒dead assignment elimination dead assignment elimination r[3] = r[4] + 1; r[3] = r[4] + 1;

slide-36
SLIDE 36

Languages, Compilers, and Tools for Embedded Systems

36

Florida State University

VPO Optimization Phases

  • Register assignment (assigning pseudo registers to

hardware registers) is implicitly performed before the first phase that requires it.

  • Some phases are applied after the sequence
  • fixing the entry and exit of the function to manage the

run-time stack

  • exploiting predication on the ARM
  • performing instruction scheduling
slide-37
SLIDE 37

Languages, Compilers, and Tools for Embedded Systems

37

Florida State University

VPO Optimization Phases

register allocation k

  • remv. useless jumps

u minimize loop jumps j instruction selection s block reordering i reverse branches r dead assignment elim. h strength reduction q loop unrolling g

  • eval. order determin.
  • remv. unreachable code

d code abstraction n common subexpr. elim. c loop transformations l branch chaining b Optimization Phase ID Optimization Phase ID

slide-38
SLIDE 38

Languages, Compilers, and Tools for Embedded Systems

38

Florida State University

Eliminating Consecutively Applied Phases

  • A phase just applied in our compiler cannot

be immediately active again.

a b c d b c d a d a d a c b b c L2 L1 L0

slide-39
SLIDE 39

Languages, Compilers, and Tools for Embedded Systems

39

Florida State University

Detecting Equivalent Function Instances

sum = 0; for (i = 0; i < 1000; i++ ) sum += a [ i ]; Source Code

r[10]=0; r[12]=HI[a]; r[12]=r[12]+LO[a]; r[1]=r[12]; r[9]=4000+r[12]; L3 r[8]=M[r[1]]; r[10]=r[10]+r[8]; r[1]=r[1]+4; IC=r[1]?r[9]; PC=IC<0,L3;

Register Allocation before Code Motion

r[11]=0; r[10]=HI[a]; r[10]=r[10]+LO[a]; r[1]=r[10]; r[9]=4000+r[10]; L5 r[8]=M[r[1]]; r[11]=r[11]+r[8]; r[1]=r[1]+4; IC=r[1]?r[9]; PC=IC<0,L5;

Code Motion before Register Allocation

r[32]=0; r[33]=HI[a]; r[33]=r[33]+LO[a]; r[34]=r[33]; r[35]=4000+r[33]; L01 r[36]=M[r[34]]; r[32]=r[32]+r[36]; r[34]=r[34]+4; IC=r[34]?r[35]; PC=IC<0,L01;

After Mapping Registers

slide-40
SLIDE 40

Languages, Compilers, and Tools for Embedded Systems

40

Florida State University

Case when No Leaf is Optimal