[PPT] - An SSA-based Algorithm for Optimal Speculative Code Motion under an PowerPoint Presentation

SLIDE 1

An SSA-based Algorithm for Optimal Speculative Code Motion under an Execution Profile

Hucheng Zhou Tsinghua University June 2011

Joint work with: Wenguang Chen (Tsinghua University), Fred Chow (ICube Technology Corp.)

SLIDE 2

June 2011 MC-SSAPRE PLDI 2

Partial Redundancy Elimination (PRE)

Eliminates expressions redundant on some (not

necessarily all) paths

One of the most important and widely applied

target-independent global optimization

Subsumes global common subexpression and

loop invariant code motion

B3 B4 a+b B5 a+b B1 B2 a+b B3 B4 t B5 t B1 t=a+b B2 t=a+b

PRE

SLIDE 4

June 2011 MC-SSAPRE PLDI 4

PRE Facts

Applied to each lexically identified expression

independently – e.g (a+b), (a-b), (a*c)

Formulated as a Placement problem:

Step 1 – Determine where to perform insertions

– Render more computations fully redundant

Step 2 – Delete fully redundant computations

Main challenge is in Step 1

SLIDE 5

June 2011 MC-SSAPRE PLDI 5

The Most Popular PRE Algorithms

Lazy Code Motion (Knoop et. al ) – Computationally and Life-time Optimal – Ordinary program representation – Bit-vector-based iterative data flow analyses SSAPRE – Computationally and Life-time Optimal – SSA form of program representation – Sparse solution of data flow properties – Subsumes local common subexpression

Insensitive to basic block boundaries

SLIDE 6

June 2011 MC-SSAPRE PLDI 6

Static Single Assignment (SSA)

Program representation with built-in use-def

information

Use-def edges factored at join points in CFG
Use-def implicitly represented via unique names
Each renamed variable has only one definition

B3 B4

=a

B5

=a

B1

a=

B2

a=

CFG use-def B3 B4

=a

B5

=a

B1

a=

B2

a=

USE-DEF

a3 = (a1,a2)

B3 B4

=a3

B5

=a3

B1

a2=

B2

a1=

factored use-def



SLIDE 7

June 2011 MC-SSAPRE PLDI 7

Factored Redundancy Graph (FRG)

Used in SSAPRE to represent redundant relationships among
ccurrences of the same expression via edges
The redundancy edges are factored as in SSA
Can view as SSA applied to expressions

– Effectively put the t storing the expression after PRE in SSA form

B3 B4

a+b

B5

a+b

B1

a+b

B2

a+b

CFG redundancy B3 B4

a+b

B5

a+b

B1

a+b

B2

a+b

Redundancy t3= (t1,t2) B3 B4

t3

B5

t3

B1

t2=a+b

B2

t1=a+b

factored redundancy



SLIDE 8

June 2011 MC-SSAPRE PLDI 8

Speculative Code Motion

Classical PRE only inserts at places where the expression is anticipated (down-safe)

– Many redundant computations cannot be eliminated

Speculative code motion ignores safety constraint – Can remove more redundancies – Not applicable to computations that may trigger runtime exceptions

Classical PRE B3 B4 B5 a+b B1 B2 a+b CFG B3 B4 B5 t B1 t=a+b B2 t=a+b Unsafe Path Speculation

SLIDE 9

June 2011 MC-SSAPRE PLDI 9

While Loop Example

Classical PRE Speculation

Invariant code motion involves speculation

SLIDE 10

June 2011 MC-SSAPRE PLDI 10

While Loop Restructuring

while loop restructure PRE

The common solution
Speculation no longer necessary
But code size increases

SLIDE 11

June 2011 MC-SSAPRE PLDI 11

Speculation not always beneficial

Useless computations introduced for some paths
Beneficial only if removed computations executed

more frequently than inserted computations

Requires execution frequency information

B3 150 B4 50 B5 100

a+b

B1 50 B2 100

a+b

B3 150 B4 50 B5 100

t

B1 50

t=a+b

B2 100

t=a+b

Non-beneficial because freq(B2) > freq(B4)

SLIDE 12

June 2011 MC-SSAPRE PLDI 12

Problem Statement

How to minimize the dynamic execution count of an expression under an execution profile

A more aggressive form of PRE

– Classical PRE beneficial regardless of execution frequencies

Cai and Xue (2003, 2006) first to apply min-cut to solve

this problem optimally – Algorithm called MC-PRE – Uses bit-vector-based data flow analyses – Min-cut applied to CFG

No SSA-based technique exists yet

SLIDE 13

June 2011 MC-SSAPRE PLDI 13

Topic of this Paper

MC-SSAPRE – a new algorithm that yields

ptimal code placement under the SSAPRE

framework Overview:

Form a essential flow graph (EFG) out of the

FRG

Map the BB execution frequencies to the EFG

nodes

Apply min-cut to the EFG

SLIDE 14

June 2011 MC-SSAPRE PLDI 14

Algorithm Steps

SSAPRE Steps

Construct FRG

฀ F insertion – Rename

Data Flow Attributes

– DownSafety – WillBeAvail

Book-keeping

– Finalize – CodeMotion MC-SSAPRE Steps

Construct FRG
F insertion
Rename
Form EFG and perform min-cut
Data flow
Graph reduction
Single source
Single sink
Minimum cut
WillBeAvail
Book-keeping
Finalize
CodeMotion

SLIDE 15

June 2011 MC-SSAPRE PLDI 15

Running example in SSA Form

a1+b1

B1 50 B2 20 B3 70

a1+b1 exit a1+b1 a1+b1 exit exit exit

B4 50 B5 10 B6 10 B7 50 B8 60 B9 5 B10 5 B12 60 B12 5

a1+b1 Input Program

SLIDE 16

June 2011 MC-SSAPRE PLDI 16

FRG for Running Example

Introduce h so the FRG can be viewed from an SSA perspective

F F F

h1

B1 50

h2= F(h1,^)

B3 70

h4= F(h3,h2) h4 h3 h2 h2

B4 50 B6 10 B8 60 B9 5

F Insertion and Rename

a1+b1

B1 50 B2 20 B3 70

a1+b1 exit a1+b1 a1+b1 exit exit exit

B4 50 B5 10 B6 10 B7 50 B8 60 B9 5 B10 5 B12 60 B12 5

a1+b1 Input Program FRG

SLIDE 17

June 2011 MC-SSAPRE PLDI 17

Roles of Factored Redundancy Graph

Insertions need to be considered only at F’s

– associated with the F operands

Medium to compute data flow properties to disqualify

more F’s from being insertion candidates

SSA form for t (temporary to store the computed

value) will be carved out of the FRG

Three kinds of nodes:

1.Real occurrences in original program

Def – always non-redundant
Use – partially redundant (including fully redundant)
2. F (def)
3. F operand (use) – can be ^

SLIDE 18

June 2011 MC-SSAPRE PLDI 18

Data Flow Properties for MC-SSAPRE

Fully available

Insertions at these F’s always unnecessary

because the computed values are available Partially anticipated

Insertions should only be at these F’s
otherwise, the inserted computation would

have no use

SLIDE 19

June 2011 MC-SSAPRE PLDI 19

Graph Reduction

Use computed data flow properties to further narrow down the F candidates for insertion Delete:

F’s that are fully available
F’s that are not partial anticipated
Use nodes (real occurrences or F
perands) that are fully redundant
Edges from/to above nodes

SLIDE 20

Graph Reduction for Running Example

June 2011 MC-SSAPRE PLDI 20

graph reduction

h1

B1 50

h2= F(h1,^)

B3 70

h3 h2 h2

B4 50 B6 10 B9 5

F F F

h2= F(h1, ^)

B3 70

h4= F(h3,h2) h4 h2

B6 10 B8 60

rg_excluded

rg_excluded – fully redundant occurrences determined during Renaming

F F F

h4= F(h3,h2) h4

B8 60

SLIDE 21

June 2011 MC-SSAPRE PLDI 21

Form Essential Flow Graph (EFG)

Introduce a virtual source node

– Add an edge from it to each ^ F operand

Introduce a virtual sink node

– Add an edge from each real occurrence to it

Result is a complete flow network

source

new edges

h2= F(h1,^)

B3 70

h2

B6 10

sink

F F F

h4= F(h3,h2) h4

B8 60

SLIDE 22

∞ ∞

June 2011 MC-SSAPRE PLDI 22

Edges in EFG

Edges to the sink are never insertion candidate – Mark with ∞ frequency Other edges are: Type 1 edge – Edges ending at a F operand Type 2 edge – Edges from a F to a real occurrence

source

Type 1

h2= F (h1,^)

B3 70

h2

B6 10

sink

F F F

h4= F(h3,h2) h4

B8 60

Type 2

SLIDE 23

June 2011 MC-SSAPRE PLDI 23

Mapping Frequencies to EFG Edges

Model insertion at a Type 1 edge by inserting at

exit of the predecessor BB corresponding to the F operand – Annotate the Type 1 edge by the node frequency of that predecessor BB

Insertion at a Type 2 edge means performing

the computation in place – Annotate the Type 2 edge by the frequency of the real occurrence

SLIDE 24

h4= F(h3,h2) h4

60 10 10 20

EFG annotated with Frequencies

June 2011 MC-SSAPRE PLDI 24

a1+b1

B1 50 B2 20 B3 70

a1+b1 exit a1+b1 a1+b1 exit exit exit

B4 50 B5 10 B6 10 B7 50 B8 60 B9 5 B10 5 B12 60 B12 5

a1+b1 ∞ ∞ source h2= F(h1,^)

B3 70

h2

B6 10

sink

B8 60

Type 2

Original Program

Type 1

Final EFG

SLIDE 25

June 2011 MC-SSAPRE PLDI 25

Performing Minimum Cut

A minimum cut

separates the flow network into two halves, such

that

the sum of the weights of the cut edges is

minimized By performing insertions at the cut edges, the number of execution of the computation is minimized – Implies computational optimality If min-cut not unique, choose the cut nearest the sink – Induces life-time optimality

SLIDE 26

June 2011 MC-SSAPRE PLDI 26

Our Example

60 10 10 20

∞ ∞ source h2= F(h1,^)

B3 70

h2

B6 10

sink

B8 60 h4= F(h3,h2)

h4

Two possible min-cuts
Pick later red one

min-cut min-cut

SLIDE 27

June 2011 MC-SSAPRE PLDI 27

Final Result

 



B3 70

t2

t2 =a1+b1 t2 t2=a1+b1

exit

t1=a1+b1 t1

t1

exit exit exit

B4 50 B5 10 B6 10 B7 50 B8 60 B9 5 B10 5 B11 10 B13 5

a1+b1

B1 50 B2 20

60 10 10 20

∞ ∞ source h2= F(h1,^)

B3 70

h2

B6 10

sink

B8 60 h4= F(h3,h2)

h4 min-cut final transformed program

SLIDE 28

V – number of FRG nodes E – number of FRG edges

Except the minimum cut

step, all the steps are O(V+E)

Performing minimum cut

is

In general,

Vcfg > Vfrg > Vefg

Complexity of MC-SSAPRE

) 2 ( E V O

June 2011 MC-SSAPRE PLDI 28

MC-SSAPRE Steps

Construct FRG
F insertion
Rename
Form EFG and perform min-cut
Data flow
Graph reduction
Single source
Single sink
Minimum cut
WillBeAvail
Book-keeping
Finalize
CodeMotion

SLIDE 29

June 2011 MC-SSAPRE PLDI 29

Our Implementation

Implemented MC-SSAPRE in the open

source Path64 compiler, a descendent of the compiler with the original SSAPRE

Leveraged existing SSAPRE infrastructure
Resulting compiler will perform:

– SSAPRE when no profile available

Perform speculation for loop-invariant

computations

– MC-SSAPRE with profile data

Compiler always restructures while loops

SLIDE 30

June 2011 MC-SSAPRE PLDI 30

Setup of Experiment 1

Target is Intel CoreTM i7-970 at 2.67GHz with 8MB

cache

Ubuntu 9.10
With 6GB on board memory
Compare run-time performances of all of SPEC

CPU2006 (29 benchmarks)

The 3 runs:

SSAPRE – no speculation, no profile data SSAPREsp – loop-based speculation, no profile data MC-SSAPRE – speculation based on profile data

SLIDE 31

June 2011 MC-SSAPRE PLDI 31

Experimental Results – CINT2006

Average speedup of 2.13% over SSAPRE
Average speedup of 2.25% over SSAPREsp

0.94 0.96 0.98 1 1.02 1.04 1.06 1.08

400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk Average

SSAPRE SSAPREsp MC-SSAPRE

SLIDE 32

June 2011 MC-SSAPRE PLDI 32

Experimental Results – CFP2006

Average speedup of 2.76% over SSAPRE
Average speedup of 1.96% over SSAPREsp

0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 410.bwaves 416.gamess 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 447.dealII 450.soplex 453.povray 454.calculix 459.GemsFDTD 465.tonto 470.lbm 481.wrf 482.sphinx3 Average

SSAPRE

SSAPREsp

MC-SSAPRE

SLIDE 33

June 2011 MC-SSAPRE PLDI 33

Setup of Experiment 2

Calculate size of EFGs formed during MC-SSAPRE
Same 29 SPEC CPU2006 benchmarks
Target-independent
Show

– Optimization overhead in MC-SSAPRE – Impact of sparse approach

Exclude empty EFGs
Smallest EFG is 4 nodes:

– Source, sink, F, real occurrence

SLIDE 34

June 2011 MC-SSAPRE PLDI 34

Sizes of EFGs

183152 EFGs in the 29 SPEC CPU2006 benchmarks
Near 50% of EFGs are only 4 nodes
86.5% of EFGs are less than 10 nodes
99.0% of EFGs are less than 50 nodes
24 EFGs larger than 300 nodes (largest size is 805)

20000 40000 60000 80000 100000 4 5 6 7 8 9 10 11–15 16–20 21–30 31–40 41–50 51–60 61–70 >=71 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% Number of EFGs Cumulative %

Number of Nodes in the EFG

SLIDE 35

June 2011 MC-SSAPRE PLDI 35

Conclusion

The minimum-cut technique for flow networks can

effectively be applied to SSA graphs

SSA-based compilers can apply MC-SSAPRE to

achieve optimal speculative code motion under an execution profile

The sparse approach is effective in reducing the

problem sizes

The polynomial time complexity of Min-cut only has

limited effect on MC-SSAPRE’s optimization efficiency

MC-SSAPRE always improves program performance
ver SSAPRE

SLIDE 36

An SSA-based Algorithm for Optimal Speculative Code Motion under an Execution Profile

Hucheng Zhou Tsinghua University June 2011

Joint work with: Wenguang Chen (Tsinghua University), Fred Chow (ICube Technology Corp.)

Contents

Basic Concepts PRE SSA SSAPRE Speculative Code Motion MC-SSAPRE Algorithm Complexity Experiments Conclusion

Partial Redundancy Elimination (PRE)

necessarily all) paths

target-independent global optimization

loop invariant code motion

PRE

PRE Facts

independently – e.g (a+b), (a-b), (a*c)

Step 1 – Determine where to perform insertions

– Render more computations fully redundant

Step 2 – Delete fully redundant computations

The Most Popular PRE Algorithms

Static Single Assignment (SSA)

information

=a

=a

a=

a=

=a

=a

a=

a=

=a3

a2=

a1=



Factored Redundancy Graph (FRG)

– Effectively put the t storing the expression after PRE in SSA form

a+b

a+b

a+b

a+b

a+b

a+b

a+b

a+b

t3

t3

t2=a+b



Speculative Code Motion

Classical PRE only inserts at places where the expression is anticipated (down-safe)

– Many redundant computations cannot be eliminated

Speculative code motion ignores safety constraint – Can remove more redundancies – Not applicable to computations that may trigger runtime exceptions

While Loop Example

Invariant code motion involves speculation

While Loop Restructuring

Speculation not always beneficial

more frequently than inserted computations

Problem Statement

How to minimize the dynamic execution count of an expression under an execution profile

– Classical PRE beneficial regardless of execution frequencies

this problem optimally – Algorithm called MC-PRE – Uses bit-vector-based data flow analyses – Min-cut applied to CFG

Topic of this Paper

MC-SSAPRE – a new algorithm that yields

framework Overview:

FRG

nodes

Algorithm Steps

SSAPRE Steps

฀ F insertion – Rename

– DownSafety – WillBeAvail

– Finalize – CodeMotion MC-SSAPRE Steps

Running example in SSA Form

FRG for Running Example

Introduce h so the FRG can be viewed from an SSA perspective

F F F

Roles of Factored Redundancy Graph

– associated with the F operands

more F’s from being insertion candidates

value) will be carved out of the FRG

1.Real occurrences in original program

Data Flow Properties for MC-SSAPRE

Fully available

because the computed values are available Partially anticipated

have no use