An SSA-based Algorithm for Optimal Speculative Code Motion under an - - PowerPoint PPT Presentation

an ssa based algorithm for optimal speculative code
SMART_READER_LITE
LIVE PREVIEW

An SSA-based Algorithm for Optimal Speculative Code Motion under an - - PowerPoint PPT Presentation

An SSA-based Algorithm for Optimal Speculative Code Motion under an Execution Profile Hucheng Zhou Tsinghua University June 2011 Joint work with: Wenguang Chen (Tsinghua University), Fred Chow (ICube Technology Corp.) Contents Basic Concepts


slide-1
SLIDE 1

An SSA-based Algorithm for Optimal Speculative Code Motion under an Execution Profile

Hucheng Zhou Tsinghua University June 2011

Joint work with: Wenguang Chen (Tsinghua University), Fred Chow (ICube Technology Corp.)

slide-2
SLIDE 2

June 2011 MC-SSAPRE PLDI 2

Contents

Basic Concepts PRE SSA SSAPRE Speculative Code Motion MC-SSAPRE Algorithm Complexity Experiments Conclusion

slide-3
SLIDE 3

June 2011 MC-SSAPRE PLDI 3

Partial Redundancy Elimination (PRE)

  • Eliminates expressions redundant on some (not

necessarily all) paths

  • One of the most important and widely applied

target-independent global optimization

  • Subsumes global common subexpression and

loop invariant code motion

B3 B4 a+b B5 a+b B1 B2 a+b B3 B4 t B5 t B1 t=a+b B2 t=a+b

PRE

slide-4
SLIDE 4

June 2011 MC-SSAPRE PLDI 4

PRE Facts

  • Applied to each lexically identified expression

independently – e.g (a+b), (a-b), (a*c)

  • Formulated as a Placement problem:

Step 1 – Determine where to perform insertions

– Render more computations fully redundant

Step 2 – Delete fully redundant computations

  • Main challenge is in Step 1
slide-5
SLIDE 5

June 2011 MC-SSAPRE PLDI 5

The Most Popular PRE Algorithms

Lazy Code Motion (Knoop et. al ) – Computationally and Life-time Optimal – Ordinary program representation – Bit-vector-based iterative data flow analyses SSAPRE – Computationally and Life-time Optimal – SSA form of program representation – Sparse solution of data flow properties – Subsumes local common subexpression

  • Insensitive to basic block boundaries
slide-6
SLIDE 6

June 2011 MC-SSAPRE PLDI 6

Static Single Assignment (SSA)

  • Program representation with built-in use-def

information

  • Use-def edges factored at join points in CFG
  • Use-def implicitly represented via unique names
  • Each renamed variable has only one definition

B3 B4

=a

B5

=a

B1

a=

B2

a=

CFG use-def B3 B4

=a

B5

=a

B1

a=

B2

a=

USE-DEF

a3 = (a1,a2)

B3 B4

=a3

B5

=a3

B1

a2=

B2

a1=

factored use-def

slide-7
SLIDE 7

June 2011 MC-SSAPRE PLDI 7

Factored Redundancy Graph (FRG)

  • Used in SSAPRE to represent redundant relationships among
  • ccurrences of the same expression via edges
  • The redundancy edges are factored as in SSA
  • Can view as SSA applied to expressions

– Effectively put the t storing the expression after PRE in SSA form

B3 B4

a+b

B5

a+b

B1

a+b

B2

a+b

CFG redundancy B3 B4

a+b

B5

a+b

B1

a+b

B2

a+b

Redundancy t3= (t1,t2) B3 B4

t3

B5

t3

B1

t2=a+b

B2

t1=a+b

factored redundancy

slide-8
SLIDE 8

June 2011 MC-SSAPRE PLDI 8

Speculative Code Motion

Classical PRE only inserts at places where the expression is anticipated (down-safe)

– Many redundant computations cannot be eliminated

Speculative code motion ignores safety constraint – Can remove more redundancies – Not applicable to computations that may trigger runtime exceptions

Classical PRE B3 B4 B5 a+b B1 B2 a+b CFG B3 B4 B5 t B1 t=a+b B2 t=a+b Unsafe Path Speculation

slide-9
SLIDE 9

June 2011 MC-SSAPRE PLDI 9

While Loop Example

Classical PRE Speculation

Invariant code motion involves speculation

slide-10
SLIDE 10

June 2011 MC-SSAPRE PLDI 10

While Loop Restructuring

while loop restructure PRE

  • The common solution
  • Speculation no longer necessary
  • But code size increases
slide-11
SLIDE 11

June 2011 MC-SSAPRE PLDI 11

Speculation not always beneficial

  • Useless computations introduced for some paths
  • Beneficial only if removed computations executed

more frequently than inserted computations

  • Requires execution frequency information

B3 150 B4 50 B5 100

a+b

B1 50 B2 100

a+b

B3 150 B4 50 B5 100

t

B1 50

t=a+b

B2 100

t=a+b

Non-beneficial because freq(B2) > freq(B4)

slide-12
SLIDE 12

June 2011 MC-SSAPRE PLDI 12

Problem Statement

How to minimize the dynamic execution count of an expression under an execution profile

  • A more aggressive form of PRE

– Classical PRE beneficial regardless of execution frequencies

  • Cai and Xue (2003, 2006) first to apply min-cut to solve

this problem optimally – Algorithm called MC-PRE – Uses bit-vector-based data flow analyses – Min-cut applied to CFG

  • No SSA-based technique exists yet
slide-13
SLIDE 13

June 2011 MC-SSAPRE PLDI 13

Topic of this Paper

MC-SSAPRE – a new algorithm that yields

  • ptimal code placement under the SSAPRE

framework Overview:

  • Form a essential flow graph (EFG) out of the

FRG

  • Map the BB execution frequencies to the EFG

nodes

  • Apply min-cut to the EFG
slide-14
SLIDE 14

June 2011 MC-SSAPRE PLDI 14

Algorithm Steps

SSAPRE Steps

  • Construct FRG

฀ F insertion – Rename

  • Data Flow Attributes

– DownSafety – WillBeAvail

  • Book-keeping

– Finalize – CodeMotion MC-SSAPRE Steps

  • Construct FRG
  • F insertion
  • Rename
  • Form EFG and perform min-cut
  • Data flow
  • Graph reduction
  • Single source
  • Single sink
  • Minimum cut
  • WillBeAvail
  • Book-keeping
  • Finalize
  • CodeMotion
slide-15
SLIDE 15

June 2011 MC-SSAPRE PLDI 15

Running example in SSA Form

a1+b1

B1 50 B2 20 B3 70

a1+b1 exit a1+b1 a1+b1 exit exit exit

B4 50 B5 10 B6 10 B7 50 B8 60 B9 5 B10 5 B12 60 B12 5

a1+b1 Input Program

slide-16
SLIDE 16

June 2011 MC-SSAPRE PLDI 16

FRG for Running Example

Introduce h so the FRG can be viewed from an SSA perspective

F F F

h1

B1 50

h2= F(h1,^)

B3 70

h4= F(h3,h2) h4 h3 h2 h2

B4 50 B6 10 B8 60 B9 5

F Insertion and Rename

a1+b1

B1 50 B2 20 B3 70

a1+b1 exit a1+b1 a1+b1 exit exit exit

B4 50 B5 10 B6 10 B7 50 B8 60 B9 5 B10 5 B12 60 B12 5

a1+b1 Input Program FRG

slide-17
SLIDE 17

June 2011 MC-SSAPRE PLDI 17

Roles of Factored Redundancy Graph

  • Insertions need to be considered only at F’s

– associated with the F operands

  • Medium to compute data flow properties to disqualify

more F’s from being insertion candidates

  • SSA form for t (temporary to store the computed

value) will be carved out of the FRG

  • Three kinds of nodes:

1.Real occurrences in original program

  • Def – always non-redundant
  • Use – partially redundant (including fully redundant)
  • 2. F (def)
  • 3. F operand (use) – can be ^
slide-18
SLIDE 18

June 2011 MC-SSAPRE PLDI 18

Data Flow Properties for MC-SSAPRE

Fully available

  • Insertions at these F’s always unnecessary

because the computed values are available Partially anticipated

  • Insertions should only be at these F’s
  • otherwise, the inserted computation would

have no use

slide-19
SLIDE 19

June 2011 MC-SSAPRE PLDI 19

Graph Reduction

Use computed data flow properties to further narrow down the F candidates for insertion Delete:

  • F’s that are fully available
  • F’s that are not partial anticipated
  • Use nodes (real occurrences or F
  • perands) that are fully redundant
  • Edges from/to above nodes
slide-20
SLIDE 20

Graph Reduction for Running Example

June 2011 MC-SSAPRE PLDI 20

graph reduction

h1

B1 50

h2= F(h1,^)

B3 70

h3 h2 h2

B4 50 B6 10 B9 5

F F F

h2= F(h1, ^)

B3 70

h4= F(h3,h2) h4 h2

B6 10 B8 60

rg_excluded

rg_excluded – fully redundant occurrences determined during Renaming

F F F

h4= F(h3,h2) h4

B8 60

slide-21
SLIDE 21

June 2011 MC-SSAPRE PLDI 21

Form Essential Flow Graph (EFG)

  • Introduce a virtual source node

– Add an edge from it to each ^ F operand

  • Introduce a virtual sink node

– Add an edge from each real occurrence to it

  • Result is a complete flow network

source

new edges

h2= F(h1,^)

B3 70

h2

B6 10

sink

F F F

h4= F(h3,h2) h4

B8 60

slide-22
SLIDE 22

∞ ∞

June 2011 MC-SSAPRE PLDI 22

Edges in EFG

Edges to the sink are never insertion candidate – Mark with ∞ frequency Other edges are: Type 1 edge – Edges ending at a F operand Type 2 edge – Edges from a F to a real occurrence

source

Type 1

h2= F (h1,^)

B3 70

h2

B6 10

sink

F F F

h4= F(h3,h2) h4

B8 60

Type 2

slide-23
SLIDE 23

June 2011 MC-SSAPRE PLDI 23

Mapping Frequencies to EFG Edges

  • Model insertion at a Type 1 edge by inserting at

exit of the predecessor BB corresponding to the F operand – Annotate the Type 1 edge by the node frequency of that predecessor BB

  • Insertion at a Type 2 edge means performing

the computation in place – Annotate the Type 2 edge by the frequency of the real occurrence

slide-24
SLIDE 24

h4= F(h3,h2) h4

60 10 10 20

EFG annotated with Frequencies

June 2011 MC-SSAPRE PLDI 24

a1+b1

B1 50 B2 20 B3 70

a1+b1 exit a1+b1 a1+b1 exit exit exit

B4 50 B5 10 B6 10 B7 50 B8 60 B9 5 B10 5 B12 60 B12 5

a1+b1 ∞ ∞ source h2= F(h1,^)

B3 70

h2

B6 10

sink

B8 60

Type 2

Original Program

Type 1

Final EFG

slide-25
SLIDE 25

June 2011 MC-SSAPRE PLDI 25

Performing Minimum Cut

A minimum cut

  • separates the flow network into two halves, such

that

  • the sum of the weights of the cut edges is

minimized By performing insertions at the cut edges, the number of execution of the computation is minimized – Implies computational optimality If min-cut not unique, choose the cut nearest the sink – Induces life-time optimality

slide-26
SLIDE 26

June 2011 MC-SSAPRE PLDI 26

Our Example

60 10 10 20

∞ ∞ source h2= F(h1,^)

B3 70

h2

B6 10

sink

B8 60 h4= F(h3,h2)

h4

  • Two possible min-cuts
  • Pick later red one

min-cut min-cut

slide-27
SLIDE 27

June 2011 MC-SSAPRE PLDI 27

Final Result

 

B3 70

t2

t2 =a1+b1 t2 t2=a1+b1

exit

t1=a1+b1 t1

t1

exit exit exit

B4 50 B5 10 B6 10 B7 50 B8 60 B9 5 B10 5 B11 10 B13 5

a1+b1

B1 50 B2 20

60 10 10 20

∞ ∞ source h2= F(h1,^)

B3 70

h2

B6 10

sink

B8 60 h4= F(h3,h2)

h4 min-cut final transformed program

slide-28
SLIDE 28

V – number of FRG nodes E – number of FRG edges

  • Except the minimum cut

step, all the steps are O(V+E)

  • Performing minimum cut

is

  • In general,

Vcfg > Vfrg > Vefg

Complexity of MC-SSAPRE

) 2 ( E V O

June 2011 MC-SSAPRE PLDI 28

MC-SSAPRE Steps

  • Construct FRG
  • F insertion
  • Rename
  • Form EFG and perform min-cut
  • Data flow
  • Graph reduction
  • Single source
  • Single sink
  • Minimum cut
  • WillBeAvail
  • Book-keeping
  • Finalize
  • CodeMotion
slide-29
SLIDE 29

June 2011 MC-SSAPRE PLDI 29

Our Implementation

  • Implemented MC-SSAPRE in the open

source Path64 compiler, a descendent of the compiler with the original SSAPRE

  • Leveraged existing SSAPRE infrastructure
  • Resulting compiler will perform:

– SSAPRE when no profile available

  • Perform speculation for loop-invariant

computations

– MC-SSAPRE with profile data

  • Compiler always restructures while loops
slide-30
SLIDE 30

June 2011 MC-SSAPRE PLDI 30

Setup of Experiment 1

  • Target is Intel CoreTM i7-970 at 2.67GHz with 8MB

cache

  • Ubuntu 9.10
  • With 6GB on board memory
  • Compare run-time performances of all of SPEC

CPU2006 (29 benchmarks)

  • The 3 runs:

SSAPRE – no speculation, no profile data SSAPREsp – loop-based speculation, no profile data MC-SSAPRE – speculation based on profile data

slide-31
SLIDE 31

June 2011 MC-SSAPRE PLDI 31

Experimental Results – CINT2006

  • Average speedup of 2.13% over SSAPRE
  • Average speedup of 2.25% over SSAPREsp

0.94 0.96 0.98 1 1.02 1.04 1.06 1.08

400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk Average

SSAPRE SSAPREsp MC-SSAPRE

slide-32
SLIDE 32

June 2011 MC-SSAPRE PLDI 32

Experimental Results – CFP2006

  • Average speedup of 2.76% over SSAPRE
  • Average speedup of 1.96% over SSAPREsp

0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 410.bwaves 416.gamess 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 447.dealII 450.soplex 453.povray 454.calculix 459.GemsFDTD 465.tonto 470.lbm 481.wrf 482.sphinx3 Average

SSAPRE

SSAPREsp

MC-SSAPRE

slide-33
SLIDE 33

June 2011 MC-SSAPRE PLDI 33

Setup of Experiment 2

  • Calculate size of EFGs formed during MC-SSAPRE
  • Same 29 SPEC CPU2006 benchmarks
  • Target-independent
  • Show

– Optimization overhead in MC-SSAPRE – Impact of sparse approach

  • Exclude empty EFGs
  • Smallest EFG is 4 nodes:

– Source, sink, F, real occurrence

slide-34
SLIDE 34

June 2011 MC-SSAPRE PLDI 34

Sizes of EFGs

  • 183152 EFGs in the 29 SPEC CPU2006 benchmarks
  • Near 50% of EFGs are only 4 nodes
  • 86.5% of EFGs are less than 10 nodes
  • 99.0% of EFGs are less than 50 nodes
  • 24 EFGs larger than 300 nodes (largest size is 805)

20000 40000 60000 80000 100000 4 5 6 7 8 9 10 11–15 16–20 21–30 31–40 41–50 51–60 61–70 >=71 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% Number of EFGs Cumulative %

Number of Nodes in the EFG

slide-35
SLIDE 35

June 2011 MC-SSAPRE PLDI 35

Conclusion

  • The minimum-cut technique for flow networks can

effectively be applied to SSA graphs

  • SSA-based compilers can apply MC-SSAPRE to

achieve optimal speculative code motion under an execution profile

  • The sparse approach is effective in reducing the

problem sizes

  • The polynomial time complexity of Min-cut only has

limited effect on MC-SSAPRE’s optimization efficiency

  • MC-SSAPRE always improves program performance
  • ver SSAPRE
slide-36
SLIDE 36

Questions?