Optimization++ Complexities and strategies of optimization - - PowerPoint PPT Presentation

optimization
SMART_READER_LITE
LIVE PREVIEW

Optimization++ Complexities and strategies of optimization - - PowerPoint PPT Presentation

Optimization++ Complexities and strategies of optimization Instruction Scheduling Register Allocation Optimization Recap 1. Intermediate language (IL) module - better separation of front-end and back-end modules - permit


slide-1
SLIDE 1

Optimization++

  • Complexities and strategies of optimization
  • Instruction Scheduling
  • Register Allocation
slide-2
SLIDE 2

Optimization Recap

  • 1. Intermediate language (IL) module
  • better separation of front-end and back-end modules
  • permit multi-pass optimization
  • we’re focusing on 3-address code
  • 2. Basic blocks (BBs) & Control Flow Graph (CFGs)
  • BBs are jump-free sequences of code
  • CFGs link up BBs
  • clearly, efficiently identify jump-free sequences of code and

control flow

  • 3. IL-based optimization
  • data-flow analysis (abstract program execution of facts)

* available expressions, avail. copies, useless vars, …

  • atomized, recombinable optimizations

* common subexpressionn elmination, copy propagation, useless expressions (stmts), redunction in strength with induction variable elimination

slide-3
SLIDE 3

Optimization Strategies

Optimizations can be done:

  • Locally (within BB)
  • Globally (CFG for a function); true global rare
  • Functions can be inlined to get better results

– can bloat code, replication hurts instruction cache locality

  • Peep-hole: sliding window over IL or assembler

– e.g., reduce 2 simple instructions to 1 complex instruction

80/20 rules of optimization

  • 50% of improvement can be achieved in local opts

– Also easier to implement (50/10 rule? ;)

  • 80% of instructions executed are in 20% of the code – in

inner loops

– Focusing optimizations in inner loops means optimization is 80% faster, yet 80% effective

slide-4
SLIDE 4

Complexities – Subject of Research

Pointers

  • Unknown what is being set; safe kill optimization

Function calls

  • Like huge outer loops (think recursion), but used in many

places

  • Algorithmic costs to handle is high, often treat like pointers

Debugging

  • Optimization makes it hard to set

break points, inspect values, etc.

Runtime optimization

  • Java virtual machines compile & optimize during execution
  • Must be very lean, may use runtime information

Jeanne Ferrante Brad Calder Andrew Chien Scott Baden Jeanne Ferrante Brad Calder Andrew Chien Scott Baden

slide-5
SLIDE 5

Instruction Scheduling

  • RISC machines separate memory instr. from rest
  • Each instruction does memory or computation, not both
  • Allows most instructions to execute in a few cycles (say,

5), only multiplies and divides longer

  • If adjacent operations are unrelated, they can be
  • verlapped (compute on x not near load of x)
  • pipeline: fetch, decode, execute, (execute,) store

(e.g.)

  • one-cycle net cost if perfectly overlapped, no cache misses
  • also, cannot do two fetches, etc., at a time (hazards)
  • To get best performance, then, need to reorder

instructions to minimize ‘stalling’ dependences

slide-6
SLIDE 6

Example of Pipeline Stalls

ld x, r1 f d e e s ld y, r2 f d e e s add r1, r2, r3 f d e s st r3, z f d e e s

  • 10 cycles (3 cycles of stalls)
  • Ideas? These in front:

ld a, r4 ld b, r5 sub r4, r5, r6 st r6, c ld a, r4 ld b, r5 sub r4, r5, r6 st r6, c

slide-7
SLIDE 7

Reorder instructions - no stalls/bubbles!

ld a, r4 f d e e s ld b, r5 f d e e s ld x, r1 f d e e s ld y, r2 f d e e s s sub r4, r5, r6 f d e s s st r6, c f d e e s s add r1, r2, r3 f d e s s st r3, z f d e e s

  • Assumes no cache misses on loads
  • Any ideas for implementation (an algorithm?)
slide-8
SLIDE 8

Maximize distance between dependences

ld a, r4 ld b, r5 sub r4, r5, r6 st r6, c ld x, r1 ld y, r2 add r1, r2, r3 st r3, z

(1s, 2i) (1s, 2i) (0s, 1i) (1s, 2i) (1s, 2i) (0s, 1i)

Repeat until no instructions:

  • 1. Examine all “roots”

(all predecessors scheduled)

  • 2. Schedule one that
  • a. can cause stalls on succ.
  • b. has most succesors

(creates most choices)

  • c. longest path to leaf
  • 3. Delete it from graph

Repeat until no instructions:

  • 1. Examine all “roots”

(all predecessors scheduled)

  • 2. Schedule one that
  • a. can cause stalls on succ.
  • b. has most succesors

(creates most choices)

  • c. longest path to leaf
  • 3. Delete it from graph
slide-9
SLIDE 9

Optimizing Register Allocation

Memory operations are expensive

  • ‘Extra’ instr, take longer, cause stalls, miss cache

Best never to load or store data from memory

  • Keep all data in registers, but in short supply
  • How to prioritize; ideas?

Temporal locality

Values to be

  • a. used the most
  • b. in the shortest span of

instructions

Temporal locality

Values to be

  • a. used the most
  • b. in the shortest span of

instructions

slide-10
SLIDE 10

Greedy Algorithm

Policy: Let variable loaded into register stay in reg until reg is needed for something else x := y + y ld [fp-4], r1 ld [fp-4], r1 ld [fp ld [fp-

  • 4],

4], r2 r2 add r1, r1 r1, r3 add r1, r2 r2, r3 st r3, [fp-8] st r3, [fp-8]

  • Achieve by VarSTO remember Register & vice versa

R = Machine.smartGetReg(varSTO);

  • If called on STO with a reg, just returns its reg
  • If all registers in use, chooses register to free
  • Keeps memory-based model, modular change
slide-11
SLIDE 11

smartGetReg

Register smartGetReg(STO var) { if (!(reg = varFile.varsReg(var))) { if (!(reg = getReg()) {

  • var = varFile.stalestVar(); // remove LRU var/reg from file
  • reg = varFile.varsReg(ovar);

varFile.remVar(ovar); freeReg(oreg); reg = getReg(); // guaranteed to be the same } varFile.put(var, reg); emitLoad(var, reg); // DOES PREEMPTIVE LOAD!! } varFile.markUsed(var); // updates LRU ‘timestamp’ return reg; }

  • Must free all at end of function
  • Can also do lazy stores, but

not as big a win

  • Must free all at end of function
  • Can also do lazy stores, but

not as big a win

slide-12
SLIDE 12

Global Register Allocation

Actually examine temporal locality

  • Values likely to be used the most
  • In the shortest span of instructions

In short, reference density

  • Static:

number of var references in code

  • Dynamic:

number of var references occur (loops)

Two challenges

  • Determining (guessing) dynamic density

– estimate from static, or from profiling via testcases

  • Choosing allocation that gets the most dense

variables (most references) in registers

slide-13
SLIDE 13

Graph-Coloring Register Allocation

Schedule variables into registers like classes into rooms

  • Schedule the most classes possible for available rooms
  • Can’t have two classes (variables) in a room (register) at same time

x y w z [1] z := 1 [2] x := 2 * z | [3] y := 3 * z | | [4] w := x + y | | | [5] z := y + z | | | [6] x := y * w | | try: r1 r2 r3 ? ?

  • r: r1 r2 r1 r3

lifetime

slide-14
SLIDE 14

Graph-Coloring Register Allocation

Create conflict graph, edge means “cannot be scheduled in same ‘room’ (register) because (life)times overlap” x1 y1 w1 z1 [1] z := 1 [2] x := 2 * z | [3] y := 3 * z | | [4] w := x + y | | | [5] z := y + z | | | [6] x := y * w | | x w y z

Repeat until all registers allocated:

  • Select node from unallocated set*
  • Give register not in use by any neighbor
  • Remove node from unallocated set

*prioritize to non-trivial, dense:

  • a. constrained nodes (more edges than regs)
  • b. nodes with higher reference density

(2c, 1/3d) (3c, 1d) (3c, 3/4d) (2c, 1/2d)

r3 r1 r2 r3

slide-15
SLIDE 15

Global Register Allocation

[1] r2 := 1 [2] r3 := 2 * r2 [3] r1 := 3 * r2 [4] r3 := r3 + r1 ! x and w “share” [5] z := r1 + r2 [6] x := r1 * r3

  • Entire calculation done in registers
  • Note that done on IL – restricted/typed temps
  • Accuracy of allocation depends on mapping to actual

assembly

slide-16
SLIDE 16

Lessons Learned

Significant benefits possible at little cost or complexity Modelling (formalization) of problems

  • 3-addr code, BB, CFG, dependence graphs

Clarifies structure

  • Exposes differences, similarities, (mis)matches
  • Identifies opportunities (for optimization)
  • Efficient and simple algorithms available “off the shelf”

Also useful for thinking, software design, debugging, etc.