optimization
play

Optimization++ Complexities and strategies of optimization - PowerPoint PPT Presentation

Optimization++ Complexities and strategies of optimization Instruction Scheduling Register Allocation Optimization Recap 1. Intermediate language (IL) module - better separation of front-end and back-end modules - permit


  1. Optimization++ • Complexities and strategies of optimization • Instruction Scheduling • Register Allocation

  2. Optimization Recap 1. Intermediate language (IL) module - better separation of front-end and back-end modules - permit multi-pass optimization - we’re focusing on 3-address code 2. Basic blocks (BBs) & Control Flow Graph (CFGs) - BBs are jump-free sequences of code - CFGs link up BBs - clearly, efficiently identify jump-free sequences of code and control flow 3. IL-based optimization - data-flow analysis (abstract program execution of facts) * available expressions, avail. copies, useless vars, … - atomized, recombinable optimizations * common subexpressionn elmination, copy propagation, useless expressions (stmts), redunction in strength with induction variable elimination

  3. Optimization Strategies Optimizations can be done: • Locally (within BB) • Globally (CFG for a function); true global rare • Functions can be inlined to get better results – can bloat code, replication hurts instruction cache locality • Peep-hole: sliding window over IL or assembler – e.g., reduce 2 simple instructions to 1 complex instruction 80/20 rules of optimization • 50% of improvement can be achieved in local opts – Also easier to implement (50/10 rule? ;) • 80% of instructions executed are in 20% of the code – in inner loops – Focusing optimizations in inner loops means optimization is 80% faster, yet 80% effective

  4. Complexities – Subject of Research Pointers • Unknown what is being set; safe kill optimization Function calls • Like huge outer loops (think recursion), but used in many places • Algorithmic costs to handle is high, often treat like pointers Debugging • Optimization makes it hard to set Jeanne Ferrante Jeanne Ferrante Brad Calder break points, inspect values, etc. Brad Calder Andrew Chien Andrew Chien Runtime optimization Scott Baden Scott Baden • Java virtual machines compile & optimize during execution • Must be very lean, may use runtime information

  5. Instruction Scheduling • RISC machines separate memory instr. from rest - Each instruction does memory or computation, not both - Allows most instructions to execute in a few cycles (say, 5), only multiplies and divides longer • If adjacent operations are unrelated, they can be overlapped (compute on x not near load of x ) - pipeline: fetch, decode, execute, (execute,) store (e.g.) - one-cycle net cost if perfectly overlapped, no cache misses - also, cannot do two fetches, etc., at a time (hazards) • To get best performance, then, need to reorder instructions to minimize ‘stalling’ dependences

  6. Example of Pipeline Stalls ld x, r1 f d e e s ld y, r2 f d e e s add r1, r2, r3 f d e s st r3, z f d e e s • 10 cycles (3 cycles of stalls) • Ideas? These in front: ld a, r4 ld a, r4 ld b, r5 ld b, r5 sub r4, r5, r6 sub r4, r5, r6 st r6, c st r6, c

  7. Reorder instructions - no stalls/bubbles! ld a, r4 f d e e s ld b, r5 f d e e s ld x, r1 f d e e s ld y, r2 f d e e s s sub r4, r5, r6 f d e s s st r6, c f d e e s s add r1, r2, r3 f d e s s st r3, z f d e e s • Assumes no cache misses on loads • Any ideas for implementation (an algorithm?)

  8. Maximize distance between dependences ld a, r4 ld b, r5 (1s, 2i) (1s, 2i) sub r4, r5, r6 (0s, 1i) st r6, c Repeat until no instructions: Repeat until no instructions: ld x, r1 ld y, r2 1. Examine all “roots” 1. Examine all “roots” (all predecessors scheduled) (all predecessors scheduled) (1s, 2i) (1s, 2i) 2. Schedule one that 2. Schedule one that add r1, r2, r3 a. can cause stalls on succ. a. can cause stalls on succ. b. has most succesors b. has most succesors (creates most choices) (creates most choices) (0s, 1i) c. longest path to leaf c. longest path to leaf st r3, z 3. Delete it from graph 3. Delete it from graph

  9. Optimizing Register Allocation Memory operations are expensive - ‘Extra’ instr, take longer, cause stalls, miss cache Best never to load or store data from memory • Keep all data in registers, but in short supply • How to prioritize; ideas? Temporal locality Temporal locality Values to be Values to be a. used the most a. used the most b. in the shortest span of b. in the shortest span of instructions instructions

  10. Greedy Algorithm Policy: Let variable loaded into register stay in reg until reg is needed for something else x := y + y � ld [fp-4], r1 � ld [fp-4], r1 ld [fp- -4], 4], r2 r2 add r1, r1 r1, r3 ld [fp add r1, r2 r2, r3 st r3, [fp-8] st r3, [fp-8] • Achieve by VarSTO remember Register & vice versa R = Machine.smartGetReg(varSTO); - If called on STO with a reg, just returns its reg - If all registers in use, chooses register to free • Keeps memory-based model, modular change

  11. smartGetReg • Must free all at end of function • Must free all at end of function • Can also do lazy stores , but • Can also do lazy stores , but not as big a win not as big a win Register smartGetReg(STO var) { if (!(reg = varFile.varsReg(var))) { if (!(reg = getReg()) { ovar = varFile. stalestVar (); // remove LRU var/reg from file oreg = varFile.varsReg(ovar); varFile.remVar(ovar); freeReg(oreg); reg = getReg(); // guaranteed to be the same } varFile.put(var, reg); emitLoad(var, reg); // DOES PREEMPTIVE LOAD!! } varFile. markUsed (var); // updates LRU ‘timestamp’ return reg; }

  12. Global Register Allocation Actually examine temporal locality • Values likely to be used the most • In the shortest span of instructions In short, reference density • Static: number of var references in code • Dynamic: number of var references occur (loops) Two challenges • Determining (guessing) dynamic density – estimate from static, or from profiling via testcases • Choosing allocation that gets the most dense variables (most references) in registers

  13. Graph-Coloring Register Allocation Schedule variables into registers like classes into rooms • Schedule the most classes possible for available rooms • Can’t have two classes (variables) in a room (register) at same time x y w z [1] z := 1 lifetime [2] x := 2 * z | [3] y := 3 * z | | [4] w := x + y | | | [5] z := y + z | | | [6] x := y * w | | try: r1 r2 r3 ? ? or: r1 r2 r1 r3

  14. Graph-Coloring Register Allocation Create conflict graph, edge means “cannot be scheduled in same ‘room’ (register) because (life)times overlap” x 1 y 1 w 1 z 1 [1] z := 1 [2] x := 2 * z | [3] y := 3 * z | | [4] w := x + y | | | [5] z := y + z | | | [6] x := y * w | | (2c, 1/2d) (2c, 1/3d) r3 Repeat until all registers allocated: r3 x w • Select node from unallocated set* • Give register not in use by any neighbor • Remove node from unallocated set *prioritize to non-trivial, dense: y z a. constrained nodes (more edges than regs) r1 r2 (3c, 3/4d) (3c, 1d) b. nodes with higher reference density

  15. Global Register Allocation [1] r2 := 1 [2] r3 := 2 * r2 [3] r1 := 3 * r2 [4] r3 := r3 + r1 ! x and w “share” [5] z := r1 + r2 [6] x := r1 * r3 • Entire calculation done in registers • Note that done on IL – restricted/typed temps - Accuracy of allocation depends on mapping to actual assembly

  16. Lessons Learned Significant benefits possible at little cost or complexity Modelling (formalization) of problems • 3-addr code, BB, CFG, dependence graphs Clarifies structure • Exposes differences, similarities, (mis)matches • Identifies opportunities (for optimization) • Efficient and simple algorithms available “off the shelf” Also useful for thinking, software design, debugging, etc.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend