eecs 583 class 10 code generation
play

EECS 583 Class 10 Code Generation University of Michigan October - PowerPoint PPT Presentation

EECS 583 Class 10 Code Generation University of Michigan October 6, 2014 Announcements Reminder: HW 2 Due this Thursday, You should have started by now Class project proposals Think about partners/topic! - 1 - Course Project


  1. EECS 583 – Class 10 Code Generation University of Michigan October 6, 2014

  2. Announcements ❖ Reminder: HW 2 » Due this Thursday, You should have started by now ❖ Class project proposals » Think about partners/topic! - 1 -

  3. Course Project – Time to Start Thinking About This ❖ Mission statement: Design and implement something “interesting” in a compiler » LLVM preferred, but others are fine » Groups of 2-4 people (1 or 5 persons is possible in some cases) » Extend existing research paper or go out on your own ❖ Topic areas » Dynamic optimization » Approximate Computing » Memory system optimization » Machine learning for compilation » Automatic parallelization/SIMDization » Compiling for GPU/GPU-like architecture » Creating custom processors » Reliability » Energy - 2 -

  4. Course Projects – Timetable ❖ Now » Start thinking about potential topics, identify group members ❖ Oct 20-22 (week after fall break): Project proposals » No class that week » Chang-hung and I will meet with each group, slot signups in class Oct 15 » Ideas/proposal discussed at meeting » Written proposal (a paragraph or 2 plus some references) due Monday, Oct 29 from each group ❖ Nov 3 – Dec 3: Research presentations » Each group present a research paper related to their project (20 mins + 5 mins Q&A) ❖ Late Nov: Project checkpoint » Update on your progress, what left to do ❖ Dec 8-12: Project demos » Each group, 30 min slot - Presentation/Demo/whatever you like » Turn in short report on your project - 3 -

  5. Class Problem from Last Time à Answer r1 = 0 r2 = 0 r111 = r5 * 2 r109 = r1 << 1 Optimize this applying r1 = 0 r113 = r2 -1 induction var str reduction r2 = 0 r5 = r5 + 1 Note, after copy r5 = r5 + 1 r111 = r111 + 2 propagation, r10 r11 = r5 * 2 r11 = r111 and r4 can be r10 = r11 + 2 r10 = r11 + 2 strength reduced r12 = load (r10+0) r12 = load (r10+0) as well. r9 = r109 r9 = r1 << 1 r4 = r9 - 10 r4 = r9 - 10 r3 = load(r4+4) r3 = load(r4+4) r3 = r3 + 1 r3 = r3 + 1 store(r4+0, r3) store(r4+0, r3) r7 = r3 << 2 r6 = load(r7+0) r7 = r3 << 2 r13 = r113 r6 = load(r7+0) r1 = r1 + 1 r13 = r2 - 1 r109 = r109 + 2 r1 = r1 + 1 r2 = r2 + 1 r2 = r2 + 1 r113 = r113 + 1 r13, r12, r6, r10 - 4 - liveout r13, r12, r6, r10 liveout

  6. ILP Optimization ❖ Traditional optimizations » Redundancy elimination » Reducing operation count ❖ ILP (instruction-level parallelism) optimizations » Increase the amount of parallelism and the ability to overlap operations » Operation count is secondary, often trade parallelism for extra instructions (avoid code explosion) ❖ ILP increased by breaking dependences » True or flow = read after write dependence » False or (anti/output) = write after read, write after write - 5 -

  7. Back Substitution ❖ Generation of expressions by compiler frontends is very y = a + b + c – d + e – f; sequential » Account for operator precedence r9 = r1 + r2 » Apply left-to-right within r10 = r9 + r3 same precedence r11 = r10 - r4 ❖ Back substitution r12 = r11 + r5 r13 = r12 – r6 » Create larger expressions Ÿ Iteratively substitute RHS expression for LHS variable Subs r12: » Note – may correspond to r13 = r11 + r5 – r6 multiple source statements Subs r11: » Enable subsequent optis r13 = r10 – r4 + r5 – r6 ❖ Optimization Subs r10 r13 = r9 + r3 – r4 + r5 – r6 » Re-compute expression in a Subs r9 more favorable manner r13 = r1 + r2 + r3 – r4 + r5 – r6 - 6 -

  8. Tree Height Reduction original: r9 = r1 + r2 Re-compute expression as a ❖ r10 = r9 + r3 balanced binary tree r11 = r10 - r4 » Obey precedence rules r12 = r11 + r5 » Essentially re-parenthesize r13 = r12 – r6 » Combine literals if possible after back subs: Effects ❖ r13 = r1 + r2 + r3 – r4 + r5 – r6 » Height reduced (n terms) Ÿ n-1 (assuming unit latency) r1 + r2 r3 – r4 r5 – r6 Ÿ ceil(log2(n)) » Number of operations remains final code: constant t1 = r1 + r2 » Cost + t2 = r3 – r4 Ÿ Temporary registers “live” longer t3 = r5 – r6 » Watch out for t4 = t1 + t2 Ÿ Always ok for integer arithmetic + r13 = t4 + t3 Ÿ Floating-point – may not be!! r13 - 7 -

  9. Class Problem Assume: + = 1, * = 3 operand 0 0 0 1 2 0 arrival times r1 r2 r3 r4 r5 r6 r10 = r1 * r2 r11 = r10 + r3 r12 = r11 + r4 r13 = r12 – r5 r14 = r13 + r6 Back susbstitute Re-express in tree-height reduced form Account for latency and arrival times - 8 -

  10. Optimizing Unrolled Loops loop: r1 = load(r2) loop: r1 = load(r2) r3 = load(r4) r3 = load(r4) r5 = r1 * r3 r5 = r1 * r3 iter1 r6 = r6 + r5 unroll 3 times r6 = r6 + r5 r2 = r2 + 4 r2 = r2 + 4 r4 = r4 + 4 r4 = r4 + 4 r1 = load(r2) if (r4 < 400) goto loop r3 = load(r4) r5 = r1 * r3 iter2 r6 = r6 + r5 Unroll = replicate loop body r2 = r2 + 4 n-1 times. r4 = r4 + 4 r1 = load(r2) Hope to enable overlap of r3 = load(r4) operation execution from r5 = r1 * r3 iter3 r6 = r6 + r5 different iterations r2 = r2 + 4 r4 = r4 + 4 Not possible! if (r4 < 400) goto loop - 9 -

  11. Register Renaming on Unrolled Loop loop: r1 = load(r2) loop: r1 = load(r2) r3 = load(r4) r3 = load(r4) r5 = r1 * r3 r5 = r1 * r3 iter1 r6 = r6 + r5 iter1 r6 = r6 + r5 r2 = r2 + 4 r2 = r2 + 4 r4 = r4 + 4 r4 = r4 + 4 r1 = load(r2) r11 = load(r2) r3 = load(r4) r13 = load(r4) r5 = r1 * r3 r15 = r11 * r13 iter2 iter2 r6 = r6 + r5 r6 = r6 + r15 r2 = r2 + 4 r2 = r2 + 4 r4 = r4 + 4 r4 = r4 + 4 r1 = load(r2) r21 = load(r2) r3 = load(r4) r23 = load(r4) r5 = r1 * r3 r25 = r21 * r23 iter3 iter3 r6 = r6 + r5 r6 = r6 + r25 r2 = r2 + 4 r2 = r2 + 4 r4 = r4 + 4 r4 = r4 + 4 if (r4 < 400) goto loop if (r4 < 400) goto loop - 10 -

  12. Register Renaming is Not Enough! ❖ Still not much overlap possible loop: r1 = load(r2) r3 = load(r4) ❖ Problems r5 = r1 * r3 » r2, r4, r6 sequentialize the iter1 r6 = r6 + r5 r2 = r2 + 4 iterations r4 = r4 + 4 » Need to rename these r11 = load(r2) ❖ 2 specialized renaming optis r13 = load(r4) » Accumulator variable r15 = r11 * r13 iter2 r6 = r6 + r15 expansion (r6) r2 = r2 + 4 » Induction variable expansion r4 = r4 + 4 (r2, r4) r21 = load(r2) r23 = load(r4) r25 = r21 * r23 iter3 r6 = r6 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop - 11 -

  13. Accumulator Variable Expansion r16 = r26 = 0 ❖ Accumulator variable loop: r1 = load(r2) r3 = load(r4) » x = x + y or x = x – y r5 = r1 * r3 » where y is loop variant!! iter1 r6 = r6 + r5 r2 = r2 + 4 ❖ Create n-1 temporary r4 = r4 + 4 accumulators r11 = load(r2) ❖ Each iteration targets a r13 = load(r4) different accumulator r15 = r11 * r13 iter2 r16 = r16 + r15 ❖ Sum up the accumulator r2 = r2 + 4 variables at the end r4 = r4 + 4 ❖ May not be safe for floating- r21 = load(r2) r23 = load(r4) point values r25 = r21 * r23 iter3 r26 = r26 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop r6 = r6 + r16 + r26 - 12 -

  14. Induction Variable Expansion r12 = r2 + 4, r22 = r2 + 8 r14 = r4 + 4, r24 = r4 + 8 ❖ Induction variable r16 = r26 = 0 » x = x + y or x = x – y loop: r1 = load(r2) r3 = load(r4) » where y is loop invariant!! r5 = r1 * r3 ❖ Create n-1 additional induction iter1 r6 = r6 + r5 variables r2 = r2 + 12 r4 = r4 + 12 ❖ Each iteration uses and r11 = load(r12) modifies a different induction r13 = load(r14) variable r15 = r11 * r13 iter2 r16 = r16 + r15 ❖ Initialize induction variables to r12 = r12 + 12 init, init+step, init+2*step, etc. r14 = r14 + 12 ❖ Step increased to n*original r21 = load(r22) step r23 = load(r24) r25 = r21 * r23 ❖ Now iterations are completely iter3 r26 = r26 + r25 independent !! r22 = r22 + 12 r24 = r24 + 12 if (r4 < 400) goto loop r6 = r6 + r16 + r26 - 13 -

  15. Better Induction Variable Expansion r16 = r26 = 0 ❖ With base+displacement loop: r1 = load(r2) r3 = load(r4) addressing, often don’t need r5 = r1 * r3 additional induction variables iter1 r6 = r6 + r5 » Just change offsets in each iterations to reflect step r11 = load(r2+4) » Change final increments to n r13 = load(r4+4) * original step r15 = r11 * r13 iter2 r16 = r16 + r15 r21 = load(r2+8) r23 = load(r4+8) r25 = r21 * r23 iter3 r26 = r26 + r25 r2 = r2 + 12 r4 = r4 + 12 if (r4 < 400) goto loop r6 = r6 + r16 + r26 - 14 -

  16. Homework Problem loop: loop: r1 = load(r2) r1 = load(r2) r5 = r6 + 3 r5 = r6 + 3 r6 = r5 + r1 r6 = r5 + r1 r2 = r2 + 4 r2 = r2 + 4 if (r2 < 400) goto loop r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 r1 = load(r2) r5 = r6 + 3 Optimize the unrolled r6 = r5 + r1 loop r2 = r2 + 4 if (r2 < 400) goto loop Renaming Tree height reduction Ind/Acc expansion - 15 -

  17. Code Generation ❖ Map optimized “machine-independent” assembly to final assembly code ❖ Input code » Classical optimizations » ILP optimizations » Formed regions (sbs, hbs), applied if-conversion (if appropriate) ❖ Virtual à physical binding » 2 big steps » 1. Scheduling Ÿ Determine when every operation executions Ÿ Create MultiOps » 2. Register allocation Ÿ Map virtual à physical registers Ÿ Spill to memory if necessary - 16 -

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend