limits of superscalar architecture
play

Limits of Superscalar Architecture Virendra Singh Associate - PowerPoint PPT Presentation

Limits of Superscalar Architecture Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail:


  1. Limits of Superscalar Architecture Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in CS-683: Advanced Computer Architecture Lecture 21 (09 Oct 2013) CADSL

  2. VLIW Processors • Package multiple operations into one instruction • Example VLIW processor: – One integer instruction (or branch) – Two independent floating-point operations – Two independent memory references  Must be enough parallelism in code to fill the available slots CADSL 09 Oct 2013 CS-683@IITB 2

  3. VLIW Processors • Disadvantages: – Statically finding parallelism – Code size – No hazard detection hardware – Binary code compatibility CADSL 09 Oct 2013 CS-683@IITB 3

  4. Summary: Multiple Issue CADSL 09 Oct 2013 CS-683@IITB 4

  5. Recap: Advanced Superscalars • Even simple branch prediction can be quite effective • Path-based predictors can achieve >95% accuracy • BTB redirects control flow early in pipe, BHT cheaper per entry but must wait for instruction decode • Branch mispredict recovery requires snapshots of pipeline state to reduce penalty • Unified physical register file design, avoids reading data from multiple locations (ROB+arch regfile) • Superscalars can rename multiple dependent instructions in one clock cycle • Need speculative store buffer to avoid waiting for stores to commit CADSL 09 Oct 2013 5 CS-683@IITB

  6. Recap: Branch Prediction and Speculative Execution Update predictors kill Branch Branch kill Resolution Prediction kill kill Decode & PC Fetch Reorder Buffer Commit Rename Reg. File Branch Store ALU MEM D$ Unit Buffer CADSL Execute 09 Oct 2013 6 CS-683@IITB

  7. Little ’ s Law Parallelism = Throughput * Latency or = × N T L Throughput per Cycle One Operation Latency in Cycles CADSL 09 Oct 2013 7 CS-683@IITB

  8. Example Pipelined ILP Machine Max Throughput, Six Instructions per Cycle One Pipeline Stage Two Integer Units, Latency in Single Cycle Latency Cycles Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency • How much instruction-level parallelism (ILP) required to keep machine pipelines busy? + + (2x1 2x3 2x4) 2 2 T = = = = × = 6 L 2 N 6 2 1 6 6 3 3 CADSL 09 Oct 2013 8 CS-683@IITB

  9. Superscalar Control Logic Scaling Issue Width W Issue Group Previously Issued Lifetime L Instructions • Each issued instruction must check against W*L instructions, i.e., growth in hardware ∝ W*(W*L) • For in-order machines, L is related to pipeline latencies • For out-of-order machines, L also includes time spent in instruction buffers (instruction window or ROB) • As W increases, larger instruction window is needed to find enough parallelism to keep machine busy => greater L => Out-of-order control logic grows faster than W2 (~W3) CADSL 09 Oct 2013 9 CS-683@IITB

  10. Superscalar Scenario • Interest in multiple-issue because wanted to improve performance without affecting uniprocessor programming model • Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice • Conservative in ideas, just faster clock and bigger • Processors of last 10 years (Pentium 4, IBM Power 5, AMD Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple-issue processors announced in 1995 – Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X as many renaming registers, and 2X as many load- store units  performance 8 to 16X • Peak v. delivered performance gap increasing CADSL 09 Oct 2013 CS-683@IITB 10

  11. Out-of-Order Control Complexity: MIPS R10000 Control Logic [ SGI/MIPS Technologies Inc., 1995 ] CADSL 09 Oct 2013 11 CS-683@IITB

  12. Sequential ISA Bottleneck Sequential Sequential Superscalar compiler source code machine code a = foo(b); for (i=0, i< Find independent Schedule operations operations Superscalar processor Schedule Check instruction execution dependencies CADSL 09 Oct 2013 12 CS-683@IITB

  13. For most apps, most execution units lie idle For an 8-way superscalar. From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995. CADSL 09 Oct 2013 CS-683@IITB 13

  14. Limits to ILP • Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C programs) – Hardware sophistication – Compiler sophistication – • How much ILP is available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve? – Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock – Motorola AltaVec: 128 bit ints and FPs – Supersparc Multimedia ops, etc. – CADSL 09 Oct 2013 CS-683@IITB 14

  15. Overcoming Limits • Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies • However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future CADSL 09 Oct 2013 CS-683@IITB 15

  16. Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers  all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Jump prediction – all jumps perfectly predicted (returns, case statements) 2 & 3  no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle; CADSL 09 Oct 2013 CS-683@IITB 16

  17. Limits to ILP HW Model comparison Model Power 5 Instructions Issued Infinite 4 per clock Instruction Window Infinite 200 Size Renaming Infinite 48 integer + Registers 40 Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect ?? Analysis CADSL 17 09 Oct 2013 CS-683@IITB

  18. Upper Limit to ILP: Ideal Machin e FP: 75 - 150 Instructions Per Clock Integer: 18 - 60 CADSL 09 Oct 2013 CS-683@IITB 18

  19. Limits to ILP HW Model comparison New Model Power 5 Model Instructions Infinite Infinite 4 Issued per clock Instruction Infinite, 2K, Infinite 200 Window Size 512, 128, 32 Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect Perfect 2% to 6% Prediction misprediction (Tournament Branch Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect Perfect ?? CADSL 19 09 Oct 2013 CS-683@IITB

  20. More Realistic HW: Window Impact Change from Infinite window FP: 9 - 150 2048, 512, 128, 32 Click to edit the outline text format 160 ● 150 Second Outline Level – 140 Third Outline Level 119 ● Instructions Per Cl 120 – Fourth Outline Level Integer: 8 - 63 Fifth Outline ● 100 Level 75 Sixth Outline IPC ● 80 Level 63 61 60 59 55 • Seventh Outline LevelClick to edit 60 49 45 Master text styles 41 36 35 34 – Second level 40 18 16 15 15 15 14 14 ● Third level 13 12 20 11 10 10 9 9 8 8 – Fourth level 0 gcc espresso li fpppp ● Fifth level doduc tomcatv Infinite 2048 512 128 32 20 09 Oct 2013 CS-683@IITB CADSL

  21. Limits to ILP HW Model comparison New Model Power 5 Model Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect vs. 8K Perfect 2% to 6% Prediction Tournament misprediction vs. 512 2-bit (Tournament Branch vs. profile vs. Predictor) none Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect Perfect ?? CADSL 21 09 Oct 2013 CS-683@IITB

  22. More Realistic HW: Branch Impact Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle FP: 15 - 45 Integer: 6 - 12 IPC BHT (512) CADSL 09 Oct 2013 CS-683@IITB 22 Profile

  23. Misprediction Rates 35% 30% 30% Misprediction Rate 23% 25% 18% 18% 20% 16% 14% 14% 15% 12% 12% 10% 6% 5% 4% 3% 5% 2% 2% 1% 1% 0% 0% tomcatv doduc fpppp li espresso gcc Profile-based 2-bit counter Tournament CADSL 09 Oct 2013 CS-683@IITB 23

  24. Limits to ILP HW Model comparison New Model Model Power 5 Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite v. 256, Infinite 48 integer + Registers 128, 64, 32, none 40 Fl. Pt. Branch 8K 2-bit Perfect Tournament Branch Prediction Predictor Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect Perfect Alias CADSL 24 09 Oct 2013 CS-683@IITB

  25. More Realistic HW: Renaming Register Impact (N int + N fp) FP: 11 - 45 Change 2048 instr window, 64 instr issue, 8K 2 level Prediction Integer: 5 - 15 IPC CADSL 09 Oct 2013 CS-683@IITB 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend