Limits of Superscalar Architecture Virendra Singh Associate - PowerPoint PPT Presentation

Limits of Superscalar Architecture Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in CS-683: Advanced Computer Architecture Lecture 21 (09 Oct 2013) CADSL

VLIW Processors • Package multiple operations into one instruction • Example VLIW processor: – One integer instruction (or branch) – Two independent floating-point operations – Two independent memory references  Must be enough parallelism in code to fill the available slots CADSL 09 Oct 2013 CS-683@IITB 2

VLIW Processors • Disadvantages: – Statically finding parallelism – Code size – No hazard detection hardware – Binary code compatibility CADSL 09 Oct 2013 CS-683@IITB 3

Summary: Multiple Issue CADSL 09 Oct 2013 CS-683@IITB 4

Recap: Advanced Superscalars • Even simple branch prediction can be quite effective • Path-based predictors can achieve >95% accuracy • BTB redirects control flow early in pipe, BHT cheaper per entry but must wait for instruction decode • Branch mispredict recovery requires snapshots of pipeline state to reduce penalty • Unified physical register file design, avoids reading data from multiple locations (ROB+arch regfile) • Superscalars can rename multiple dependent instructions in one clock cycle • Need speculative store buffer to avoid waiting for stores to commit CADSL 09 Oct 2013 5 CS-683@IITB

Recap: Branch Prediction and Speculative Execution Update predictors kill Branch Branch kill Resolution Prediction kill kill Decode & PC Fetch Reorder Buffer Commit Rename Reg. File Branch Store ALU MEM D$ Unit Buffer CADSL Execute 09 Oct 2013 6 CS-683@IITB

Little ’ s Law Parallelism = Throughput * Latency or = × N T L Throughput per Cycle One Operation Latency in Cycles CADSL 09 Oct 2013 7 CS-683@IITB

Example Pipelined ILP Machine Max Throughput, Six Instructions per Cycle One Pipeline Stage Two Integer Units, Latency in Single Cycle Latency Cycles Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency • How much instruction-level parallelism (ILP) required to keep machine pipelines busy? + + (2x1 2x3 2x4) 2 2 T = = = = × = 6 L 2 N 6 2 1 6 6 3 3 CADSL 09 Oct 2013 8 CS-683@IITB

Superscalar Control Logic Scaling Issue Width W Issue Group Previously Issued Lifetime L Instructions • Each issued instruction must check against W*L instructions, i.e., growth in hardware ∝ W*(W*L) • For in-order machines, L is related to pipeline latencies • For out-of-order machines, L also includes time spent in instruction buffers (instruction window or ROB) • As W increases, larger instruction window is needed to find enough parallelism to keep machine busy => greater L => Out-of-order control logic grows faster than W2 (~W3) CADSL 09 Oct 2013 9 CS-683@IITB

Superscalar Scenario • Interest in multiple-issue because wanted to improve performance without affecting uniprocessor programming model • Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice • Conservative in ideas, just faster clock and bigger • Processors of last 10 years (Pentium 4, IBM Power 5, AMD Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple-issue processors announced in 1995 – Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X as many renaming registers, and 2X as many load- store units  performance 8 to 16X • Peak v. delivered performance gap increasing CADSL 09 Oct 2013 CS-683@IITB 10

Out-of-Order Control Complexity: MIPS R10000 Control Logic [ SGI/MIPS Technologies Inc., 1995 ] CADSL 09 Oct 2013 11 CS-683@IITB

Sequential ISA Bottleneck Sequential Sequential Superscalar compiler source code machine code a = foo(b); for (i=0, i< Find independent Schedule operations operations Superscalar processor Schedule Check instruction execution dependencies CADSL 09 Oct 2013 12 CS-683@IITB

For most apps, most execution units lie idle For an 8-way superscalar. From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995. CADSL 09 Oct 2013 CS-683@IITB 13

Limits to ILP • Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C programs) – Hardware sophistication – Compiler sophistication – • How much ILP is available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve? – Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock – Motorola AltaVec: 128 bit ints and FPs – Supersparc Multimedia ops, etc. – CADSL 09 Oct 2013 CS-683@IITB 14

Overcoming Limits • Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies • However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future CADSL 09 Oct 2013 CS-683@IITB 15

Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers  all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Jump prediction – all jumps perfectly predicted (returns, case statements) 2 & 3  no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle; CADSL 09 Oct 2013 CS-683@IITB 16

Limits to ILP HW Model comparison Model Power 5 Instructions Issued Infinite 4 per clock Instruction Window Infinite 200 Size Renaming Infinite 48 integer + Registers 40 Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect ?? Analysis CADSL 17 09 Oct 2013 CS-683@IITB

Upper Limit to ILP: Ideal Machin e FP: 75 - 150 Instructions Per Clock Integer: 18 - 60 CADSL 09 Oct 2013 CS-683@IITB 18

Limits to ILP HW Model comparison New Model Power 5 Model Instructions Infinite Infinite 4 Issued per clock Instruction Infinite, 2K, Infinite 200 Window Size 512, 128, 32 Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect Perfect 2% to 6% Prediction misprediction (Tournament Branch Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect Perfect ?? CADSL 19 09 Oct 2013 CS-683@IITB

More Realistic HW: Window Impact Change from Infinite window FP: 9 - 150 2048, 512, 128, 32 Click to edit the outline text format 160 ● 150 Second Outline Level – 140 Third Outline Level 119 ● Instructions Per Cl 120 – Fourth Outline Level Integer: 8 - 63 Fifth Outline ● 100 Level 75 Sixth Outline IPC ● 80 Level 63 61 60 59 55 • Seventh Outline LevelClick to edit 60 49 45 Master text styles 41 36 35 34 – Second level 40 18 16 15 15 15 14 14 ● Third level 13 12 20 11 10 10 9 9 8 8 – Fourth level 0 gcc espresso li fpppp ● Fifth level doduc tomcatv Infinite 2048 512 128 32 20 09 Oct 2013 CS-683@IITB CADSL

Limits to ILP HW Model comparison New Model Power 5 Model Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect vs. 8K Perfect 2% to 6% Prediction Tournament misprediction vs. 512 2-bit (Tournament Branch vs. profile vs. Predictor) none Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect Perfect ?? CADSL 21 09 Oct 2013 CS-683@IITB

More Realistic HW: Branch Impact Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle FP: 15 - 45 Integer: 6 - 12 IPC BHT (512) CADSL 09 Oct 2013 CS-683@IITB 22 Profile

Misprediction Rates 35% 30% 30% Misprediction Rate 23% 25% 18% 18% 20% 16% 14% 14% 15% 12% 12% 10% 6% 5% 4% 3% 5% 2% 2% 1% 1% 0% 0% tomcatv doduc fpppp li espresso gcc Profile-based 2-bit counter Tournament CADSL 09 Oct 2013 CS-683@IITB 23

Limits to ILP HW Model comparison New Model Model Power 5 Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite v. 256, Infinite 48 integer + Registers 128, 64, 32, none 40 Fl. Pt. Branch 8K 2-bit Perfect Tournament Branch Prediction Predictor Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect Perfect Alias CADSL 24 09 Oct 2013 CS-683@IITB

More Realistic HW: Renaming Register Impact (N int + N fp) FP: 11 - 45 Change 2048 instr window, 64 instr issue, 8K 2 level Prediction Integer: 5 - 15 IPC CADSL 09 Oct 2013 CS-683@IITB 25

Limits of Superscalar Architecture Virendra Singh Associate - PowerPoint PPT Presentation

Limits of Superscalar Architecture Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail:

City Limits Lions Clubs City Limits Lions Clubs City Limits Lions Clubs City Limits Lions

Different Types of Limits Besides ordinary, two-sided limits, there are one-sided limits (left-

MAT 166 Calculus for Bus/Soc Chapter 3 Notes Limits The Deriviative David J. Gisch Limits

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

Overview Computer architecture Scaling performance and CMOS 1 Trends in Microprocessor

Limits (the size of the pie) allocation limits minimum reliability flow of supply Limits

Medical Programs Overview Table 1. Caption Medical SNAP TANF Programs Income Limits Income

Scope & Limits of Scope & Limits of Scope & Limits of Legal Authority Legal

A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material

Sequential Presentation Of Long Instructions Limits of pipelining, The case for superscalar,

Superscalar Organization Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture

Modeling Limits Jaroslav Neetil Patrice Ossona de Mendez Charles University CAMS, CNRS/EHESS

DB server limits (process/sessions) DB server limits (process/sessions) Carlos Fernando Gamboa,

Superscalar Organization Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer

Collaborators: Lee Armus, Danny Dale, Tanio Diaz-Santos, Chris Hayward, Alex Pope, Anna Sajina,

Hidden surface removal Visibility of primitives Clipping algorithms will discard objects or

Beyond binary classification Subhransu Maji CMPSCI 689: Machine Learning 19 February 2015

CPI < 1 Pipelined CPUs may have multiple execution units of different types (to

CENG3420 Lecture 12: Instruction-Level Parallelism Bei Yu byu@cse.cuhk.edu.hk (Latest update:

Proving Skipping Refinement with ACL2s Mitesh Jain and Pete Manolios Northeastern University

Simulating Multi-Core RISC-V Systems in gem5 Tuan Ta, Lin Cheng, and Christopher Batten School

Limits of Superscalar Architecture Virendra Singh Associate - PowerPoint PPT Presentation

Limits of Superscalar Architecture Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail:

City Limits Lions Clubs City Limits Lions Clubs City Limits Lions Clubs City Limits Lions

Different Types of Limits Besides ordinary, two-sided limits, there are one-sided limits (left-

MAT 166 Calculus for Bus/Soc Chapter 3 Notes Limits The Deriviative David J. Gisch Limits

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

Overview Computer architecture Scaling performance and CMOS 1 Trends in Microprocessor

Limits (the size of the pie) allocation limits minimum reliability flow of supply Limits

Medical Programs Overview Table 1. Caption Medical SNAP TANF Programs Income Limits Income

Scope &amp; Limits of Scope &amp; Limits of Scope &amp; Limits of Legal Authority Legal

A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material

Sequential Presentation Of Long Instructions Limits of pipelining, The case for superscalar,

Superscalar Organization Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture

Modeling Limits Jaroslav Neetil Patrice Ossona de Mendez Charles University CAMS, CNRS/EHESS

DB server limits (process/sessions) DB server limits (process/sessions) Carlos Fernando Gamboa,

Superscalar Organization Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer

Collaborators: Lee Armus, Danny Dale, Tanio Diaz-Santos, Chris Hayward, Alex Pope, Anna Sajina,

Hidden surface removal Visibility of primitives Clipping algorithms will discard objects or

Beyond binary classification Subhransu Maji CMPSCI 689: Machine Learning 19 February 2015

CPI &lt; 1 Pipelined CPUs may have multiple execution units of different types (to

CENG3420 Lecture 12: Instruction-Level Parallelism Bei Yu byu@cse.cuhk.edu.hk (Latest update:

Proving Skipping Refinement with ACL2s Mitesh Jain and Pete Manolios Northeastern University

Simulating Multi-Core RISC-V Systems in gem5 Tuan Ta, Lin Cheng, and Christopher Batten School

Scope & Limits of Scope & Limits of Scope & Limits of Legal Authority Legal

CPI < 1 Pipelined CPUs may have multiple execution units of different types (to