unit 8 superscalar pipelines
play

Unit 8: Superscalar Pipelines Then: Static & dynamic scheduling - PowerPoint PPT Presentation

A Key Theme: Parallelism Previously: pipeline-level parallelism Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture Next: instruction-level parallelism (ILP) Execute multiple independent


  1. A Key Theme: Parallelism • Previously: pipeline-level parallelism • Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture • Next: instruction-level parallelism (ILP) • Execute multiple independent instructions fully in parallel Unit 8: Superscalar Pipelines • Then: • Static & dynamic scheduling • Extract much more ILP Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania' ' • Data-level parallelism (DLP) with'sources'that'included'University'of'Wisconsin'slides ' • Single-instruction, multiple data (one insn., four 64-bit adds) by'Mark'Hill,'Guri'Sohi,'Jim'Smith,'and'David'Wood ' • Thread-level parallelism (TLP) • Multiple software threads running on multiple cores CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 1 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 2 Readings This Unit: (In-Order) Superscalar Pipelines • Textbook (MA:FSPTCM) App App App • Idea of instruction-level parallelism • Sections 3.1, 3.2 (but not “Sidebar” in 3.2), 3.5.1 System software • Sections 4.2, 4.3, 5.3.3 • Superscalar hardware issues Mem CPU I/O • Bypassing and register file • Stall logic • Fetch • “Superscalar” vs VLIW/EPIC CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 3 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 4

  2. “Scalar” Pipeline & the Flynn Bottleneck An Opportunity… regfile • But consider: ADD r1, r2 -> r3 I$ D$ ADD r4, r5 -> r6 B • Why not execute them at the same time ? (We can!) P • What about: ADD r1, r2 -> r3 • So far we have looked at scalar pipelines ADD r4, r3 -> r6 • One instruction per stage • In this case, dependences prevent parallel execution • With control speculation, bypassing, etc. – Performance limit (aka “Flynn Bottleneck”) is CPI = IPC = 1 • What about three instructions at a time? – Limit is never even achieved (hazards) • Or four instructions at a time? – Diminishing returns from “super-pipelining” (hazards + overhead) CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 5 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 6 What Checking Is Required? What Checking Is Required? • For two instructions: 2 checks • For two instructions: 2 checks ADD src1 1 , src2 1 -> dest 1 ADD src1 1 , src2 1 -> dest 1 ADD src1 2 , src2 2 -> dest 2 (2 checks) ADD src1 2 , src2 2 -> dest 2 (2 checks) • For three instructions: 6 checks • For three instructions: 6 checks ADD src1 1 , src2 1 -> dest 1 ADD src1 1 , src2 1 -> dest 1 ADD src1 2 , src2 2 -> dest 2 (2 checks) ADD src1 2 , src2 2 -> dest 2 (2 checks) ADD src1 3 , src2 3 -> dest 3 (4 checks) ADD src1 3 , src2 3 -> dest 3 (4 checks) • For four instructions: 12 checks • For four instructions: 12 checks ADD src1 1 , src2 1 -> dest 1 ADD src1 1 , src2 1 -> dest 1 ADD src1 2 , src2 2 -> dest 2 (2 checks) ADD src1 2 , src2 2 -> dest 2 (2 checks) ADD src1 3 , src2 3 -> dest 3 (4 checks) ADD src1 3 , src2 3 -> dest 3 (4 checks) ADD src1 4 , src2 4 -> dest 4 (6 checks) ADD src1 4 , src2 4 -> dest 4 (6 checks) • Plus checking for load-to-use stalls from prior n loads • Plus checking for load-to-use stalls from prior n loads CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 7 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 8

  3. Multiple-Issue or “Superscalar” Pipeline regfile I$ D$ B P • Overcome this limit using multiple issue • Also called superscalar • Two instructions per stage at once, or three, or four, or eight… How do we build such • “Instruction-Level Parallelism (ILP)” [Fisher, IEEE TC’81] • Today, typically “4-wide” (Intel Core i7, AMD Opteron) “superscalar” hardware? • Some more (Power5 is 5-issue; Itanium is 6-issue) • Some less (dual-issue is common for simple cores) CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 9 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 10 A Typical Dual-Issue Pipeline (1 of 2) A Typical Dual-Issue Pipeline (2 of 2) regfile regfile I$ I$ D$ D$ B B P P • Multi-ported register file • Fetch an entire 16B or 32B cache block • Larger area, latency, power, cost, complexity • 4 to 8 instructions (assuming 4-byte average instruction length) • Multiple execution units • Predict a single branch per cycle • Simple adders are easy, but bypass paths are expensive • Parallel decode • Memory unit • Need to check for conflicting instructions • Single load per cycle (stall at decode) probably okay for dual issue • Is output register of I 1 is an input register to I 2 ? • Alternative: add a read port to data cache • Other stalls, too (for example, load-use delay) • Larger area, latency, power, cost, complexity CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 11 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 12

  4. Superscalar Challenges - Front End • Superscalar instruction fetch • Modest: fetch multiple instructions per cycle • Aggressive: buffer instructions and/or predict multiple branches • Superscalar instruction decode • Replicate decoders • Superscalar instruction issue • Determine when instructions can proceed in parallel • More complex stall logic - order N 2 for N -wide machine • Not all combinations of types of instructions possible Superscalar Implementation • Superscalar register read Challenges • Port for each register read (4-wide superscalar  8 read “ports”) • Each port needs its own set of address and data wires • Latency & area ∝ #ports 2 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 13 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 14 Superscalar Challenges - Back End Superscalar Bypass • Superscalar instruction execution • N 2 bypass network • Replicate arithmetic units (but not all, say, integer divider) – N+1 input muxes at each ALU input • Perhaps multiple cache ports (slower access, higher energy) – N 2 point-to-point connections versus • Only for 4-wide or larger (why? only ~35% are load/store insn) – Routing lengthens wires • Superscalar bypass paths – Heavy capacitive load • More possible sources for data values • Order (N 2 * P) for N -wide machine with execute pipeline depth P • And this is just one bypass stage (MX)! • Superscalar instruction register writeback • There is also WX bypassing • Even more for deeper pipelines • One write port per instruction that writes a register • Example, 4-wide superscalar  4 write ports • One of the big problems of superscalar • Fundamental challenge: • Why? On the critical path of • Amount of ILP (instruction-level parallelism) in the program single-cycle “bypass & execute” • Compiler must schedule code and extract parallelism loop CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 15 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 16

  5. Not All N 2 Created Equal Mitigating N 2 Bypass & Register File • Clustering : mitigates N 2 bypass • N 2 bypass vs. N 2 stall logic & dependence cross-check • Group ALUs into K clusters • Which is the bigger problem? • Full bypassing within a cluster • Limited bypassing between clusters • N 2 bypass … by far • With 1 or 2 cycle delay • 64- bit quantities (vs. 5-bit) • Can hurt IPC, but faster clock • Multiple levels (MX, WX) of bypass (vs. 1 level of stall logic) • (N/K) + 1 inputs at each mux • Must fit in one clock period with ALU (vs. not) • (N/K) 2 bypass paths in each cluster • Steering : key to performance • Dependence cross-check not even 2nd biggest N 2 problem • Steer dependent insns to same cluster • Regfile is also an N 2 problem (think latency where N is #ports) • Cluster register file , too • And also more serious than cross-check • Replica a register file per cluster • All register writes update all replicas • Fewer read ports; only for cluster CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 17 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 18 Mitigating N 2 RegFile: Clustering++ Another Challenge: Superscalar Fetch • What is involved in fetching multiple instructions per cycle? cluster 0 RF0 • In same cache block? → no problem • 64-byte cache block is 16 instructions (~4 bytes per instruction) • Favors larger block size (independent of hit rate) RF1 • What if next instruction is last instruction in a block? cluster 1 DM • Fetch only one instruction that cycle • Clustering : split N -wide execution pipeline into K clusters • Or, some processors may allow fetching from 2 consecutive blocks • With centralized register file, 2N read ports and N write ports • What about taken branches? • How many instructions can be fetched on average? • Clustered register file : extend clustering to register file • Average number of instructions per taken branch? • Replicate the register file (one replica per cluster) • Register file supplies register operands to just its cluster • Assume: 20% branches, 50% taken → ~10 instructions • All register writes go to all register files (keep them in sync) • Consider a 5-instruction loop with an 4-issue processor • Advantage: fewer read ports per register! • Without smarter fetch, ILP is limited to 2.5 (not 4, which is bad) • K register files, each with 2N/K read ports and N write ports CIS 501 (Martin): Superscalar 19 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend