Unit 8: Superscalar Pipelines Then: Static & dynamic scheduling - PowerPoint PPT Presentation

A Key Theme: Parallelism • Previously: pipeline-level parallelism • Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture • Next: instruction-level parallelism (ILP) • Execute multiple independent instructions fully in parallel Unit 8: Superscalar Pipelines • Then: • Static & dynamic scheduling • Extract much more ILP Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania' ' • Data-level parallelism (DLP) with'sources'that'included'University'of'Wisconsin'slides ' • Single-instruction, multiple data (one insn., four 64-bit adds) by'Mark'Hill,'Guri'Sohi,'Jim'Smith,'and'David'Wood ' • Thread-level parallelism (TLP) • Multiple software threads running on multiple cores CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 1 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 2 Readings This Unit: (In-Order) Superscalar Pipelines • Textbook (MA:FSPTCM) App App App • Idea of instruction-level parallelism • Sections 3.1, 3.2 (but not “Sidebar” in 3.2), 3.5.1 System software • Sections 4.2, 4.3, 5.3.3 • Superscalar hardware issues Mem CPU I/O • Bypassing and register file • Stall logic • Fetch • “Superscalar” vs VLIW/EPIC CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 3 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 4

“Scalar” Pipeline & the Flynn Bottleneck An Opportunity… regfile • But consider: ADD r1, r2 -> r3 I$ D$ ADD r4, r5 -> r6 B • Why not execute them at the same time ? (We can!) P • What about: ADD r1, r2 -> r3 • So far we have looked at scalar pipelines ADD r4, r3 -> r6 • One instruction per stage • In this case, dependences prevent parallel execution • With control speculation, bypassing, etc. – Performance limit (aka “Flynn Bottleneck”) is CPI = IPC = 1 • What about three instructions at a time? – Limit is never even achieved (hazards) • Or four instructions at a time? – Diminishing returns from “super-pipelining” (hazards + overhead) CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 5 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 6 What Checking Is Required? What Checking Is Required? • For two instructions: 2 checks • For two instructions: 2 checks ADD src1 1 , src2 1 -> dest 1 ADD src1 1 , src2 1 -> dest 1 ADD src1 2 , src2 2 -> dest 2 (2 checks) ADD src1 2 , src2 2 -> dest 2 (2 checks) • For three instructions: 6 checks • For three instructions: 6 checks ADD src1 1 , src2 1 -> dest 1 ADD src1 1 , src2 1 -> dest 1 ADD src1 2 , src2 2 -> dest 2 (2 checks) ADD src1 2 , src2 2 -> dest 2 (2 checks) ADD src1 3 , src2 3 -> dest 3 (4 checks) ADD src1 3 , src2 3 -> dest 3 (4 checks) • For four instructions: 12 checks • For four instructions: 12 checks ADD src1 1 , src2 1 -> dest 1 ADD src1 1 , src2 1 -> dest 1 ADD src1 2 , src2 2 -> dest 2 (2 checks) ADD src1 2 , src2 2 -> dest 2 (2 checks) ADD src1 3 , src2 3 -> dest 3 (4 checks) ADD src1 3 , src2 3 -> dest 3 (4 checks) ADD src1 4 , src2 4 -> dest 4 (6 checks) ADD src1 4 , src2 4 -> dest 4 (6 checks) • Plus checking for load-to-use stalls from prior n loads • Plus checking for load-to-use stalls from prior n loads CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 7 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 8

Multiple-Issue or “Superscalar” Pipeline regfile I$ D$ B P • Overcome this limit using multiple issue • Also called superscalar • Two instructions per stage at once, or three, or four, or eight… How do we build such • “Instruction-Level Parallelism (ILP)” [Fisher, IEEE TC’81] • Today, typically “4-wide” (Intel Core i7, AMD Opteron) “superscalar” hardware? • Some more (Power5 is 5-issue; Itanium is 6-issue) • Some less (dual-issue is common for simple cores) CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 9 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 10 A Typical Dual-Issue Pipeline (1 of 2) A Typical Dual-Issue Pipeline (2 of 2) regfile regfile I$ I$ D$ D$ B B P P • Multi-ported register file • Fetch an entire 16B or 32B cache block • Larger area, latency, power, cost, complexity • 4 to 8 instructions (assuming 4-byte average instruction length) • Multiple execution units • Predict a single branch per cycle • Simple adders are easy, but bypass paths are expensive • Parallel decode • Memory unit • Need to check for conflicting instructions • Single load per cycle (stall at decode) probably okay for dual issue • Is output register of I 1 is an input register to I 2 ? • Alternative: add a read port to data cache • Other stalls, too (for example, load-use delay) • Larger area, latency, power, cost, complexity CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 11 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 12

Superscalar Challenges - Front End • Superscalar instruction fetch • Modest: fetch multiple instructions per cycle • Aggressive: buffer instructions and/or predict multiple branches • Superscalar instruction decode • Replicate decoders • Superscalar instruction issue • Determine when instructions can proceed in parallel • More complex stall logic - order N 2 for N -wide machine • Not all combinations of types of instructions possible Superscalar Implementation • Superscalar register read Challenges • Port for each register read (4-wide superscalar  8 read “ports”) • Each port needs its own set of address and data wires • Latency & area ∝ #ports 2 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 13 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 14 Superscalar Challenges - Back End Superscalar Bypass • Superscalar instruction execution • N 2 bypass network • Replicate arithmetic units (but not all, say, integer divider) – N+1 input muxes at each ALU input • Perhaps multiple cache ports (slower access, higher energy) – N 2 point-to-point connections versus • Only for 4-wide or larger (why? only ~35% are load/store insn) – Routing lengthens wires • Superscalar bypass paths – Heavy capacitive load • More possible sources for data values • Order (N 2 * P) for N -wide machine with execute pipeline depth P • And this is just one bypass stage (MX)! • Superscalar instruction register writeback • There is also WX bypassing • Even more for deeper pipelines • One write port per instruction that writes a register • Example, 4-wide superscalar  4 write ports • One of the big problems of superscalar • Fundamental challenge: • Why? On the critical path of • Amount of ILP (instruction-level parallelism) in the program single-cycle “bypass & execute” • Compiler must schedule code and extract parallelism loop CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 15 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 16

Not All N 2 Created Equal Mitigating N 2 Bypass & Register File • Clustering : mitigates N 2 bypass • N 2 bypass vs. N 2 stall logic & dependence cross-check • Group ALUs into K clusters • Which is the bigger problem? • Full bypassing within a cluster • Limited bypassing between clusters • N 2 bypass … by far • With 1 or 2 cycle delay • 64- bit quantities (vs. 5-bit) • Can hurt IPC, but faster clock • Multiple levels (MX, WX) of bypass (vs. 1 level of stall logic) • (N/K) + 1 inputs at each mux • Must fit in one clock period with ALU (vs. not) • (N/K) 2 bypass paths in each cluster • Steering : key to performance • Dependence cross-check not even 2nd biggest N 2 problem • Steer dependent insns to same cluster • Regfile is also an N 2 problem (think latency where N is #ports) • Cluster register file , too • And also more serious than cross-check • Replica a register file per cluster • All register writes update all replicas • Fewer read ports; only for cluster CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 17 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 18 Mitigating N 2 RegFile: Clustering++ Another Challenge: Superscalar Fetch • What is involved in fetching multiple instructions per cycle? cluster 0 RF0 • In same cache block? → no problem • 64-byte cache block is 16 instructions (~4 bytes per instruction) • Favors larger block size (independent of hit rate) RF1 • What if next instruction is last instruction in a block? cluster 1 DM • Fetch only one instruction that cycle • Clustering : split N -wide execution pipeline into K clusters • Or, some processors may allow fetching from 2 consecutive blocks • With centralized register file, 2N read ports and N write ports • What about taken branches? • How many instructions can be fetched on average? • Clustered register file : extend clustering to register file • Average number of instructions per taken branch? • Replicate the register file (one replica per cluster) • Register file supplies register operands to just its cluster • Assume: 20% branches, 50% taken → ~10 instructions • All register writes go to all register files (keep them in sync) • Consider a 5-instruction loop with an 4-issue processor • Advantage: fewer read ports per register! • Without smarter fetch, ILP is limited to 2.5 (not 4, which is bad) • K register files, each with 2N/K read ports and N write ports CIS 501 (Martin): Superscalar 19 CIS 501: Comp. Arch. | Prof. Milo Martin | Superscalar 20

Unit 8: Superscalar Pipelines Then: Static & dynamic scheduling - PowerPoint PPT Presentation

A Key Theme: Parallelism Previously: pipeline-level parallelism Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture Next: instruction-level parallelism (ILP) Execute multiple independent

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

CIS 371 Computer Organization and Design Unit 9: Superscalar Pipelines Slides developed by Milo

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin & Amir Roth at U. Penn

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material

HOUSING PROJECT 1 UNIT 4 UNIT 1 UNIT 6 UNIT 5 UNIT 3 UNIT 2 Application of the Concept

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Unit Identifier Unit October 21, 2014 Unit Identifiers Unit Members Representing Name Email

CS 764: Topics in Database Management Systems Lecture 12: Parallel DBMSs Xiangyao Yu 10/14/2020

Introduction to GPUs and to the Linux Graphics Stack Martin Peres CC By-SA 3.0 Nouveau

iSHELL INSTRUMENT CONTROLLER OVERVIEW Tony Denault Software Programmer Eric

Jason Williams Cody Boettcher CSCE 488 Homework 6 Wireless Wumpus World Wireless technology

Assessing and Improving Large Scale Parallel Volume Rendering on the IBM Blue Gene/P

1/13/2020 Bringing Up Great Kids (BUGK) Facilitating respectful, reflective & effective

Rep eated Computation of a Global F unction 1 Goals of the lecture Rep eated

Object-Oriented Programming Some OO languages OOP = Simula 67: the original Abstract Data Types

Unit 8: Superscalar Pipelines Then: Static & dynamic scheduling - PowerPoint PPT Presentation

A Key Theme: Parallelism Previously: pipeline-level parallelism Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture Next: instruction-level parallelism (ILP) Execute multiple independent

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

CIS 371 Computer Organization and Design Unit 9: Superscalar Pipelines Slides developed by Milo

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin &amp; Amir Roth at U. Penn

Licensed Pipelines &amp; the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material

HOUSING PROJECT 1 UNIT 4 UNIT 1 UNIT 6 UNIT 5 UNIT 3 UNIT 2 Application of the Concept

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Unit Identifier Unit October 21, 2014 Unit Identifiers Unit Members Representing Name Email

CS 764: Topics in Database Management Systems Lecture 12: Parallel DBMSs Xiangyao Yu 10/14/2020

Introduction to GPUs and to the Linux Graphics Stack Martin Peres CC By-SA 3.0 Nouveau

iSHELL INSTRUMENT CONTROLLER OVERVIEW Tony Denault Software Programmer Eric

Jason Williams Cody Boettcher CSCE 488 Homework 6 Wireless Wumpus World Wireless technology

Assessing and Improving Large Scale Parallel Volume Rendering on the IBM Blue Gene/P

1/13/2020 Bringing Up Great Kids (BUGK) Facilitating respectful, reflective &amp; effective

Rep eated Computation of a Global F unction 1 Goals of the lecture Rep eated

Object-Oriented Programming Some OO languages OOP = Simula 67: the original Abstract Data Types

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin & Amir Roth at U. Penn

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

1/13/2020 Bringing Up Great Kids (BUGK) Facilitating respectful, reflective & effective