Superscalar Organization Nima Honarmand Spring 2016 :: CSE 502 - PowerPoint PPT Presentation

Spring 2016 :: CSE 502 – Computer Architecture Superscalar Organization Nima Honarmand

Spring 2016 :: CSE 502 – Computer Architecture Instruction-Level Parallelism (ILP) • Recall: “Parallelism is the number of independent tasks available” • ILP is a measure of inter-dependencies between insns. • Average ILP = num. instruction / num. cyc required in an “ideal machine” code1: ILP = 1 i.e. must execute serially code2: ILP = 3 i.e. can execute at the same time r1  r2 + 1 r1  r2 + 1 code1: code2: r3  r9 / 17 r3  r1 / 17 r4  r0 - r10 r4  r0 - r3

Spring 2016 :: CSE 502 – Computer Architecture ILP != IPC • ILP usually assumes – Infinite resources – Perfect fetch – Unit-latency for all instructions • ILP is a property of the program dataflow • IPC is the “real” observed metric – How many insns. are executed per cycle • ILP is an upper-bound on the attainable IPC – Specific to a particular program

Spring 2016 :: CSE 502 – Computer Architecture Purported Limits on ILP Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 Kuck et al. [1972] 8 Riseman and Foster [1972] 51 Nicolau and Fisher [1984] 90

Spring 2016 :: CSE 502 – Computer Architecture ILP Limits of Scalar Pipelines (1) • Scalar upper bound on throughput – Limited to CPI >= 1 – Solution: superscalar pipelines with multiple insns at each stage Prefetch Decode1 U-pipe V-pipe Decode2 Decode2 Execute Execute Pentium Pipeline Writeback Writeback

Spring 2016 :: CSE 502 – Computer Architecture ILP Limits of Scalar Pipelines (2) • Inefficient unified IF • • • pipeline ID • • • – Lower resource utilization and longer RD • • • instruction latency EX ALU MEM1 FP1 BR – Solution: diversified pipelines MEM2 FP2 FP3 WB • • •

Spring 2016 :: CSE 502 – Computer Architecture ILP Limits of Scalar Pipelines (3) • Rigid pipeline stall IF • • • policy ID • • • – A stalled RD instruction stalls • • • ( in order ) Dispatch all newer Buffer ( out of order ) instructions EX ALU MEM1 FP1 BR – Solution 1: MEM2 FP2 out-of-order execution FP3 ( out of order ) Reorder Buffer ( in order ) WB • • •

Spring 2016 :: CSE 502 – Computer Architecture ILP Limits of Scalar Pipelines (3) • Rigid pipeline stall Fetch policy Instruction Buffer In – A stalled Decode Program instruction stalls Order Dispatch Buffer all newer Dispatch instructions Issuing Buffer – Solution 1: Out Execute of out-of-order Order Completion Buffer execution Complete – Solution 2: inter- In Program stage buffers Store Buffer Order Retire

Spring 2016 :: CSE 502 – Computer Architecture ILP Limits of Scalar Pipelines (4) • Instruction dependencies limit parallelism – Frequent stalls due to data and control dependencies – Solution 1: renaming – for WAR and WAW register dependences – Solution 2: speculation – for control dependences and memory dependences

Spring 2016 :: CSE 502 – Computer Architecture ILP Limits of Scalar Pipelines (Summary) 1. Scalar upper bound on throughput – Limited to CPI >= 1 – Solution: superscalar pipelines with multiple insns at each stage 2. Inefficient unified pipeline – Lower resource utilization and longer instruction latency – Solution: diversified pipelines 3. Rigid pipeline stall policy – A stalled instruction stalls all newer instructions – Solution: out-of-order execution and inter-stage buffers 4. Instruction dependencies limit parallelism – Frequent stalls due to data and control dependencies – Solutions: renaming and speculation State of the art: Out-of-Order Superscalar Pipelines

Spring 2016 :: CSE 502 – Computer Architecture Overall Picture • Fetch issues: – Fetch multiple isns I-cache – Branches Instruction – Branch target mis-alignment Branch FETCH Flow Predictor Instruction • Decode issues: Buffer DECODE – Identify insns – Find dependences Memory Integer Floating-point Media • Execution issues: – Dispatch insns Memory – Resolve dependences Data – Bypass networks Flow EXECUTE – Multiple outstanding memory Reorder accesses Buffer Register (ROB) • Completion issues: Data COMMIT Flow D-cache Store – Out-of-order completion Queue – Speculative instructions – Precise exceptions State of the art: Out-of-Order Superscalar Pipelines

Superscalar Organization Nima Honarmand Spring 2016 :: CSE 502 - PowerPoint PPT Presentation

Spring 2016 :: CSE 502 Computer Architecture Superscalar Organization Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material

Sequential Presentation Of Long Instructions Limits of pipelining, The case for superscalar,

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng

CIS 371 Computer Organization and Design Unit 9: Superscalar Pipelines Slides developed by Milo

FabScalar RISC-V Rangeen Basu Roy Chowdhury Anil Kumar Kannepalli Eric Rotenberg FabScalar

Caches Out-of-order execution Data flow model Samira Khan Superscalar processor March

Superscalar Design: Instruction Flow Techniques Virendra Singh Associate Professor C omputer A

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin & Amir Roth at U. Penn

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M.

CSC2/458 Parallel and Distributed Systems Automatic Parallelization in Hardware Sreepathi Pai

Superscalar Design: An Introduction Virendra Singh Associate Professor C omputer A rchitecture

Superscalar Organization Nima Honarmand Spring 2018 :: CSE 502 Review: Instruction-Level

Gaug auge field field as as a dark da rk m matter tter cand ndidate Y A S A M A N A M A

DATA LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Vector Spaces Sets Closed Under Operations Defn. A set S is closed under some opera- tion if

The Road to Advanced Mission Critical Linux Support: an Open Source Approach marco bill-peter

Hyperbolic Field Space and Swampland Conjecture for DBI Scalar Speaker: Yun-Long Zhang Yukawa

Black-hole binary inspiral and merger in scalar-tensor theory of gravity U. Sperhake DAMTP ,

Higgs Boson Decays to Light Scalars at ATLAS University of Birmingham, 22 nd April 2020 Elliot

Variably scaled kernels M. Bozzini jointed with L. Lenarduzzi, M. Rossini, R. Schaback Maia

Superscalar Organization Nima Honarmand Spring 2016 :: CSE 502 - PowerPoint PPT Presentation

Spring 2016 :: CSE 502 Computer Architecture Superscalar Organization Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material

Sequential Presentation Of Long Instructions Limits of pipelining, The case for superscalar,

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng

CIS 371 Computer Organization and Design Unit 9: Superscalar Pipelines Slides developed by Milo

FabScalar RISC-V Rangeen Basu Roy Chowdhury Anil Kumar Kannepalli Eric Rotenberg FabScalar

Caches Out-of-order execution Data flow model Samira Khan Superscalar processor March

Superscalar Design: Instruction Flow Techniques Virendra Singh Associate Professor C omputer A

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin &amp; Amir Roth at U. Penn

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M.

CSC2/458 Parallel and Distributed Systems Automatic Parallelization in Hardware Sreepathi Pai

Superscalar Design: An Introduction Virendra Singh Associate Professor C omputer A rchitecture

Superscalar Organization Nima Honarmand Spring 2018 :: CSE 502 Review: Instruction-Level

Gaug auge field field as as a dark da rk m matter tter cand ndidate Y A S A M A N A M A

DATA LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Vector Spaces Sets Closed Under Operations Defn. A set S is closed under some opera- tion if

The Road to Advanced Mission Critical Linux Support: an Open Source Approach marco bill-peter

Hyperbolic Field Space and Swampland Conjecture for DBI Scalar Speaker: Yun-Long Zhang Yukawa

Black-hole binary inspiral and merger in scalar-tensor theory of gravity U. Sperhake DAMTP ,

Higgs Boson Decays to Light Scalars at ATLAS University of Birmingham, 22 nd April 2020 Elliot

Variably scaled kernels M. Bozzini jointed with L. Lenarduzzi, M. Rossini, R. Schaback Maia

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin & Amir Roth at U. Penn