Superscalar Organization Nima Honarmand Spring 2018 :: CSE 502 - PowerPoint PPT Presentation

Spring 2018 :: CSE 502 Superscalar Organization Nima Honarmand

Spring 2018 :: CSE 502 Review: Instruction-Level Parallelism (ILP) • “Parallelism is the number of independent tasks available” • ILP is a measure of inter-dependencies between insns • Average ILP = num. instruction / num. cyc required in an “ideal machine” code1: ILP = 1 i.e. must execute serially code2: ILP = 3 i.e. can execute at the same time r1  r2 + 1 r1  r2 + 1 code1: code2: r3  r9 / 17 r3  r1 / 17 r4  r0 - r10 r4  r0 - r3

Spring 2018 :: CSE 502 ILP != IPC • ILP usually assumes – Infinite resources – Perfect fetch and branch prediction – Unit-latency for all instructions • ILP is a property of the program dataflow • IPC is the “real” observed metric – How many insns. are executed per cycle • ILP is an upper-bound on the attainable IPC – Specific to a particular program

Spring 2018 :: CSE 502 Purported Limits on ILP Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 Kuck et al. [1972] 8 Riseman and Foster [1972] 51 Nicolau and Fisher [1984] 90

Spring 2018 :: CSE 502 ILP Limits of Scalar Pipelines (1) • Scalar upper bound on throughput – Limited to IPC <= 1 – Solution: superscalar pipelines with multiple insns at each stage Prefetch Decode1 U-pipe V-pipe Decode2 Decode2 Execute Execute Pentium Pipeline Writeback Writeback

Spring 2018 :: CSE 502 ILP Limits of Scalar Pipelines (2) • Unified pipeline : a IF • • • pipeline where all instructions go ID • • • through the same stages RD • • • – Like our 5-stage pipeline EX ALU MEM1 FP1 BR • Unified pipelines are MEM2 FP2 inefficient FP3 – Lower resource utilization and longer instruction latency WB • • • – Solution: diversified pipelines

Spring 2018 :: CSE 502 ILP Limits of Scalar Pipelines (3) • Rigid pipeline stall IF • • • policy ID • • • – A stalled RD • • • instruction stalls ( in order ) Dispatch all newer Buffer ( out of order ) instructions EX ALU MEM1 FP1 BR – Solution 1: MEM2 FP2 out-of-order FP3 execution ( out of order ) Reorder Buffer ( in order ) WB • • •

Spring 2018 :: CSE 502 ILP Limits of Scalar Pipelines (3) • Rigid pipeline stall Fetch policy Instruction Buffer In – A stalled Decode Program instruction stalls Order Dispatch Buffer all newer Dispatch instructions Issuing Buffer – Solution 1: Out Execute of out-of-order Order Completion Buffer execution Complete – Solution 2: inter- In Program stage buffers Store Buffer Order Retire

Spring 2018 :: CSE 502 ILP Limits of Scalar Pipelines (4) • Instruction dependencies limit parallelism – Frequent stalls due to data and control dependencies – Solution 1: renaming – for WAR and WAW register dependences – Solution 2: speculation – for control dependences and memory dependences

Spring 2018 :: CSE 502 Summary : ILP Limits of Scalar Pipelines 1) Scalar upper bound on throughput – Limited to IPC <= 1 – Solution: superscalar pipelines with multiple insns at each stage 2) Inefficient unified pipeline – Lower resource utilization and longer instruction latency – Solution: diversified pipelines 3) Rigid pipeline stall policy – A stalled instruction stalls all newer instructions – Solution: out-of-order execution and inter-stage buffers 4) Instruction dependencies limit parallelism – Frequent stalls due to data and control dependencies – Solutions: renaming and speculation State of the art: Out-of-Order Superscalar Speculative Pipelines

Spring 2018 :: CSE 502 Superscalar Pipelines: Overall Picture • Fetch issues: – Fetch multiple isns I-cache – Branches and speculation Instruction Branch FETCH Flow • Decode issues: Predictor Instruction Buffer – Identify insns DECODE – Find dependences • Execution issues: Memory Integer Floating-point Media – Dispatch insns – Resolve dependences Memory – Forwarding networks Data – Multiple outstanding memory Flow EXECUTE accesses Reorder Buffer Register • Completion issues: (ROB) Data COMMIT – Out-of-order completion Flow D-cache Store Queue – Speculative instructions – Precise exceptions State of the art: Out-of-Order Superscalar Speculative Pipelines

Superscalar Organization Nima Honarmand Spring 2018 :: CSE 502 - PowerPoint PPT Presentation

Spring 2018 :: CSE 502 Superscalar Organization Nima Honarmand Spring 2018 :: CSE 502 Review: Instruction-Level Parallelism (ILP) Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material

Sequential Presentation Of Long Instructions Limits of pipelining, The case for superscalar,

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng

CIS 371 Computer Organization and Design Unit 9: Superscalar Pipelines Slides developed by Milo

FabScalar RISC-V Rangeen Basu Roy Chowdhury Anil Kumar Kannepalli Eric Rotenberg FabScalar

Caches Out-of-order execution Data flow model Samira Khan Superscalar processor March

Superscalar Design: Instruction Flow Techniques Virendra Singh Associate Professor C omputer A

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin & Amir Roth at U. Penn

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M.

CSC2/458 Parallel and Distributed Systems Automatic Parallelization in Hardware Sreepathi Pai

Superscalar Design: An Introduction Virendra Singh Associate Professor C omputer A rchitecture

Lecturer: Francesco Quaglia Hardware insights Pipelining and superscalar processors

Computation of operators in wavelet coordinates Tsogtgerel Gantumur and Rob Stevenson Department

Strong Support for paid family leave results from a survey of new york City residents February

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

Lecture 4: Outline The period of a state The period of a state Random walks Random walks

Lecture 6: Outline Recap the stationary distribution, and the vector ( k ) . The vector ( k

Logics of variable inclusion and Ponka sums of matrices Tommaso Moraschini joint work with S.

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Identification by Laplace Transform in Nonlinear Panel or Time Series Models with Unobserved

Sambuz

Useful Links

Newsletter

Mail Us

Superscalar Organization Nima Honarmand Spring 2018 :: CSE 502 - PowerPoint PPT Presentation

Spring 2018 :: CSE 502 Superscalar Organization Nima Honarmand Spring 2018 :: CSE 502 Review: Instruction-Level Parallelism (ILP) Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material

Sequential Presentation Of Long Instructions Limits of pipelining, The case for superscalar,

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng

CIS 371 Computer Organization and Design Unit 9: Superscalar Pipelines Slides developed by Milo

FabScalar RISC-V Rangeen Basu Roy Chowdhury Anil Kumar Kannepalli Eric Rotenberg FabScalar

Caches Out-of-order execution Data flow model Samira Khan Superscalar processor March

Superscalar Design: Instruction Flow Techniques Virendra Singh Associate Professor C omputer A

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin &amp; Amir Roth at U. Penn

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M.

CSC2/458 Parallel and Distributed Systems Automatic Parallelization in Hardware Sreepathi Pai

Superscalar Design: An Introduction Virendra Singh Associate Professor C omputer A rchitecture

Lecturer: Francesco Quaglia Hardware insights Pipelining and superscalar processors

Computation of operators in wavelet coordinates Tsogtgerel Gantumur and Rob Stevenson Department

Strong Support for paid family leave results from a survey of new york City residents February

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

Lecture 4: Outline The period of a state The period of a state Random walks Random walks

Lecture 6: Outline Recap the stationary distribution, and the vector ( k ) . The vector ( k

Logics of variable inclusion and Ponka sums of matrices Tommaso Moraschini joint work with S.

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN &amp; Gated RNN

Identification by Laplace Transform in Nonlinear Panel or Time Series Models with Unobserved

Sambuz

Useful Links

Newsletter

Mail Us

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin & Amir Roth at U. Penn

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN