Overview Basics of Pipelining Pipeline Hazards Appendix A - PDF document

Overview • Basics of Pipelining • Pipeline Hazards Appendix A • Pipeline Implementation p p • Pipelining + Exceptions Pipelining: Basic and Intermediate • Pipeline to handle Multicycle Operations Concepts 1 2 Unpipelined Execution of 3 LD Instructions • Assumed are the following delays: Memory access = 2 nsec, Pipelining: Its Natural! ALU operation = 2 nsec, Register file access = 1 nsec; P ro g ra m 2 4 6 8 1 0 1 2 1 4 1 6 1 8 e x e c u t io n T im e o rd e r ( in in s tr u c tio n s ) • Laundry Example In s tru c tio n D a ta A B C D ld r 1 , 1 0 0 ( r 4 ) R e g A L U R e g fe tc h a c c e s s • Ann, Brian, Cathy, Dave In s tru c tio n D a ta each have one load of clothes ld r 2 , 2 0 0 ( r 5 ) 8 n s R e g A L U R e g fe tc h a c c e s s to wash, dry, and fold In s tru c tio n ld r 3 , 3 0 0 ( r 6 ) 8 n s fe tc h • Washer takes 30 minutes . .. • Assuming 2nsec clock cycle time (i.e. 500 MHz clock), every ld 8 n s instruction needs 4 clock cycles (i.e. 8 nsec) to execute. • Dryer takes 40 minutes • The total time to execute this sequence is 12 clock cycles (i.e. 24 nsec). CPI = 12 cycles/3 instructions= 4 cycles / instruction. • “Folder” takes 20 minutes 3 4 Pipelined Laundry: Sequential Laundry Start work ASAP 6 PM 7 8 9 11 Midnight 10 6 PM 7 8 9 11 Midnight 10 Time Time 30 40 20 30 40 20 30 40 20 30 40 20 T 30 40 40 40 40 20 A a T s A A a a k k s B k O B r C O d r e C d r D e r D • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take? • Pipelined laundry takes 3.5 hours for 4 loads 5 6 1

Key Definitions Pipeline Stages Pipelining is a key implementation technique used to build fast processors. It allows the execution of We can divide the execution of an instruction multiple instructions to overlap in time. into the following 5 “classic” stages: A pipeline within a processor is similar to a car I F: Instruction Fetch assembly line. Each assembly station is called a I D: Instruction Decode, register fetch pipe stage or a pipe segment. EX: Execution The throughput of an instruction pipeline is MEM: Memory Access the measure of how often an instruction exits the WB: Register write Back pipeline. 7 8 Pipeline Throughput and Latency Pipeline Throughput and Latency IF ID EX MEM WB IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns 5 ns 4 ns 5 ns 10 ns 4 ns Pipeline throughput: how often an instruction is completed. [ [ ] ] Consider the pipeline above with the indicated Consider the pipeline above with the indicated = 1 instr / max lat ( IF ), lat ( ID ), lat ( EX ), lat ( MEM ), lat ( WB ) delays. We want to know what is the pipeline [ ] = 1 instr / max 5 ns , 4 ns , 5 ns , 10 ns , 4 ns throughput and the pipeline latency . = 1 instr / 10 ns ( ignoring pipeline register overhead ) Pipeline throughput: instructions completed per second. Pipeline latency: how long does it take to execute an instruction in the pipeline. Pipeline latency: how long does it take to execute a = + + + + L lat ( IF ) lat ( ID ) lat ( EX ) lat ( MEM ) lat ( WB ) single instruction in the pipeline. = + + + + = 5 ns 4 ns 5 ns 10 ns 4 ns 28 ns Is this right? 9 10 Pipelining Lessons • Pipelining doesn’t help Pipeline Throughput and Latency latency of single task, it helps throughput of 6 PM 7 8 9 entire workload IF ID EX MEM WB Time • Pipeline rate limited by slowest pipeline stage 5 ns 4 ns 5 ns 10 ns 4 ns T 30 40 40 40 40 20 a Simply adding the latencies to compute the pipeline • Multiple tasks operating s simultaneously i lt l latency, only would work for an isolated instruction A k • Potential speedup = I1 IF ID EX MEM WB L(I1) = 28ns O Number pipe stages L(I2) = 33ns B I2 IF ID EX MEM WB r • Unbalanced lengths of I3 IF ID EX MEM WB L(I3) = 38ns d pipe stages reduces I4 IF ID EX MEM WB e C r speedup L(I5) = 43ns We are in trouble! The latency is not constant. • Time to “fill” pipeline This happens because this is an unbalanced D pipeline. The solution is to make every state and time to “drain” it the same length as the longest one. reduces speedup 11 12 2

Other Definitions Design Issues • Pipe stage or pipe segment • Balance the length of each pipeline stage – A decomposable unit of the fetch-decode-execute Depth of the pipeline paradigm Throughput = Time per instruction on unpipelined machine p p p • Pipeline depth • Pipeline depth • Problems – Number of stages in a pipeline • Machine cycle – Usually, stages are not balanced – Pipelining overhead – Clock cycle time – Hazards (conflicts) • Latch • Performance (throughput CPU performance equation) – Per phase/stage local information storage unit – Decrease of the CPI – Decrease of cycle time 13 14 Basic Pipeline Pipelined Datapath with Resources Clock number 1 2 3 4 5 6 7 8 9 Instr # IF ID EX MEM WB i i +1 IF ID EX MEM WB i +2 IF ID EX MEM WB i +3 IF ID EX MEM WB i +4 IF ID EX MEM WB 15 16 Physics of Clock Skew Pipeline Registers • Basically caused because the clock edge reaches different parts of the chip at different times – Capacitance-charge-discharge rates • All wires, leads, transistors, etc. have capacitance • Longer wire, larger capacitance – Repeaters used to drive current, handle fan-out problems • C is inversely proportional to rate-of-change of V – Time to charge/discharge adds to delay – Dominant problem in old integration densities. • For a fixed C, rate-of-change of V is proportional to I – Problem with this approach is power requirements go up – Power dissipation becomes a problem. – Speed-of-light propagation delays • Dominates current integration densities as nowadays capacitances are much lower. • But nowadays clock rates are much faster (even small delays will consume a large part of the clock cycle) • Current day research � asynchronous chip designs 17 18 3

Computing Pipeline Speedup Performance Issues Speedup = average instruction time unpipelined average instruction time pipelined • Unpipelined processor – 1.0 nsec clock cycle CPI = Ideal CPI pipelined + Pipeline stall clock cycles per instr – 4 cycles for ALU and branches Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined – 5 cycles for memory x Ideal CPI + Pipeline stall per instr Clock Cycle pipelined – Frequencies Speedup = Pipeline depth Clock Cycle unpipelined x – ALU (40%), Branch (20%), and Memory (40%) 1 + Pipeline stall CPI Clock Cycle pipelined • Clock skew and setup adds 0.2ns overhead Remember that average instruction time = CPI*Clock Cycle • Speedup with pipelining? And ideal CPI for pipelined machine is 1. 19 20 Structural Hazards Pipeline Hazards • Limits to pipelining: Hazards prevent next • Overlapped execution of instructions: instruction from executing during its designated – Pipelining of functional units clock cycle – Structural hazards: HW cannot support this combination – Duplication of resources of instructions (single person to fold and put clothes • Structural Hazard away) away) – When the pipeline can not accommodate some – Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) combination of instructions – Control hazards: Pipelining of branches & other • Consequences instructions that change the PC – Stall • Common solution is to stall the pipeline until the – Increase of CPI from its ideal value (1) hazard is resolved, inserting one or more “bubbles” in the pipeline 21 22 Structural Hazard with 1 port per Pipelining of Functional Units Memory Fully pipelined M1 M2 M3 M4 M5 FP Multiply IF ID MEM WB EX Partially pipelined Partially pipelined M1 M1 M2 M2 M3 M3 M4 M4 M5 M5 FP Multiply IF ID MEM WB EX Not pipelined M1 M2 M3 M4 M5 FP Multiply IF ID WB MEM EX 23 24 4

Overview Basics of Pipelining Pipeline Hazards Appendix A - PDF document

Overview Basics of Pipelining Pipeline Hazards Appendix A Pipeline Implementation p p Pipelining + Exceptions Pipelining: Basic and Intermediate Pipeline to handle Multicycle Operations Concepts 1 2 Unpipelined

01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 |

OVERVIEW PRESENTATION / 1 OVERVIEW PRESENTATION / 1 SF park overview OVERVIEW PRESENTATION / 2

OVERVIEW PRESENTATION / 1 OVERVIEW PRESENTATION / 1 Acknowledgements OVERVIEW PRESENTATION / 2 SF

INVESTOR PRESENTATION FEBRUARY 2016 INDEX EXECUTIVE SUMMARY COMPANY OVERVIEW BUSINESS OVERVIEW

INVESTOR PRESENTATION MAY 2019 Index Executive Summary Company Overview Business Overview

INVESTOR PRESENTATION MARCH 2016 INDEX EXECUTIVE SUMMARY COMPANY OVERVIEW BUSINESS OVERVIEW

1 Overview Overview Regional demographic overview Regional demographic overview Workforce

Covid-19 and Business Interruption: Maximizing Insurance Coverage and Federal Grants Counsel

OVERVIEW OVERVIEW OVERVIEW OVERVIEW The qualifications are aimed at primary school

An overview to Maltese An overview to Maltese An overview to Maltese An overview to Maltese

GSM System Overview GSM System Overview GSM System Overview GSM System Overview Phone Lin

Butterball Employees Butterball Employees Butterball Employees Benefits Overview Ruan Benefits

Program-for-Results Financing Overview Overview Overview of World Bank Instruments

INVESTOR PRESENTATION Index Executive Summary Company Overview Business Overview Industry

Key Maths 3 UK Assessm ent overview Claire Parsons Overview 1. Key Maths 3 UK (overview) 2.

Federal Fiscal Year 2017-18 CHASE Fee Program June 21, 2018 Overview CHASE Overview Fee

NVIDIA GPU - odd dwarfs Julian Na and Marcus V olker 12. Februar 2015 1/37 Dwarfs

The Semantics of CML (From Appendix B of Reppys Book) In these slides, we present the syntax

1. Action Plan Six Key Areas of Focus 2. Existing and Proposed Operational Model 3. The

Some Useful Sets The Empty Set Definition 1 The empty set is the set with no elements, denoted by

Threshold Accepting for Credit Risk Assessment and Validation M. Lyra 1 A. Onwunta P . Winker

Z Sample CPE Tracking OMB Circular A-123 History Letter 1981 OMB First Issued Circular No.

The Effects of Compliance Reminders on Tax Payments in Greece Evidence from a

Earnings Conference Call Third Quarter 2017 October 26, 2017 Cautionary Statements And Risk