Appendix A Appendix A Pipelining: Basic and Intermediate Concepts - PowerPoint PPT Presentation

Appendix A Appendix A Pipelining: Basic and Intermediate Concepts p 1

Overview • Basics of Pipelining B i f Pi li i • Pipeline Hazards • Pipeline Implementation • Pipelining + Exceptions Pipelining + Exceptions • Pipeline to handle Multicycle Operations 2

Unpipelined Execution of 3 LD Instructions p p • Assumed are the following delays: Memory access = 2 nsec, ALU operation = 2 nsec, Register file access = 1 nsec; P ro g ra m P ro g ra m 2 4 6 8 1 0 1 2 1 4 1 6 1 8 e x e c u tio n T im e o rd e r (in in s tr u c tio n s ) In s tru c tio n D a ta ld r 1 , 1 0 0 (r 4 ) R e g A L U R e g fe tc h a c c e s s In s tru c tio n D a ta ld r 2 , 2 0 0 (r 5 ) R e g A L U R e g 8 n s fe tc h a c c e s s In s tru c tio n ld r 3 , 3 0 0 (r 6 ) 8 n s fe tc h . .. 8 n s • Assuming 2nsec clock cycle time (i.e. 500 MHz clock), every ld g y ( ), y instruction needs 4 clock cycles (i.e. 8 nsec) to execute. • The total time to execute this sequence is 12 clock cycles (i.e. 24 nsec). CPI = 12 cycles/3 instructions= 4 cycles / instruction. sec). C cyc es/3 st uct o s cyc es / st uct o . 3

Pipelining: Its Natural! • Laundry Example A A B B C C D D • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold t h d d f ld • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes 4

Sequential Laundry 6 PM Midnight 7 8 9 11 10 Time 30 40 20 30 40 20 30 40 20 30 40 20 T A A a a s k B O r C d e e r D • Sequential laundry takes 6 hours for 4 loads q y • If they learned pipelining, how long would laundry take? 5

Pipelined Laundry: Start work ASAP S k ASAP 6 PM Midnight 7 8 9 11 10 Time 30 40 40 40 40 20 T A a s k k B O r C C d e r D • Pipelined laundry takes 3.5 hours for 4 loads 6

Key Definitions Pipelining is a key implementation technique used to build fast processors. It allows the execution of multiple instructions to overlap in time multiple instructions to overlap in time. A pipeline within a processor is similar to a car assembly line Each assembly station is called a assembly line. Each assembly station is called a pipe stage or a pipe segment. The throughput of an instruction pipeline is The throughput of an instruction pipeline is the measure of how often an instruction exits the pipeline. 7

Pipeline Stages We can divide the execution of an instruction into the following 5 “classic” stages: I F: Instruction Fetch I F: Instruction Fetch I D: Instruction Decode, register fetch EX: Execution MEM M MEM: Memory Access A WB: Register write Back 8

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns Consider the pipeline above with the indicated delays. We want to know what is the pipeline y p p throughput and the pipeline latency . Pipeline throughput: instructions completed per second. Pipeline throughput: instructions completed per second. Pipeline latency: how long does it take to execute a single instruction in the pipeline. 9

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns Pipeline throughput: how often an instruction is completed. [ ] = 1 instr / max lat ( IF ), lat ( ID ), lat ( EX ), lat ( MEM ), lat ( WB ) [ [ ] ] = 1 1 i instr / / max 5 5 ns , 4 4 ns , 5 5 ns , 10 10 ns , 4 4 ns = 1 instr / 10 ns ( ignoring pipeline register overhead ) Pipeline latency: how long does it take to execute an Pipeline latency: how long does it take to execute an instruction in the pipeline. = + + + + L lat ( IF ) lat ( ID ) lat ( EX ) lat ( MEM ) lat ( WB ) = + + + + = 5 ns 4 ns 5 ns 10 ns 4 ns 28 ns Is this right? 10

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns Simply adding the latencies to compute the pipeline latency, only would work for an isolated instruction L(I1) = 28ns I1 IF ID EX MEM WB L(I2) = 33ns I2 IF ID EX MEM WB L(I3) = 38ns I3 IF ID EX MEM WB I4 I4 IF IF ID ID EX EX MEM MEM WB WB L(I5) = 43ns We are in trouble! The latency is not constant. This happens because this is an unbalanced pipeline. The solution is to make every state the same length as the longest one. 11

Pipelining Lessons • Pipelining doesn’t help Pi li i d ’t h l latency of single task, it helps throughput of 6 PM 6 PM 7 7 8 8 9 9 entire workload ti kl d Time • Pipeline rate limited by slowest pipeline stage T 30 30 40 40 40 40 40 40 40 20 • Multiple tasks operating 40 20 a s simultaneously A k • Potential speedup = • Potential speedup = O Number pipe stages B r • Unbalanced lengths of d pipe stages reduces i t d e C r speedup • Time to “fill” pipeline D D and time to “drain” it reduces speedup 12

Other Definitions • Pipe stage or pipe segment – A decomposable unit of the fetch-decode-execute paradigm paradigm • Pipeline depth – Number of stages in a pipeline Number of stages in a pipeline • Machine cycle – Clock cycle time • Latch – Per phase/stage local information storage unit 13

Design Issues • Balance the length of each pipeline stage Depth of the pipeline Depth of the pipeline Throughput = Time per instruction on unpipelined machine • Problems Problems – Usually, stages are not balanced – Pipelining overhead – Hazards (conflicts) • Performance (throughput CPU performance equation) – Decrease of the CPI D f h CPI – Decrease of cycle time 14

Basic Pipeline Basic Pipeline Clock number 1 2 3 4 5 6 7 8 9 Instr # IF ID EX MEM WB i i +1 IF ID EX MEM WB i +2 IF ID EX MEM WB i +3 i +3 IF ID EX MEM WB i +4 IF ID EX MEM WB 15

16 Pipelined Datapath with Resources

17 Pipeline Registers

Physics of Clock Skew Physics of Clock Skew • Basically caused because the clock edge reaches different parts of the chip at different times p p – Capacitance-charge-discharge rates • All wires, leads, transistors, etc. have capacitance • Longer wire, larger capacitance – Repeaters used to drive current, handle fan-out problems R t d t d i t h dl f t bl • C is inversely proportional to rate-of-change of V – Time to charge/discharge adds to delay – Dominant problem in old integration densities. • For a fixed C, rate-of-change of V is proportional to I – Problem with this approach is power requirements go up – Power dissipation becomes a problem. – Speed-of-light propagation delays p g p p g y • Dominates current integration densities as nowadays capacitances are much lower. • But nowadays clock rates are much faster (even small delays will consume a large part of the clock cycle) g p y ) • Current day research � asynchronous chip designs 18

Performance Issues • Unpipelined processor – 1.0 nsec clock cycle – 4 cycles for ALU and branches – 5 cycles for memory – Frequencies – ALU (40%), Branch (20%), and Memory (40%) • Clock skew and setup adds 0.2ns overhead Cl k k d t dd 0 2 h d • Speedup with pipelining? 19

Computing Pipeline Speedup p g p p p Speedup = average instruction time unpipelined p p g p p average instruction time pipelined CPI = Ideal CPI pipelined pipelined + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined x Ideal CPI + Pipeline stall per instr Clock Cycle pipelined Speedup = Pipeline depth Clock Cycle unpipelined x 1 + Pipeline stall CPI Clock Cycle pipelined Remember that average instruction time = CPI*Clock Cycle And ideal CPI for pipelined machine is 1. 20

Pipeline Hazards p • Limits to pipelining: Hazards prevent next instruction from executing during its designated instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) – Data hazards: Instruction depends on result of prior p p instruction still in the pipeline (missing sock) – Control hazards: Pipelining of branches & other instructions that change the PC instructions that change the PC • Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline 21

Structural Hazards • Overlapped execution of instructions: pp – Pipelining of functional units – Duplication of resources p • Structural Hazard – When the pipeline can not accommodate some p p combination of instructions • Consequences – Stall – Increase of CPI from its ideal value (1) 22

23 Structural Hazard with 1 port per Memory

Pipelining of Functional Units Fully pipelined M1 M2 M3 M4 M5 FP Multiply FP M l i l IF ID WB MEM EX Partially pipelined M1 M2 M3 M4 M5 FP Multiply FP Multiply IF IF ID ID WB WB MEM MEM EX Not pipelined M1 M2 M3 M4 M5 FP Multiply p y IF IF ID ID WB WB MEM MEM EX 24

Appendix A Appendix A Pipelining: Basic and Intermediate Concepts - PowerPoint PPT Presentation

Appendix A Appendix A Pipelining: Basic and Intermediate Concepts p 1 Overview Basics of Pipelining B i f Pi li i Pipeline Hazards Pipeline Implementation Pipelining + Exceptions Pipelining + Exceptions Pipeline

APPENDICES appendix 1. Systems maps appendix 1. Systems maps appendix 1. Systems maps appendix

ETA Appendix F Presentation April 17, 2019 Appendix F - Percent Increase applied to Appendix

Appendix F Appendix F CERN Probably one of the most incredible experiments in the world!

A4 APPENDIX Appendix 4 R APPENDIX 4 Board of Directors Report Presentation of the agenda for

Appendix C Public Involvement and Agency Coordination C2 - PUBLIC INVOLVEMENT MEETINGS PART 3

Environmental Quality Council Proposed Chapter 1, Appendix H Rule Proposed Chapter 1, Appendix H

SI232 Slide Set #6: Digital Logic (Appendix B) 1 2 Appendix Goals Logic Design Digital

IC220 Slide Set #6: Digital Logic (Appendix B) 1 2 Appendix Goals Logic Design Digital

APPENDIX Appendix Presentation Slides MICHIGAN TECHNOLOGICAL UNIVERSITY H-STEM INTRODUCTIONS

APPENDIX APPENDIX JAERI DEMO Design Poloidal Ring Coil Cryostat Coil Gap Rib Panel Blanket

2019-20 Draft Budget Appendix April 17, 2019 30 Appendix contents Accounting policies

FLOW CHART A FOLLOW-UP ON POSITIVE TEST KIT RESULTS Appendix N Flow Chart Testing to

Appendix Deposits and Loans Appendix Deposits (ending balance) Loans (ending balance) (Unit:

Appendix Deposits and Loans Appendix Deposits (ending balance) Loans (ending balance) (Unit:

SPECIES LISTED UNDER CITES APPENDIX II AFTER CoP 17 Decision for wood species Inclusion of

The PDL Process Faculty Workshop Spring 2015 Appendix P3 the report Article 17: Professional

Chapter 5 Managing Process Constraints Theory of Constraints Managing Bottlenecks

CS4617 Computer Architecture Lecture 4: Memory Hierarchy 2 Dr J Vaughan September 17, 2014 1/25

CSEE 3827: Fundamentals of Computer Systems, Spring 2011 8. Processor Performance Prof. Martha

Performance What do we mean by Performance? We must take many different factors into account:

Pipelining Performance Measurements Cycle Time: Time in between clock ticks Latency:

Computer Systems Lecture 14 Performance Measures CS 230 - Spring 2020 3-1 CPU Clocking

CSSE232 Computer Architecture Performance Class status

Technology Insertion/Infusion CALCE Electronic Products and Systems Center University of Maryland