 
              Appendix A Appendix A Pipelining: Basic and Intermediate Concepts p 1
Overview • Basics of Pipelining B i f Pi li i • Pipeline Hazards • Pipeline Implementation • Pipelining + Exceptions Pipelining + Exceptions • Pipeline to handle Multicycle Operations 2
Unpipelined Execution of 3 LD Instructions p p • Assumed are the following delays: Memory access = 2 nsec, ALU operation = 2 nsec, Register file access = 1 nsec; P ro g ra m P ro g ra m 2 4 6 8 1 0 1 2 1 4 1 6 1 8 e x e c u tio n T im e o rd e r (in in s tr u c tio n s ) In s tru c tio n D a ta ld r 1 , 1 0 0 (r 4 ) R e g A L U R e g fe tc h a c c e s s In s tru c tio n D a ta ld r 2 , 2 0 0 (r 5 ) R e g A L U R e g 8 n s fe tc h a c c e s s In s tru c tio n ld r 3 , 3 0 0 (r 6 ) 8 n s fe tc h . .. 8 n s • Assuming 2nsec clock cycle time (i.e. 500 MHz clock), every ld g y ( ), y instruction needs 4 clock cycles (i.e. 8 nsec) to execute. • The total time to execute this sequence is 12 clock cycles (i.e. 24 nsec). CPI = 12 cycles/3 instructions= 4 cycles / instruction. sec). C cyc es/3 st uct o s cyc es / st uct o . 3
Pipelining: Its Natural! • Laundry Example A A B B C C D D • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold t h d d f ld • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes 4
Sequential Laundry 6 PM Midnight 7 8 9 11 10 Time 30 40 20 30 40 20 30 40 20 30 40 20 T A A a a s k B O r C d e e r D • Sequential laundry takes 6 hours for 4 loads q y • If they learned pipelining, how long would laundry take? 5
Pipelined Laundry: Start work ASAP S k ASAP 6 PM Midnight 7 8 9 11 10 Time 30 40 40 40 40 20 T A a s k k B O r C C d e r D • Pipelined laundry takes 3.5 hours for 4 loads 6
Key Definitions Pipelining is a key implementation technique used to build fast processors. It allows the execution of multiple instructions to overlap in time multiple instructions to overlap in time. A pipeline within a processor is similar to a car assembly line Each assembly station is called a assembly line. Each assembly station is called a pipe stage or a pipe segment. The throughput of an instruction pipeline is The throughput of an instruction pipeline is the measure of how often an instruction exits the pipeline. 7
Pipeline Stages We can divide the execution of an instruction into the following 5 “classic” stages: I F: Instruction Fetch I F: Instruction Fetch I D: Instruction Decode, register fetch EX: Execution MEM M MEM: Memory Access A WB: Register write Back 8
Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns Consider the pipeline above with the indicated delays. We want to know what is the pipeline y p p throughput and the pipeline latency . Pipeline throughput: instructions completed per second. Pipeline throughput: instructions completed per second. Pipeline latency: how long does it take to execute a single instruction in the pipeline. 9
Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns Pipeline throughput: how often an instruction is completed. [ ] = 1 instr / max lat ( IF ), lat ( ID ), lat ( EX ), lat ( MEM ), lat ( WB ) [ [ ] ] = 1 1 i instr / / max 5 5 ns , 4 4 ns , 5 5 ns , 10 10 ns , 4 4 ns = 1 instr / 10 ns ( ignoring pipeline register overhead ) Pipeline latency: how long does it take to execute an Pipeline latency: how long does it take to execute an instruction in the pipeline. = + + + + L lat ( IF ) lat ( ID ) lat ( EX ) lat ( MEM ) lat ( WB ) = + + + + = 5 ns 4 ns 5 ns 10 ns 4 ns 28 ns Is this right? 10
Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns Simply adding the latencies to compute the pipeline latency, only would work for an isolated instruction L(I1) = 28ns I1 IF ID EX MEM WB L(I2) = 33ns I2 IF ID EX MEM WB L(I3) = 38ns I3 IF ID EX MEM WB I4 I4 IF IF ID ID EX EX MEM MEM WB WB L(I5) = 43ns We are in trouble! The latency is not constant. This happens because this is an unbalanced pipeline. The solution is to make every state the same length as the longest one. 11
Pipelining Lessons • Pipelining doesn’t help Pi li i d ’t h l latency of single task, it helps throughput of 6 PM 6 PM 7 7 8 8 9 9 entire workload ti kl d Time • Pipeline rate limited by slowest pipeline stage T 30 30 40 40 40 40 40 40 40 20 • Multiple tasks operating 40 20 a s simultaneously A k • Potential speedup = • Potential speedup = O Number pipe stages B r • Unbalanced lengths of d pipe stages reduces i t d e C r speedup • Time to “fill” pipeline D D and time to “drain” it reduces speedup 12
Other Definitions • Pipe stage or pipe segment – A decomposable unit of the fetch-decode-execute paradigm paradigm • Pipeline depth – Number of stages in a pipeline Number of stages in a pipeline • Machine cycle – Clock cycle time • Latch – Per phase/stage local information storage unit 13
Design Issues • Balance the length of each pipeline stage Depth of the pipeline Depth of the pipeline Throughput = Time per instruction on unpipelined machine • Problems Problems – Usually, stages are not balanced – Pipelining overhead – Hazards (conflicts) • Performance (throughput CPU performance equation) – Decrease of the CPI D f h CPI – Decrease of cycle time 14
Basic Pipeline Basic Pipeline Clock number 1 2 3 4 5 6 7 8 9 Instr # IF ID EX MEM WB i i +1 IF ID EX MEM WB i +2 IF ID EX MEM WB i +3 i +3 IF ID EX MEM WB i +4 IF ID EX MEM WB 15
16 Pipelined Datapath with Resources
17 Pipeline Registers
Physics of Clock Skew Physics of Clock Skew • Basically caused because the clock edge reaches different parts of the chip at different times p p – Capacitance-charge-discharge rates • All wires, leads, transistors, etc. have capacitance • Longer wire, larger capacitance – Repeaters used to drive current, handle fan-out problems R t d t d i t h dl f t bl • C is inversely proportional to rate-of-change of V – Time to charge/discharge adds to delay – Dominant problem in old integration densities. • For a fixed C, rate-of-change of V is proportional to I – Problem with this approach is power requirements go up – Power dissipation becomes a problem. – Speed-of-light propagation delays p g p p g y • Dominates current integration densities as nowadays capacitances are much lower. • But nowadays clock rates are much faster (even small delays will consume a large part of the clock cycle) g p y ) • Current day research � asynchronous chip designs 18
Performance Issues • Unpipelined processor – 1.0 nsec clock cycle – 4 cycles for ALU and branches – 5 cycles for memory – Frequencies – ALU (40%), Branch (20%), and Memory (40%) • Clock skew and setup adds 0.2ns overhead Cl k k d t dd 0 2 h d • Speedup with pipelining? 19
Computing Pipeline Speedup p g p p p Speedup = average instruction time unpipelined p p g p p average instruction time pipelined CPI = Ideal CPI pipelined pipelined + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined x Ideal CPI + Pipeline stall per instr Clock Cycle pipelined Speedup = Pipeline depth Clock Cycle unpipelined x 1 + Pipeline stall CPI Clock Cycle pipelined Remember that average instruction time = CPI*Clock Cycle And ideal CPI for pipelined machine is 1. 20
Pipeline Hazards p • Limits to pipelining: Hazards prevent next instruction from executing during its designated instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) – Data hazards: Instruction depends on result of prior p p instruction still in the pipeline (missing sock) – Control hazards: Pipelining of branches & other instructions that change the PC instructions that change the PC • Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline 21
Structural Hazards • Overlapped execution of instructions: pp – Pipelining of functional units – Duplication of resources p • Structural Hazard – When the pipeline can not accommodate some p p combination of instructions • Consequences – Stall – Increase of CPI from its ideal value (1) 22
23 Structural Hazard with 1 port per Memory
Pipelining of Functional Units Fully pipelined M1 M2 M3 M4 M5 FP Multiply FP M l i l IF ID WB MEM EX Partially pipelined M1 M2 M3 M4 M5 FP Multiply FP Multiply IF IF ID ID WB WB MEM MEM EX Not pipelined M1 M2 M3 M4 M5 FP Multiply p y IF IF ID ID WB WB MEM MEM EX 24
Recommend
More recommend