unit 5 pipelining
play

Unit 5: Pipelining Load-use stalling Pipelined multi-cycle - PowerPoint PPT Presentation

This Unit: Pipelining App App App Single-cycle & multi-cycle datapaths System software Latency vs throughput & performance Basic pipelining CIS 501: Computer Architecture Mem CPU I/O Data hazards Bypassing Unit


  1. This Unit: Pipelining App App App • Single-cycle & multi-cycle datapaths System software • Latency vs throughput & performance • Basic pipelining CIS 501: Computer Architecture Mem CPU I/O • Data hazards • Bypassing Unit 5: Pipelining • Load-use stalling • Pipelined multi-cycle operations • Control hazards • Branch prediction Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania' ' with'sources'that'included'University'of'Wisconsin'slides ' by'Mark'Hill,'Guri'Sohi,'Jim'Smith,'and'David'Wood ' CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 1 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 2 Readings In-Class Exercise • Chapter 2.1 of MA:FSPTCM • You have a washer, dryer, and “folder” • Each takes 30 minutes per load • How long for one load in total? • How long for two loads of laundry? • How long for 100 loads of laundry? • Now assume: • Washing takes 30 minutes, drying 60 minutes, and folding 15 min • How long for one load in total? • How long for two loads of laundry? • How long for 100 loads of laundry? CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 3 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 4

  2. In-Class Exercise Answers • You have a washer, dryer, and “folder” • Each takes 30 minutes per load • How long for one load in total? 90 minutes • How long for two loads of laundry? 90 + 30 = 120 minutes • How long for 100 loads of laundry? 90 + 30*99 = 3060 min • Now assume: • Washing takes 30 minutes, drying 60 minutes, and folding 15 min • How long for one load in total? 105 minutes Datapath Background • How long for two loads of laundry? 105 + 60 = 165 minutes • How long for 100 loads of laundry? 105 + 60*99 = 6045 min CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 5 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 6 Recall: The Sequential Model Recall: Maximizing Performance Execution time = • Basic structure of all modern ISAs (instructions/program) * (seconds/cycle) * (cycles/instruction) • Often called VonNeuman, but in ENIAC before (1 billion instructions) * (1ns per cycle) * (1 cycle per insn) • Program order : total order on dynamic insns = 1 second • Instructions per program: • Order and named storage define computation • Determined by program, compiler, instruction set architecture (ISA) • Convenient feature: program counter (PC) • Cycles per instruction: “CPI” • Insn itself stored in memory at location pointed to by PC • Typical range today: 2 to 0.5 • Next PC is next insn unless insn says otherwise • Determined by program, compiler, ISA, micro-architecture • Seconds per cycle: “clock period” - same each cycle • Processor logically executes loop at left • Typical range today: 2ns to 0.25ns • Atomic : insn finishes before next insn starts • Reciprocal is frequency: 0.5 Ghz to 4 Ghz (1 Htz = 1 cycle per sec) • Determined by micro-architecture, technology parameters • Implementations can break this constraint physically • For minimum execution time, minimize each term • But must maintain illusion to preserve correctness • Difficult: often pull against one another CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 7 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 8

  3. Single-Cycle Datapath Multi-Cycle Datapath + + 4 4 A Register Register Insn Insn PC PC IR Mem File Mem File O D Data Data B s1 s2 d s1 s2 d Mem Mem T insn-mem T regfile T ALU T data-mem T regfile T singlecycle • Single-cycle datapath : true “atomic” fetch/execute loop • Multi-cycle datapath : attacks slow clock • Fetch, decode, execute one complete instruction every cycle • Fetch, decode, execute one complete insn over multiple cycles + Takes 1 cycle to execution any instruction by definition (“CPI” is 1) • Allows insns to take different number of cycles – Long clock period: to accommodate slowest instruction + Opposite of single-cycle: short clock period (less “work” per cycle) (worst-case delay through circuit, must wait this long every time) - Multiple cycles per instruction (higher “CPI”) CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 9 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 10 Recap: Single-cycle vs. Multi-cycle Single-cycle vs. Multi-cycle Performance • Single-cycle insn0.fetch, dec, exec • Clock period = 50ns, CPI = 1 insn1.fetch, dec, exec Single-cycle • Performance = 50ns/insn insn0.fetch insn0.dec insn0.exec Multi-cycle insn1.fetch insn1.dec insn1.exec • Multi-cycle has opposite performance split of single-cycle • Single-cycle datapath : + Shorter clock period • Fetch, decode, execute one complete instruction every cycle – Higher CPI + Low CPI: 1 by definition – Long clock period: to accommodate slowest instruction • Multi-cycle • Branch: 20% ( 3 cycles), load: 20% ( 5 cycles), ALU: 60% ( 4 cycles) • Clock period = 11ns , CPI = (20%*3)+(20%*5)+(60%*4) = 4 • Multi-cycle datapath : attacks slow clock • Why is clock period 11ns and not 10ns? overheads • Fetch, decode, execute one complete insn over multiple cycles • Performance = 44ns/insn • Allows insns to take different number of cycles ± Opposite of single-cycle: short clock period, high CPI (think: CISC) • Aside: CISC makes perfect sense in multi-cycle datapath CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 11 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 12

  4. Recall: Latency vs. Throughput • Latency (execution time) : time to finish a fixed task • Throughput (bandwidth) : number of tasks in fixed time • Different: exploit parallelism for throughput, not latency (e.g., bread) • Often contradictory (latency vs. throughput) • Will see many examples of this • Choose definition of performance that matches your goals • Scientific program? Latency, web server: throughput? • Example: move people 10 miles • Car: capacity = 5, speed = 60 miles/hour Pipelined Datapath • Bus: capacity = 60, speed = 20 miles/hour • Latency: car = 10 min , bus = 30 min • Throughput: car = 15 PPH (count return trip), bus = 60 PPH • Fastest way to send 1TB of data? (at 100+ mbits/second) CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 13 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 14 Latency versus Throughput Pipelining insn0.fetch, dec, exec insn0.fetch insn0.dec insn0.exec insn1.fetch, dec, exec insn1.fetch insn1.dec insn1.exec Single-cycle Multi-cycle insn0.fetch insn0.dec insn0.exec insn0.fetch insn0.dec insn0.exec Multi-cycle insn1.fetch insn1.dec insn1.exec Pipelined insn1.fetch insn1.dec insn1.exec • Can we have both low CPI and short clock period? • Important performance technique • Not if datapath executes only one insn at a time • Improves instruction throughput rather instruction latency • Latency and throughput: two views of performance … • Begin with multi-cycle design • (1) at the program level and (2) at the instructions level • When insn advances from stage 1 to 2, next insn enters at stage 1 • Single instruction latency • Form of parallelism: “insn-stage parallelism” • Doesn’t matter: programs comprised of billions of instructions • Maintains illusion of sequential fetch/execute loop • Difficult to reduce anyway • Individual instruction takes the same number of stages • Goal is to make programs, not individual insns, go faster + But instructions enter and leave at a much faster rate • Instruction throughput → program latency • Laundry analogy • Key: exploit inter-insn parallelism CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 15 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 16

  5. 5 Stage Multi-Cycle Datapath 5 Stage Pipeline: Inter-Insn Parallelism + << 4 + 2 4 Register Data File Insn s1 s2 d PC Mem A Mem P Insn I Register O D a C Mem R File Data B s1 s2 d Mem T insn-mem T regfile T ALU T data-mem T regfile d S T singlecycle X • Pipelining : cut datapath into N stages (here 5) • One insn in each stage in each cycle + Clock period = MAX(T insn-mem , T regfile , T ALU , T data-mem ) + Base CPI = 1: insn enters and leaves every cycle – Actual CPI > 1: pipeline must often “stall” • Individual insn latency increases (pipeline overhead), not the point CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 17 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 18 5 Stage Pipelined Datapath More Terminology & Foreshadowing • Scalar pipeline : one insn per stage per cycle PC PC • Alternative: “superscalar” (later) + 4 • In-order pipeline : insns enter execute stage in order O Insn Register A • Alternative: “out-of-order” (later) PC O D Mem File Data s1 s2 d Mem B • Pipeline depth : number of pipeline stages B IR IR IR IR PC • Nothing magical about five D X M W • Contemporary high-performance cores have ~15 stage pipelines • Five stage: F etch, D ecode, e X ecute, M emory, W riteback • Nothing magical about 5 stages (Pentium 4 had 22 stages!) • Latches (pipeline registers) named by stages they begin • PC , D , X , M , W CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 19 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend