Processor Pipeline Nima Honarmand Spring 2016 :: CSE 502 Computer - PowerPoint PPT Presentation

Spring 2016 :: CSE 502 – Computer Architecture Processor Pipeline Nima Honarmand

Spring 2016 :: CSE 502 – Computer Architecture Generic Instruction Cycle • Steps in processing an instruction: – Instruction Fetch ( IF_STEP ) – Instruction Decode ( ID_STEP ) – Operand Fetch ( OF_STEP ) • Might be from registers or memory – Execute ( EX_STEP ) • Perform computation on the operands – Result Store or Write Back ( RS_STEP ) • Write the execution results back to registers or memory • ISA determines what needs to be done in each step for each instruction • μ Arch determines how HW implements the steps

Spring 2016 :: CSE 502 – Computer Architecture Datapath vs. Control Logic • Datapath is the collection of HW components and their connection in a processor – Determines the static structure of processor • Control logic determines the dynamic flow of data between the components – E.g., the control lines of MUXes and ALU in last slide – Is a function of? • Instruction words • State of the processor • Execution results at each stage

Spring 2016 :: CSE 502 – Computer Architecture Generic Datapath Components • Main components – Instruction Cache – Data Cache – Register File – Functional Units (ALU, Floating Point Unit, Memory Unit, …) – Pipeline Registers • Auxiliary Components (in advanced processors) – Reservation Stations – Reorder Buffer – Branch Predictor – Prefetchers – … • Lots of glue logic (often multiplexors) to glue these together

Spring 2016 :: CSE 502 – Computer Architecture Example: MIPS Instruction Set • All instructions are 32 bits

Spring 2016 :: CSE 502 – Computer Architecture A Simple MIPS Datapath Write-Back (WB) + 1 Reg ALU PC File I-cache D-cache Inst. Decode & Execute Memory Inst. Fetch Register Read (IF) (EX) (MEM) (ID) IF_STEP ID_STEP OF_STEP EX_STEP RS_STEP

Spring 2016 :: CSE 502 – Computer Architecture Single-Instruction Datapath Single-cycle ins0.(fetch,dec,ex,mem,wb) ins1.(fetch,dec,ex,mem,wb) Multi-cycle ins0.fetch ins0.(dec,ex) ins0.(mem,wb) ins1.fetch ins1.(dec,ex) ins1.(mem,wb) time • Process one instruction at a time • Single-cycle control: hardwired – Low CPI (1) – Long clock period (to accommodate slowest instruction) • Multi-cycle control: typically micro-programmed – Short clock period – High CPI • Can we have both low CPI and short clock period? – Not if datapath executes only one instruction at a time – No good way to make a single instruction go faster

Spring 2016 :: CSE 502 – Computer Architecture Pipelined Datapath Multi-cycle ins0.fetch ins0.(dec,ex) ins0.(mem,wb) ins1.fetch ins1.(dec,ex) ins1.(mem,wb) ins0.fetch ins0.(dec,ex) ins0.(mem,wb) Pipelined ins1.fetch ins1.(dec,ex) ins1.(mem,wb) time ins2.fetch ins2.(dec,ex) ins2.(mem,wb) • Start with multi-cycle design • When insn0 goes from stage 1 to stage 2 … insn1 starts stage 1 • Each instruction passes through all stages … but instructions enter and leave at faster rate Style Ideal CPI Cycle Time (1/freq) Single-cycle 1 Long Multi-cycle > 1 Short Pipelined 1 Short Pipeline can have as many insns in flight as there are stages

Spring 2016 :: CSE 502 – Computer Architecture Pipeline Illustrated Comb. Logic BW = ~(1/n) L n Gate Delay n n Gate Gate L -- L -- BW = ~(2/n) Delay Delay 2 2 n n n Gate Gate Gate L L -- -- L -- BW = ~(3/n) Delay Delay Delay 3 3 3 Pipeline Latency = n Gate Delay + (p-1) register delays p: # of stages Improves throughput at the expense of latency

Spring 2016 :: CSE 502 – Computer Architecture 5-Stage MIPS Pipeline

Spring 2016 :: CSE 502 – Computer Architecture Stage 1: Fetch • Fetch an instruction from instruction cache every cycle – Use PC to index instruction cache – Increment PC (assume no branches for now) • Write state to the pipeline register (IF/ID) – The next stage will read this pipeline register

Spring 2016 :: CSE 502 – Computer Architecture Stage 1: Fetch Diagram target M U X 1 PC + 1 + Decode PC Instruction en Instruction bits Cache en IF / ID Pipeline register

Spring 2016 :: CSE 502 – Computer Architecture Stage 2: Decode • Decodes opcode bits – Set up Control signals for later stages • Read input operands from register file – Specified by decoded instruction bits • Write state to the pipeline register (ID/EX) – Opcode – Register contents, immediate operand – PC+1 (even though decode didn’t use it) – Control signals (from insn) for opcode and destReg

Spring 2016 :: CSE 502 – Computer Architecture Stage 2: Decode Diagram target PC + 1 PC + 1 regA contents regA regB Execute Fetch Register File destReg contents regB data Instruction en bits Signals/imm Control IF / ID ID / EX Pipeline register Pipeline register

Spring 2016 :: CSE 502 – Computer Architecture Stage 3: Execute • Perform ALU operations – Calculate result of instruction • Control signals select operation • Contents of regA used as one input • Either regB or constant offset (imm from insn) used as second input – Calculate PC-relative branch target • PC+1+(constant offset) • Write state to the pipeline register (EX/Mem) – ALU result, contents of regB, and PC+1+offset – Control signals (from insn) for opcode and destReg

Spring 2016 :: CSE 502 – Computer Architecture Stage 3: Execute Diagram target +offset PC+1 PC + 1 + contents result ALU regA A Memory Decode L U M contents contents regB U regB X Signals/imm Control Control Signals destReg data ID / EX EX/Mem Pipeline register Pipeline register

Spring 2016 :: CSE 502 – Computer Architecture Stage 4: Memory • Perform data cache access – ALU result contains address for LD or ST – Opcode bits control R/W and enable signals • Write state to the pipeline register (Mem/WB) – ALU result and Loaded data – Control signals (from insn) for opcode and destReg

Spring 2016 :: CSE 502 – Computer Architecture Stage 4: Memory Diagram target +offset PC+1 result ALU result ALU Write-back in_addr Execute Loaded contents data in_data regB Data Cache en R/W Control Control signals signals destReg data EX/Mem Mem/WB Pipeline register Pipeline register

Spring 2016 :: CSE 502 – Computer Architecture Stage 5: Write-back • Writing result to register file (if required) – Write Loaded data to destReg for LD – Write ALU result to destReg for ALU insn – Opcode bits control register write enable signal

Spring 2016 :: CSE 502 – Computer Architecture Stage 5: Write-back Diagram result ALU Loaded data Memory M data U X Control signals M destReg U Mem/WB X Pipeline register

Spring 2016 :: CSE 502 – Computer Architecture Putting It All Together M U X + 1 target + PC+1 PC+1 eq? ALU regA instruction M result regB valA U A Register Inst ALU PC X mdata File L data Cache result Data valB U M dest U Cache data X dest signals/imm valB Control M Control Control U signals signals X IF/ID ID/EX EX/Mem Mem/WB

Spring 2016 :: CSE 502 – Computer Architecture Issues With Pipelining

Spring 2016 :: CSE 502 – Computer Architecture Pipelining Idealism • Uniform Sub-operations – Operation can partitioned into uniform-latency sub-ops • Repetition of Identical Operations – Same ops performed on many different inputs • Independent Operations – All ops are mutually independent

Spring 2016 :: CSE 502 – Computer Architecture Pipeline Realism • Uniform Sub- operations … NOT! – Balance pipeline stages • Stage quantization to yield balanced stages • Minimize internal fragmentation (left-over time near end of cycle) • Repetition of Identical Operations … NOT! – Unifying instruction types • Coalescing instruction types into one “multi - function” pipe • Minimize external fragmentation (idle stages to match length) • Independent Operations … NOT! – Resolve data and resource hazards • Inter-instruction dependency detection and resolution Pipelining is expensive

Spring 2016 :: CSE 502 – Computer Architecture The Generic Instruction Pipeline IF Instruction Fetch ID Instruction Decode OF Operand Fetch EX Instruction Execute WB Write-back

Spring 2016 :: CSE 502 – Computer Architecture Balancing Pipeline Stages IF T IF = 6 units Without pipelining T cyc  T IF +T ID +T OF +T EX +T OS ID T ID = 2 units = 31 Pipelined T cyc  max{T IF , T ID , T OF , T EX , T OS } OF T ID = 9 units = 9 EX Speedup = 31 / 9 = 3.44 T EX = 5 units WB T OS = 9 units Can we do better?

Spring 2016 :: CSE 502 – Computer Architecture Balancing Pipeline Stages (1/2) • Two methods for stage quantization – Divide sub-ops into smaller pieces – Merge multiple sub-ops into one • Recent/Current trends – Deeper pipelines (more and more stages) – Pipelining of memory accesses – Multiple different pipelines/sub-pipelines

Spring 2016 :: CSE 502 – Computer Architecture Balancing Pipeline Stages (2/2) Coarser-Grained Machine Cycle: Finer-Grained Machine Cycle: 4 machine cyc / instruction 11 machine cyc /instruction IF IF T IF&ID = 8 units IF ID ID OF OF T OF = 9 units OF # stages = 4 # stages = 11 OF T cyc = 9 units EX T cyc = 3 units T EX = 5 units EX EX WB T OS = 9 units WB WB WB

Processor Pipeline Nima Honarmand Spring 2016 :: CSE 502 Computer - PowerPoint PPT Presentation

Spring 2016 :: CSE 502 Computer Architecture Processor Pipeline Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture Generic Instruction Cycle Steps in processing an instruction: Instruction Fetch ( IF_STEP )

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Office of Pipeline Safety Office of Pipeline Safety Presentation on Presentation on Damage

Ma Magic Mountain Pipeline Phase 6 gic Mountain Pipeline Phase 6 Project ject Board Meeting

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering & Research

Pipeline Construction Pipeline Construction Challenges Challenges NAPCA Workshop August 19,

Pipeline A Presentation by Team Pipeline Ben Lai Brandon Bakhshai Jeffrey Serio Somya

1,000 foot pipeline Connect Replacement (Saugus 3 and 4) Wells to Magic Mountain Pipeline

Cortex-A15 Processor ARMs next generation mobile applications processor Travis Lanier Senior

Ch. 5: Processor + Memory December 12, 2008 Ch. 5: Processor + Memory Overview of Implementation

Chapter 12 CPU Structure and Function Contents Processor organization Register

Processor Architecture: Current Trends A B Transfer a truckload at a time from A to B Processor

Embedded systems & the Nios II soft core processor A Nios II processor system I equivalent to

Processor Design Single Cycle Processor Hung-Wei Tseng Recap: the stored-program computer

Processor Pipeline Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

Continuous Improvement Toolkit Lean Measures Continuous Improvement Toolkit . www.citoolkit.com

CS137: Today Electronic Design Automation Retiming Cycle time (clock period)

LECTURE 1 Introduction CLASSES OF COMPUTERS When we think of a computer, most of us

1 CPI (cycles per instruction) CPI (cycles per instruction) Parallelism to save time

Cycle 4c: R-type result write (add, and) Inf2C Computer Systems - 2013-2014 19 What happens in

CS3350B Computer Organization Chapter 4: Instruction-Level Parallelism Part 1: Pipelining Alex

Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons

Dancing Back to 1914 Rewind Dance Dance 1 20/11/15 2 20/11/15 Heritage g 3 20/11/15 4

Processor Pipeline Nima Honarmand Spring 2016 :: CSE 502 Computer - PowerPoint PPT Presentation

Spring 2016 :: CSE 502 Computer Architecture Processor Pipeline Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture Generic Instruction Cycle Steps in processing an instruction: Instruction Fetch ( IF_STEP )

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Office of Pipeline Safety Office of Pipeline Safety Presentation on Presentation on Damage

Ma Magic Mountain Pipeline Phase 6 gic Mountain Pipeline Phase 6 Project ject Board Meeting

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering &amp; Research

Pipeline Construction Pipeline Construction Challenges Challenges NAPCA Workshop August 19,

Pipeline A Presentation by Team Pipeline Ben Lai Brandon Bakhshai Jeffrey Serio Somya

1,000 foot pipeline Connect Replacement (Saugus 3 and 4) Wells to Magic Mountain Pipeline

Cortex-A15 Processor ARMs next generation mobile applications processor Travis Lanier Senior

Ch. 5: Processor + Memory December 12, 2008 Ch. 5: Processor + Memory Overview of Implementation

Chapter 12 CPU Structure and Function Contents Processor organization Register

Processor Architecture: Current Trends A B Transfer a truckload at a time from A to B Processor

Embedded systems &amp; the Nios II soft core processor A Nios II processor system I equivalent to

Processor Design Single Cycle Processor Hung-Wei Tseng Recap: the stored-program computer

Processor Pipeline Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

Continuous Improvement Toolkit Lean Measures Continuous Improvement Toolkit . www.citoolkit.com

CS137: Today Electronic Design Automation Retiming Cycle time (clock period)

LECTURE 1 Introduction CLASSES OF COMPUTERS When we think of a computer, most of us

1 CPI (cycles per instruction) CPI (cycles per instruction) Parallelism to save time

Cycle 4c: R-type result write (add, and) Inf2C Computer Systems - 2013-2014 19 What happens in

CS3350B Computer Organization Chapter 4: Instruction-Level Parallelism Part 1: Pipelining Alex

Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons

Dancing Back to 1914 Rewind Dance Dance 1 20/11/15 2 20/11/15 Heritage g 3 20/11/15 4

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering & Research

Embedded systems & the Nios II soft core processor A Nios II processor system I equivalent to