PIPELINING: 5-STAGE PIPELINE Mahdi Nazm Bojnordi Assistant - - PowerPoint PPT Presentation

pipelining 5 stage pipeline
SMART_READER_LITE
LIVE PREVIEW

PIPELINING: 5-STAGE PIPELINE Mahdi Nazm Bojnordi Assistant - - PowerPoint PPT Presentation

PIPELINING: 5-STAGE PIPELINE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Tonight: Homework 1 deadline (11:59PM) n Verify your uploaded files


slide-1
SLIDE 1

PIPELINING: 5-STAGE PIPELINE

CS/ECE 6810: Computer Architecture

Mahdi Nazm Bojnordi

Assistant Professor School of Computing University of Utah

slide-2
SLIDE 2

Overview

¨ Announcement

¤ Tonight: Homework 1 deadline (11:59PM)

n Verify your uploaded files before deadline ¨ This lecture

¤ Impacts of pipelining on performance ¤ The MIPS five-stage pipeline ¤ Pipeline hazards

n Structural hazards n Data hazards

slide-3
SLIDE 3

Single-cycle RISC Architecture

¨ Example: simple MIPS architecture

¤ Critical path includes all of the processing steps Write Back

  • Inst. Fetch
  • Inst. Decode

Execute Memory Inst. Memory Register File ALU Data Memory PC Controller

slide-4
SLIDE 4

Single-cycle RISC Architecture

¨ Example program

¤ CT=6ns; CPU Time = ? AND R1,R2,R3 XOR R4,R2,R3 SUB R5,R1,R4 ADD R6,R1,R4 MUL R7,R5,R6 Time CPU Time = IC x CPI x CT

slide-5
SLIDE 5

Single-cycle RISC Architecture

¨ Example program

¤ CT=6ns; CPU Time = 5 x 1 x 6ns = 30ns AND R1,R2,R3 XOR R4,R2,R3 SUB R5,R1,R4 ADD R6,R1,R4 MUL R7,R5,R6 Time CPU Time = IC x CPI x CT

How to improve?

slide-6
SLIDE 6

Reusing Idle Resources

¨ Each processing step finishes in a fraction of a cycle

¤ Idle resources can be reused for processing next

instructions

Write Back

  • Inst. Fetch
  • Inst. Decode

Execute Memory Inst. Memory Register File ALU Data Memory PC

slide-7
SLIDE 7

Pipelined Architecture

¨ Five stage pipeline

¤ Critical path determines the cycle time Write Back

  • Inst. Fetch
  • Inst. Decode

Execute Memory Inst. Memory Register File ALU Data Memory PC 1.5ns 1.5ns 1.25ns 1.05ns 0.7ns

slide-8
SLIDE 8

Pipelined Architecture

¨ Example program

¤ CT=1.5ns; CPU Time = ? AND R1,R2,R3 XOR R4,R2,R3 SUB R5,R1,R4 ADD R6,R1,R4 MUL R7,R5,R6 Time CPU Time = IC x CPI x CT

slide-9
SLIDE 9

Pipelined Architecture

¨ Example program

¤ CT=1.5ns; CPU Time = 5 x 5 x 1.5ns = 37.5ns > 30ns AND R1,R2,R3 XOR R4,R2,R3 SUB R5,R1,R4 ADD R6,R1,R4 MUL R7,R5,R6 Time CPU Time = IC x CPI x CT WORSE!!

slide-10
SLIDE 10

Pipelined Architecture

¨ Example program

¤ CT=1.5ns; CPU Time = ? AND R1,R2,R3 XOR R4,R2,R3 SUB R5,R1,R4 ADD R6,R1,R4 MUL R7,R5,R6 Time CPU Time = IC x CPI x CT

slide-11
SLIDE 11

Pipelined Architecture

¨ Example program

¤ CT=1.5ns; CPU Time = 9 x 1 x 1.5ns = 13.5ns AND R1,R2,R3 XOR R4,R2,R3 SUB R5,R1,R4 ADD R6,R1,R4 MUL R7,R5,R6 Time CPU Time = IC x CPI x CT

What is the cost of pipelining?

slide-12
SLIDE 12

Pipelining Technique

¨ Improving throughput at the expense of latency

¤ Delay: D = T + nδ ¤ Throughput: IPS = n/(T + nδ) Combinational Logic Critical Path Delay = 30

slide-13
SLIDE 13

Pipelining Technique

¨ Improving throughput at the expense of latency

¤ Delay: D = T + nδ ¤ Throughput: IPS = n/(T + nδ) Combinational Logic Critical Path Delay = 30 Combinational Logic Critical Path Delay = 15 Combinational Logic Critical Path Delay = 15

  • Comb. Logic

Delay = 10

  • Comb. Logic

Delay = 10

  • Comb. Logic

Delay = 10 D = IPS = D = IPS = D = IPS =

slide-14
SLIDE 14

Pipelining Technique

¨ Improving throughput at the expense of latency

¤ Delay: D = T + nδ ¤ Throughput: IPS = n/(T + nδ) Combinational Logic Critical Path Delay = 30 Combinational Logic Critical Path Delay = 15 Combinational Logic Critical Path Delay = 15

  • Comb. Logic

Delay = 10

  • Comb. Logic

Delay = 10

  • Comb. Logic

Delay = 10 D = 31 IPS = 1/31 D = 32 IPS = 2/32 D = 33 IPS = 3/33

slide-15
SLIDE 15

Pipelining Latency vs. Throughput

¨ Theoretical delay and throughput models for

perfect pipelining

5 10 15 20 50 100 150 200 Relative Performance Number of Pipeline Stages Delay (D)

slide-16
SLIDE 16

Pipelining Latency vs. Throughput

¨ Theoretical delay and throughput models for

perfect pipelining

5 10 15 20 50 100 150 200 Relative Performance Number of Pipeline Stages Delay (D) Throughput (IPS)

slide-17
SLIDE 17

Five Stage MIPS Pipeline

slide-18
SLIDE 18

Simple Five Stage Pipeline

¨ A pipelined load-store architecture that processes

up to one instruction per cycle

Write Back

  • Inst. Fetch
  • Inst. Decode

Execute Memory Inst. Memory Register File ALU Data Memory PC

slide-19
SLIDE 19

Instruction Fetch

¨ Read an instruction from memory (I-Memory)

¤ Use the program counter (PC) to index into the I-

Memory

¤ Compute NPC by incrementing current PC

n What about branches? ¨ Update pipeline registers

¤ Write the instruction into the pipeline registers

slide-20
SLIDE 20

Instruction Fetch

Memory PC + 4 NPC Instruction Branch Target Pipeline Register Why increment by 4? NPC = PC + 4 clock clock

slide-21
SLIDE 21

Instruction Fetch

Memory PC + 4 NPC Instruction Branch Target Pipeline Register Why increment by 4? NPC = PC + 4 Critical Path = Max{P1, P2, P3} P1 P2 P3 clock clock

slide-22
SLIDE 22

Instruction Decode

¨ Generate control signals for the opcode bits ¨ Read source operands from the register file (RF)

¤ Use the specifiers for indexing RF

n How many read ports are required? ¨ Update pipeline registers

¤ Send the operand and immediate values to next stage ¤ Pass control signals and NPC to next stage

slide-23
SLIDE 23

Instruction Decode

Register File ctrl Pipeline Register NPC NPC Instruction Pipeline Register reg reg decode target

slide-24
SLIDE 24

Execute Stage

¨ Perform ALU operation

¤ Compute the result of ALU

n Operation type: control signals n First operand: contents of a register n Second operand: either a register or the immediate value

¤ Compute branch target

n Target = NPC + immediate ¨ Update pipeline registers

¤ Control signals, branch target, ALU results, and

destination

slide-25
SLIDE 25

Execute Stage

ALU ctrl Pipeline Register NPC Target Pipeline Register reg reg + reg ctrl Res

slide-26
SLIDE 26

Memory Access

¨ Access data memory

¤ Load/store address: ALU outcome ¤ Control signals determine read or write access

¨ Update pipeline registers

¤ ALU results from execute ¤ Loaded data from D-Memory ¤ Destination register

slide-27
SLIDE 27

Memory Access

ctrl Pipeline Register Target Pipeline Register Res reg Dat ctrl Res Memory addr data data

slide-28
SLIDE 28

Register Write Back

¨ Update register file

¤ Control signals determine if a register write is needed ¤ Only one write port is required

n Write the ALU result to the destination register, or n Write the loaded data into the register file

slide-29
SLIDE 29

Five Stage Pipeline

¨ Ideal pipeline: IPC=1

¤ Is there enough resources to keep the pipeline stages

busy all the time?

+ 4

PC

+ Mem Reg. File ALU Mem Reg. File

  • Inst. Fetch

Decode Execute Memory Writeback

slide-30
SLIDE 30

Pipeline Hazards

slide-31
SLIDE 31

Pipeline Hazards

¨ Structural hazards: multiple instructions compete for

the same resource

¨ Data hazards: a dependent instruction cannot

proceed because it needs a value that hasn’t been produced

¨ Control hazards: the next instruction cannot be

fetched because the outcome of an earlier branch is unknown

slide-32
SLIDE 32

Structural Hazards

¨ 1. Unified memory for instruction and data

R1ß Mem[R2] R7ß R1+R0 R6ß R4-R5 R3ß Mem[R20]

slide-33
SLIDE 33

Structural Hazards

¨ 1. Unified memory for instruction and data

R1ß Mem[R2] R7ß R1+R0 R6ß R4-R5 R3ß Mem[R20] Separate inst. and data memories.

slide-34
SLIDE 34

Structural Hazards

¨ 1. Unified memory for instruction and data ¨ 2. Register file with shared read/write access ports

R1ß Mem[R2] R7ß R1+R0 R6ß R4-R5 R3ß Mem[R20]

slide-35
SLIDE 35

Structural Hazards

¨ 1. Unified memory for instruction and data ¨ 2. Register file with shared read/write access ports

R1ß Mem[R2] R7ß R1+R0 R6ß R4-R5 R3ß Mem[R20] Register access in half cycles.