Chapter Six 1 2004 Morgan Kaufmann Publishers Pipelining The - - PowerPoint PPT Presentation

chapter six
SMART_READER_LITE
LIVE PREVIEW

Chapter Six 1 2004 Morgan Kaufmann Publishers Pipelining The - - PowerPoint PPT Presentation

Chapter Six 1 2004 Morgan Kaufmann Publishers Pipelining The laundry analogy for pipelining: 2 2004 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput Single cycle 2400 ps


slide-1
SLIDE 1

Chapter Six

1

2004 Morgan Kaufmann Publishers

slide-2
SLIDE 2

Pipelining

  • The laundry analogy for pipelining:

2

2004 Morgan Kaufmann Publishers

slide-3
SLIDE 3

Pipelining

  • Improve performance by increasing instruction throughput

– Single cycle – 2400 ps – pipelining – 1400 ps

3

2004 Morgan Kaufmann Publishers

Ideal speedup is number of stages in the pipeline. Do we achieve this?

slide-4
SLIDE 4
  • Pipelining: key to making processors fast

– is an implementation technique in which multiple instructions are overlapped in execution.

  • Execution of MIPS instructuons: take 5 steps classically:
  • 1. IF: Fetch instr from mem.
  • 2. ID: Read regs while decoding the instr.
  • 3. EX: Execute the op or calculate an addr.
  • 4. MEM: Access an operand in data mem.

4

2004 Morgan Kaufmann Publishers

  • 4. MEM: Access an operand in data mem.
  • 5. WB: Write the result into a reg.
  • Graphical representation of instr pipeline
  • Memory and register file are written/read in the first/last half of

clock cycle (shaded area : it is in use)

Time 2 4 6 8 10 add $s0, $t0, $t1

  • IF

ID WB EX MEM

  • MIPS pipeline

use 5 stages

slide-5
SLIDE 5

Pipelining

  • What makes it easy

– all instructions are the same length – just a few instruction formats – memory operands appear only in loads and stores – Operands must be aligned in memory

  • What makes it hard?

5

2004 Morgan Kaufmann Publishers

  • What makes it hard?

The situations in pipelining that a planned instruction cannot execute in the proper clock cycle

  • Pipeline Hazards.

– structural hazards: suppose we had only one memory – data hazards: an instruction depends on a previous instruction – control hazards: need to worry about branch instructions

slide-6
SLIDE 6

An Overview of Pipelining – Structural Hazard

  • Structural Hazards

– The situation that a planned instr cannot execute in the clock proper cycle because hardware cannot support the combination

  • f instructions that we want to execute in the same clock.
  • Example

– The first instr is accessing data from memory, while the fourth instr is fetching an instr from that same memory.

6

2004 Morgan Kaufmann Publishers

  • Solution : Add more hardware (add another memory)

IF

Structural hazard occurs if there is only one memory

slide-7
SLIDE 7

An Overview of Piplining – Data Hazard

  • Data Hazard:

– The situation that a planned instr cannot execute in the proper clock cycle because data that is needed to execute the instr is still in the pipeline (not yet available).

  • Example 1

Example 2 ( load-use data hazard ) add $s0, $t0, $t1 lw $s0, $t0, $t1 sub $t2, $s0, $t3 sub $t2, $s0, $t3

7

2004 Morgan Kaufmann Publishers

  • Solutions

– Data forwarding (bypassing)

  • Retrieving the data early from internal buffers rather than waiting

for it to arrive to registers or memory. – Pipeline stall (bubbles)

  • A stall initiated in order to resolve a hazard.

– Reordering code

IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB

slide-8
SLIDE 8

An Overview of Piplining – Data Hazard solution

  • Forwarding

(Example 1)

add $s0, $t0, $t1 sub $t2, $s0, $t3 Program execution

  • rder

(in instructions) IF ID WB EX IF ID MEM EX Time 2 4 6 8 10 MEM WB MEM

IF ID EX ME WB IF ID EX ME WB

8

2004 Morgan Kaufmann Publishers

  • Forwarding

+ stall

(Example 2)

Time 2 4 6 8 10 12 14 lw $s0, 20($t1) sub $t2, $s0, $t3 Program execution

  • rder

(in instructions) IF ID WB MEM EX IF ID WB MEM EX

bubble bubble bubble bubble bubble

Can’t Forward only

IF ID EX ME WB IF ID EX ME WB

slide-9
SLIDE 9

An Overview of Piplining – Data Hazard solution

Reordering code

9

2004 Morgan Kaufmann Publishers

(Ans.) The hazard occurs on $t2 between the 2nd lw and the 1st sw, so

swapping the two sw and using forwarding can remove the stall

IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB

slide-10
SLIDE 10
  • Control Hazard:

– The situation that a planned instr cannot execute in the proper cycle because the instr fetched I not the one that is needed; that is, the flow of instr addresses is not what the pipeline expected.

  • Example: Branch instr.
  • Solutions

– Stall:

An Overview of Pipelining – Control Hazard

10

2004 Morgan Kaufmann Publishers

– Stall:

  • Wait until the pipeline determines the outcome of the branch and

knows what instr address to fetch from. – Prediction

  • Predict the branch to be taken, or untaken. When the guess is

wrong, restart the pipeline from the proper branch address. – Delayed branch

  • Place an instr immediately after the branch instr that is not

affected by the branch. So, a taken branch changes the address

  • f the instr that follows this safe instr.
slide-11
SLIDE 11

An Overview of Pipelining – Control Hazard Solution

  • Stall

– For branch instrs:

  • Assumption: put in enough extra hardware to test regs,

calculate the branch addr, and update PC during the 2nd stage

  • E.g.: pipeline stall, bubble

11

2004 Morgan Kaufmann Publishers

  • Ex. Estimate the impact on the CPI of stalling on branches.

Assume all other instr have a CPI of 1 and branches are 13% of the instructions. (Ans. CPI=1.13)

slide-12
SLIDE 12

An Overview of Pipelining – Control Hazard Solution

  • Prediction
  • 1. Always predict that branches as untaken

Branch

12

2004 Morgan Kaufmann Publishers

Branch untaken Branch taken

slide-13
SLIDE 13
  • Prediction
  • 2. Have some predicted as taken & some as untaken
  • For example, always predict taken for branches that jump to an

earlier address.

  • 3. Dynamic hardware prediction:
  • make guesses depending on the behavior of each branch and

may change predictions for a branch over the life of a program

An Overview of Pipelining – Control Hazard Solution

13

2004 Morgan Kaufmann Publishers

  • E.g.: keep a history for each branch as taken or untaken, and

then use the past to predict the future (90% accuracy)

  • Misprediction:

– When the guess is wrong, the pipeline control must ensure that the instrs following the wrongly guessed branch have no effect and must restart the pipeline from the proper branch addr. – Longer pipelines exacerbate the problem.

slide-14
SLIDE 14
  • Delayed branch: used by MIPS

– Always executes the next sequential instr, with the branch taking place after that one instr delay

Time 2 4 6 8 10 12 14 Program execution

  • rder

(in instructions)

An Overview of Pipelining – Control Hazard Solution

14

2004 Morgan Kaufmann Publishers

Instruction fetch Reg ALU Data access Reg

beq $1, $2, 40 add $4, $5, $6 lw $3, 300($0)

Instruction fetch Reg ALU Data access Reg

2 ns

Instruction fetch Reg ALU Data access Reg

2 ns

  • 2 ns

(Delayed branch slot) (in instructions)

slide-15
SLIDE 15

Pipeline Overview Summary

  • Pipelining:

– exploits parallelism among the instrs in a sequential instr stream – Substantial adv.: is fundamentally invisible to the programmer

  • Big Picture:

– Pipelining increases the # of simultaneously executing instrs and the rate at which instrs are started and completed.

15

2004 Morgan Kaufmann Publishers

– Pipelining does not reduce the time it takes to complete an individual instr. – Pipelining improves instr throughput rather than individual instr execution time. – Pipeline designers must cope with structural, control, and data hazards. – Branch prediction, forwarding, and stalls help make a computer fast while still getting the right answers.

slide-16
SLIDE 16

Basic Idea

  • The single-cycle datapath from Ch5

16

2004 Morgan Kaufmann Publishers

  • There are two right-to-left flows

– WB stage – cause data hazard – Selection of the next value of PC – cause control hazard

slide-17
SLIDE 17

Pipelined Datapath

  • Separate each pipeline stage by inserting pipeline register
  • Assume register file is written/read in the first/last half of clock cycle

17

2004 Morgan Kaufmann Publishers

Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

slide-18
SLIDE 18

Pipeline Examples: Load and Store

LOAD

  • IF

– instr -> IF/ID – PC+4 -> PC – PC+4 -> IF/ID

  • ID

– Reg[IF/ID.rs] -> ID/EX – Reg[IF/ID.rt] -> ID/EX – IF/ID.Sign-extended 32bits ->

STORE

  • IF

– instr -> IF/ID – PC+4 -> PC – PC+4 -> IF/ID

  • ID

– Reg[IF/ID.rs] -> ID/EX – Reg[IF/ID.rt] -> ID/EX – IF/ID.Sign-extended 32bits ->

18

2004 Morgan Kaufmann Publishers

– IF/ID.Sign-extended 32bits -> ID/EX – IF/ID.pc+4 -> ID/EX

  • EX

– mem-addr -> EX/MEM

  • MEM

– mem-data= MEM[EX/MEM. mem-addr] -> MEM/WB

  • WB

– MEM/WB.mem-data -> Reg[rt]

IF/ID.rt -> ID/EX -> EX/MEM -> MEM/WB

– IF/ID.Sign-extended 32bits -> ID/EX – IF/ID.pc+4 -> ID/EX

  • EX

– mem-addr -> EX/MEM

  • MEM

– Reg[rt] -> MEM[EX/MEM. mem-addr]

  • WB

– Do nothing

ID/EX.Reg[rt] -> EX/MEM

slide-19
SLIDE 19

Corrected Datapath

19

2004 Morgan Kaufmann Publishers

ID/EX.Reg[rt] -> EX/MEM for sw IF/ID.rt -> ID/EX -> EX/MEM -> MEM/WB for lw

slide-20
SLIDE 20

Graphically Representing Pipelines

Program execution

  • rder

(in instructions) lw $1, 100($0) lw $2, 200($0) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC7 IM DM Reg Reg ALU IM DM Reg Reg ALU

20

2004 Morgan Kaufmann Publishers

  • Can help with answering questions like:

– how many cycles does it take to execute this code? – what is the ALU doing during cycle 4? – use this representation to help understand datapaths

lw $3, 300($0) IM DM Reg Reg ALU

slide-21
SLIDE 21
  • We have 5 stages. What needs to be controlled in each stage?

– Instruction Fetch and PC Increment – Instruction Decode / Register Fetch – Execution – Memory Stage – Write Back

Pipeline control

21

2004 Morgan Kaufmann Publishers

  • How would control be handled in an automobile plant?

– a fancy control center telling everyone what to do? – should we use a finite state machine?

slide-22
SLIDE 22

Pipeline Control

IF ID EX MEM WB

22

2004 Morgan Kaufmann Publishers

  • Totally 9 control

signals

rt rd

slide-23
SLIDE 23
  • Pass control signals along just like the data

Pipeline Control

Instruction Execution/address calculation stage control lines Memory access stage control lines Write-back stage control lines RegDst ALUOp1 ALUOp0 ALUSrc Branch MemRead MemWrite RegWrite MemtoReg R-format 1 1 1 lw 1 1 1 1 sw X 1 1 X beq X 1 1 x

23

2004 Morgan Kaufmann Publishers

Create control information during ID stage, and then used in appropriate stage as pipeline move down

slide-24
SLIDE 24

Datapath with Control

24

2004 Morgan Kaufmann Publishers

slide-25
SLIDE 25
  • Problem with starting next instruction before first is finished

– dependencies that “go backward in time” are data hazards – Register files are assumed to be written in the first half of a clock cycle and to be read in the last half of a clock cycle

Dependencies

25

2004 Morgan Kaufmann Publishers

slide-26
SLIDE 26
  • Have compiler guarantee no hazards
  • Where do we insert the “nops” ?

sub $2, $1, $3 and $12, $2, $5

  • r

$13, $6, $2 add $14, $2, $2 sw $15, 100($2)

Software Solution

26

2004 Morgan Kaufmann Publishers

sw $15, 100($2)

  • Problem: this really slows us down!
slide-27
SLIDE 27
  • Use temporary results, don’t wait for them to be written

– ALU forwarding

Forwarding

1a.EX/MEM.RegRd = ID/EX.RegRs

Hazard Conditions

27

2004 Morgan Kaufmann Publishers

1a.EX/MEM.RegRd = ID/EX.RegRs 1b.EX/MEM.RegRd = ID/EX.RegRt 2a.MEM/WB.RegRd = ID/EX.RegRs 2b.MEM/WB.RegRd = ID/EX.RegRt

slide-28
SLIDE 28

Forwarding

ID/EX MEM/WB EX/MEM M u x ALU ID/EX MEM/WB Data memory EX/MEM

  • a. No forwarding

Registers

Before adding forwarding

28

2004 Morgan Kaufmann Publishers

Registers M u x M u x ALU Data memory M u x Forwarding unit

  • b. With forwarding

ForwardB Rd EX/MEM.RegisterRd MEM/WB.RegisterRd Rt Rt Rs ForwardA M u x

After adding forwarding

2a.MEM/WB.RegRd = ID/EX.RegRs 2b.MEM/WB.RegRd = ID/EX.RegRt 1a.EX/MEM.RegRd = ID/EX.RegRs 1b.EX/MEM.RegRd = ID/EX.RegRt

slide-29
SLIDE 29

– The control values for the forwarding multiplexors:

29

2004 Morgan Kaufmann Publishers

slide-30
SLIDE 30
  • Hazard detection unit (do in EX stage

do in EX stage )

  • 1. EX hazard: (e.g., The sub-and in P.27)

if ((EX/MEM.RegWrite ) and (EX/MEM.RegisterRd ≠ ≠ ≠ ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if ((EX/MEM.RegWrite) and ((EX/MEM.RegisterRd ≠ ≠ ≠ ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10

30

2004 Morgan Kaufmann Publishers

and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10

  • 2. MEM hazard: (e.g.,The sub-or in P.27)

if ((MEM/WB.RegWrite) and (MEM/WB.RegisterRd ≠ ≠ ≠ ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if ((MEM/WB.RegWrite) and (MEM/WB.RegisterRd ≠ ≠ ≠ ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01

slide-31
SLIDE 31

Consider: sum a vector of numbers in a single reg

add $1, $1, $2 add $1, $1, $3 add $1, $1, $4

EX/MEM.RegisterRd = ID/EX.RegisterRs EX/MEM.RegisterRd = ID/EX.RegisterRs

31

2004 Morgan Kaufmann Publishers

  • 2. Extended MEM hazard:

if ((MEM/WB.RegWrite) and (MEM/WB.RegisterRd ≠ ≠ ≠ ≠ 0) and (EX/MEM.RegisterRd ≠ ≠ ≠ ≠ ID/EX.RegisterRs) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if ((MEM/WB.RegWrite) and (MEM/WB.RegisterRd ≠ ≠ ≠ ≠ 0) and (EX/MEM.RegisterRd ≠ ≠ ≠ ≠ ID/EX.RegisterRt) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01

slide-32
SLIDE 32
  • Forwarding can’t resolve all data hazards, stall is necessary sometime

– E.g., Load-use data hazard needs a stall

Can't always forward

rd rs rt rt rs

32

2004 Morgan Kaufmann Publishers

rd rs rt If (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))) stall the pipeline

Hazard Detection (Do in ID stage Do in ID stage)

insert the stall b/t the load and its use

slide-33
SLIDE 33

Stalling

  • We can stall the pipeline by keeping an instruction in the same stage

– bubble – change the EX, MEM, WB controls fields of the ID/EX to 0 – Keep the instructions in IF and ID in the same stages for one more cycle (the and and or below) Hazard detected

33

2004 Morgan Kaufmann Publishers

and and

  • r
  • r
slide-34
SLIDE 34

Hazard Detection Unit

  • Stall by letting an instruction that won’t write anything go forward

– 0 as mux input : control fields EX, MEM, WB of the ID/EX becomes 0 (bubble) – PCWrite = 0: to fetch the same instruction to IF (the or instr) – IF/Dwrite = 0: make the ID decode the same instr stored in IF/ID (the and instr)

34

2004 Morgan Kaufmann Publishers RegWrite RegWrite

slide-35
SLIDE 35
  • When we decide to branch, other instructions are in the pipeline!

Branch Hazards

35

2004 Morgan Kaufmann Publishers

  • We are predicting “branch not taken”

– need to add hardware for flushing instructions if we are wrong – Since branch instr decides whether to branch in MEM stage, the instructions in IF, ID and EX stages mush be flushed from the pipeline.

slide-36
SLIDE 36
  • Reduce Branch Penalty

– Add some hardware circuits to move branch decision to ID stage

  • Address calculation and equality test (XOR the two registers and OR the result)

– For wrong branch decision, only need to flush one instr (in IF stage)

  • Add a control line IF.Flush to clear the Instr field of IF/ID (become nop)

Flushing Instructions

36

2004 Morgan Kaufmann Publishers

slide-37
SLIDE 37

Branches

  • If the branch is taken, we have a penalty of one cycle
  • For our simple design, this is reasonable
  • With deeper pipelines, penalty increases and static branch prediction

drastically hurts performance

  • Solution: dynamic branch prediction

37

2004 Morgan Kaufmann Publishers

A 2-bit prediction scheme

slide-38
SLIDE 38

Branch Hazard – Delay Branch

  • Branch delay slot – the slot directly after a delayed branch

instruction

  • Delay branch – in the MIPS architecture, branch delay slot is filled

by an instruction that does not affect the branch – From before, from target, from fall through (not taken)

38

2004 Morgan Kaufmann Publishers

Always correct Always correct Must be aware of whether register is changed Must be aware of whether register is changed

slide-39
SLIDE 39

Improving Performance

  • Try and avoid stalls! E.g., reorder these instructions:

lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1)

  • Dynamic Pipeline Scheduling

39

2004 Morgan Kaufmann Publishers

  • Dynamic Pipeline Scheduling

– Hardware chooses which instructions to execute next – Will execute instructions out of order (e.g., doesn’t wait for a dependency to be resolved, but rather keeps going!) – Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect)

  • Trying to exploit instruction-level parallelism
slide-40
SLIDE 40

Advanced Pipelining

  • Increase the depth of the pipeline
  • Start more than one instruction each cycle (multiple issue)

CPI<1

  • Loop unrolling to expose more ILP (better scheduling)
  • “Superscalar” processors

– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue – dynamic multiple issue: processor dynamically chooses which

40

2004 Morgan Kaufmann Publishers

– dynamic multiple issue: processor dynamically chooses which instructions to execute in a given cycle while trying to avoid hazard.

  • All modern processors are superscalar and issue multiple

instructions usually with some limitations (e.g., different “pipes”)

  • VLIW: very long instruction word, static multiple issue

(relies more on compiler technology - packing instructions and handling hazard)

  • This class has given you the background you need to learn more!
slide-41
SLIDE 41

Chapter 6 Summary

  • Pipelining does not improve latency, but does improve throughput

41

2004 Morgan Kaufmann Publishers