Control (Branch) Hazards A: beqz R2, L1 B C D ------ L1: P - - PowerPoint PPT Presentation

control branch hazards
SMART_READER_LITE
LIVE PREVIEW

Control (Branch) Hazards A: beqz R2, L1 B C D ------ L1: P - - PowerPoint PPT Presentation

Control (Branch) Hazards A: beqz R2, L1 B C D ------ L1: P Nave (Lazy) Implementation of a Conditional Branch Instruction in DLX Pipeline : IF: Fetch Branch Instruction from IM ID: Decode Instruction and read registers to be used in


slide-1
SLIDE 1

Control (Branch) Hazards

A: beqz R2, L1 B C D

  • L1: P

Naïve (Lazy) Implementation of a Conditional Branch Instruction in DLX Pipeline

:

IF: Fetch Branch Instruction from IM ID: Decode Instruction and read registers to be used in comparison EX: Determine Branch Outcome: Compare Source Register values to zero (or each other) and set FLAG in Output Pipeline Register MEM: Compute Target Address: (PC + 4) carried along with instruction + Sign Extended and Shifted Displacement At end of clock cycle: PC assigned Target Address if instruction now in MEM is a successful branch else Execution continues as normal Successful (or Taken) Branch: (i) instruction is a branch and (ii) branch condition is true

1

slide-2
SLIDE 2

Naive Implementation of Branch Equal Instruction: BEQZ Rs, d

PC +

IM

P C + 4 I n s t r u c t i

  • n

Decode REG FILE

SE

PC

C n t r l

(rs) (rt)

d

rs rt rd

C n t r l

A L U

F L A G

REG FILE

d

PC AND MUX

IF ID EX MEM WB

MEM

Outcome of the branch known at end of cycle 3. PC updated with new value (branch target

  • r PC+4 value)

at end of cycle 4.

zero

ADD

Branch outcome known Compute target address

<<

2

slide-3
SLIDE 3

Control Hazard

IF EX ID

A

MEM WB T = 1 IF EX ID MEM WB

A

T = 2 IF EX ID MEM WB

A

T = 3 IF EX ID MEM WB

B/P

A

T = 4

/NOT TAKEN / TAKEN BRANCH 16

slide-4
SLIDE 4

Control (Branch) Hazards

:

Problems:

  • The target address of the branch is not known (at least) till instruction is decoded
  • What is the address of instruction P?
  • The outcome of the branch (taken/ not taken) is determined deep in the pipeline
  • Should we execute B or P after A?

What should the pipeline (processor) do after fetching the branch instruction? SOLUTION 1

  • Delay the next instruction
  • Till we know the outcome of the branch and the address of next instruction

Software: Add 3 NOPS after every Branch Instruction Hardware: Hazard Detection Unit checks for a Branch Instruction in the ID, EX, or MEM stage and stalls PC/Inserts NOPs

15

slide-5
SLIDE 5

Simple Software Solution: Insert NOPs

MEM

2 3 4 5 6

ID EX WB

7 1

IF

8

A

IF ID EX MEM

B P

9

WB

A: beqz R2, label NOP NOP NOP B

  • label: P

Possible execution sequences: Branch Not Taken: A, NOP, NOP, NOP, B Branch Taken: A, NOP, NOP, NOP, P

  • Adds 3 cycles to execution time for every branch

NOP NOP NOP

17

slide-6
SLIDE 6

Control Hazard

IF EX ID

A

MEM WB

NOP

T = 1 IF EX ID MEM WB

NOP

A

T = 2 IF EX ID MEM WB

NOP

A

T = 3 IF EX ID MEM WB

P/B

A

T = 4

16

NOP NOP NOP NOP NOP v NOP

slide-7
SLIDE 7

Hardware-Controlled Pipeline Stall

A :BEQ R1, R2, L1 B : ---- C: --- L1: P

MEM 2 3 4 5 6 ID EX WB IF IF IF 7 1 IF IF ID EX 8 MEM A B C 9 WB C P Branch Taken: 3 Additional Cycles

slide-8
SLIDE 8

Hardware-Controlled Pipeline Stall

A :BEQ R1, R2, L1 B : ---- C: --- L1: P

MEM 2 3 4 5 6 ID EX WB IF IF IF 7 1 IF IF ID EX 8 MEM A B C 9 WB Branch Not Taken: 3 Additional Cycles

slide-9
SLIDE 9

Hardware-Controlled Pipeline Stall

A :BEQ R1, R2, L1 B : ---- C: --- L1: P

MEM 2 3 4 5 6 ID EX WB IF IF IF ID EX 7 MEM 1 IF IF ID EX 8 WB MEM A B C 9 WB C C Optimized Branch Not Taken: PC gets address of C

slide-10
SLIDE 10

Hazard Detection Unit

IF EX ID MEM WB P C

HDU

Freeze register: do not update Insert NOP

Stall PC and Insert NOP into IF/ID if there is a Branch instruction in either the IF/ID, ID/EX or EX/MEM pipeline register

slide-11
SLIDE 11

Hardware Controlled Pipeline Stall

IF EX ID MEM WB B IF EX ID MEM WB T = 2 T = 3

  • Instruction B (address) held in PC register until A reaches WB stage
  • Internally generated NOPs propagated forward while B is stalled

A

Stall

B

Stall

A

A: BEQ R1, R2, L1 B: C:

  • L1: P
slide-12
SLIDE 12

Hardware Controlled Pipeline Stall

IF EX ID MEM WB P IF EX ID MEM WB T = 5 T = 5 C

A

IF EX ID MEM WB B T = 4

A A

TAKEN BRANCH BRANCH NOT TAKEN IF EX ID MEM WB B

A

B

OPTIMIZED

slide-13
SLIDE 13

Branch Delay Slots

Branch Delay slots:

  • Delay introduced by software to avoid control hazards
  • Dummy instructions following branch instruction for purpose of

creating delays till the new PC value can be set

  • Instructions in the Branch Delay slots always executed
  • In our design: 3 Branch Delay Slots
  • Microarchitecture might choose not to expose all the delay slots and

use some hardware mechanisms for providing the remaining delay

Software Solution:

  • Software must delay the execution of the next-in-line instruction after the Branch

Delay depends on the pipeline structure

  • Microarchitecture is exposed to the software (compiler)

5

slide-14
SLIDE 14

Performance of Simple Stall Based Schemes

1. Stall scheme has a branch penalty of 3 cycles (may be 2 in optimized hardware design) 2. Software inserted NOPs (3 cycles) 1. Hardware inserted stall cycles (3 non-optimized) Example: Suppose Branch Frequency is 20% and 60% of branches are taken. Assume software solution with penalties as above. Assume the compile is able to fill 20% of the Branch Delay slots with useful instructions. How is CPI affected? Each Branch Instructions incurs extra delay of 3 cycles except for the delay slots filled with useful instructions. Branch Penalty (per executed instruction) = 20% x 3 (delay slots) x(80%) unfilled delay slots = 0.48 cycles CPI = Nominal CPI + Penalty Cycles (per instruction) Assuming no other causes of stalls CPI = 1.0 + 0.52 = 1.48

13

slide-15
SLIDE 15

Alternate Hardware Solution

beqz R2, label B C D E

  • label: P

A C D

9 8 7 6

5 4 3 2 1 WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF B E

  • Why delay in-line instructions B, C, D etc?
  • Let instructions following A enter pipeline normally

Works if Branch Not Taken!

14

slide-16
SLIDE 16

B C

Control Hazard

IF EX ID

A

MEM WB

B

T = 1 IF EX ID MEM WB

C A

T = 2 IF EX ID MEM WB

D B A

T = 3 IF EX ID

D

MEM WB

E C B A

T = 4

BRANCH NOT TAKEN

No Stall Cycles

15

slide-17
SLIDE 17

Speculation: Alternate Hardware Solution

beqz R2, label B C D E

  • label: P

A C D WB MEM REG ID IF WB MEM REG ID IF WB MEM REG ID IF WB MEM REG ID IF WB MEM REG ID IF

9 8 7 6

5 4 3 2 1 B P

  • B, C, D have not updated machine state at cycle 4
  • Flush B, C, D at end of cycle 4

What if Branch is Taken?

16

slide-18
SLIDE 18

B C

Control Hazard

IF EX ID

A

MEM WB

B

T = 1 IF EX ID MEM WB

C A

T = 2 IF EX ID MEM WB

D B A

T = 3 IF EX ID

D

MEM WB

P C B A

T = 4

TAKEN BRANCH 16

slide-19
SLIDE 19

P Q

Control Hazard

IF EX ID

D

MEM WB

P C B

T = 4 IF EX ID MEM WB

Q D C B

T = 5 IF EX ID MEM WB

R P D C

T = 6 IF EX ID

R

MEM WB

S Q P D

T = 7

TAKEN BRANCH: WRITES to MEM or REG by B, C or D will result in error 16

A

slide-20
SLIDE 20

Alternate Hardware Solution

IF EX ID

A

MEM WB

B

T = 2 IF EX ID

B

MEM WB

C A

T = 3 IF EX ID

C

MEM WB

D B A

T = 4 IF EX ID MEM WB P

A

T = 4

Insert NOP in IF/ID, ID/EX, EX/MEM

Taken Branch

17

slide-21
SLIDE 21

Branch Penalty in Modified Hardware Scheme

  • More than an optimized implementation of stall
  • Simple form of control speculation
  • Speculating it is a NOT TAKEN Branch
  • Continue fetching in-line instructions
  • Performance depends on accuracy of speculation
  • Speculation correct (NOT TAKEN Branch): Continue with no stalls (0 Penalty Cycles)
  • Speculation incorrect (TAKEN Branch): Flush 3 trailing instructions (3 Penalty Cycles)

Example:

Branch Frequency: 20% 5% of Branches are Unconditional Branches 70% Conditional branches are NOT TAKEN CPI = Nominal CPI + Penalty cycles for TAKEN BRANCH + Penalty Cycles for NOT TAKEN Branch Penalty Cycles for TAKEN BRANCH = Penalty cycles for UNCONDITIONAL BRANCH + Penalty cycles for TAKEN CONDITIONAL BRANCH = 20% x 5% x 3 + 20% x 95% x 30% x 3 = 0.03 + 0.171 = 0.201 CPI = 1.0 + = 1.201

19

slide-22
SLIDE 22

B C Predict branch: Not Taken; Actually Not Taken

IF EX ID

A

MEM WB

B

T = 1 IF EX ID MEM WB

C A

T = 2 IF EX ID MEM WB

D B A

T = 3 IF EX ID

D

MEM WB

E C B A

T = 4

BRANCH NOT TAKEN

No Stall Cycles

DO NOTHING 20

slide-23
SLIDE 23

Predict branch: Not Taken; Actually Taken

IF EX ID

A

MEM WB

B

T = 1 IF EX ID

B

MEM WB

C A

T = 2 IF EX ID

C

MEM WB

D B A

T = 3 IF EX ID MEM WB

P A

T = 4

Branch actually taken: FLUSH pipeline

Make B,C,D NOPS

21

slide-24
SLIDE 24

More Control Speculation

Can we predict branch as taken ?

  • Speculatively fetch and execute instructions at the branch target
  • Useful only if target address is known earlier than branch outcome
  • May require stall cycles until target address known
  • Flush pipeline if prediction is incorrect
  • Must ensure that flushed instructions do not update any machine state
  • Assume that target address is computed in the ID stage
  • Stall of 1 cycle till PC updated with target address (ALWAYS!)
  • Assume branch outcome known at the end of cycle 3 in EX stage

22

slide-25
SLIDE 25

Predict branch taken

IF EX ID

A

MEM WB

B

T = 1 IF EX ID MEM WB

P A

T = 2 IF EX ID

P

MEM WB

Q A

T = 3 IF EX ID

Q

MEM WB

R P A

T = 4

Branch actually taken: Single stall cycle

23

slide-26
SLIDE 26

Predict branch taken

IF EX ID

A

MEM WB

B

T = 1 IF EX ID MEM WB

P A

T = 2 IF EX ID MEM WB

B A

T = 3 IF EX ID

B

MEM WB

C A

T = 4

Branch actually not taken: 2 wasted cycles FLUSH pipeline Make P NOP

24

slide-27
SLIDE 27

More Control Speculation (contd … ) Reduce branch delay (from 3 cycles of first design) to 1 or 2 Early Branch Detection hardware to compute: Target address : Easy to move to ID stage Branch outcome: Easy to move to EX stage Predict Not Taken: Actually Not Taken: No stalls Actually Taken: 2 cycles Predict Taken: Actually Taken: 1 cycle Actually Not Taken: 2 cycles

26

slide-28
SLIDE 28

More Control Speculation

Predict Branch Taken Branch Actually Taken: 1 cycle penalty Branch Actually Not Taken: 2 cycle penalty

  • 16% of instructions were conditional branches
  • 4% of instructions were unconditional branches
  • 62% of conditional branches were taken

CPI = 1 + 16% x 62% x 1 + 16% x 38% x 2 + 4% x 1 = 1.26

Taken Conditional Branch Not Taken Conditional Branch Unconditional Branch (Taken)

27

Predict Not Taken: Branch Actually Taken: 2 cycle penalty Branch Actually Not Taken: 0 cycle penalty

CPI = 1 + 16% x 62% x 2 + 4% x 1 = 1.14

Know its TAKEN

slide-29
SLIDE 29

Summary: Control (Branch) Hazard

  • Do branch resolution (outcome and target address) early
  • Stall pipeline for required number of cycles

Methods that reduce the branch penalty

  • 1. Prediction
  • Speculatively predict branch outcome
  • Recover if mispredicted
  • 2. Delayed Branch
  • Always execute instruction(s) following the branch.
  • Compiler tries to fill the branch delay slot(s) with useful instructions.

28

slide-30
SLIDE 30

Summary: Control (Branch) Hazard

Default pipeline: Fetch and execute in-line instructions following a branch

Incorrect if the branch is actually taken

Solutions:

Delay the actual instruction following a branch: Software: Exposed Branch Delay Slots Insert NOP instructions (3 for our pipeline) (Optimization) Move useful instructions into the Delay Slot Microarchitecture: Insert stall cycles in the pipeline dynamically Freeze the PC, Insert NOP into the IF/ID Pipeline Register Optimizations:

  • Do branch resolution (outcome and target address) early

Compute Target Address in ID stage PC can be updated with target address at end of cycle 2 Compute Branch Outcome in EX stage (or even ID stage!)

28

slide-31
SLIDE 31

Summary: Control (Branch) Hazard

Optimizations (contd …) Use Branch Prediction to reduce Branch Penalty

  • Speculatively predict branch outcome
  • Recover if mispredicted

Predict Branch Not Taken:

No penalty if Branch is actually not taken If Branch actually taken Flush the pipeline of the inline instructions that are already in the pipeline Update PC with Target Address

Predict Branch Taken: Compute Target Address in ID stage and update PC Requires 1 stall cycle to compute address Compute Branch Outcome in (say) EX stage If Branch Actually Taken (prediction is correct) do nothing If Branch Actually Not Taken (misspeculation) Flush pipeline of the instruction at target that has entered pipeline Update PC with inline instruction address

28