BRANCH PREDICTORS Mahdi Nazm Bojnordi Assistant Professor School - - PowerPoint PPT Presentation

branch predictors
SMART_READER_LITE
LIVE PREVIEW

BRANCH PREDICTORS Mahdi Nazm Bojnordi Assistant Professor School - - PowerPoint PPT Presentation

BRANCH PREDICTORS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcements Homework 2 release: Sept. 26 th This lecture Dynamic branch prediction


slide-1
SLIDE 1

BRANCH PREDICTORS

CS/ECE 6810: Computer Architecture

Mahdi Nazm Bojnordi

Assistant Professor School of Computing University of Utah

slide-2
SLIDE 2

Overview

¨ Announcements

¤ Homework 2 release: Sept. 26th

¨ This lecture

¤ Dynamic branch prediction ¤ Counter based branch predictor ¤ Correlating branch predictor ¤ Global vs. local branch predictors

slide-3
SLIDE 3

Big Picture: Why Branch Prediction?

¨ Problem: performance is mainly limited by the number of

instructions fetched per second

¨ Solution: deeper and wider frontend ¨ Challenge: handling branch instructions

slide-4
SLIDE 4

Big Picture: How to Predict Branch?

¨ Static prediction (based on direction or profile)

¨ Always not-taken

¨ Target = next PC

¨ Always taken

¨ Target = unknown

¨ Dynamic prediction

¨ Special hardware using PC

Inst. Memory

PC + 4

NPC

Instruction

target

clk

clk

direction

slide-5
SLIDE 5

Recall: Dynamic Branch Prediction

¨ Hardware unit capable of learning at runtime

¤ 1. Prediction logic

n Direction (taken or not-taken) n Target address (where to fetch next)

¤ 2. Outcome validation and training

n Outcome is computed regardless of prediction

¤ 3. Recovery from misprediction

n Nullify the effect of instructions on the wrong path

slide-6
SLIDE 6

Branch Prediction

¨ Goal: avoiding stall cycles caused by branches ¨ Solution: static or dynamic branch predictor

¤ 1. prediction ¤ 2. validation and training ¤ 3. recovery from misprediction

¨ Performance is influenced by the frequency of

branches (b), prediction accuracy (a), and misprediction cost (c)

slide-7
SLIDE 7

Branch Prediction

¨ Goal: avoiding stall cycles caused by branches ¨ Solution: static or dynamic branch predictor

¤ 1. prediction ¤ 2. validation and training ¤ 3. recovery from misprediction

¨ Performance is influenced by the frequency of

branches (b), prediction accuracy (a), and misprediction cost (c)

𝑇𝑞𝑓𝑓𝑒𝑣𝑞 = 𝑃𝑚𝑒 𝑈𝑗𝑛𝑓 𝑂𝑓𝑥 𝑈𝑗𝑛𝑓 = 𝐷𝑄𝐽234 𝐷𝑄𝐽567 = 1 + 𝑐𝑑 1 + 1 − 𝑏 𝑐𝑑

slide-8
SLIDE 8

Problem

¨ A pipelined processor requires 3 stall cycles to

compute the outcome of every branch before fetching next instruction; due to perfect forwarding/bypassing, no stall cycles are required for data/structural hazards; every 5th instruction is a branch.

¤ Compute speedup gained by a branch predictor with

90% accuracy

slide-9
SLIDE 9

Problem

¨ A pipelined processor requires 3 stall cycles to

compute the outcome of every branch before fetching next instruction; due to perfect forwarding/bypassing, no stall cycles are required for data/structural hazards; every 5th instruction is a branch.

¤ Compute speedup gained by a branch predictor with

90% accuracy

Speedup = (1 + 0.2×3) / (1 + 0.1×0.2×3) = 1.5

slide-10
SLIDE 10

Bimodal Branch Predictors

¨ One-bit branch predictor

¤ Keep track of and use the outcome of last branch N T

taken taken not-taken not-taken

slide-11
SLIDE 11

Bimodal Branch Predictors

¨ One-bit branch predictor

¤ Keep track of and use the outcome of last branch

while(1) { for(i=0; i<10; i++) { } for(j=0; j<20; j++) { } }

N T

taken taken not-taken not-taken

slide-12
SLIDE 12

Bimodal Branch Predictors

¨ One-bit branch predictor

¤ Keep track of and use the outcome of last branch

while(1) { for(i=0; i<10; i++) { } for(j=0; j<20; j++) { } }

branch-1 branch-2 N T

taken taken not-taken not-taken

slide-13
SLIDE 13

Bimodal Branch Predictors

¨ One-bit branch predictor

¤ Keep track of and use the outcome of last branch

while(1) { for(i=0; i<10; i++) { } for(j=0; j<20; j++) { } }

branch-1 branch-2 N T

taken taken not-taken not-taken

¨ Shared predictor ¨ Two mispredictions per loop

slide-14
SLIDE 14

Bimodal Branch Predictors

¨ One-bit branch predictor

¤ Keep track of and use the outcome of last branch

while(1) { for(i=0; i<10; i++) { } for(j=0; j<20; j++) { } }

branch-1 branch-2 N T

taken taken not-taken not-taken

Accuracy = 26/30 = 0.86

¨ Shared predictor ¨ Two mispredictions per loop

How to improve?

slide-15
SLIDE 15

Bimodal Branch Predictors

¨ Two-bit branch predictor

¤ Increment if taken ¤ Decrement if untaken

while(1) { for(i=0; i<10; i++) { } for(j=0; j<20; j++) { } }

branch-1 branch-2

slide-16
SLIDE 16

Bimodal Branch Predictors

¨ Two-bit branch predictor

¤ Increment if taken ¤ Decrement if untaken

while(1) { for(i=0; i<10; i++) { } for(j=0; j<20; j++) { } }

branch-1 branch-2 01 10

taken taken not-taken not- taken

00 11

slide-17
SLIDE 17

Bimodal Branch Predictors

¨ Two-bit branch predictor

¤ Increment if taken ¤ Decrement if untaken

while(1) { for(i=0; i<10; i++) { } for(j=0; j<20; j++) { } }

branch-1 branch-2 01 10

taken taken not-taken not- taken

00 11

  • One misprediction on loop

exit

  • Accuracy = 28/30 = 0.93
slide-18
SLIDE 18

Bimodal Branch Predictors

¨ Two-bit branch predictor

¤ Increment if taken ¤ Decrement if untaken

while(1) { for(i=0; i<10; i++) { } for(j=0; j<20; j++) { } }

branch-1 branch-2 01 10

taken taken not-taken not- taken

00 11

  • One misprediction on loop

exit

  • Accuracy = 28/30 = 0.93
  • How to improve?
  • 3-bit predictor?
  • Problem?
  • A single predictor shared

among many branches

slide-19
SLIDE 19

Using Multiple Counters

¨ How to assign a branch to each counter?

Counters … branch-1 … branch-2 … branch-3 Program code PC

slide-20
SLIDE 20

Using Multiple Counters

¨ How to assign a branch to each counter?

Counters … branch-1 … branch-2 … branch-3 Program code PC a n

slide-21
SLIDE 21

Using Multiple Counters

¨ How to assign a branch to each counter?

Counters … branch-1 … branch-2 … branch-3 Program code PC

  • 1. How many branches

are in a program?

  • 2. How many counters

are used?

a n

slide-22
SLIDE 22

Using Multiple Counters

¨ How to assign a branch to each counter?

Counters … branch-1 … branch-2 … branch-3 Program code PC

  • 1. How many branches

are in a program?

  • 2. How many counters

are used? Cost = n2a bits

a n

slide-23
SLIDE 23

Using Multiple Counters

¨ How to assign a branch to each counter?

¤ Decode History Table (DHT)

n Reduced HW with aliasing

Counters … branch-1 … branch-2 … branch-3 Program code PC

Least significant bits are used to select a counter

n b

slide-24
SLIDE 24

Using Multiple Counters

¨ How to assign a branch to each counter?

¤ Decode History Table (DHT)

n Reduced HW with aliasing

Counters … branch-1 … branch-2 … branch-3 Program code PC

Least significant bits are used to select a counter (+) Reduced hardware (⎼) Branch aliasing

n b

slide-25
SLIDE 25

Using Multiple Counters

¨ How to assign a branch to each counter?

¤ Decode History Table (DHT)

n Reduced HW with aliasing

Counters … branch-1 … branch-2 … branch-3 Program code PC

Least significant bits are used to select a counter (+) Reduced hardware (⎼) Branch aliasing Cost = n2b bits

n b

slide-26
SLIDE 26

Using Multiple Counters

¨ How to assign a branch to each counter?

¤ Decode History Table (DHT)

n Reduced HW with aliasing

¤ Branch History Table (BHT)

n Precisely tracking branches

Counters PC

Most significant bits are used as tags

Tags = hit/miss* n b a-b

slide-27
SLIDE 27

Using Multiple Counters

¨ How to assign a branch to each counter?

¤ Decode History Table (DHT)

n Reduced HW with aliasing

¤ Branch History Table (BHT)

n Precisely tracking branches

Counters PC

Most significant bits are used as tags (+) No aliasing (⎼) Missing entries

Tags = hit/miss* n b a-b

slide-28
SLIDE 28

Using Multiple Counters

¨ How to assign a branch to each counter?

¤ Decode History Table (DHT)

n Reduced HW with aliasing

¤ Branch History Table (BHT)

n Precisely tracking branches

Counters PC

Most significant bits are used as tags (+) No aliasing (⎼) Missing entries

Tags

Cost = (a-b+n)2b bits

= hit/miss* n b a-b

slide-29
SLIDE 29

Using Multiple Counters

¨ How to assign a branch to each counter?

¤ Decode History Table (DHT)

n Reduced HW with aliasing

¤ Branch History Table (BHT)

n Precisely tracking branches

¤ Combined BHT and DHT

n BHT is used on a hit n DHT is used/updated on a miss

DHT PC BHT = n b a-b n

slide-30
SLIDE 30

Using Multiple Counters

¨ How to assign a branch to each counter?

¤ Decode History Table (DHT)

n Reduced HW with aliasing

¤ Branch History Table (BHT)

n Precisely tracking branches

¤ Combined BHT and DHT

n BHT is used on a hit n DHT is used/updated on a miss

DHT PC BHT

Cost = (a-b+2n)2b bits

= n b a-b n

slide-31
SLIDE 31

Using Multiple Counters

¨ How to assign a branch to each counter?

¤ Decode History Table (DHT)

n Reduced HW with aliasing

¤ Branch History Table (BHT)

n Precisely tracking branches

¤ Combined BHT and DHT

n BHT is used on a hit n DHT is used/updated on a miss

DHT PC BHT

Cost = (a-b+2n)2b bits

= DHT typically has more entries than BHT n b a-b n

slide-32
SLIDE 32

Correlating Branch Predictor

¨ Executed branches of a program stream may be

correlated

while (1) { if(x == 0) y = 0; … if(y == 0) x = 1; }

slide-33
SLIDE 33

Correlating Branch Predictor

¨ Executed branches of a program stream may be

correlated

while (1) { if(x == 0) y = 0; … if(y == 0) x = 1; }

branch-1 branch-2

slide-34
SLIDE 34

Correlating Branch Predictor

¨ Executed branches of a program stream may be

correlated

while (1) { if(x == 0) y = 0; … if(y == 0) x = 1; }

branch-1 branch-2

while: BNEQ R1, R0, skp1 ADDI R2, R0, #0 skp1: ... BNEQ R2, R0, skp2 ADDI R1, R0, #1 skp2: J while

slide-35
SLIDE 35

Correlating Branch Predictor

¨ Executed branches of a program stream may be

correlated

while (1) { if(x == 0) y = 0; … if(y == 0) x = 1; }

branch-1 branch-2 Global History Register: an r-bit shift register that maintains outcome history taken? r

slide-36
SLIDE 36

Correlating Branch Predictor

¨ Executed branches of a program stream may be

correlated

while (1) { if(x == 0) y = 0; … if(y == 0) x = 1; }

branch-1 branch-2 Global History Register: an r-bit shift register that maintains outcome history taken? PC r n b

slide-37
SLIDE 37

Correlating Branch Predictor

¨ Executed branches of a program stream may be

correlated

while (1) { if(x == 0) y = 0; … if(y == 0) x = 1; }

branch-1 branch-2 Global History Register: an r-bit shift register that maintains outcome history taken? PC

Cost = r+n2b+r bits

r n b

slide-38
SLIDE 38

Global Branch Predictor

¨ GHR is merged with PC bits to choose a counter

PC GHR Shared Counters n b r

slide-39
SLIDE 39

Global Branch Predictor

¨ GHR is merged with PC bits to choose a counter

PC GHR Shared Counters XOR n b r

slide-40
SLIDE 40

Global Branch Predictor

¨ GHR is merged with PC bits to choose a counter

PC GHR Shared Counters XOR

Cost = r+n2MAX{b, r} bits

n b r

slide-41
SLIDE 41

Local Branch Predictor

¨ One GHR per branch

PC History Registers Shared Counters b r n

slide-42
SLIDE 42

Local Branch Predictor

¨ One GHR per branch

Cost = r2b+n2r bits

PC History Registers Shared Counters b r n

slide-43
SLIDE 43

Local Branch Predictor

¨ One GHR per branch

PC History Registers Predictors b r n

slide-44
SLIDE 44

Local Branch Predictor

¨ One GHR per branch

Cost = r2b+n2MAX{r, b} bits

PC History Registers Predictors b r n

slide-45
SLIDE 45

Local Branch Predictor

¨ One GHR per branch

Cost = r2b+n2MAX{r, b} bits

PC History Registers Predictors XOR b r n

slide-46
SLIDE 46

Tournament Branch Predictor

¨ Local predictor may work well for some

applications, while global predictor works well for some other programs

¤ Include both and identify/use the best one for each

branch

PC Local Predictor Global Predictor Tournament Predictor

  • utput

Two bit saturating counters

slide-47
SLIDE 47

Branch Prediction Summary

¨ Dedicated predictor per branch

¤ Program counter is used for assigning predictors to

branches

¨ Capturing correlation among branches

¤ Shift register is used to track history

¨ Predicting branch direction is not enough

¤ Which instruction to be fetched if taken?

¨ Storing the target instruction can eliminate fetching

¤ Extra hardware is required

slide-48
SLIDE 48

Branch Target Buffer

¨ Store tags and target addresses for each branch

V Tag Target PC Target address

slide-49
SLIDE 49

Branch Target Buffer

¨ Store tags and target addresses for each branch

V Tag Target PC = AND Hit/miss* Target address