P age 1 Correlating Branches Consider 3 Scenarios I dea: t aken/ - - PDF document

p age 1
SMART_READER_LITE
LIVE PREVIEW

P age 1 Correlating Branches Consider 3 Scenarios I dea: t aken/ - - PDF document

CS252 Todays Big I dea Graduate Computer Architecture Lecture 18: Reactive: past actions cause system to adapt use Branch Prediction + analysis resources => I LP do what you did bef ore better ex: caches TCP windows


slide-1
SLIDE 1

P age 1

CS252/ Culler Lec 18. 1 4/ 2/ 02

CS252 Graduate Computer Architecture

Lecture 18: Branch Prediction + analysis resources => I LP

April 2, 2002 Prof . David E. Culler Comput er Science 252 Spring 2002

CS252/ Culler Lec 18. 2 4/ 2/ 02

Today’s Big I dea

  • Reactive: past actions cause system to

adapt use

– do what you did bef ore better – ex: caches – TCP windows – URL completion, . . .

  • Proact ive: uses past act ions t o predict

f ut ure act ions

– optimize speculatively, anticipate what you are about to do – branch prediction – long cache blocks – ???

CS252/ Culler Lec 18. 3 4/ 2/ 02

Review: Case f or Branch Prediction when I ssue N instructions per clock cycle

  • 1. Branches will arrive up t o n t imes f ast er in

an n- issue processor

  • 2. Amdahl’s Law => relat ive impact of t he

control stalls will be larger with the lower pot ent ial CPI in an n- issue processor conversely, need branch predict ion t o ‘see’ potential parallelism

CS252/ Culler Lec 18. 4 4/ 2/ 02

Review: 7 Branch Prediction Schemes

  • 1. 1- bit Branch- Predict ion Buf f er
  • 2. 2- bit Branch- Predict ion Buf f er
  • 3. Correlat ing Branch Predict ion Buf f er
  • 4. Tournament Branch Predict or
  • 5. Branch Target Buf f er
  • 6. I nt egrat ed I nst ruct ion Fet ch Unit s
  • 7. Ret urn Address Predict ors

CS252/ Culler Lec 18. 5 4/ 2/ 02

Review: Dynamic Branch Prediction

  • Perf ormance = ƒ(accuracy, cost of mispredict ion)
  • Branch Hist ory Table: Lower bit s of PC address

index t able of 1- bit values

– Says whether or not branch taken last time – No address check (saves HW, but may not be right branch)

  • Problem: in a loop, 1- bit BHT will cause

2 mispredict ions (avg is 9 it erat ions bef ore exit ):

– End of loop case, when it exits instead of looping as bef ore – First time through loop on next time through code, when it predicts exit inst ead of looping – Only 80% accuracy even if loop 90% of the time

CS252/ Culler Lec 18. 6 4/ 2/ 02

  • Bet t er Solut ion: 2- bit scheme where change

predict ion only if get mispredict ion twice:

  • Red: st op, not t aken
  • Green: go, taken
  • Adds hyst eresis t o decision making process

Review: Dynamic Branch Prediction

(Jim Smit h, 1981) T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T NT NT

slide-2
SLIDE 2

P age 2

CS252/ Culler Lec 18. 7 4/ 2/ 02

Consider 3 Scenarios

  • Branch f or loop t est
  • Check f or error or except ion
  • Alt ernat ing t aken / not- t aken

– example?

  • Your worst- case predict ion scenario

CS252/ Culler Lec 18. 8 4/ 2/ 02

Correlating Branches

I dea: t aken/ not t aken of recent ly execut ed branches is relat ed t o behavior

  • f next branch (as

well as t he hist ory of that branch behavior)

– Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction

  • (2, 2) predict or: 2- bit

global, 2- bit local

Branch address (4 bits) 2-bits per branch local predictors Prediction Prediction 2-bit recent global branch history (01 = not taken then taken)

CS252/ Culler Lec 18. 9 4/ 2/ 02 0% 1% 5% 6% 6% 11% 4% 6% 5% 1% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Accuracy of Dif f erent Schemes

(Figure 3.15, p. 206)

4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT

0% 18% Frequency of Mispredictions What ’s missing in t his pict ure?

CS252/ Culler Lec 18. 10 4/ 2/ 02

Re- evaluating Correlation

  • Several of the SPEC benchmarks have less

t han a dozen branches responsible f or 90%

  • f t aken branches:

program branch % static # = 90% compress 14% 236 13 eqntott 25% 494 5 gcc 15% 9531 2020 mpeg 10% 5598 532 real gcc 13% 17361 3214

  • Real programs + OS more like gcc
  • Small benef its beyond benchmarks f or

correlat ion? problems wit h branch aliases?

CS252/ Culler Lec 18. 11 4/ 2/ 02

BHT Accuracy

  • Mispredict because eit her:

– Wrong guess f or that branch – Got branch history of wrong branch when index the table

  • 4096 ent ry t able programs vary f rom 1%

mispredict ion (nasa7, t omcat v) to 18% (eqntott ), wit h spice at 9% and gcc at 12%

  • For SPEC92,

4096 about as good as inf init e t able

CS252/ Culler Lec 18. 12 4/ 2/ 02

Tournament Predictors

  • Mot ivat ion f or correlat ing branch predict ors is

2- bit predictor f ailed on important branches; by adding global inf ormat ion, perf ormance improved

  • Tournament predict ors: use 2 predict ors, 1

based on global inf ormat ion and 1 based on local inf ormat ion, and combine wit h a select or

  • Hopes t o select right predict or f or right

branch (or right cont ext of branch)

slide-3
SLIDE 3

P age 3

CS252/ Culler Lec 18. 13 4/ 2/ 02

Dynamically f inding structure in Spaghetti

?

CS252/ Culler Lec 18. 14 4/ 2/ 02

Tournament Predictor in Alpha 21264

  • 4 K 2
  • bit counters to choose f rom among a global

predictor and a local predictor

  • Global predictor also has 4K entries and is indexed by

the history of the last 12 branches; each entry in the global predictor is a standard 2 - bit predictor

– 12- bit pat t ern: ith bit 0 => ith prior branch not taken; it h bit 1 => it h prior branch taken;

  • Local predictor consists of a 2 - level predictor:

– Top level a local history table consisting of 1024 10- bit entries; each 10- bit ent ry corresponds t o t he most recent 10 branch outcomes f or the entry. 10- bit history allows patterns 10 branches to be discovered and predicted. – Next level Selected entry f rom the local history table is used to index a table of 1K entries consisting a 3

  • bit

saturating counters, which provide the local prediction

  • Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bit s!

(~180, 000 transistors)

CS252/ Culler Lec 18. 15 4/ 2/ 02

% of predictions f rom local predictor in Tournament Prediction Scheme

98% 100% 94% 90% 55% 76% 72% 63% 37% 69% 0% 20% 40% 60% 80% 100%

nasa7 matrix300 tomcatv doduc spice fpppp gcc espresso eqntott li

CS252/ Culler Lec 18. 16 4/ 2/ 02

94% 96% 98% 98% 97% 100% 70% 82% 77% 82% 84% 99% 88% 86% 88% 86% 95% 99% 0% 20% 40% 60% 80% 100% g c c espresso li fpppp doduc tomcatv Branch prediction accuracy Profile-based 2-bit counter Tournament

Accuracy of Branch Prediction

  • Prof ile: branch prof ile f rom last execution

(static in that in encoded in instruction, but prof ile)

f ig 3.40

CS252/ Culler Lec 18. 17 4/ 2/ 02

Accuracy v. Size (SPEC89)

0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Total predictor size (Kbits) Local Correlating Tournament

CS252/ Culler Lec 18. 18 4/ 2/ 02

Need Address at Same Time as Prediction

  • Branch Target Buf f er (BTB): Address of branch index to get

prediction AND branch address (if taken)

– Note: must check f or branch match now, since can’t use wrong branch address (Figure 3.19, 3.20)

Branch PC Predict ed PC =? PC of inst ruct ion FETCH Extra prediction state bits Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4)

slide-4
SLIDE 4

P age 4

CS252/ Culler Lec 18. 19 4/ 2/ 02

  • Avoid branch predict ion by t urning branches

int o condit ionally execut ed inst ruct ions: if (x) then A = B op C else NOP

– I f f alse, then neither store result nor cause exception – Expanded I SA of Alpha, MI PS, PowerPC, SPARC have conditional move; PA- RI SC can annul any f ollowing instr. – I A- 64: 64 1- bit condition f ields selected so conditional execution of any instruction – This transf ormation is called “if - conversion”

  • Drawbacks t o condit ional inst ruct ions

– Still takes a clock even if “annulled” – Stall if condition evaluated late – Complex conditions reduce ef f ectiveness; condition becomes known late in pipeline

x A = B op C

Predicated Execution

CS252/ Culler Lec 18. 20 4/ 2/ 02

Special Case Return Addresses

  • Regist er I ndirect branch hard t o predict

address

  • SPEC89 85% such branches f or procedure

return

  • Since st ack discipline f or procedures, save

return address in small buf f er that acts like a st ack: 8 t o 16 ent ries has small miss rat e

CS252/ Culler Lec 18. 21 4/ 2/ 02

Pitf all: Sometimes bigger and dumber is better

  • 21264 uses t ournament predict or (29 Kbits)
  • Earlier 21164 uses a simple 2- bit predict or

wit h 2K ent ries (or a t ot al of 4 Kbits)

  • SPEC95 benchmarks, 22264 out perf orms

– 21264 avg. 11. 5 mispredictions per 1000 inst ruct ions – 21164 avg. 16. 5 mispredictions per 1000 inst ruct ions

  • Reversed f or t ransact ion processing (TP) !

– 21264 avg. 17 mispredict ions per 1000 instructions – 21164 avg. 15 mispredictions per 1000 instructions

  • TP code much larger & 21164 hold 2X

branch predictions based on local behavior (2K vs. 1K local predict or in t he 21264)

CS252/ Culler Lec 18. 22 4/ 2/ 02

Dynamic Branch Prediction Summary

  • Predict ion becoming import ant part of scalar

execut ion

  • Branch Hist ory Table: 2 bit s f or loop accuracy
  • Correlat ion: Recent ly execut ed branches correlat ed

wit h next branch.

– Either dif f erent branches – Or dif f erent executions of same branches

  • Tournament Predict or: more resources t o

compet it ive solut ions and pick bet ween t hem

  • Branch Target Buf f er: include branch address &

predict ion

  • Predicat ed Execut ion can reduce number of

branches, number of mispredict ed branches

  • Ret urn address st ack f or predict ion of indirect

j ump

CS252/ Culler Lec 18. 23 4/ 2/ 02

Administrivia

  • Looking f or init ial project result s next week
  • Midterm back t hurs
  • Homework
  • Dan Sorin t o t alk on April 16 @ 3:30,

Saf etyNet: I mproving t he Availabilit y and Designability of Shared Memory

CS252/ Culler Lec 18. 24 4/ 2/ 02

Getting CPI < 1: I ssuing Multiple I nstructions/ Cycle

  • Vect or Processing: Explicit coding of independent

loops as operat ions on large vect ors of numbers

– Multimedia instructions being added to many processors

  • Superscalar: varying no. inst ruct ions/ cycle (1 t o 8),

scheduled by compiler or by HW (Tomasulo)

– I BM PowerPC, Sun UltraSparc, DEC Alpha, Pentium I I I / 4

  • (Very) Long I nst ruct ion Words (V)LI W:

f ixed number of inst ruct ions (4 - 16) scheduled by t he compiler; put ops int o wide t emplat es (TBD)

– I ntel Architecture- 64 (I A- 64) 64- bit address » Renamed: “Explicitly Parallel I nstruction Computer (EPI C)”

  • Ant icipat ed success of mult iple inst ruct ions lead t o

I nst ruct ions Per Clock cycle (I PC) vs. CPI

slide-5
SLIDE 5

P age 5

CS252/ Culler Lec 18. 25 4/ 2/ 02

Getting CPI < 1: I ssuing Multiple I nstructions/ Cycle

  • Superscalar MI PS: 2 inst ruct ions, 1 FP & 1 anyt hing

– Fetch 64- bit s/ clock cycle; I nt on lef t, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports f or FP registers to do FP load & FP op in a pair

Type P ipeStages I nt. instruction I F I D EX MEM WB FP instruction I F I D EX MEM WB I nt. instruction I F I D EX MEM WB FP instruction I F I D EX MEM WB I nt. instruction I F I D EX MEM WB FP instruction I F I D EX MEM WB

  • 1 cycle load delay expands t o 3 inst ruct ions in SS

– instruction in right half can’t use it, nor instructions in next slot

CS252/ Culler Lec 18. 26 4/ 2/ 02

Multiple I ssue I ssues

  • issue packet: group of inst ruct ions f rom f et ch

unit t hat could pot ent ially issue in 1 clock

– I f instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue – 0 t o N inst ruct ion issues per clock cycle, f or N- issue

  • Perf orming issue checks in 1 cycle could limit

clock cycle time: O(n2- n) comparisons

– => issue stage usually split and pipelined – 1st stage decides how many instructions f rom within this packet can issue, 2nd stage examines hazards among selected instructions and those already been issued – => higher branch penalties => prediction accuracy important

CS252/ Culler Lec 18. 27 4/ 2/ 02

Multiple I ssue Challenges

  • While I nt eger/ FP split is simple f or t he HW, get CPI
  • f 0. 5 only f or programs wit h:

– Exactly 50% FP operations AND No hazards

  • I f more instructions issue at same time, greater

dif f icult y of decode and issue:

– Even 2- scalar => examine 2 opcodes, 6 register specif iers, & decide if 1 or 2 instructions can issue; (N- issue ~O(N2- N) comparisons) – Register f ile: need 2x reads and 1x writes/ cycle – Rename logic: must be able to rename same register multiple time s in

  • ne cycle! For instance, consider 4- way issue:

add r1, r2, r3 add p11, p4, p7 sub r4, r1, r2 ⇒ sub p22, p11, p4 lw r1, 4(r4) lw p23, 4(p22) add r5, r1, r2 add p12, p23, p4 I magine doing this transf ormation in a single cycle! – Result buses: Need to complete multiple instructions/ cycle » So, need multiple buses with associated matching logic at every reservation station. » Or, need multiple f orwarding paths

CS252/ Culler Lec 18. 28 4/ 2/ 02

Dynamic Scheduling in Superscalar The easy way

  • How t o issue t wo inst ruct ions and keep in- order

instruction issue f or Tomasulo?

– Assume 1 integer + 1 f loating point – 1 Tomasulo control f or integer, 1 f or f loating point

  • I ssue 2X Clock Rat e, so t hat issue remains in order
  • Only loads/ st ores might cause dependency bet ween

integer and FP issue:

– Replace load reservation station with a load queue;

  • perands must be read in the order they are f etched

– Load checks addresses in Store Queue to avoid RAW violation – Store checks addresses in Load Queue to avoid WAR, WAW

CS252/ Culler Lec 18. 29 4/ 2/ 02

Register renaming, virtual registers versus Reorder Buf f ers

  • Alt ernat ive t o Reorder Buf f er is a larger virt ual

set of regist ers and regist er renaming

  • Virt ual regist ers hold bot h archit ect urally visible

registers + temporary values

– replace f unctions of reorder buf f er and reservation station

  • Renaming process maps names of archit ect ural

registers to registers in virtual register set

– Changing subset of virtual registers contains architecturally visible registers

  • Simplif ies inst ruct ion commit : mark regist er as no

longer speculat ive, f ree regist er wit h old value

  • Adds 40- 80 ext ra regist ers: Alpha, Pent ium, …

– Size limits no. instructions in execution (used until commit)

CS252/ Culler Lec 18. 30 4/ 2/ 02

How much to speculate?

  • Speculat ion Pro: uncover event s t hat would
  • therwise stall the pipeline (cache misses)
  • Speculat ion Con: speculat e cost ly if except ional

event occurs when speculat ion was incorrect

  • Typical solut ion: speculat ion allows only low-

cost except ional event s (1st- level cache miss)

  • When expensive except ional event occurs,

(2nd- level cache miss or TLB miss) processor wait s unt il t he inst ruct ion causing event is no longer speculative bef ore handling the event

  • Assuming single branch per cycle: f ut ure may

speculat e across mult iple branches!

slide-6
SLIDE 6

P age 6

CS252/ Culler Lec 18. 31 4/ 2/ 02

Limits to I LP

  • Conf lict ing st udies of amount

– Benchmarks (vectorized Fortran FP vs. integer C programs) – Hardware sophistication – Compiler sophistication

  • How much I LP is available using exist ing

mechanisms wit h increasing HW budget s?

  • Do we need t o invent new HW/ SW mechanisms t o

keep on processor perf ormance curve?

– I ntel MMX, SSE (Streaming SI MD Extensions): 64 bit ints – I ntel SSE2: 128 bit, including 2 64- bit Fl. Pt. per clock – Motorola Alt aVec: 128 bit ints and FPs – Supersparc Multimedia ops, etc.

CS252/ Culler Lec 18. 32 4/ 2/ 02

Limits to I LP

I nit ial HW Model here; MI PS compilers. Assumpt ions f or ideal/ perf ect machine t o st art :

  • 1. Register renaming – inf init e virt ual regist ers

=> all regist er WAW & WAR hazards are avoided

  • 2. Branch predict ion – perf ect ; no mispredict ions
  • 3. Jump predict ion – all jumps perf ect ly predict ed

2 & 3 => machine wit h perf ect speculat ion & an unbounded buf f er of inst ruct ions available

  • 4. Memory- address alias analysis – addresses are

known & a store can be moved bef ore a load provided addresses not equal Also: unlimit ed number of inst ruct ions issued/ clock cycle; perf ect caches; 1 cycle lat ency f or all inst ruct ions (FP * , / );

CS252/ Culler Lec 18. 33 4/ 2/ 02

Upper Limit to I LP: I deal Machine

(Figure 3.35 p. 242) Programs 20 40 60 80 100 120 140 160 gcc espresso li fpppp doducd tomcatv 54.8 62.6 17.9 75.2 118.7 150.1

Integer: 18 - 60 FP: 75 - 150

IPC

How is t his dat a gener at ed?

CS252/ Culler Lec 18. 34 4/ 2/ 02 35 41 16 61 58 60 9 12 10 48 15 6 7 6 46 13 45 6 6 7 45 14 45 2 2 2 29 4 19 46 10 20 30 40 50 60 gcc espresso li fpppp doducd tomcatv Program Perfect Selective predictor Standard 2-bit Static None

More Realistic HW: Branch I mpact

Figure 3.37

Change f rom I nf init e window t o examine t o 2000 and maximum issue of 64 inst ruct ions per clock cycle

Profile BHT (512) Tournament Perfect No prediction

FP: 15 - 45 Integer: 6 - 12

IPC

CS252/ Culler Lec 18. 35 4/ 2/ 02 11 15 12 29 54 10 15 12 49 16 10 13 12 35 15 44 9 10 11 20 11 28 5 5 6 5 5 7 4 4 5 4 5 5 59 45 10 20 30 40 50 60 70 gcc espresso li fpppp doducd tomcatv Program Infinite 256 128 64 32 None

More Realistic HW: Renaming Register I mpact

Figure 3.41

Change 2000 instr window, 64 instr issue, 8K 2 level Predict ion 64 None 256 Infinite 32 128 Integer: 5 - 15 FP: 11 - 45

IPC

CS252/ Culler Lec 18. 36 4/ 2/ 02 Program 5 10 15 20 25 30 35 40 45 50 gcc espresso li fpppp doducd tomcatv 10 15 12 49 16 45 7 7 9 49 16 4 5 4 4 6 5 3 5 3 3 4 4 45 Perfect Global/stack Perfect Inspection None

More Realistic HW: Memory Address Alias I mpact

Figure 3.44

Change 2000 instr window, 64 instr issue, 8K 2 level Predict ion, 256 renaming regist ers None Global/Stack perf; heap conflicts Perfect Inspec. Assem. FP: 4 - 45 (Fortran, no heap) Integer: 4 - 9

IPC

slide-7
SLIDE 7

P age 7

CS252/ Culler Lec 18. 37 4/ 2/ 02 Program 10 20 30 40 50 60 gcc expresso li fpppp doducd tomcatv 10 15 12 52 17 56 10 15 12 47 16 10 13 11 35 15 34 9 10 11 22 12 8 8 9 14 9 14 6 6 6 8 7 9 4 4 4 5 4 6 3 2 3 3 3 3 45 22 Infinite 256 128 64 32 16 8 4

Realistic HW: Window I mpact

(Figur e 3.46)

Perf ect disambiguat ion (HW), 1K Select ive Predict ion, 16 ent ry ret urn, 64 regist ers, issue as many as window 64 16 256 Infinite 32 128 8 4 Integer: 6 - 12 FP: 8 - 45

IPC

CS252/ Culler Lec 18. 38 4/ 2/ 02

How to Exceed I LP Limits of this study?

  • WAR and WAW hazards through memory

– eliminated WAW and WAR hazards on registers through renaming, but not in memory usage

  • Unnecessary dependences (compiler not unrolling

loops so it erat ion variable dependence)

  • Overcoming the data f low limit: value predict ion,

predict ing values and speculat ing on predict ion

– Address value prediction and speculation predicts addresses and speculates by reordering loads and stores; could provide better aliasing analysis, only need predict if addresses =

  • Use mult iple t hreads of cont rol

CS252/ Culler Lec 18. 39 4/ 2/ 02

Workstation Microprocessors 3/ 2001

Sour ce: Micr opr ocessor Repor t , www.MPRonline.com

  • Max issue: 4 inst ruct ions (many CPUs)

Max rename regist ers: 128 (Pent ium 4) Max BHT: 4K x 9 (Alpha 21264B), 16Kx2 (Ult ra I I I ) Max Window Size (OOO): 126 int ruct ions (Pent . 4) Max Pipeline: 22/ 24 st ages (Pent ium 4)

CS252/ Culler Lec 18. 40 4/ 2/ 02

SPEC 2000 Perf ormance 3/ 2001 Source: Microprocessor Report , www. MPRonline. com

1.6X 3.8X 1.2X 1.7X 1.5X

CS252/ Culler Lec 18. 41 4/ 2/ 02

Conclusion

  • 1985- 2000: 1000X perf ormance

– Moore’s Law transistors/ chip => Moore’s Law f or Perf ormance/ MPU

  • Hennessy: indust ry been f ollowing a roadmap of ideas

known in 1985 t o exploit I nst ruct ion Level Parallelism and (real) Moore’s Law t o get 1. 55X/ year

– Caches, Pipelining, Superscalar, Branch Prediction, Out- o f - order execut ion, …

  • I LP limit s: To make perf ormance progress in f ut ure

need t o have explicit parallelism f rom programmer vs. implicit parallelism of I LP exploit ed by compiler, HW?

– Otherwise drop to old rate of 1. 3X per year? – Less than 1. 3X because of processor- memory perf ormance gap?

  • I mpact on you: if you care about perf ormance,

bet t er t hink about explicit ly parallel algorit hms

  • vs. rely on I LP?