P age 1 Data Dependence and Hazards Data Dependence and Hazards I - - PDF document

p age 1
SMART_READER_LITE
LIVE PREVIEW

P age 1 Data Dependence and Hazards Data Dependence and Hazards I - - PDF document

CS252 Recall f rom Pipelining Review Graduate Computer Architecture Lecture 15: Pipeline CPI = I deal pipeline CPI + St ruct ural I nstruction Level Parallelism and Dynamic St alls + Dat a Hazard St alls + Cont rol St alls Execution


slide-1
SLIDE 1

P age 1

CS252/ Culler Lec 15. 1 3/ 12/ 02

CS252 Graduate Computer Architecture

Lecture 15: I nstruction Level Parallelism and Dynamic Execution

March 11, 2002 Prof . David E. Culler Comput er Science 252 Spring 2002

CS252/ Culler Lec 15. 2 3/ 12/ 02

Recall f rom Pipelining Review

  • Pipeline CPI = I deal pipeline CPI + St ruct ural

St alls + Dat a Hazard St alls + Cont rol St alls

– I deal pipeline CPI : measure of the maximum perf ormance attainable by the implementation – Structural hazards: HW cannot support this combination of inst ruct ions – Data hazards: I nstruction depends on result of prior instruction still in the pipeline – Control hazards: Caused by delay between the f etching of instructions and decisions about changes in control f low (branches and jumps)

CS252/ Culler Lec 15. 3 3/ 12/ 02

Recall Data Hazard Resolution: I n- order issue, in- order completion

Time (clock cycles)

  • r r8, r2,r9

I n s t r. O r d e r

lw r1, 0(r2) sub r4,r1,r6 and r6,r2,r7

Reg ALU DMem I fetch Reg Reg I fetch ALU DMem Reg

Bubble

I fetch ALU DMem Reg

Bubble

Reg I fetch ALU DMem

Bubble

Reg

Ext end t o Mult iple inst ruct ion issue? What if load had longer delay? Can and issue?

CS252/ Culler Lec 15. 4 3/ 12/ 02

I n- Order I ssue, Out - of - order Completion

  • Which hazards are present? RAW? WAR? WAW?
  • load

r3 <- r1, r2

  • add

r1 <- r5, r2

  • sub

r3 <- r3, r1 or r3 <- r2, r1

  • Register Reservations

– when issue mark dest inat ion regist er busy t ill complet e – check all regist er reservat ions bef ore issue

Reg ALU I fetch Reg Add DMem Reg DMem’ CS252/ Culler Lec 15. 5 3/ 12/ 02

I deas to Reduce Stalls

Technique Reduces Dynamic schedulin g Dat a hazar d st alls Dynamic br anch pr edict ion Cont rol st alls I ssuing mult iple inst r uct ions per cycle I deal CP I Specula t ion Dat a and cont r ol st alls Dynamic memory disambiguat ion Dat a hazar d st alls involving memor y Loop unr olling Cont rol hazar d st alls Basic compiler p ipel ine sch e duling Dat a hazar d st alls Compiler dependence analysis I deal CP I and da t a hazar d st alls Sof t ware pipelining and t race scheduling I deal CP I and da t a hazar d st alls Compiler speculat ion I deal CP I , dat a and cont r ol st alls

Chapter 3 Chapter 4

CS252/ Culler Lec 15. 6 3/ 12/ 02

I nstruction- Level Parallelism (I LP)

  • Basic Block (BB) I LP is quite small

– BB: a straight- line code sequence wit h no branches in except to the entry and no branches out except at the exit – average dynamic branch f requency 15% to 25% => 4 to 7 instructions execute between a pair of branches – Plus instructions in BB likely to depend on each other

  • To obt ain subst ant ial perf ormance enhancement s,

we must exploit I LP across mult iple basic blocks

  • Simplest: loop- level parallelism t o exploit

parallelism among it erat ions of a loop

– Vect or is one way – I f not vector, then either dynamic via branch prediction or static via loop unrolling by compiler

slide-2
SLIDE 2

P age 2

CS252/ Culler Lec 15. 7 3/ 12/ 02

  • I nstr J is dat a dependent on I nst rI

I nstr J t ries t o read operand bef ore I nstr I writ es it

  • or I nst rJ is data dependent on I nst rK which is

dependent on I nstr I

  • Caused by a “True Dependence” (compiler term)
  • I f t rue dependence caused a hazard in t he pipeline,

called a Read Af t er Writ e (RAW) hazard

Data Dependence and Hazards

I: add r1,r2,r3 J: sub r4,r1,r3

CS252/ Culler Lec 15. 8 3/ 12/ 02

  • Dependences are a propert y of programs
  • Presence of dependence indicat es pot ent ial f or a

hazard, but act ual hazard and lengt h of any st all is a propert y of t he pipeline

  • I mport ance of t he dat a dependencies

1) indicat es t he possibilit y of a hazard 2) det ermines order in which result s must be calculated 3) set s an upper bound on how much parallelism can possibly be exploit ed

  • Today looking at HW schemes t o avoid hazard

Data Dependence and Hazards

CS252/ Culler Lec 15. 9 3/ 12/ 02

  • Name dependence: when 2 inst ruct ions use same

regist er or memory locat ion, called a name, but no f low of dat a bet ween t he inst ruct ions associat ed wit h t hat name; 2 versions of name dependence

  • I nstr J writ es operand bef ore I nst rI reads it

Called an “ant i- dependence” in compiler work. This results f rom reuse of the name “r1”

  • I f anti- dependence caused a hazard in t he

pipeline, called a Writ e Af t er Read (WAR) hazard I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Name Dependence # 1: Ant i- dependence

CS252/ Culler Lec 15. 10 3/ 12/ 02

Name Dependence # 2: Output dependence

  • I nstr J writ es operand bef ore I nst rI writ es it .
  • Called an “out put dependence” by compiler writ ers

This also results f rom the reuse of name “r1”

  • I f anti- dependence caused a hazard in t he pipeline,

called a Writ e Af t er Writ e (WAW) hazard I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

CS252/ Culler Lec 15. 11 3/ 12/ 02

I LP and Data Hazards

  • program order: order inst ruct ions would execut e in

if execut ed sequent ially 1 at a t ime as det ermined by original source program

  • HW/ SW goal: exploit parallelism by preserving

appearance of program order

– modif y order in manner than cannot be observed by program – must not af f ect the outcome of the program

  • Ex: I nst ruct ions involved in a name dependence can

execute simultaneously if name used in inst ruct ions is changed so inst ruct ions do not conf lict

– Register renaming resolves name dependence f or regs – Either by compiler or by HW – add r1, r2, r3 – sub r2, r4,r5 – and r3, r2, 1

CS252/ Culler Lec 15. 12 3/ 12/ 02

Control Dependencies

  • Every inst ruct ion is cont rol dependent on

some set of branches, and, in general, these control dependencies must be preserved t o preserve program order if p1 { S1; }; if p2 { S2; }

  • S1 is cont rol dependent on p1, and S2 is

cont rol dependent on p2 but not on p1.

slide-3
SLIDE 3

P age 3

CS252/ Culler Lec 15. 13 3/ 12/ 02

Control Dependence I gnored

  • Cont rol dependence need not always be preserved

– willing to execute instructions that should not have been executed, thereby violating the control dependences, if can do so without af f ecting correctness of the program

  • I nst ead, 2 propert ies crit ical t o program

correctness are except ion behavior and data f low

CS252/ Culler Lec 15. 14 3/ 12/ 02

Exception Behavior

  • Preserving except ion behavior => any

changes in inst ruct ion execut ion order must not change how except ions are raised in program (=> no new except ions)

  • Example:

DADDU R2,R3,R4 BEQZ R2,L1 LW R1,0(R2) L1:

  • Problem wit h moving LW bef ore BEQZ?

CS252/ Culler Lec 15. 15 3/ 12/ 02

Data Flow

  • Data f low: act ual f low of dat a values among

instructions that produce results and those that consume t hem

– branches make f low dynamic, determine which instruction is supplier of data

  • Example:

DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R5,R6 L: … OR R7,R1,R8

  • OR depends on DADDU or DSUBU?

Must preserve dat a f low on execut ion

CS252/ Culler Lec 15. 16 3/ 12/ 02

CS 252 Administrivia

  • Final Project Proposals due 3/ 17

– send URL to page containing » title & participants » problem statement » annotated bibliography – we’ll monitor progress through the pages

  • Assignment 3 out , due in 3/ 19
  • Quiz 3/ 21

CS252/ Culler Lec 15. 17 3/ 12/ 02

Advantages of Dynamic Scheduling

  • Handles cases when dependences unknown at

compile time

– (e. g. , because they may involve a memory ref erence)

  • I t simplif ies the compiler
  • Allows code t hat compiled f or one pipeline

t o run ef f icient ly on a dif f erent pipeline

  • Hardware speculat ion, a t echnique wit h

signif icant perf ormance advant ages, t hat builds on dynamic scheduling

CS252/ Culler Lec 15. 18 3/ 12/ 02

HW Schemes: I nstruction Parallelism

  • Key idea: Allow inst ruct ions behind st all t o proceed

DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14

  • Enables out- of - order execut ion

and allows out- of - order complet ion

  • Will

dist inguish when an inst ruct ion begins execut ion and when it complet es execut ion; bet ween 2 t imes, t he inst ruct ion is in execut ion

  • I n a dynamically scheduled pipeline, all inst ruct ions

pass through issue stage in order (in- order issue)

slide-4
SLIDE 4

P age 4

CS252/ Culler Lec 15. 19 3/ 12/ 02

Dynamic Scheduling Step 1

  • Simple pipeline had 1 stage to check both

st ruct ural and dat a hazards: I nst ruct ion Decode (I D), also called I nst ruct ion I ssue

  • Split t he I D pipe st age of simple 5
  • st age

pipeline int o 2 st ages:

  • I ssue—Decode

inst ruct ions, check f or st ruct ural hazards

  • Read operands

—Wait unt il no dat a hazards, t hen read operands

CS252/ Culler Lec 15. 20 3/ 12/ 02

A Dynamic Algorithm: Tomasulo’s Algorithm

  • For I BM 360/ 91 (bef ore caches!)
  • Goal: High Perf ormance wit hout special compilers
  • Small number of f loat ing point regist ers (4 in 360)

prevent ed int erest ing compiler scheduling of operat ions

– This led Tomasulo to try to f igure out how to get more ef f ective registers — renaming in hardware!

  • Why St udy 1966 Comput er?
  • The descendants of this have f lourished!

– Alpha 21264, HP 8000, MI PS 10000, Pentium I I I , PowerPC 604, …

CS252/ Culler Lec 15. 21 3/ 12/ 02

Tomasulo Algorithm

  • Control & buf f ers dist ribut ed wit h Funct ion Unit s (FU)

– FU buf f ers called “reservation stations”; have pending

  • perands
  • Registers in instructions replaced by values or pointers

t o reservat ion st at ions(RS);

– f orm of register renaming ;

– avoids WAR, WAW hazards – More reservation stations than registers, so can do

  • ptimizations compilers can’t
  • Results to FU f rom RS, not t hrough regist ers

, over Common Dat a Bus t hat broadcast s result s t o all FUs

  • Load and Stores treated as FUs with RSs as well
  • I nt eger inst ruct ions can go past branches, allowing

FP ops beyond basic block in FP queue

CS252/ Culler Lec 15. 22 3/ 12/ 02

Tomasulo Organization

FP adders FP adders

Add1 Add2 Add3

FP multipliers FP multipliers

Mult1 Mult2

From Mem FP Registers Reservation St at ions Common Data Bus (CDB) To Mem FP Op Queue Load Buf f ers Store Buf f ers

Load1 Load2 Load3 Load4 Load5 Load6

CS252/ Culler Lec 15. 23 3/ 12/ 02

Reservation Station Components

Op: Operat ion t o perf orm in t he unit (e. g. , + or –) Vj, Vk: Value of Source operands

– Store buf f ers has V f ield, result to be stored

Qj , Qk: Reservat ion st at ions producing source registers (value to be written)

– Note: Qj,Qk=0 => ready

– Store buf f ers only have Qi f or RS producing result

Busy: I ndicat es reservat ion st at ion or FU is busy Register result status—I ndicat es which f unct ional unit will writ e each regist er, if one exist s. Blank when no pending instructions that will write that register.

CS252/ Culler Lec 15. 24 3/ 12/ 02

Three Stages of Tomasulo Algorithm

  • 1. Issue—get instruction f rom FP Op Queue

I f reservat ion st at ion f ree (no st ruct ural hazard), control issues inst r & sends operands (renames registers).

  • 2. Execute—operate on operands (EX)

When both operands ready then execute; if not ready, watch Common Data Bus f or result

  • 3. Write result—f inish execution (WB)

Write on Common Data Bus to all awaiting units; mark reservation station available

  • Normal data bus: data + destination (“go to” bus)
  • Common data bus: data + source

(“ come f rom” bus)

– 64 bits of data + 4 bits of Functional Unit source address – Write if matches expected Functional Unit (produces result) – Does the broadcast

  • Example speed:

3 clocks f or Fl . pt. +,- ; 10 f or * ; 40 clks f or /

slide-5
SLIDE 5

P age 5

CS252/ Culler Lec 15. 25 3/ 12/ 02

Tomasulo Example

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 Load1 No LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No Add2 No Add3 No Mult1 No Mult2 No

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

FU

Clock cycle count er FU count down I nstruction stream 3 Load/ Buf f ers 3 FP Adder R.S. 2 FP Mult R.S.

CS252/ Culler Lec 15. 26 3/ 12/ 02

Tomasulo Example Cycle 1

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No Add2 No Add3 No Mult1 No Mult2 No

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

1 FU Load1

CS252/ Culler Lec 15. 27 3/ 12/ 02

Tomasulo Example Cycle 2

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No Add2 No Add3 No Mult1 No Mult2 No

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

2 FU Load2 Load1

Not e: Can have mult iple loads out st anding

CS252/ Culler Lec 15. 28 3/ 12/ 02

Tomasulo Example Cycle 3

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

3 FU Mult1 Load2 Load1

  • Not e: regist ers names are removed (“renamed”) in

Reservation Stations; MULT issued

  • Load1 complet ing; what is wait ing f or Load1?

CS252/ Culler Lec 15. 29 3/ 12/ 02

Tomasulo Example Cycle 4

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 ADDD F6 F8 F2

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

4 FU Mult1 Load2 M(A1) Add1

  • Load2 complet ing; what is wait ing f or Load2?

CS252/ Culler Lec 15. 30 3/ 12/ 02

Tomasulo Example Cycle 5

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

5 FU Mult1 M(A2) M(A1) Add1 Mult2

  • Timer starts down f or Add1, Mult1
slide-6
SLIDE 6

P age 6

CS252/ Culler Lec 15. 31 3/ 12/ 02

Tomasulo Example Cycle 6

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

6 FU Mult1 M(A2) Add2 Add1 Mult2

  • I ssue ADDD here despit e name dependency on F6?

CS252/ Culler Lec 15. 32 3/ 12/ 02

Tomasulo Example Cycle 7

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

7 FU Mult1 M(A2) Add2 Add1 Mult2

  • Add1 (SUBD) complet ing; what is wait ing f or it ?

CS252/ Culler Lec 15. 33 3/ 12/ 02

Tomasulo Example Cycle 8

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

8 FU Mult1 M(A2) Add2 (M-M) Mult2

CS252/ Culler Lec 15. 34 3/ 12/ 02

Tomasulo Example Cycle 9

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No 1 Add2 Yes ADDD (M-M) M(A2) Add3 No 6 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

9 FU Mult1 M(A2) Add2 (M-M) Mult2

CS252/ Culler Lec 15. 35 3/ 12/ 02

Tomasulo Example Cycle 10

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add3 No 5 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

10 FU Mult1 M(A2) Add2 (M-M) Mult2

  • Add2 (ADDD) complet ing; what is wait ing f or it ?

CS252/ Culler Lec 15. 36 3/ 12/ 02

Tomasulo Example Cycle 11

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No Add2 No Add3 No 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

11 FU Mult1 M(A2) (M-M+M) (M-M) Mult2

  • Writ e result of ADDD here?
  • All quick inst ruct ions complet e in t his cycle!
slide-7
SLIDE 7

P age 7

CS252/ Culler Lec 15. 37 3/ 12/ 02

Tomasulo Example Cycle 12

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No Add2 No Add3 No 3 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

12 FU Mult1 M(A2) (M-M+M) (M-M) Mult2

CS252/ Culler Lec 15. 38 3/ 12/ 02

Tomasulo Example Cycle 13

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No Add2 No Add3 No 2 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

13 FU Mult1 M(A2) (M-M+M) (M-M) Mult2

CS252/ Culler Lec 15. 39 3/ 12/ 02

Tomasulo Example Cycle 14

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No Add2 No Add3 No 1 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

14 FU Mult1 M(A2) (M-M+M) (M-M) Mult2

CS252/ Culler Lec 15. 40 3/ 12/ 02

Tomasulo Example Cycle 15

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No Add2 No Add3 No 0 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

15 FU Mult1 M(A2) (M-M+M) (M-M) Mult2

  • Mult 1 (MULTD) complet ing; what is wait ing f or it ?

CS252/ Culler Lec 15. 41 3/ 12/ 02

Tomasulo Example Cycle 16

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1)

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

16 FU M*F4 M(A2) (M-M+M) (M-M) Mult2

  • Just wait ing f or Mult 2 (DI VD) t o complet e

CS252/ Culler Lec 15. 42 3/ 12/ 02

Faster than light computation (skip a couple of cycles)

slide-8
SLIDE 8

P age 8

CS252/ Culler Lec 15. 43 3/ 12/ 02

Tomasulo Example Cycle 55

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No Add2 No Add3 No Mult1 No 1 Mult2 Yes DIVD M*F4 M(A1)

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

55 FU M*F4 M(A2) (M-M+M) (M-M) Mult2

CS252/ Culler Lec 15. 44 3/ 12/ 02

Tomasulo Example Cycle 56

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 56 ADDD F6 F8 F2 6 10 11

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No Add2 No Add3 No Mult1 No 0 Mult2 Yes DIVD M*F4 M(A1)

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

56 FU M*F4 M(A2) (M-M+M) (M-M) Mult2

  • Mult 2 (DI VD) is complet ing; what is wait ing f or it ?

CS252/ Culler Lec 15. 45 3/ 12/ 02

Tomasulo Example Cycle 57

Instruction status:

Exec Write

Instruction j k

Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 56 57 ADDD F6 F8 F2 6 10 11

Reservation Stations:

S1 S2 RS RS

Time Name Busy

Op Vj Vk Qj Qk

Add1 No Add2 No Add3 No Mult1 No Mult2 Yes DIVD M*F4 M(A1)

Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30

56 FU M*F4 M(A2) (M-M+M) (M-M) Result

  • Once again: I n- order issue, out- of - order execut ion

and out- of - order complet ion.

CS252/ Culler Lec 15. 46 3/ 12/ 02

Tomasulo Drawbacks

  • Complexity

– delays of 360/ 91, MI PS 10000, Alpha 21264, I BM PPC 620 in CA:AQA 2/ e, but not in silicon!

  • Many associat ive st ores (CDB) at high speed
  • Perf ormance limit ed by Common Dat a Bus

– Each CDB must go to multiple f unctional units ⇒ high capacitance, high wiring density – Number of f unctional units that can complete per cycle limited to one! » Multiple CDBs ⇒ more FU logic f or parallel assoc stores

  • Non- precise int errupt s!

– We will address this later

CS252/ Culler Lec 15. 47 3/ 12/ 02

Tomasulo Loop Example

Loop:LD F0 R1 MULTD F4 F0 F2 SD F4 R1 SUBI R1 R1 #8 BNEZ R1 Loop

  • This t ime assume Mult iply t akes 4 clocks
  • Assume 1st load t akes 8 clocks

(L1 cache miss), 2nd load takes 1 clock (hit)

  • To be clear, will show clocks f or SUBI , BNEZ

– Reality: integer instructions ahead of Fl. Pt. I nstructions

  • Show 2 it erat ions

CS252/ Culler Lec 15. 48 3/ 12/ 02

Loop Example

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 Load1 No 1 MULTD F4 F0 F2 Load2 No 1 SD F4 R1 Load3 No 2 LD F0 R1 Store1 No 2 MULTD F4 F0 F2 Store2 No 2 SD F4 R1 Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 No SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

80

Fu Added Store Buf f ers Value of Register used f or address, iteration control I nstruction Loop I ter- at ion Count

slide-9
SLIDE 9

P age 9

CS252/ Culler Lec 15. 49 3/ 12/ 02

Loop Example Cycle 1

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 Load1 Yes 80 Load2 No Load3 No Store1 No Store2 No Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 No SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

1 80

Fu

Load1

CS252/ Culler Lec 15. 50 3/ 12/ 02

Loop Example Cycle 2

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No Load3 No Store1 No Store2 No Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

2 80

Fu

Load1 Mult1

CS252/ Culler Lec 15. 51 3/ 12/ 02

Loop Example Cycle 3

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 R1 3 Load3 No Store1 Yes 80 Mult1 Store2 No Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

3 80

Fu

Load1 Mult1

  • I mplicit renaming set s up dat a f low graph

CS252/ Culler Lec 15. 52 3/ 12/ 02

Loop Example Cycle 4

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 R1 3 Load3 No Store1 Yes 80 Mult1 Store2 No Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

4 80

Fu

Load1 Mult1

  • Dispat ching SUBI I nst ruct ion (not in FP queue)

CS252/ Culler Lec 15. 53 3/ 12/ 02

Loop Example Cycle 5

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 R1 3 Load3 No Store1 Yes 80 Mult1 Store2 No Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

5 72

Fu

Load1 Mult1

  • And, BNEZ inst ruct ion (not in FP queue)

CS252/ Culler Lec 15. 54 3/ 12/ 02

Loop Example Cycle 6

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 R1 3 Load3 No 2 LD F0 R1 6 Store1 Yes 80 Mult1 Store2 No Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

6 72

Fu

Load2 Mult1

  • Not ice t hat F0 never sees Load f rom locat ion 80
slide-10
SLIDE 10

P age 10

CS252/ Culler Lec 15. 55 3/ 12/ 02

Loop Example Cycle 7

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 R1 3 Load3 No 2 LD F0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 No Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

7 72

Fu

Load2 Mult2

  • Register f ile completely detached f rom computation
  • First and Second iteration completely overlapped

CS252/ Culler Lec 15. 56 3/ 12/ 02

Loop Example Cycle 8

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 R1 3 Load3 No 2 LD F0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 R1 8 Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

8 72

Fu

Load2 Mult2

CS252/ Culler Lec 15. 57 3/ 12/ 02

Loop Example Cycle 9

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 9 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 R1 3 Load3 No 2 LD F0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 R1 8 Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

9 72

Fu

Load2 Mult2

  • Load1 completing: who is waiting?
  • Note: Dispatching SUBI

CS252/ Culler Lec 15. 58 3/ 12/ 02

Loop Example Cycle 10

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 R1 3 Load3 No 2 LD F0 R1 6 10 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 R1 8 Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

10 64

Fu

Load2 Mult2

  • Load2 completing: who is waiting?
  • Note: Dispatching BNEZ

CS252/ Culler Lec 15. 59 3/ 12/ 02

Loop Example Cycle 11

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 R1 3 Load3 Yes 64 2 LD F0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 R1 8 Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 4 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

11 64

Fu

Load3 Mult2

  • Next load in sequence

CS252/ Culler Lec 15. 60 3/ 12/ 02

Loop Example Cycle 12

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 R1 3 Load3 Yes 64 2 LD F0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 R1 8 Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 2 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 3 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

12 64

Fu

Load3 Mult2

  • Why not issue third multiply?
slide-11
SLIDE 11

P age 11

CS252/ Culler Lec 15. 61 3/ 12/ 02

Loop Example Cycle 13

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 R1 3 Load3 Yes 64 2 LD F0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 R1 8 Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 2 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

13 64

Fu

Load3 Mult2

  • Why not issue third store?

CS252/ Culler Lec 15. 62 3/ 12/ 02

Loop Example Cycle 14

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 Load2 No 1 SD F4 R1 3 Load3 Yes 64 2 LD F0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 R1 8 Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 1 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

14 64

Fu

Load3 Mult2

  • Mult 1 complet ing. Who is wait ing?

CS252/ Culler Lec 15. 63 3/ 12/ 02

Loop Example Cycle 15

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 R1 3 Load3 Yes 64 2 LD F0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult2 2 SD F4 R1 8 Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 No SUBI R1 R1 #8 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

15 64

Fu

Load3 Mult2

  • Mult 2 complet ing. Who is wait ing?

CS252/ Culler Lec 15. 64 3/ 12/ 02

Loop Example Cycle 16

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 R1 3 Load3 Yes 64 2 LD F0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 R1 8 Store3 No

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 4 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

16 64

Fu

Load3 Mult1

CS252/ Culler Lec 15. 65 3/ 12/ 02

Loop Example Cycle 17

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 R1 3 Load3 Yes 64 2 LD F0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 R1 8 Store3 Yes 64 Mult1

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

17 64

Fu

Load3 Mult1

CS252/ Culler Lec 15. 66 3/ 12/ 02

Loop Example Cycle 18

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 R1 3 18 Load3 Yes 64 2 LD F0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 R1 8 Store3 Yes 64 Mult1

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

18 64

Fu

Load3 Mult1

slide-12
SLIDE 12

P age 12

CS252/ Culler Lec 15. 67 3/ 12/ 02

Loop Example Cycle 19

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 R1 3 18 19 Load3 Yes 64 2 LD F0 R1 6 10 11 Store1 No 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 R1 8 19 Store3 Yes 64 Mult1

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

19 56

Fu

Load3 Mult1

CS252/ Culler Lec 15. 68 3/ 12/ 02

Loop Example Cycle 20

Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 R1 1 9 10 Load1 Yes 56 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 R1 3 18 19 Load3 Yes 64 2 LD F0 R1 6 10 11 Store1 No 2 MULTD F4 F0 F2 7 15 16 Store2 No 2 SD F4 R1 8 19 20 Store3 Yes 64 Mult1

Reservation Stations:

S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 R1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop

Register result status

Clock

R1

F0 F2 F4 F6 F8 F10 F12 ... F30

20 56

Fu

Load1 Mult1

  • Once again: I n- order issue, out- of - order execut ion

and out- of - order complet ion.

CS252/ Culler Lec 15. 69 3/ 12/ 02

Why can Tomasulo overlap iterations

  • f loops?
  • Register renaming

– Multiple iterations use dif f erent physical destinations f or registers (dynamic loop unrolling).

  • Reservation stations

– Permit instruction issue to advance past integer control f low

  • perat ions

– Also buf f er old values of registers - totally avoiding the WAR stall that we saw in the scoreboard.

  • Ot her perspect ive: Tomasulo building dat a f low

dependency graph on t he f ly.

CS252/ Culler Lec 15. 70 3/ 12/ 02

Tomasulo’s scheme of f ers 2 major advantages

(1) t he dist r ibut ion of t he hazar d det ect ion logic

– distributed reservation stations and the CDB – I f multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB – I f a centralized register f ile were used, the units would have to read their results f rom the registers when register buses are available.

(2) t he eliminat ion of st alls f or WAW and WAR hazards

CS252/ Culler Lec 15. 71 3/ 12/ 02

What about Precise I nterrupts?

  • St at e of machine looks as if no inst ruct ion

beyond f ault ing inst ruct ions has issued

  • Tomasulo had:

I n- order issue, out- of - order execut ion, and

  • ut- of - order complet ion
  • Need t o “f ix” t he out- of - order complet ion

aspect so t hat we can f ind precise breakpoint in inst ruct ion st ream.

CS252/ Culler Lec 15. 72 3/ 12/ 02

Relationship between precise interrupts and specultation:

  • Speculat ion: guess and check
  • I mport ant f or branch predict ion:

– Need to “take our best shot” at predicting branch direction.

  • I f we speculat e and are wrong, need t o back up and

rest art execut ion t o point at which we predict ed incorrectly:

– This is exactly same as precise exceptions!

  • Technique f or bot h precise int errupt s/ except ions

and speculat ion: in- order complet ion or commit

slide-13
SLIDE 13

P age 13

CS252/ Culler Lec 15. 73 3/ 12/ 02

HW support f or precise interrupts

  • Need HW buf f er f or results
  • f uncommit t ed inst ruct ions:

reorder buf f er

– 3 f ields: instr, destination, value – Use reorder buf f er number instead of reservation station when execution completes – Supplies operands between execution complete & commit – (Reorder buf f er can be operand source => more registers like RS) – I nstructions commit – Once instruction commits, result is put into register – As a result, easy to undo speculated instructions

  • n mispredicted branches
  • r except ions

Reorder Buffer FP Op Queue FP Adder FP Adder Res Stations Res Stations FP Regs

CS252/ Culler Lec 15. 74 3/ 12/ 02

Four Steps of Speculative Tomasulo Algorithm

  • 1. I ssue—get instruction f rom FP Op Queue

I f reservation station and reorder buf f er slot f ree, issue inst r & send operands & reorder buf f er no. f or destination (this stage somet imes called “dispat ch”)

  • 2. Execution—operate on operands (EX)

When both operands ready then execute; if not ready, watch CDB f or result ; when bot h in reservat ion st at ion, execut e; checks RAW (somet imes called “issue”)

  • 3. Write result—f inish execution (WB)

Write on Common Data Bus to all awaiting FUs & reorder buf f er; mark reservat ion st at ion available.

  • 4. Commit—update register with reorder result

When inst r. at head of reorder buf f er & result present, update register with result (or store to memory) and remove inst r f rom reorder buf f er. Mispredicted branch f lushes reorder buf f er (somet imes called “graduat ion”)

CS252/ Culler Lec 15. 75 3/ 12/ 02

What are the hardware complexities with reorder buf f er (ROB)?

Reorder Buffer FP Op Queue FP Adder FP Adder Res Stations Res Stations FP Regs Compar network

  • How do you f ind the latest version of a register?

– (As specif ied by Smith paper) need associative comparison network – Could use f uture f ile or just use the register result status buf f er to track which specif ic reorder buf f er has received the value

  • Need as many ports on ROB as register f ile

Reorder Table

Dest Reg Result Exceptions? Valid Program Counter

CS252/ Culler Lec 15. 76 3/ 12/ 02

Summary

  • Reservations stations: implicit regist er renaming to

larger set of regist ers + buf f ering source operands

– Prevents registers as bottleneck – Avoids WAR, WAW hazards of Scoreboard – Allows loop unrolling in HW

  • Not limit ed t o basic blocks

(int eger unit s get s ahead, beyond branches)

  • Today, helps cache misses as well

– Don’t stall f or L1 Data cache miss (insuf f icient I LP f or L2 miss?)

  • Last ing Cont ribut ions

– Dynamic scheduling – Register renaming – Load/ store disambiguation

  • 360/ 91 descendant s are Pent ium I I I ; PowerPC 604;

MI PS R10000; HP- PA 8000; Alpha 21264