Parallel architectures Electronic Computers LM Parallelism 1 - - PowerPoint PPT Presentation

parallel architectures
SMART_READER_LITE
LIVE PREVIEW

Parallel architectures Electronic Computers LM Parallelism 1 - - PowerPoint PPT Presentation

Parallel architectures Electronic Computers LM Parallelism 1 Architecture Architecture: functional behaviour of a computer. For instance a processor which executes DLX code Implementation: a logical network implementing the


slide-1
SLIDE 1

1

Parallel architectures

Electronic Computers LM

Parallelism

slide-2
SLIDE 2

2

Architecture

  • Synthesis: a physical implementation. There are many possible synthesises of the same

implementation (for instance different technologies) The ISA varies slowly while the implementation change rapidly (see for instance IA8, IA16, IA32…). More an ISA remains more are the programs implemented on it and therefore compatibility becomes the main issue.

  • Architecture: functional behaviour of a computer. For instance a processor which executes

DLX code

  • Implementation: a logical network implementing the architecture. It is called also
  • microarchitecture. There are many implementations of the same architecture. Example:

family x86 The architecture is defined by the machine language that is the instruction set (assembly language). Instruction Set Architecture -> ISA

Parallelism

slide-3
SLIDE 3

3

Parallelism

  • Superscalar superpipelined (i.e. Pentium IV, I5, I7 etc.)

……….. Instruction level parallelism

  • Sequential

Single instruction executed at a time

  • Pipelined

Multiple instructions executed simultaneously

  • Superpipelined

Multiple stages for each operation (EX, MEM etc.) in order to increase the clock frequency (i.e. Pentium IV)

  • Scalar

A single pipeline

  • Superscalar

Multiple pipelines; many instructions started at the same time. Possibile Out Of Order execution (run time decision)

  • Very Long Instruction Word

Multiple pipelines; many instructions started at the same time. Instruction

  • rder decided at compile time

Parallelism

slide-4
SLIDE 4

4

Parallelism architectures

  • Memory level parallelism

A memory able to provide multiple data at different addresses at the same time (outstanding requests - DDR2, DDR3 etc.)

  • Multicore (core level parallelism)

Many processors in the same chip (i.e.. Core duo – Nehalem – Sandy Bridge …..)

  • Multithread (thread level parallelism)

Pipelines of the same processor used by different processes at the same time (time sharing) (as if it were a multicore – ex. Pentium IV, Nehalem, Sandy Bridge etc….)

Parallelism

slide-5
SLIDE 5

5

Deep Pipeline (Superpipeline)

Fetch Decode Execute Memory Writeback Fetch Decode Execute Memory Writeback Branch penalty

  • Each stage subdivided in three substages.. Higher clock frequency but higher branch

penalty Branch penalty

  • Higher power consumption!!!!!!!!!!!!

Parallelism

slide-6
SLIDE 6

6

Parallel pipelines

Sequential Time parallelism: pipeline Space parallelism: VLIW Space-time parallelism: (ie. I5, I7…)

Parallelism

slide-7
SLIDE 7

7

Diversified pipelines - 1

IF ID RD MEM2 FP2 FP3 WB Multi instruction buffer to avoid pipelines block. Dedicated pipelines. The instruction sequence is defined at compile-time. Careful compilation is fundamental in order to avoid an underexploitation of the pipelines. Different execution times problem Instruction interdependency problem

Parallelism

ALU MEM1 FP1 BR EX F => Floating

slide-8
SLIDE 8

8

Diversified pipelines - 2

IF ID RD EX ALU MEM1 FP1 BR MEM2 FP2 FP3 Dispatch Buffer Reorder Buffer

«Out Of Order» execution

”In order” execution WB ”In order” retirement

Parallelism

slide-9
SLIDE 9

9

Floating Point DLX – F instructions

IF ID MEM WB Integer FP Multipl. FP adder FP/Int. Divid. multicycle stages IF ID MEM WB Ex Integer M1 M2 M3 M4 M5 M6 M7 FP Multiply A1 A2 A3 A4 FP Add

FP/INT. Divide (i.e . 24 clock cycles – one instruction at a time executed) Parallelism

Pipelined

slide-10
SLIDE 10

10

DLX revisited

  • Example FMUL F1,F2, F2 (no interdependency between instructions in this sequence)

FADD F3, F4, F5 FLD F6, 10(R8) FST 40(R10), F9

FMUL IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB FADD IF ID A1 A2 A3 A4 MEM WB FLD IF ID EX MEM WB FST IF ID EX MEM (WB)

  • Because of the different instructions execution times Read After Write (RAW - DLX) hazards are more frequent

Data written Data required for computing the address In violet the stages where the operands are needed and in green the stages where results are produced

nop Same destination register Write sequence error

  • Very important structure change (more intermediate registers, more complex ID stage to send each

instruction to the appropriate execution stage)

  • Hazards problems: the instructions do not end in the same order of their issue.
  • Since the division is normally a single functional unity , up to 40 clocks stalls may occur in this case
  • Multiple instructions at the same time in the same stages (in particular in WB)
  • Write After Write hazards (WAW)– i.e. if a FADD F6, F4, F5 (four EX cycles ) would directly precede a

FLD F6, 10(R8) (one EX cycle) (although in this case the FADD would have been dropped by the compiler since useless)

  • Instructions are not completed in order

Parallelism

Red squares: execution

slide-11
SLIDE 11

11

DLX revisited

  • For WAW hazards consider the following example

IF ID M1 M2 M3 M4 M5 M6 M7

MEM WB

IF ID EX

MEM WB

IF ID EX

MEM WB

IF ID A1 A2 A3 A4

MEM WB

IF ID EX

MEM WB

IF ID EX

MEM WB

IF ID EX

MEM WB

FMUL F0, F4, F6

………….. …………..

FADD F2, F4, F6

………….. …………..

FLD F2, 0(R2)

If FADD was started one clock later a Write After Write hazard would have taken place !! Multilple RF write operations

  • To cope with multiple write operations at the same time of different registers the number of the input ports of

the RF can be increased (expensive) or stalls must be introduced (normally in MEM or WB stages so as to choose the instructions to be stalled). More complex pipelines

  • RAW hazards are solved through the forwarding

Normally the hazards are detected in the ID stage considering the preceding and following instructions so as to introduce the required stalls (in this case FLD would have been stalled one clock) Hazards occur normally among homogeneous registers (FP or Integer) but for the FLOAD and FSTOR which use integer register for address computing Parallelism

slide-12
SLIDE 12

12

DLX revisited

  • How can we grant that the final result is that of the program ?
  • In the previous case FLD F2, 0(R2) must be stalled until FADD F2, F4, F6 has reached the

MEM stage. It must be however assumed that between the two instructions there must at least

  • ne using through the forwarding the result of FADD F2, F4, F6 otherwise the compiler would

have dropped the instruction !

  • The situation would have been even worse if FLD had been completed before the FADD.
  • In any case it is always possible that different instructions are completed in an order different

from that of their issue

Parallelism

slide-13
SLIDE 13

Compiler

13

Let’s consider this high level language statements X = Y + Z A = B * C to be executed in a processor with the following pipeline Fetch F Dec. D Issue I Ex. E Ex. E Ex. E WB W In order emission

The issue of the addition (multiply) is possible only AFTER the previous instruction execution calculating R2 (R5) that is after the last EX stage possibly with forwarding

Busy decoder The issue is here possible since data to R1 e R2 have been already produced Multiply: waits for results RAW Stalls Data not available

D freed by the addition

Busy decoder- RAW

Decoder busy

Addition result not yet ready Parallelism

slide-14
SLIDE 14

Compiler

14

But we can modify the emission without modyfying the result 16 cicles instead of 22 !!!! before after Fetch F Dec. D Issue I Ex. E Ex. E Ex. E WB W

Waiting for R5 Busy decoder Parallelism Waiting for R6

Emission possible since R1 and R2 already ready

slide-15
SLIDE 15

Multicycle hazards

15

Let’s suppose to have a FP adder (1 cicle – in red) and a multiplier (3 cicles in green). I1 F1 = F2 + F3 I2 F2 = F4 x F5 I3 F3 = F3 + F4 I4 F6 = F6 x F6 I5 F1 = F3 + F5 I6 F2 = F3 + F4 I1 I2 I3 I4 I5 I6 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10

NB: in this graph the hazards are potential since the registers only are considered no matter how many cycles are required by the executions

Parallelism

I1 I2

WAR (F2)

I6

WAW(F2)

I3

WAR (F3) RAW (F3)

I5

RAW (F3) WAW(F1)

slide-16
SLIDE 16

Dynamic instructions scheduling

16

  • Systems with out of order executions but commitment always in order
  • Temporal dependencies (hazards) not known at compile time
  • It allows the execution of the code on different pipelines and on superscalar processors with

no implications for the compiler.

Parallelism

  • It allows the execution of instructions ahead of their position (in the following case FSUB

F12,F8,F14) if the conditions allow it FDIV F0,F2,F4 FADD F10,F0,F8 (RAW - must wait for F0) FSUB F12,F8,F14 (can be executed anyway)

slide-17
SLIDE 17

17

Scoreboard

  • Consider the following sequence

FDIV F0, F2, F4 FADD F10, F0, F8 FSUB F8, F8, F14 They must read the same value Write After Read (WAR)

  • There is an antidependency (WAR hazard) between FADD and FSUB: should FSUB end before

FADD has read F8 an error would occur (F8 already updated)

  • A possible Write After Write (WAW) hazard would occur if in FSUB F10 instead of F8 had

been used as destination (in case FSUB would end before FADD)

  • “Scoreboard” technique: an instruction per clock should be terminated executing an

instruction as soon as possible.

Parallelism

Read after Write (RAW)

slide-18
SLIDE 18

18

Scoreboard

FP MUL FP MUL FP DIV

Registers

FP ADD INTEG

Scoreboard

The scoreboard is somehow equivalent to the ID stage (just after the fetch) and determines when an instruction can read its operands and start its execution. The scoreboard considers all system state changes and decides when the first instruction in the FIFO queue (as produced by the compiler) can be started. Functional units

Parallelism

slide-19
SLIDE 19

19

Scoreboard

  • Obviously some stalls can be induced because the number of busses available for transfers is

small

  • The four stages equivalent to ID, EX and WB in DLX are:

1. Emission: if a functional unit for the instruction is available (free) and the required operands are available in RF with updated values, the instruction is issued unless another functional unit has already an instruction which must write into the same destination register. No WAW hazards therefore. In this latter case the instruction is stalled which blocks the emission of all the following instructions in the prefetch queue even when all other conditions for them are met! 2. Operand read: the instruction has been emitted. If the operand is available and no already executing instruction must write it, the operand is read otherwise stall in the functional unit 3. Execution: when the result has been computed and stored the scoreboard is informed so as to unblock a possibly waiting instruction 4. In case of possible WAR the instruction is stalled and does not write the result if there is a previous instruction which has not yet read the operands and one of them is the destination register of the considered instruction. Once the operand has been read the result can be written

  • It must be noticed that with this organisation the forwarding is avoided since the results are

written as soon as produced (but for the wait WAR – point 4) The scoreboard technique allows to transfer instructions directly from EX to WB stage (reducing the RAW risks) .

Parallelism

slide-20
SLIDE 20

20

An example

Hypothetical timing for different instructions (which includes the operands read and execution)

FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles LD F6, 34(R2) LD F2, 45(R3) FMUL F0, F2,F4 (MULD) FSUB F8, F6, F2 (SUBD ) FDIV F10, F0, F6 (DIVD) FADD F6, F8, F2 (ADDD)

RAW < WAR RAW

Parallelism

Integer Do you find more hypothetical hazards?

For instance what about F0?

slide-21
SLIDE 21

21

Scoreboard entities

Instruction stages: emission, operands read, execution and writeback Statuses of the functional units (FU): 9 parameters Busy Unit busy Op Operation Code presently executed Fi Instruction destination (result) register Fj, Fk Operands source registers Qj, Qk Functional units producing the required operands (if not yet ready) for the registers Fj and Fk Rj, Rk Flags (yes) indicating whether Fj, Fk have been already updated Result status register : indicates which functional unit will write each register. Void when no functional unit has to do with the specific register N.B. It must be remembered that in case of possible WAW the instructions emission is stalled (point 1 of the rules) N.B. In the following example we suppose that two multiplication/division units are available

Parallelism

slide-22
SLIDE 22

22

Example (here we assume that F0 is a “normal”register and not always “0”)

Instruction status Read ExecutionWrite Instruction j k Issue Op complete Result LD F6 34 R2 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 Functional unit status Time Name Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F31

Functional Unit producing the result for the floating point register Fx (Qj, Qk)

Instructions states Progression clock

1 integer unit 2 multipl. units 1 add/sub unit 1 division unit

Rj and Rk indicates whether (possibly in the next cycle if just produced) the data can be read from the operands source registers of the executed

  • instruction. Qjand Qk are the Functional Units which produce them (if not

yet ready). Fj and Fk are the registers where data produced by Qj and Qk are stored (or will be stored in the next cycle – data available if the corresponding Ri is yes) to be used in the executed instruction

F2 dest Source1Source 2 Integer Mult1 Mult2 Add Divide FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk

Register Qi Ready ? FU=Functional Unit

  • n. of cycles of

execution yet to elapse

NB LD = FLD MULTD = FMUL SUBD = FSUB DIVD = FDIV ADDD = FADD

FLD 1 cycle FADD, FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles

Parallelism Floating point result registers

slide-23
SLIDE 23

23

Cycle 1

Instruction status Read Execution Write Instruction j k Issue Op/Excomplete Result LD F6 34 R2 LD F2 45 R3 MULTDF0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest

S1 S2 FUj FUk Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Integer

Functional unit used for producing the result in F6 R2 is supposed to be already available and therefore in the next clock can be

  • used. LD uses the integer unit

At clock 1 the instruction stage of LD F6,34(R2) is Issue Parallelism

Yes Load F6 R2 Yes

R2

1

Brown colour for state change

1

slide-24
SLIDE 24

24

Cycle 2

Instruction status Read Execution Write Instruction j k Issue Op/Ex complete Result LD F6 34 R2 1 2 LD F2 45 R3 MULTDF0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Integer

Data ready in R2: instruction can proceed

NB:

The second LD cannot be emitted because the

  • nly

integer unit is busy and the same applies for MULTD because instructions must be emitted in order Parallelism

2

slide-25
SLIDE 25

25

Cycle 3

Instruction status Read Execution Write Instruction j k Issue complete Result LD F6 34 R2 1 2 3 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Integer

FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles

Parallelism

Op/Ex 3

slide-26
SLIDE 26

26

Cycle 4

Instruction status Read Execution Write Instruction j k Issue complete Result LD F6 34 R2 1 2 3 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load R2 Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU 4 F6

Register at the end of the period has been written Functional unit freed at the end of the period

The change of status of the FUs indicates their value at the clock positive edge concluding ending the current cycle (future status). For instance the integer functional unit is freed at the end of cycle 4 together with the result writeback. LD F6 34,R2 disappears totally from scoreboard at the clock positive edge concluding the current cycle 4. Parallelism

Op/Ex Integer 4

slide-27
SLIDE 27

27

Cycle 5

Instruction status Read Execution Write Instruction j k Issue complete Result LD F6 34 R2 1 2 3 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 RU S1 S2 RUj RUk Rj? Rk?

R3 supposed already ready as in the previous case

5 Yes Load F2 R3 Yes Integer

The Integer Functional Unit must produce a new value for F2 At the beginning of cycle 5 the integer unit is already free and then LD F2 45, R3 can start Parallelism

4 Op/Ex 5

slide-28
SLIDE 28

28

Cycle 6

Instruction status Read ExecutionWrite Instruction j k Issue complete Result LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Mult1 Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Integer

S1 S2 FUj FUk Fj? Fk?

F4 supposed already present

Yes Mult F0 F2 F4 Integer No Yes Mult

MULTD waits for F2 from the integer unit !!!!

6

MULTD F0 F2, F4 can start because its FU is free and the destination register is F0

Parallelism

Op/Ex 6

slide-29
SLIDE 29

29

Cycle 7

MULTD stalled in the execution unit because F2 not yet ready.

Instruction status Read ExecutionWrite Instruction j k Issue complete Result LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 MULTD F0 F2 F4 6 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult Integer

S1 S2 FUj FUk Fj? Fk?

(NB : FP adder executes FP subtractions too)

F8 Yes Subd F6 F2 Integer Yes No Add 7

SUBD F8 F6, F2 can start because the arithmetic FP sum/subtraction is free. Parallelism The same for SUBD

Op/Ex 7

slide-30
SLIDE 30

30

Cycle 8

Instruction status

Read EX Write

Instruction j k

Issue

  • complete. Result

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTDF0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Yes Mult F0 F2 F4 Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Divide Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add

S1 S2 FUj FUk Fj? Fk?

F0 not yet available

Yes Load F2 R3

8 Yes Div F10 F0 F6 Mult1 No Yes Divide

DIVD F10 F0, F6 can start because the divide FP FU is free

Updated at the end of the cycle

Yes Yes

F2 available !!

F2 written allows MULTD and SUBD to read the operands during the next cycle

F2 is written and therefore the integer unit is free

Parallelism

Op/Ex 8

slide-31
SLIDE 31

31

Cycle 9 - 10

Instruction status Read EX Write Instruction j k Issue

complete Result

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7

N.B.: MULTD and SUBD can read the

  • perands

because F2 available (see cycle 8). DIVD is still stalled because of F0.

9 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 10 clock Mult1 Yes Mult F0 F2 F4 Mult2 No 2 clock Add Yes Sub F8 F6 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide 40 clock

Parallelism ADDD cannot start because SUBD uses the adder FU

Op/Ex

9-10

slide-32
SLIDE 32

32

Cycle 11

Nota: FU Add requires 2 cycles for the SUBD and therefore nothing happens in cycle 10 while MULTD still processes its data NB: ADDD will use the result of the SUBD but is not yet started because of SUBD

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No

8 clocks more

Mult1 Yes Mult F0 F2 F4 Mult2 No 0 Add Yes Sub F8 F6 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide

Instruction status Read EX Write Instruction j k Issue

complete Result

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11

Parallelism Op/Ex

11

slide-33
SLIDE 33

33

Cycle 12

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No

7 clocks more

Mult1 Yes Mult F0 F2 F4 Mult2 No Add No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Divide

Instruction status Read EX Write Instruction j k Issue

completeResult

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11

SUBD ends freeing the FU. In the next period ADDD can start

12

F8 is written and the ADD/SUB FU is freed FLD 1 cycle FADD and FSUB 2c ycles FMUL 10 cycles FDIV 40 cycles

Parallelism Op/Ex

12

slide-34
SLIDE 34

34

Cycle 13

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Divide

Instruction status Fead EX Write Instruction j k IssueOp/Excomplete Fesult LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12

Now ADDD can start because SUBD has finished its execution and has freed the FU

Yes Add F6 F8 F2 Yes Yes Add

13

6 Clocks more

FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles

Parallelism

13

slide-35
SLIDE 35

35

Cycle 14

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide

Instruction status Read EX Write Instruction j k Issue

completeResult

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14

5 clocks more 2 Clocks more

FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles

Parallelism Op/Ex

14

slide-36
SLIDE 36

36

Cycle 15

ADDD requires two cycles and therefore no system status change

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide

Instruction status Read EX Write Instruction j k Issue

complete Result

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14

4 Clocks more 1 Clock more

FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles

Parallelism Op/Ex

15

slide-37
SLIDE 37

37

Cycle 16

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide

Instruction status Read EX Write Instruction j k Issue

completeResult

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14

ADDD ended its EX stage while MULTD and DIVD keep executing

16

3 clocks more

FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles

Parallelism Op/Ex

16

slide-38
SLIDE 38

38

Cycle 17

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide

Instruction status Read EX Write Instruction j k Issue

completeResult

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16

NB !!! ADDD stalled (cannot write) because of a WAR with DIVD on F6. DIVD does not read F6 because it waits for F0 produced by MULTD (operands are read in parallel). MULT and DIVD keep executing Stalled because WAR F6

2 Clocks more

FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles

Parallelism Op/Ex

17

slide-39
SLIDE 39

39

Cycle 18

MULT still executing DIVD still stalled

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide

Instruction status Read EX Write Instruction j k Issue

complete Result

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16

1 clock more

FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles

Parallelism Op/Ex

18

slide-40
SLIDE 40

40

Cycle 19

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide

Instruction status Read EX Write Instruction j k Issue

completeResult

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16

MULT ends its execution, will write in cycle 20 (after 10 cycles) which will unblock DIVD and then ADDD

19

FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles

Parallelism Op/Ex

19

slide-41
SLIDE 41

41

Cycle 20

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Yes Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Add Divide

Instruction status Read EX Write Instruction j k Issue

completeResult

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16 19

MULTD writes F0

unblocking DIVD

20

FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles

Parallelism Op/Ex

20

slide-42
SLIDE 42

42

Cycle 21

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Add Divide

Instruction status Read EX Write Instruction j k Issue

complete Result

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16 19 20

DIVD reads both F0 and F6 (which could not be written by ADDD because of WAR) unblocking ADDD which can write F6 in the next cycle

21

Parallelism Op/Ex

21

slide-43
SLIDE 43

43

Cycle 22

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No Divide Yes Div F10 F0 F6 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Divide

Instruction status Read EX Write Instruction j k Issue

complete Result

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16 19 20 21

Parallelism Now ADDD can write F6 after the WAR hazards with DIVD disappeared. For 6 cycles ADDD couldn’t write F6 although its result was available

22

Op/Ex

22

slide-44
SLIDE 44

44

Cycle 61

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No Divide Yes Div F10 F0 F6 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Divide

Instruction status Read EX Write Instruction j k Issue

complete

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16 19 20 21 22

DIVD execution ends after 40 cycles

61 Result

Parallelism Op/Ex

61

slide-45
SLIDE 45

45

Cycle 62

All executions ended

Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 0 Divide No Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F31

62 FU

Instruction status Read EX Write Instruction j k Issue

complete

LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16 19 20 21 22 61 Result 62

Parallelism Op/Ex

slide-46
SLIDE 46

46

Scoreboard limits

  • An instruction can be emitted only if all previous instructions have been emitted

WAW WAR

FDIV F0, F2, F4 FADD F6, F0, F8 FSTOR F6, 0(R1) FSUB F8, F10, F14 FMUL F6, F10, F8

N.B Hazards of the sequence are only potential: their occurrence depends

  • n the instructions execution time
  • Register values must be read in any case in parallel only from the register file (which means

that they must have been already stored in the registers – no RAW problem)

Parallelism

RAW

slide-47
SLIDE 47

47

Renaming – Tomasulo Algorithm

Tomasulo algorithm: “renaming” is based on the concept of “reservation stations” which are functional units buffers where instructions can be «parked» waiting for the availability of the requested Fu and the needed data. The following benefits occur

«Renaming» indicates a location different from the RF where a requested datum is produced/stored and can be

  • btained. The name «renaming» is used because it is as if the source registers of an instruction were renamed

Parallelism

  • A reservation station is a place of a FU where an instruction emitted from the instruction queue waits until the FU is

free and the needed data arrive as soon as produced (N.B. before being written in the RF). For its operands EITHER the source register data OR the reservation stations producing them are indicated (whence renaming). The renaming occurs at run-time

  • A reservation station captures a required operand exactly when and where it is (not waiting until it is written avoiding

the register file access). Similar to the case of forwarding

  • When multiple writes to the same register occur (WAW – possible only if multiple busses between FUs and RF are

available) only the most recently produced data are written (for each register a TAG is used indicating the FU which has the right to write)

  • Hazards detection and execution control are distributed (not grouped as for the Scoreboard) : only the information

stored in the reservation stations of each functional unit determines whether an instruction can execute in the FU since the source (where the datum is being produced ) and NOT the RF is in any case indicated. RAW hazards are no more possible since the requested data are provided as soon as produced. The same for WAR

  • Results are transferred directly to the waiting FUs reservation stations without the necessity of reading the RF

through the common data busses (multiple reservation stations in addition to RF register can be accessed at the same time when multiple busses are available)

slide-48
SLIDE 48

48

Tomasulo Algorithm

Tomasulo eliminates not only WAWs but also WARs FLD F6, 32(R2) FLD F2, 44(R3) FMUL F0, F2, F4 FSUB F8, F2, F6 FDIV F10, F0, F6 FADD F6, F8, F2

Renaming (functional unit producing the datum)

Possible WAW As far as the WAW between FLD and FADD per F6 is concerned the mechanism grants that only the most recent instruction in the RS using a destination register can write the register. FLD [T/F6], 32(R2) FLD F2, 44(R3) FMUL F0, F2, F4 FSUB F8, F2, [T] FDIV F10, F0, [T] FADD F6, F8, F2 When an instruction is inserted in a RS it is checked whether one or more of its operands are being produced elsewhere by other RS: if yes then renaming For the FADD a potential WAR with the FDIV could occur if FADD ended before FDIV has read its

  • perands (in case of F8 of FSUB and of F2 of FLD they were both immediately available for FADD) but since

FDIV points for F6 to the RS of FLD F6, 32(R2) and not to RF the problem does not occur. The same holds for FSUB. Parallelism Possible WAR.

slide-49
SLIDE 49

49

Tomasulo Algorithm

Parallelism

Very high performance without special compilers Differences with scoreboard

Buffer and controls directly distributed in the FUs (there is no centralized control): buffers are called “reservation stations” Source registers names substituted by pointers to buffers of the reservation stations (if the requested data are being there produced) “Renaming”: a direct pointer to the sources and not to the register One ore more Common Data Bus for sending results to all FUs requiring them Load and Stores considered as FUs (a STORE can also be a source for a RS executing a LOAD)

slide-50
SLIDE 50

50

Tomasulo Algorithm

In this example is it assumed that the MUL unit executes the DIVs too and that the ADD executes the SUBs too . LOAD and STORES are handled as other instructions In this example: 3 RS for add/sub 2 RS for mult/div 5 RS for store 5 RS for load In this example only one Data Bus. Please notice that the same Common Data Bus is used also by the RS waiting for data Each RS (more than one for each FU) stores an emitted instruction and for each operand either of two elements: either the operand value (i.e. read from RF) or the name of the RS which is producing it (renaming)

For the data produced by the FUs

Parallelism

slide-51
SLIDE 51

51

Tomasulo Algorithm

  • Writeback: as soon as a datum is produced, it is tranferred over one CDB (when more than one are

available) to the RF and to the RS waiting for it. Parallelism

  • Load buffers are used to store the load addresses
  • Store buffers contain the computed addresses and the data to be written in memory
  • Load and store must be executed in sequence if they are related to the same addresses. In the
  • ther cases it is possibile to anticipate the LOADs (never the STOREs)
  • In figure there are 3 phases (each one of which can last several clocks):
  • Emission: the instructions are extracted in order from the general instruction queue when there is a

free RS for the requested FU (the only condition) otherwise the instruction queue stalls. Operands are extracted from RF or the producing FU as indicated. In case of WAW it must be determined which instruction must provide the data

  • Execution: if one ore more operands are not yet available CDB (s) must be monitored (data must be

transferred over a bus anyway) in order to catch them (and their sources) as soon as available: RAW are therefore avoided (we are sure not to read stale data in the RF).

slide-52
SLIDE 52

52

Tomasulo Algorithm

Let’s see the scoreboard example in a Tomasulo Architecture. Let’s suppose that the execution times are the same of the scoreboard (FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) – NB. “+1” for the writeback LD F6, 34(R2) LD F2, 45(R3) FMUL F0, F2,F4 FSUB F8, F6, F2 FDIV F10, F0, F6 FADD F6, F8, F2

Parallelism

slide-53
SLIDE 53

53

Reservation Station

Register File Status: Indicates which FU will write the register (if needed). A blank means that there are no instructions which must write the register and therefore its value can be directly used N.B. From the general instruction queue one instruction per clock is emitted when a FUs RS for that instruction is available otherwise stall. In our example we assume only one CDB.

Parallelism

Op: opcode of the instruction to be executed Vj, Vk: places of the operands (either RF or the FUs producing them) Qj, Qk: Functional units producing the results. A blank indicates that the source operands are already in Vj or Vk or that they are not required Busy: Busy FU

slide-54
SLIDE 54

54

Cycle 0

Instruction status Execution Write Instruction j k Issue LD F6 34 R2 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU

The FU producing the new value

Producing FU – if blank it means that the datum is in RF

Operands register. If blank the datum is produced in the corresponding Q FU

  • NB. For LD (ST here not used) there is a limited

number of RS. Their BUSY status is here displayed differently from the FU (see next slide)

For sake of simplicity Rj e Rk (ready/notready) are not displayed since their values are implicit in the status of Qj and Qk

Parallelism

Load/store not indicated in the status table

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-55
SLIDE 55

55

Cycle 1

Instruction status Execution Write Instruction j k Issue Busy LD F6 34 R2 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 1 FU Address 1 Load1 Yes Load1 34+R2

3 RS for adder/sub 2 RS for mul/div NB: Here it is assumed that R2 and R3 are already available 5 RS for the LOAD

Parallelism

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-56
SLIDE 56

56

Cycle 2

Instruction status Execution Write Instruction j k Issue Busy LD F6 34 R2 1 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 2 FU Load1 Address Load1 Yes 34+R2

5 RS for LOAD

2- 2 Load2 Yes 45+R3 Load2

The second LD is emitted. One instruction per clock is emitted (when possible)

N.B. A second LOAD has been emitted (not possible with the scoreboard) and parked in the RS. R3 value already available in the RF Parallelism NB: Load -> 2 cycles: the first one for computing the address and the second for reading the datum

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-57
SLIDE 57

57

Cycle 3

Instruction status Execution Write Instruction j k Issue Busy LD F6 34 R2 1 2--3 Load1 Yes LD F2 45 R3 2 3- Load2 Yes MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 3 FU Load2 Load1 Address 34+R2 45+R3

Yet10 cycles LD two cycles

MULTD can be emitted although F2 NOT yet available . F2-> renaming

3 Yes Mult F4 Mult1 Load2

MULTD emitted (free RS )

Datum supposed already in the RF

Parallelism

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-58
SLIDE 58

58

Cycle 4

The FUs execute both sums and subtractions

Instruction status Execution Write Instruction j k Issue Busy LD F6 34 R2 1 2--3 LD F2 45 R3 2 3--4 Load2 Yes MULTD F0 F2 F4 3 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 No Add3 No Mult1 Yes Mult F4 Load2 Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 4 FU Mult1 Load2 Address 45+R3

Yet 3 cycles Yet 10 cycles

4

The datum read from memory LD F6 34(R2) is written both in the RF and in the RS of SUBD which is waiting for it

4 Add1 Yes Sub F6 (captured on the fly) Load2

SUBD is emitted (RS free) F6 available in RF at the end of the cycle

FU freed

Parallelism

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-59
SLIDE 59

59

Cycle 5

Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 MULTD F0 F2 F4 3 SUBD F8 F6 F2 4 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk

3 Add1

Yes Sub F6 (capt.) F2 (capt) 0 Add2 No Add3 No 10 Mult1 Yes Mult F2 (capt) F4 0 Mult2 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 5 FU Mult1 Add1

Cycles yet to be executed for completing the execution

5 5 Yes Div F6 Mult1 Mult2

DIVD is emitted (RS free)

Wait for F0 FU freed

Parallelism

The datum read from memory with LD F2 45(R3) is written both in register F2 and in the RS of SUBD and MULTD which are waiting for it

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-60
SLIDE 60

60

Cycle 6

Cycles yet to be executed for completing the execution

Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- DIVD F10 F0 F6 5 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk

2 Add1

Yes Sub F6 F2 Add2 Yes Add F2 Add1 Add3 No 9 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 6 FU Mult1 Add1 Mult2

Yet 40 cycles

Now MULTD can execute (F2 and F4 available)

6 Add2

ADDD is emitted (RS free)

Wait for F0 Wait for F8

Parallelism

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-61
SLIDE 61

61

Cycle 7

Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk

1 Add1

Yes Sub F6 F2 Add2 Yes Add F2 Add1 Add3 No 8 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 7 FU Mult1 Add2 Add1 Mult2 6 -- 7

SUBD (as ADDD) two cycles ADDD stalled waiting for SUBD (F8) Datum in F6 will be overwritten by ADDD but it was already read and is present in the RS of DIVD

Yet 40 cycles

Parallelism

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-62
SLIDE 62

62

Cycle 8

Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No 2 Add2 Yes Add F8 F2 Add3 No 7 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 8 FU Mult1 Add2 Mult2 Yet 40 8

NB: SUBD ends before MULTD and allows ADDD (which captures the result of F8) to start executing

FU freed

Parallelism

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-63
SLIDE 63

63

Cycle 9

Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No 2 Add2 Yes Add F8 F2 Add3 No 6 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 9 FU Mult1 Add2 Mult2 Yet 40

ADDD executing Parallelism

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-64
SLIDE 64

64

Cycle 10

Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No

1 Add2

Yes Add F8 F2 Add3 No 5 Mult1 Yes Mult F2 F4 Mult2 Yes Div Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 10 FU Mult1 Add2 Mult2 9 -- 10

Two execution cycles

Yet 40 F6

Parallelism

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-65
SLIDE 65

65

Cycle 11

Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 4 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 11 FU Mult1 Mult2 40 11

ADDD too ends before MULTD and DIVD

FU freed

Cycles yet to be executed for completing the execution

Parallelism

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-66
SLIDE 66

66

Cycle 12

Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 3 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 12 FU Mult1 Mult2 40

Waiting for the datum produced by MULTD

Cycles yet to be executed for completing the execution

Parallelism

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-67
SLIDE 67

67

Cycle 15

Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- 15 SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No

1 Mult1

Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 15 FU Mult1 Mult2

Waiting for the datum produced by MULTD Parallelism

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-68
SLIDE 68

68

Cycle 16

Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- 15 SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes Div F0 F6 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 16 FU Mult2

Now DIVD can execute

16

FU freed

Parallelism

FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)

  • NB. “+1” for the writeback
slide-69
SLIDE 69

69

Cycle 56

Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- 15 16 SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 17 -- 56 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 0 Mult2 Yes Div F0

F6

Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 56 FU Mult2

Parallelism

slide-70
SLIDE 70

70

Cycle 57

Instruction status Execution Write Instruction j k Issue complete Result LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- 15 16 SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 17 -- 56 57 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 57 FU

Parallelism

slide-71
SLIDE 71

71

A demo can be found at http://www.ecs.umass.edu/ece/koren/architecture/Tomasulo1/tomasulo_files/tomasulo.htm

Parallelism

slide-72
SLIDE 72

Limits of Tomasulo Algorithm

73

  • NOT precise interrupts

Parallelism

  • Very complex
  • Each CDB must be connetcted to each RS – Complex cabling – Reduce n. of CDB

means reduced efficiency

  • If a single CDB is present only one instruction per cycle can end
  • Ouf of order instructions completion !!!!!!
slide-73
SLIDE 73

Exceptions

74

  • Traps: internal causes

 Exceptional conditions (overflow, zero division etc.)  Errors (i.e. parity)  Page fault (or – see later – segment fault): data not available in memory  Syncronous to the current process  Operating systems handler  Instruction can be interrupted during its execution (i.e. page fault) and therefore must be «restartable»,. The executing program is normally temporarily aborted.

Parallelism

  • Exception/interrupt: non-programmed control transfer
  • Return address and all other information necessary to restore the interrupted situation

must be saved

  • «Response» subroutine (handler) must be executed
  • Two exceptions types: interrupt and trap
  • Interrupts: external causes

 The user program are interrupted and the then restored  Asyncronous to the current process  Acknowledged at the end of the current instruction (if interrupts enabled)  The handler is responsibility of the user program

slide-74
SLIDE 74

Examples

75

Instruction Restart

Parallelism

slide-75
SLIDE 75

Precise exceptions/interrupts

76

  • Precise exceptions(interrupts) : instruction commitment in order

Parallelism

  • Exceptions must be “precise” that is their behaviour must be same that would occur in a

“non-pipelined” architecture

  • Precise: machine status is saved as if the code would have been executed until the exception :
  • All preceding instruction must be terminated
  • All instructions following the instruction which provoked the exception must be handled as if

they never started

  • The same code must executed identically on different architectures
  • Complex problem with pipeline, OOO execution (see later) etc.
  • Scoreboard and Tomasulo have:

In order emission, execution (and therefore terminated) out of order fuori ordine

slide-76
SLIDE 76

78

  • Automatic WAW avoidance

ROB FP Op Queue FP Adder FP Adder Res Stations Res Stations FP Regs

Reorder Buffer (ROB)

Parallelism

  • FIFO queue
  • Stores pointers to all instructions in FIFO order as they are emitted. For sake of simplicity we say

that the instruction is virtually inserted in the ROB

  • When instructions are terminated the results are stored in the ROB (instead of the

RF) which provides also the operands to other instructions which requires them (renaming!)

Commitment

  • Easy “undo” of speculated instructions (see later)
  • r of branches erroneously predicted or exceptions
  • Commitment: the result of the instruction in the top slot of the FIFO are

transferred to the architectural registers

slide-77
SLIDE 77

79

Tomasulo again

Parallelism

slide-78
SLIDE 78

80

Tomasulo in 4 steps

N.B. Sometimes more instructions can be commited simultaneously. If the destination is the same (unlikely, otherwise the compiler would have dropped the first one) the result of the most recent instruction is used. Parallelism

  • Emission— Emission of an instruction from the instruction queue when a RS and

a ROB slot available. In the RS are indicated the operands source and the ROB slot where an instruction will be “parked” after its esecution (this phase is called «dispatch”). The results are NOT written in the RF until the commitment phase. NB the lack of one of the two conditions blocks the emission of the following instructions

  • Execution — Operands transformation. If not yet ready they can be in the ROB (in

this case the operand value computed by the nearest previous instructions is used)

  • r still computed in the FU. This phase is indicated as “issue”.
  • Result writeback

— Execution ends. Result trasmitted on the CDB for the RS waiting of them and for the ROB.

  • Commitment—Register (or memory) update with the results stored in the ROB when

the instruction is on the top of the ROB FIFO. In case of erroneously predicted branch the ROB results are just dropped (“graduation”). EMISSION IN ORDER COMMITMENT IN ORDER

slide-79
SLIDE 79

Parallelism 81

HW with ROB

Reorder Buffer FP Op Queue FP Adder FP Adder Res Stations Res Stations FP Regs Compar network

  • ROB is a circular queue

ROB

Destination Register Result Exception? Valid (terminated ) Program Counter

slide-80
SLIDE 80

82

Example

LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4

Parallelism

slide-81
SLIDE 81

83

To memory FP adders FP multipliers Reservation Stations FP Op queue

ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1

F0 LD F0,10(R2) N

Completed? Dest. Dest ROB top ROB end From memory 1 10+R2 Dest

ROB

Tomasulo with ROB – cycle 1

Source M1

LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4

Istruction Dest.

Parallelism

FP registers

  • Dest. => destination position in ROB
slide-82
SLIDE 82

84

2 FADD F10,F4, ROB1

FP adders FP multipliers Reservation Stations FP Op queue

ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1

F10 F0 ROB1

LD F0,10(R2)

N Ex Dest. Dest Top End 1 10+R2 Dest

ROB

Tomasulo with ROB – cycle 2

FADD F10, F4, F0

To memory From memory

M1

LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4

Renaming !! (Memory 2 clocks) Three slots for memory

  • perations

Completed?

Source Istruction Dest.

Parallelism There can be also two ROB sources

FP registers

slide-83
SLIDE 83

85

3

2 FADD F10, F4, ROB1

FP adders FP multipliers Reservation Stations FP Op queue

ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1

F2 F10 F0 ROB 2 LD F0,10(R2) N N Ex

Dest. Dest Top End 1 10+R2 Dest

ROB

Tomasulo with ROB – cycle 3

FADD F10, F4, F0 FDIV F2, F10, F6

FDIV F2, ROB2, F6

To memory From memory

M1

LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4

ROB 1

Completed?

Source Istruction Dest.

Parallelism

FP registers

slide-84
SLIDE 84

86

3

2 FADD F10, F4, M1 6 FADD F0, ROB5, F6

FP adders FP multipliers Reservation Stations FP Op queue

ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1

F0 ROB5 FADD F0, F4, F6 N F4 LD F4,0(R3) Ex

  • N

F2 F10 ROB2

Completed and committed

N Ex

Dest. Dest Top End 5 0+R3 Dest

Tomasulo with ROB – cycle 5

FADD F10, F4, F0 FDIV F2, F10, F6

FDIV F2, ROB2, F6

To memory From memory

BRNE F2, +100 M1 F0 (Memory 1) In cycle 4 (end of the first LD) FADD F10, F4, F0 started executing

Emitted in cycle 4 in parallel with LD F4, 0(R3)

LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4

ROB1

Datum captured

  • n the fly. Not

more present in the ROB

Completed?

Source Istruction Dest.

Parallelism

slide-85
SLIDE 85

87

3

2 FADD F10, F4, M1 6 FADD F0, ROB5, F6

FP adders FP multipliers Reservation Stations FP Op queue

ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1

F0 ROB5 ROB5 ST 0(R3), F4 FADD F0, F4, F6 N F4 LD F4, 0(R3) Ex

  • N

F2 F10 ROB2 N Ex

Dest. Dest Top End 5 0+R3 Dest

Tomasulo with ROB – cycle 6

FADD F10, F4, F0 FDIV F2, F10, F6

FDIV F2, ROB2, F6

To memory From memory

FP registers

BRNE F2, +100 M1 F0 M1

NB ST can start its execution when LD F4, 0(R3) has terminated the execution NOT when is committed (one cycle later )

LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4

ROB1

Completed?

Source Istruction Dest.

Parallelism

N

slide-86
SLIDE 86

88

Register Renaming

  • For each commitment the pointer to the architectural register points to the physical

register linked the commited instruction. When a new instruction regarding the same architectural register is committed the pointer to it is changed (and the physical register previously embodying the architectural register is freed).

Parallelism

  • But when an emitted instruction must use a register where can it be found? In the

ROB or in the RF ? The entire ROB should be analysed and the most recent slot found whose destination is the required register: the instruction should either point to it (if any) or to the RF

  • Solution: to use a number of physical registers greater than that of the architectural

registers (ISA) and to keep a pointer to the most recent (which is actually the architectural register).

  • Whenever an instruction inserted in the ROB must write a register (i.e. F17), it points to

a new physical register associate to the involved register (F17) where the result will be temporarily stored. Any following instruction which must use that register (F17) will use that physical register

slide-87
SLIDE 87

89

An example with R2

R2-0 R2-1 R2-3 R2-4 R2-5 R2-6 R2-7 R2-8 Circular queue of register R2 Pointer to the first free register R2 when LD R2, 10(R5) is emitted Let’ suppose that R2-2 and R2-3 are alredy occupied by previous not yet committed instructions LD R2, 10(R5) ; R2-4 (dest.) first physical register free associated to R2

Parallelism

  • When LD R2, 10(R5) is emitted register R2-4 is given to it as destination which will be used by MUL

(as soon as the new datum is computed). Now R2-2, R2-3 e R2-4 are «busy» and the first free register will be R2-5. R2-2, R2-3 ans R2-4 will be freed as soon the related instructions end. If the commitment is “in-order” all hazards disappear. R2-1 is the architectural R2 register. R2-2 will become the architectural R2 register at the commitment of the related instruction. The busy registers are freed when no more needes

  • Normallly there are 40-120 physical registers

R2-2 Architectural register MUL R8, R2, R5 ; R2-4 (sorg.) RADD R2, R9, R6 ; R2-5 (dest.) DIV R2, R2, R10 ; R2-6 (dest) e R2-5 (sorg.) (Normally the compiler would drop the next to the last instruction) (commitment of instruction using R2-2) R2-1 R2-2

slide-88
SLIDE 88

90

HW support for register renaming

  • If no physical registers (circurarly) are available the instruction is stalled. There is

no emission also if no free slot in the ROB is available and no RS is available

Parallelism

  • Free/busy register table. Two solutions: one pool of physical registers for all

architectural registers or one pool for each architectural register.

  • Fast mapping between architectural and physical registers (run time)
  • Great number of physical registers
slide-89
SLIDE 89

ROB «without Tomasulo»

Parallelism 91

  • When two instructions are ready for execution, FIFO rule (so as to speed-up the

commitment, always in order)

  • Instructions are emitted as soon a free slot in the ROB and a physical destination register are

available using the register renaming (the used registers are normally not yet committed)

  • For each FU there is a virtual queue whose slots point to the ROB slots which require that

FU.

  • The instruction of this queue are executed as soon as the required operands are

available.

slide-90
SLIDE 90

92

ROB and speculation

  • Need of a separated Return Stack Buffer for the speculative calls (otherwise

the stack could be damaged). It is a separated stack whose content is copied

  • nto the stack if the branch has been correctly predicted as taken. All

instructions following a branch not yet commited use this stack. In case of misprediction the RSB content is cancelled

Parallelism

  • Dynamic instruction execution granting precise interrupts which are checked at the

instruction commitment always in order

  • Cancellation of speculative instructions when a branch is erroneously predicted
  • The prediction error must be revealed ASAP. The cancellation of post-branch

instructions erroneously executed allows preceding instruction to keep

  • executing. The erroneously executed instructions are not yet commited
  • The early branch prediction avoids the execution of useless instructions

(sometimes very time expensive). It must remembered that not only the ROB flush occurs but also the cancellation of all the instructions already in the pipeline

slide-91
SLIDE 91

93

FLD F4,0(R10) FDIV F8, F0, F4 FMUL F4, F2, F3 FMUL F4, F4, F4 FADD F6, F10,F4 FLD F4, 0(R5) RAW WAW RAW WAW RAW

Example - 1

Parallelism

slide-92
SLIDE 92

94

F10 F8 F6 F4 F2 FU Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load2 Write Result Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Register result status Clock 0 Add1 Add2 Add3 Mult1 Mult2

FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5

Parallelism Tomasulo without ROB and with renaming. Multiplication FU execute the divisions too.

Three RS for LOAD, 2 for STORE, 2 for MUL/DIV

slide-93
SLIDE 93

95

Load1 F10 F8 F6 F4 F2 R10 yes 1 FU Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load2 Write Result Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Register result status Clock 1 Add1 Add2 Add3 Mult1 Mult2

FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5

Three RS for LOAD, 2 for STORE, 2 for MUL/DIV

Parallelism

CLOCK 1

slide-94
SLIDE 94

96

Mult1 F10 F8 F6 F4 F2 F0 div

yes 2 R10 yes 2- 1

FU Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load2 Write Result Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Register result status Clock 2 Add1 Add2 Add3 Mult1 Mult2 Load1 Load1

FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5

Three RS for LOAD, 2 for STORE, 2 for MUL/DIV

Parallelism

CLOCK 2

slide-95
SLIDE 95

97

Mult1 F10 F8 F6 F4 F2 mul yes F0 div

yes

3

2 R10 yes 2-3 1

FU Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load2 Write Result Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Register result status Clock 3 Add1 Add2 Add3 Mult1 Mult2 Mult2 Load1 F2 F3

FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5

Three RS for LOAD, 2 for STORE, 2 for MUL/DIV

Parallelism

CLOCK 3

slide-96
SLIDE 96

98

Mult1 F10 F8 F6 F4 F2 mul yes F0 div

yes

yet 9 cycles

3

2 2-3 1

FU Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load2 Write Result Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Register result status Clock 4 Add1 Add2 Add3 Mult1 Mult2 Mult2 F2 F3 4-

4

FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5

Stalled for lack of free RS until cycle 13 (end of the preceding multiplication – only two slots) blocking the emission

  • f FADD which could be executed

since there are two free slots in the corresponding RS

4-

yet 39 cycles

F4

Parallelism

Three RS for LOAD, 2 for STORE, 2 for MUL/DIV

CLOCK 4

slide-97
SLIDE 97

99

80000000: FLD F4, 0(R10) 80000004: FDIV F8, F0, F4 80000008: FMUL F4, F2, F3 8000000C: FMUL F4, F4, F4 80000010: FADD F6, F10,F4 80000014: FLD F4, 0(R5) RAW WAW RAW WAW RAW

ROB and register renaming. The instructions are in any case inserted in the ROB when a free slot is available and then executed when the FU and the operands are available (policy of all modern processors). By so doing instructions are not only terminated OOO (but with results reordered in the ROB) but also emitted even if the FU is not available The execution is totally OOO but with an In-Order commitment

Example - 2

Same instruction stream

Parallelism

slide-98
SLIDE 98

100

Addr Op. Des Sorg P0 P1

Free Free. Free Free Free Arch

P2 P3 P4 P5

F4

ROB RAT

Q0 Q1

Busy

  • Free. Free Free

Free Arch

Q2 Q3 Q4 Q5

F6

Z0 Z1

Busy Free Free Free Free Arch

Z2 Z3 Z4 Z5

F8

Initial situation

Renaming registers for F4, F6 e F8 1 2 3 4 5

Parallelism

Register Allocation Table

Top free registers of the circular queues

These are the architectural registers which a program monitor would display These are registers in use by not yet committed instructions. They will become architectural registers when the related instruction is committed

Here we assume that the instruction using Z0 precedes the instruction using Q0. RAT for R5, R10, F0, F2, F10 not displayed

slide-99
SLIDE 99

102

R10 yes 1 Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Write Result Load3 Store 1 Store 2 Time Name Busy Op Vj Vk Qj Qk Clock 1 Add1 Add2 Add3 Mult1 Mult2

FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5

Parallelism

CLOCK 1

0,R10 P0 FLD 80000000 Addr Op. Des Sorg P0 P1

Busy

  • Free. Free Free Free

Arch

P2 P3 P4 P5 Q0 Q1

Busy Free Free Free Free Arch

Q2 Q3 Q4 Q5

F6

Z0 Z1

Busy Free Free Free Free Arch

Z2 Z3 Z4 Z5

F8

ROB RAT

1 2 3 4 5

Renaming

F4

slide-100
SLIDE 100

103

Mult2 F0 div

yes 2 R10 yes 2- 1

Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load3 Write Result Store 1 Store 2 Time Name Busy Op Vj Vk Qj Qk Clock 2 Add1 Add2 Add3 Mult1 P0

FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5

Parallelism

CLOCK 2

F0,P0 Z1 FDIV 80000004 0,R10 P0 FLD 80000000 Addr

  • Op. Des

Sorg P0 P1

Busy Free Free Free Free Arch

P2 P3 P4 P5 Q0 Q1

Busy Free Free Free Free Arch

Q2 Q3 Q4 Q5

F6

Z0 Z1

Busy Busy Free Free Free Arch

Z2 Z3 Z4 Z5

F8

Most recent physical register for F4

1 2 3 4 5

ROB RAT Renaming

F4

slide-101
SLIDE 101

104

mul yes F0 div

yes

3

2 R10 yes 2-3 1

Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load3 Write Result Store 1 Store 2 Time Name Busy Op Vj Vk Qj Qk Clock 3 Add1 Add2 Add3 Mult1 Mult2 P0 F2 F3

FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5

F2,F3 P1 FMUL 80000008 F0,P0 Z1 FDIV 80000004 0,R10 P0 FLD 80000000 Addr Op. Des Sorg P0 P1

Busy Busy. Free Free Free Free

P2 P3 P4 P5

F4

ROB RAT

Q0 Q1

Busy Free Free Free Free Free

Q2 Q3 Q4 Q5

F6

Z0 Z1

Arch Busy Free Free Free Free

Z2 Z3 Z4 Z5

F8

1 2 3 4 5

Parallelism

CLOCK 3 waiting for F4 (P0)

Previous instruction using Z0 has ended its execution Z0 is now the architectural register

slide-102
SLIDE 102

P0

Op Vj Vk Qj Qk

105

mul yes F0 div

yes

Yet 9 cycles

3

2 2-3 1

Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Write Result Time Name Busy Op Clock 4 Add1 Add2 Add3 Mult1 Mult2 F2 F3 4-

4

FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5

4

Yet 39 cycles

4-

Not yet executable but however inserted in the ROB It does not block the emission

  • f the following instructions

Parallelism

Load3 Store 1 Store 2 Integer Load2 Store 1 Store 2 Qk

P1,P1 P2 FMUL 8000000C F2,F3 P1 FMUL 80000008 F0,P0 Z1 FDIV 80000004 0,R10 P0 FLD 80000000 Addr

  • Op. Des

Sorg P1 P0

Busy Busy. Busy Free Free Arch

P2 P3 P4 P5

F4

ROB RAT

Q0 Q1

Arch Free Free Free Free Free

Q2 Q3 Q4 Q5

F6

Z0 Z1

Arch Busy Free Free Free Free

Z2 Z3 Z4 Z5

F8

1 2 3 4 5 Ended but not yet committed !

CLOCK 4

Instruction using Q0 has ended its execution Q0 is now the architectural register

slide-103
SLIDE 103

106

FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5

mul yes F0 div

yes

Yet 8 cycles

3

2 2-3 1

Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Address Write Result Time Name Busy Op Vj Vk Clock 5 Add1 Add2 Add3 Mult1 Mult2 P0 (F4) F2 F3 4-

4

4 5 yes add F10 P2

Yet 38 cycles

Load2 3-

Parallelism

Load3 Store 1 Store 2 CLOCK 5 waiting for F4 (P1) Qj Qk Integer Load2 Store 1 Store 2 Qk

F10,P2 Q1 FADD 80000010 P1,P1 P2 FMUL 8000000C F2,F3 P2 FMUL 80000008 F0,P1 Z1 FDIV 80000004 Addr Op. Des Sorg P1 P0

Arch Busy. Busy Busy Free Free

P2 P3 P4 P5

F4

ROB RAT

Q0 Q1

Arch Busy Free Free Free Free

Q2 Q3 Q4 Q5

F6

Z0 Z1

Arch Busy Free Free Free Free

Z2 Z3 Z4 Z5

F8

1 2 3 4 5 FLD commited: the architectural register F4 is now P0