1
Parallel architectures
Electronic Computers LM
Parallelism
Parallel architectures Electronic Computers LM Parallelism 1 - - PowerPoint PPT Presentation
Parallel architectures Electronic Computers LM Parallelism 1 Architecture Architecture: functional behaviour of a computer. For instance a processor which executes DLX code Implementation: a logical network implementing the
1
Parallelism
2
implementation (for instance different technologies) The ISA varies slowly while the implementation change rapidly (see for instance IA8, IA16, IA32…). More an ISA remains more are the programs implemented on it and therefore compatibility becomes the main issue.
DLX code
family x86 The architecture is defined by the machine language that is the instruction set (assembly language). Instruction Set Architecture -> ISA
Parallelism
3
……….. Instruction level parallelism
Single instruction executed at a time
Multiple instructions executed simultaneously
Multiple stages for each operation (EX, MEM etc.) in order to increase the clock frequency (i.e. Pentium IV)
A single pipeline
Multiple pipelines; many instructions started at the same time. Possibile Out Of Order execution (run time decision)
Multiple pipelines; many instructions started at the same time. Instruction
Parallelism
4
A memory able to provide multiple data at different addresses at the same time (outstanding requests - DDR2, DDR3 etc.)
Many processors in the same chip (i.e.. Core duo – Nehalem – Sandy Bridge …..)
Pipelines of the same processor used by different processes at the same time (time sharing) (as if it were a multicore – ex. Pentium IV, Nehalem, Sandy Bridge etc….)
Parallelism
5
Fetch Decode Execute Memory Writeback Fetch Decode Execute Memory Writeback Branch penalty
penalty Branch penalty
Parallelism
6
Sequential Time parallelism: pipeline Space parallelism: VLIW Space-time parallelism: (ie. I5, I7…)
Parallelism
7
IF ID RD MEM2 FP2 FP3 WB Multi instruction buffer to avoid pipelines block. Dedicated pipelines. The instruction sequence is defined at compile-time. Careful compilation is fundamental in order to avoid an underexploitation of the pipelines. Different execution times problem Instruction interdependency problem
Parallelism
ALU MEM1 FP1 BR EX F => Floating
8
IF ID RD EX ALU MEM1 FP1 BR MEM2 FP2 FP3 Dispatch Buffer Reorder Buffer
«Out Of Order» execution
”In order” execution WB ”In order” retirement
Parallelism
9
IF ID MEM WB Integer FP Multipl. FP adder FP/Int. Divid. multicycle stages IF ID MEM WB Ex Integer M1 M2 M3 M4 M5 M6 M7 FP Multiply A1 A2 A3 A4 FP Add
FP/INT. Divide (i.e . 24 clock cycles – one instruction at a time executed) Parallelism
Pipelined
10
FADD F3, F4, F5 FLD F6, 10(R8) FST 40(R10), F9
FMUL IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB FADD IF ID A1 A2 A3 A4 MEM WB FLD IF ID EX MEM WB FST IF ID EX MEM (WB)
Data written Data required for computing the address In violet the stages where the operands are needed and in green the stages where results are produced
nop Same destination register Write sequence error
instruction to the appropriate execution stage)
FLD F6, 10(R8) (one EX cycle) (although in this case the FADD would have been dropped by the compiler since useless)
Parallelism
Red squares: execution
11
IF ID M1 M2 M3 M4 M5 M6 M7
MEM WB
IF ID EX
MEM WB
IF ID EX
MEM WB
IF ID A1 A2 A3 A4
MEM WB
IF ID EX
MEM WB
IF ID EX
MEM WB
IF ID EX
MEM WB
FMUL F0, F4, F6
………….. …………..
FADD F2, F4, F6
………….. …………..
FLD F2, 0(R2)
If FADD was started one clock later a Write After Write hazard would have taken place !! Multilple RF write operations
the RF can be increased (expensive) or stalls must be introduced (normally in MEM or WB stages so as to choose the instructions to be stalled). More complex pipelines
Normally the hazards are detected in the ID stage considering the preceding and following instructions so as to introduce the required stalls (in this case FLD would have been stalled one clock) Hazards occur normally among homogeneous registers (FP or Integer) but for the FLOAD and FSTOR which use integer register for address computing Parallelism
12
MEM stage. It must be however assumed that between the two instructions there must at least
have dropped the instruction !
from that of their issue
Parallelism
13
Let’s consider this high level language statements X = Y + Z A = B * C to be executed in a processor with the following pipeline Fetch F Dec. D Issue I Ex. E Ex. E Ex. E WB W In order emission
The issue of the addition (multiply) is possible only AFTER the previous instruction execution calculating R2 (R5) that is after the last EX stage possibly with forwarding
Busy decoder The issue is here possible since data to R1 e R2 have been already produced Multiply: waits for results RAW Stalls Data not available
D freed by the addition
Busy decoder- RAW
Decoder busy
Addition result not yet ready Parallelism
14
But we can modify the emission without modyfying the result 16 cicles instead of 22 !!!! before after Fetch F Dec. D Issue I Ex. E Ex. E Ex. E WB W
Waiting for R5 Busy decoder Parallelism Waiting for R6
Emission possible since R1 and R2 already ready
15
Let’s suppose to have a FP adder (1 cicle – in red) and a multiplier (3 cicles in green). I1 F1 = F2 + F3 I2 F2 = F4 x F5 I3 F3 = F3 + F4 I4 F6 = F6 x F6 I5 F1 = F3 + F5 I6 F2 = F3 + F4 I1 I2 I3 I4 I5 I6 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
NB: in this graph the hazards are potential since the registers only are considered no matter how many cycles are required by the executions
Parallelism
I1 I2
WAR (F2)
I6
WAW(F2)
I3
WAR (F3) RAW (F3)
I5
RAW (F3) WAW(F1)
16
no implications for the compiler.
Parallelism
F12,F8,F14) if the conditions allow it FDIV F0,F2,F4 FADD F10,F0,F8 (RAW - must wait for F0) FSUB F12,F8,F14 (can be executed anyway)
17
FDIV F0, F2, F4 FADD F10, F0, F8 FSUB F8, F8, F14 They must read the same value Write After Read (WAR)
FADD has read F8 an error would occur (F8 already updated)
been used as destination (in case FSUB would end before FADD)
instruction as soon as possible.
Parallelism
Read after Write (RAW)
18
FP MUL FP MUL FP DIV
Registers
FP ADD INTEG
Scoreboard
The scoreboard is somehow equivalent to the ID stage (just after the fetch) and determines when an instruction can read its operands and start its execution. The scoreboard considers all system state changes and decides when the first instruction in the FIFO queue (as produced by the compiler) can be started. Functional units
Parallelism
19
small
1. Emission: if a functional unit for the instruction is available (free) and the required operands are available in RF with updated values, the instruction is issued unless another functional unit has already an instruction which must write into the same destination register. No WAW hazards therefore. In this latter case the instruction is stalled which blocks the emission of all the following instructions in the prefetch queue even when all other conditions for them are met! 2. Operand read: the instruction has been emitted. If the operand is available and no already executing instruction must write it, the operand is read otherwise stall in the functional unit 3. Execution: when the result has been computed and stored the scoreboard is informed so as to unblock a possibly waiting instruction 4. In case of possible WAR the instruction is stalled and does not write the result if there is a previous instruction which has not yet read the operands and one of them is the destination register of the considered instruction. Once the operand has been read the result can be written
written as soon as produced (but for the wait WAR – point 4) The scoreboard technique allows to transfer instructions directly from EX to WB stage (reducing the RAW risks) .
Parallelism
20
Hypothetical timing for different instructions (which includes the operands read and execution)
FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles LD F6, 34(R2) LD F2, 45(R3) FMUL F0, F2,F4 (MULD) FSUB F8, F6, F2 (SUBD ) FDIV F10, F0, F6 (DIVD) FADD F6, F8, F2 (ADDD)
RAW < WAR RAW
Parallelism
Integer Do you find more hypothetical hazards?
For instance what about F0?
21
Instruction stages: emission, operands read, execution and writeback Statuses of the functional units (FU): 9 parameters Busy Unit busy Op Operation Code presently executed Fi Instruction destination (result) register Fj, Fk Operands source registers Qj, Qk Functional units producing the required operands (if not yet ready) for the registers Fj and Fk Rj, Rk Flags (yes) indicating whether Fj, Fk have been already updated Result status register : indicates which functional unit will write each register. Void when no functional unit has to do with the specific register N.B. It must be remembered that in case of possible WAW the instructions emission is stalled (point 1 of the rules) N.B. In the following example we suppose that two multiplication/division units are available
Parallelism
22
Example (here we assume that F0 is a “normal”register and not always “0”)
Instruction status Read ExecutionWrite Instruction j k Issue Op complete Result LD F6 34 R2 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 Functional unit status Time Name Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F31
Functional Unit producing the result for the floating point register Fx (Qj, Qk)
Instructions states Progression clock
1 integer unit 2 multipl. units 1 add/sub unit 1 division unit
Rj and Rk indicates whether (possibly in the next cycle if just produced) the data can be read from the operands source registers of the executed
yet ready). Fj and Fk are the registers where data produced by Qj and Qk are stored (or will be stored in the next cycle – data available if the corresponding Ri is yes) to be used in the executed instruction
F2 dest Source1Source 2 Integer Mult1 Mult2 Add Divide FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk
Register Qi Ready ? FU=Functional Unit
execution yet to elapse
NB LD = FLD MULTD = FMUL SUBD = FSUB DIVD = FDIV ADDD = FADD
FLD 1 cycle FADD, FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles
Parallelism Floating point result registers
23
Instruction status Read Execution Write Instruction j k Issue Op/Excomplete Result LD F6 34 R2 LD F2 45 R3 MULTDF0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest
S1 S2 FUj FUk Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Integer
Functional unit used for producing the result in F6 R2 is supposed to be already available and therefore in the next clock can be
At clock 1 the instruction stage of LD F6,34(R2) is Issue Parallelism
Yes Load F6 R2 Yes
R2
1
Brown colour for state change
1
24
Instruction status Read Execution Write Instruction j k Issue Op/Ex complete Result LD F6 34 R2 1 2 LD F2 45 R3 MULTDF0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Integer
Data ready in R2: instruction can proceed
NB:
The second LD cannot be emitted because the
integer unit is busy and the same applies for MULTD because instructions must be emitted in order Parallelism
2
25
Instruction status Read Execution Write Instruction j k Issue complete Result LD F6 34 R2 1 2 3 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Integer
FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles
Parallelism
Op/Ex 3
26
Instruction status Read Execution Write Instruction j k Issue complete Result LD F6 34 R2 1 2 3 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load R2 Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU 4 F6
Register at the end of the period has been written Functional unit freed at the end of the period
The change of status of the FUs indicates their value at the clock positive edge concluding ending the current cycle (future status). For instance the integer functional unit is freed at the end of cycle 4 together with the result writeback. LD F6 34,R2 disappears totally from scoreboard at the clock positive edge concluding the current cycle 4. Parallelism
Op/Ex Integer 4
27
Instruction status Read Execution Write Instruction j k Issue complete Result LD F6 34 R2 1 2 3 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 RU S1 S2 RUj RUk Rj? Rk?
R3 supposed already ready as in the previous case
5 Yes Load F2 R3 Yes Integer
The Integer Functional Unit must produce a new value for F2 At the beginning of cycle 5 the integer unit is already free and then LD F2 45, R3 can start Parallelism
4 Op/Ex 5
28
Instruction status Read ExecutionWrite Instruction j k Issue complete Result LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Mult1 Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Integer
S1 S2 FUj FUk Fj? Fk?
F4 supposed already present
Yes Mult F0 F2 F4 Integer No Yes Mult
MULTD waits for F2 from the integer unit !!!!
6
MULTD F0 F2, F4 can start because its FU is free and the destination register is F0
Parallelism
Op/Ex 6
29
MULTD stalled in the execution unit because F2 not yet ready.
Instruction status Read ExecutionWrite Instruction j k Issue complete Result LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 MULTD F0 F2 F4 6 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult Integer
S1 S2 FUj FUk Fj? Fk?
(NB : FP adder executes FP subtractions too)
F8 Yes Subd F6 F2 Integer Yes No Add 7
SUBD F8 F6, F2 can start because the arithmetic FP sum/subtraction is free. Parallelism The same for SUBD
Op/Ex 7
30
Instruction status
Read EX Write
Instruction j k
Issue
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTDF0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Yes Mult F0 F2 F4 Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Divide Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add
S1 S2 FUj FUk Fj? Fk?
F0 not yet available
Yes Load F2 R3
8 Yes Div F10 F0 F6 Mult1 No Yes Divide
DIVD F10 F0, F6 can start because the divide FP FU is free
Updated at the end of the cycle
Yes Yes
F2 available !!
F2 written allows MULTD and SUBD to read the operands during the next cycle
F2 is written and therefore the integer unit is free
Parallelism
Op/Ex 8
31
Instruction status Read EX Write Instruction j k Issue
complete Result
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7
N.B.: MULTD and SUBD can read the
because F2 available (see cycle 8). DIVD is still stalled because of F0.
9 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 10 clock Mult1 Yes Mult F0 F2 F4 Mult2 No 2 clock Add Yes Sub F8 F6 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide 40 clock
Parallelism ADDD cannot start because SUBD uses the adder FU
Op/Ex
9-10
32
Nota: FU Add requires 2 cycles for the SUBD and therefore nothing happens in cycle 10 while MULTD still processes its data NB: ADDD will use the result of the SUBD but is not yet started because of SUBD
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No
8 clocks more
Mult1 Yes Mult F0 F2 F4 Mult2 No 0 Add Yes Sub F8 F6 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide
Instruction status Read EX Write Instruction j k Issue
complete Result
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11
Parallelism Op/Ex
11
33
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No
7 clocks more
Mult1 Yes Mult F0 F2 F4 Mult2 No Add No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Divide
Instruction status Read EX Write Instruction j k Issue
completeResult
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11
SUBD ends freeing the FU. In the next period ADDD can start
12
F8 is written and the ADD/SUB FU is freed FLD 1 cycle FADD and FSUB 2c ycles FMUL 10 cycles FDIV 40 cycles
Parallelism Op/Ex
12
34
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Divide
Instruction status Fead EX Write Instruction j k IssueOp/Excomplete Fesult LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12
Now ADDD can start because SUBD has finished its execution and has freed the FU
Yes Add F6 F8 F2 Yes Yes Add
13
6 Clocks more
FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles
Parallelism
13
35
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide
Instruction status Read EX Write Instruction j k Issue
completeResult
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14
5 clocks more 2 Clocks more
FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles
Parallelism Op/Ex
14
36
ADDD requires two cycles and therefore no system status change
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide
Instruction status Read EX Write Instruction j k Issue
complete Result
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14
4 Clocks more 1 Clock more
FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles
Parallelism Op/Ex
15
37
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide
Instruction status Read EX Write Instruction j k Issue
completeResult
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14
ADDD ended its EX stage while MULTD and DIVD keep executing
16
3 clocks more
FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles
Parallelism Op/Ex
16
38
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide
Instruction status Read EX Write Instruction j k Issue
completeResult
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16
NB !!! ADDD stalled (cannot write) because of a WAR with DIVD on F6. DIVD does not read F6 because it waits for F0 produced by MULTD (operands are read in parallel). MULT and DIVD keep executing Stalled because WAR F6
2 Clocks more
FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles
Parallelism Op/Ex
17
39
MULT still executing DIVD still stalled
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide
Instruction status Read EX Write Instruction j k Issue
complete Result
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16
1 clock more
FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles
Parallelism Op/Ex
18
40
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Mult1 Add Divide
Instruction status Read EX Write Instruction j k Issue
completeResult
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16
MULT ends its execution, will write in cycle 20 (after 10 cycles) which will unblock DIVD and then ADDD
19
FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles
Parallelism Op/Ex
19
41
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Yes Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Add Divide
Instruction status Read EX Write Instruction j k Issue
completeResult
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16 19
MULTD writes F0
unblocking DIVD
20
FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles
Parallelism Op/Ex
20
42
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Add Divide
Instruction status Read EX Write Instruction j k Issue
complete Result
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16 19 20
DIVD reads both F0 and F6 (which could not be written by ADDD because of WAR) unblocking ADDD which can write F6 in the next cycle
21
Parallelism Op/Ex
21
43
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No Divide Yes Div F10 F0 F6 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Divide
Instruction status Read EX Write Instruction j k Issue
complete Result
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16 19 20 21
Parallelism Now ADDD can write F6 after the WAR hazards with DIVD disappeared. For 6 cycles ADDD couldn’t write F6 although its result was available
22
Op/Ex
22
44
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No Divide Yes Div F10 F0 F6 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU Divide
Instruction status Read EX Write Instruction j k Issue
complete
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16 19 20 21 22
DIVD execution ends after 40 cycles
61 Result
Parallelism Op/Ex
61
45
All executions ended
Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 0 Divide No Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F31
62 FU
Instruction status Read EX Write Instruction j k Issue
complete
LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 11 12 13 14 16 19 20 21 22 61 Result 62
Parallelism Op/Ex
46
WAW WAR
N.B Hazards of the sequence are only potential: their occurrence depends
that they must have been already stored in the registers – no RAW problem)
Parallelism
RAW
47
Tomasulo algorithm: “renaming” is based on the concept of “reservation stations” which are functional units buffers where instructions can be «parked» waiting for the availability of the requested Fu and the needed data. The following benefits occur
«Renaming» indicates a location different from the RF where a requested datum is produced/stored and can be
Parallelism
free and the needed data arrive as soon as produced (N.B. before being written in the RF). For its operands EITHER the source register data OR the reservation stations producing them are indicated (whence renaming). The renaming occurs at run-time
the register file access). Similar to the case of forwarding
available) only the most recently produced data are written (for each register a TAG is used indicating the FU which has the right to write)
stored in the reservation stations of each functional unit determines whether an instruction can execute in the FU since the source (where the datum is being produced ) and NOT the RF is in any case indicated. RAW hazards are no more possible since the requested data are provided as soon as produced. The same for WAR
through the common data busses (multiple reservation stations in addition to RF register can be accessed at the same time when multiple busses are available)
48
Tomasulo eliminates not only WAWs but also WARs FLD F6, 32(R2) FLD F2, 44(R3) FMUL F0, F2, F4 FSUB F8, F2, F6 FDIV F10, F0, F6 FADD F6, F8, F2
Renaming (functional unit producing the datum)
Possible WAW As far as the WAW between FLD and FADD per F6 is concerned the mechanism grants that only the most recent instruction in the RS using a destination register can write the register. FLD [T/F6], 32(R2) FLD F2, 44(R3) FMUL F0, F2, F4 FSUB F8, F2, [T] FDIV F10, F0, [T] FADD F6, F8, F2 When an instruction is inserted in a RS it is checked whether one or more of its operands are being produced elsewhere by other RS: if yes then renaming For the FADD a potential WAR with the FDIV could occur if FADD ended before FDIV has read its
FDIV points for F6 to the RS of FLD F6, 32(R2) and not to RF the problem does not occur. The same holds for FSUB. Parallelism Possible WAR.
49
Parallelism
Very high performance without special compilers Differences with scoreboard
Buffer and controls directly distributed in the FUs (there is no centralized control): buffers are called “reservation stations” Source registers names substituted by pointers to buffers of the reservation stations (if the requested data are being there produced) “Renaming”: a direct pointer to the sources and not to the register One ore more Common Data Bus for sending results to all FUs requiring them Load and Stores considered as FUs (a STORE can also be a source for a RS executing a LOAD)
50
In this example is it assumed that the MUL unit executes the DIVs too and that the ADD executes the SUBs too . LOAD and STORES are handled as other instructions In this example: 3 RS for add/sub 2 RS for mult/div 5 RS for store 5 RS for load In this example only one Data Bus. Please notice that the same Common Data Bus is used also by the RS waiting for data Each RS (more than one for each FU) stores an emitted instruction and for each operand either of two elements: either the operand value (i.e. read from RF) or the name of the RS which is producing it (renaming)
For the data produced by the FUs
Parallelism
51
available) to the RF and to the RS waiting for it. Parallelism
free RS for the requested FU (the only condition) otherwise the instruction queue stalls. Operands are extracted from RF or the producing FU as indicated. In case of WAW it must be determined which instruction must provide the data
transferred over a bus anyway) in order to catch them (and their sources) as soon as available: RAW are therefore avoided (we are sure not to read stale data in the RF).
52
Let’s see the scoreboard example in a Tomasulo Architecture. Let’s suppose that the execution times are the same of the scoreboard (FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) – NB. “+1” for the writeback LD F6, 34(R2) LD F2, 45(R3) FMUL F0, F2,F4 FSUB F8, F6, F2 FDIV F10, F0, F6 FADD F6, F8, F2
Parallelism
53
Register File Status: Indicates which FU will write the register (if needed). A blank means that there are no instructions which must write the register and therefore its value can be directly used N.B. From the general instruction queue one instruction per clock is emitted when a FUs RS for that instruction is available otherwise stall. In our example we assume only one CDB.
Parallelism
Op: opcode of the instruction to be executed Vj, Vk: places of the operands (either RF or the FUs producing them) Qj, Qk: Functional units producing the results. A blank indicates that the source operands are already in Vj or Vk or that they are not required Busy: Busy FU
54
Instruction status Execution Write Instruction j k Issue LD F6 34 R2 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 FU
The FU producing the new value
Producing FU – if blank it means that the datum is in RF
Operands register. If blank the datum is produced in the corresponding Q FU
number of RS. Their BUSY status is here displayed differently from the FU (see next slide)
For sake of simplicity Rj e Rk (ready/notready) are not displayed since their values are implicit in the status of Qj and Qk
Parallelism
Load/store not indicated in the status table
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
55
Instruction status Execution Write Instruction j k Issue Busy LD F6 34 R2 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 1 FU Address 1 Load1 Yes Load1 34+R2
3 RS for adder/sub 2 RS for mul/div NB: Here it is assumed that R2 and R3 are already available 5 RS for the LOAD
Parallelism
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
56
Instruction status Execution Write Instruction j k Issue Busy LD F6 34 R2 1 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 2 FU Load1 Address Load1 Yes 34+R2
5 RS for LOAD
2- 2 Load2 Yes 45+R3 Load2
The second LD is emitted. One instruction per clock is emitted (when possible)
N.B. A second LOAD has been emitted (not possible with the scoreboard) and parked in the RS. R3 value already available in the RF Parallelism NB: Load -> 2 cycles: the first one for computing the address and the second for reading the datum
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
57
Instruction status Execution Write Instruction j k Issue Busy LD F6 34 R2 1 2--3 Load1 Yes LD F2 45 R3 2 3- Load2 Yes MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 3 FU Load2 Load1 Address 34+R2 45+R3
Yet10 cycles LD two cycles
MULTD can be emitted although F2 NOT yet available . F2-> renaming
3 Yes Mult F4 Mult1 Load2
MULTD emitted (free RS )
Datum supposed already in the RF
Parallelism
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
58
The FUs execute both sums and subtractions
Instruction status Execution Write Instruction j k Issue Busy LD F6 34 R2 1 2--3 LD F2 45 R3 2 3--4 Load2 Yes MULTD F0 F2 F4 3 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 No Add3 No Mult1 Yes Mult F4 Load2 Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 4 FU Mult1 Load2 Address 45+R3
Yet 3 cycles Yet 10 cycles
4
The datum read from memory LD F6 34(R2) is written both in the RF and in the RS of SUBD which is waiting for it
4 Add1 Yes Sub F6 (captured on the fly) Load2
SUBD is emitted (RS free) F6 available in RF at the end of the cycle
FU freed
Parallelism
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
59
Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 MULTD F0 F2 F4 3 SUBD F8 F6 F2 4 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk
3 Add1
Yes Sub F6 (capt.) F2 (capt) 0 Add2 No Add3 No 10 Mult1 Yes Mult F2 (capt) F4 0 Mult2 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 5 FU Mult1 Add1
Cycles yet to be executed for completing the execution
5 5 Yes Div F6 Mult1 Mult2
DIVD is emitted (RS free)
Wait for F0 FU freed
Parallelism
The datum read from memory with LD F2 45(R3) is written both in register F2 and in the RS of SUBD and MULTD which are waiting for it
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
60
Cycles yet to be executed for completing the execution
Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- DIVD F10 F0 F6 5 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk
2 Add1
Yes Sub F6 F2 Add2 Yes Add F2 Add1 Add3 No 9 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 6 FU Mult1 Add1 Mult2
Yet 40 cycles
Now MULTD can execute (F2 and F4 available)
6 Add2
ADDD is emitted (RS free)
Wait for F0 Wait for F8
Parallelism
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
61
Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk
1 Add1
Yes Sub F6 F2 Add2 Yes Add F2 Add1 Add3 No 8 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 7 FU Mult1 Add2 Add1 Mult2 6 -- 7
SUBD (as ADDD) two cycles ADDD stalled waiting for SUBD (F8) Datum in F6 will be overwritten by ADDD but it was already read and is present in the RS of DIVD
Yet 40 cycles
Parallelism
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
62
Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No 2 Add2 Yes Add F8 F2 Add3 No 7 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 8 FU Mult1 Add2 Mult2 Yet 40 8
NB: SUBD ends before MULTD and allows ADDD (which captures the result of F8) to start executing
FU freed
Parallelism
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
63
Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No 2 Add2 Yes Add F8 F2 Add3 No 6 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 9 FU Mult1 Add2 Mult2 Yet 40
ADDD executing Parallelism
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
64
Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No
1 Add2
Yes Add F8 F2 Add3 No 5 Mult1 Yes Mult F2 F4 Mult2 Yes Div Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 10 FU Mult1 Add2 Mult2 9 -- 10
Two execution cycles
Yet 40 F6
Parallelism
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
65
Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 4 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 11 FU Mult1 Mult2 40 11
ADDD too ends before MULTD and DIVD
FU freed
Cycles yet to be executed for completing the execution
Parallelism
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
66
Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 3 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 12 FU Mult1 Mult2 40
Waiting for the datum produced by MULTD
Cycles yet to be executed for completing the execution
Parallelism
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
67
Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- 15 SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No
1 Mult1
Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 15 FU Mult1 Mult2
Waiting for the datum produced by MULTD Parallelism
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
68
Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- 15 SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes Div F0 F6 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 16 FU Mult2
Now DIVD can execute
16
FU freed
Parallelism
FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles)
69
Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- 15 16 SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 17 -- 56 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 0 Mult2 Yes Div F0
F6
Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 56 FU Mult2
Parallelism
70
Instruction status Execution Write Instruction j k Issue complete Result LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- 15 16 SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 17 -- 56 57 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 57 FU
Parallelism
71
A demo can be found at http://www.ecs.umass.edu/ece/koren/architecture/Tomasulo1/tomasulo_files/tomasulo.htm
Parallelism
73
Parallelism
means reduced efficiency
74
Exceptional conditions (overflow, zero division etc.) Errors (i.e. parity) Page fault (or – see later – segment fault): data not available in memory Syncronous to the current process Operating systems handler Instruction can be interrupted during its execution (i.e. page fault) and therefore must be «restartable»,. The executing program is normally temporarily aborted.
Parallelism
must be saved
The user program are interrupted and the then restored Asyncronous to the current process Acknowledged at the end of the current instruction (if interrupts enabled) The handler is responsibility of the user program
75
Instruction Restart
Parallelism
76
Parallelism
“non-pipelined” architecture
they never started
In order emission, execution (and therefore terminated) out of order fuori ordine
78
ROB FP Op Queue FP Adder FP Adder Res Stations Res Stations FP Regs
Parallelism
that the instruction is virtually inserted in the ROB
RF) which provides also the operands to other instructions which requires them (renaming!)
Commitment
transferred to the architectural registers
79
Parallelism
80
N.B. Sometimes more instructions can be commited simultaneously. If the destination is the same (unlikely, otherwise the compiler would have dropped the first one) the result of the most recent instruction is used. Parallelism
a ROB slot available. In the RS are indicated the operands source and the ROB slot where an instruction will be “parked” after its esecution (this phase is called «dispatch”). The results are NOT written in the RF until the commitment phase. NB the lack of one of the two conditions blocks the emission of the following instructions
this case the operand value computed by the nearest previous instructions is used)
— Execution ends. Result trasmitted on the CDB for the RS waiting of them and for the ROB.
the instruction is on the top of the ROB FIFO. In case of erroneously predicted branch the ROB results are just dropped (“graduation”). EMISSION IN ORDER COMMITMENT IN ORDER
Parallelism 81
Reorder Buffer FP Op Queue FP Adder FP Adder Res Stations Res Stations FP Regs Compar network
Destination Register Result Exception? Valid (terminated ) Program Counter
82
LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4
Parallelism
83
To memory FP adders FP multipliers Reservation Stations FP Op queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
F0 LD F0,10(R2) N
Completed? Dest. Dest ROB top ROB end From memory 1 10+R2 Dest
Source M1
LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4
Istruction Dest.
Parallelism
FP registers
84
2 FADD F10,F4, ROB1
FP adders FP multipliers Reservation Stations FP Op queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
F10 F0 ROB1
LD F0,10(R2)
N Ex Dest. Dest Top End 1 10+R2 Dest
FADD F10, F4, F0
To memory From memory
M1
LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4
Renaming !! (Memory 2 clocks) Three slots for memory
Completed?
Source Istruction Dest.
Parallelism There can be also two ROB sources
FP registers
85
3
2 FADD F10, F4, ROB1
FP adders FP multipliers Reservation Stations FP Op queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
F2 F10 F0 ROB 2 LD F0,10(R2) N N Ex
Dest. Dest Top End 1 10+R2 Dest
FADD F10, F4, F0 FDIV F2, F10, F6
FDIV F2, ROB2, F6
To memory From memory
M1
LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4
ROB 1
Completed?
Source Istruction Dest.
Parallelism
FP registers
86
3
2 FADD F10, F4, M1 6 FADD F0, ROB5, F6
FP adders FP multipliers Reservation Stations FP Op queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
F0 ROB5 FADD F0, F4, F6 N F4 LD F4,0(R3) Ex
F2 F10 ROB2
Completed and committed
N Ex
Dest. Dest Top End 5 0+R3 Dest
FADD F10, F4, F0 FDIV F2, F10, F6
FDIV F2, ROB2, F6
To memory From memory
BRNE F2, +100 M1 F0 (Memory 1) In cycle 4 (end of the first LD) FADD F10, F4, F0 started executing
Emitted in cycle 4 in parallel with LD F4, 0(R3)
LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4
ROB1
Datum captured
more present in the ROB
Completed?
Source Istruction Dest.
Parallelism
87
3
2 FADD F10, F4, M1 6 FADD F0, ROB5, F6
FP adders FP multipliers Reservation Stations FP Op queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
F0 ROB5 ROB5 ST 0(R3), F4 FADD F0, F4, F6 N F4 LD F4, 0(R3) Ex
F2 F10 ROB2 N Ex
Dest. Dest Top End 5 0+R3 Dest
FADD F10, F4, F0 FDIV F2, F10, F6
FDIV F2, ROB2, F6
To memory From memory
BRNE F2, +100 M1 F0 M1
NB ST can start its execution when LD F4, 0(R3) has terminated the execution NOT when is committed (one cycle later )
LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4
ROB1
Completed?
Source Istruction Dest.
Parallelism
N
88
register linked the commited instruction. When a new instruction regarding the same architectural register is committed the pointer to it is changed (and the physical register previously embodying the architectural register is freed).
Parallelism
ROB or in the RF ? The entire ROB should be analysed and the most recent slot found whose destination is the required register: the instruction should either point to it (if any) or to the RF
registers (ISA) and to keep a pointer to the most recent (which is actually the architectural register).
a new physical register associate to the involved register (F17) where the result will be temporarily stored. Any following instruction which must use that register (F17) will use that physical register
89
R2-0 R2-1 R2-3 R2-4 R2-5 R2-6 R2-7 R2-8 Circular queue of register R2 Pointer to the first free register R2 when LD R2, 10(R5) is emitted Let’ suppose that R2-2 and R2-3 are alredy occupied by previous not yet committed instructions LD R2, 10(R5) ; R2-4 (dest.) first physical register free associated to R2
Parallelism
(as soon as the new datum is computed). Now R2-2, R2-3 e R2-4 are «busy» and the first free register will be R2-5. R2-2, R2-3 ans R2-4 will be freed as soon the related instructions end. If the commitment is “in-order” all hazards disappear. R2-1 is the architectural R2 register. R2-2 will become the architectural R2 register at the commitment of the related instruction. The busy registers are freed when no more needes
R2-2 Architectural register MUL R8, R2, R5 ; R2-4 (sorg.) RADD R2, R9, R6 ; R2-5 (dest.) DIV R2, R2, R10 ; R2-6 (dest) e R2-5 (sorg.) (Normally the compiler would drop the next to the last instruction) (commitment of instruction using R2-2) R2-1 R2-2
90
no emission also if no free slot in the ROB is available and no RS is available
Parallelism
architectural registers or one pool for each architectural register.
Parallelism 91
commitment, always in order)
available using the register renaming (the used registers are normally not yet committed)
FU.
available.
92
the stack could be damaged). It is a separated stack whose content is copied
instructions following a branch not yet commited use this stack. In case of misprediction the RSB content is cancelled
Parallelism
instruction commitment always in order
instructions erroneously executed allows preceding instruction to keep
(sometimes very time expensive). It must remembered that not only the ROB flush occurs but also the cancellation of all the instructions already in the pipeline
93
FLD F4,0(R10) FDIV F8, F0, F4 FMUL F4, F2, F3 FMUL F4, F4, F4 FADD F6, F10,F4 FLD F4, 0(R5) RAW WAW RAW WAW RAW
Parallelism
94
F10 F8 F6 F4 F2 FU Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load2 Write Result Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Register result status Clock 0 Add1 Add2 Add3 Mult1 Mult2
FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5
Parallelism Tomasulo without ROB and with renaming. Multiplication FU execute the divisions too.
Three RS for LOAD, 2 for STORE, 2 for MUL/DIV
95
Load1 F10 F8 F6 F4 F2 R10 yes 1 FU Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load2 Write Result Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Register result status Clock 1 Add1 Add2 Add3 Mult1 Mult2
FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5
Three RS for LOAD, 2 for STORE, 2 for MUL/DIV
Parallelism
CLOCK 1
96
Mult1 F10 F8 F6 F4 F2 F0 div
yes 2 R10 yes 2- 1
FU Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load2 Write Result Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Register result status Clock 2 Add1 Add2 Add3 Mult1 Mult2 Load1 Load1
FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5
Three RS for LOAD, 2 for STORE, 2 for MUL/DIV
Parallelism
CLOCK 2
97
Mult1 F10 F8 F6 F4 F2 mul yes F0 div
yes
3
2 R10 yes 2-3 1
FU Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load2 Write Result Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Register result status Clock 3 Add1 Add2 Add3 Mult1 Mult2 Mult2 Load1 F2 F3
FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5
Three RS for LOAD, 2 for STORE, 2 for MUL/DIV
Parallelism
CLOCK 3
98
Mult1 F10 F8 F6 F4 F2 mul yes F0 div
yes
yet 9 cycles
3
2 2-3 1
FU Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load2 Write Result Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Register result status Clock 4 Add1 Add2 Add3 Mult1 Mult2 Mult2 F2 F3 4-
4
FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5
Stalled for lack of free RS until cycle 13 (end of the preceding multiplication – only two slots) blocking the emission
since there are two free slots in the corresponding RS
4-
yet 39 cycles
F4
Parallelism
Three RS for LOAD, 2 for STORE, 2 for MUL/DIV
CLOCK 4
99
80000000: FLD F4, 0(R10) 80000004: FDIV F8, F0, F4 80000008: FMUL F4, F2, F3 8000000C: FMUL F4, F4, F4 80000010: FADD F6, F10,F4 80000014: FLD F4, 0(R5) RAW WAW RAW WAW RAW
ROB and register renaming. The instructions are in any case inserted in the ROB when a free slot is available and then executed when the FU and the operands are available (policy of all modern processors). By so doing instructions are not only terminated OOO (but with results reordered in the ROB) but also emitted even if the FU is not available The execution is totally OOO but with an In-Order commitment
Same instruction stream
Parallelism
100
Addr Op. Des Sorg P0 P1
Free Free. Free Free Free Arch
P2 P3 P4 P5
F4
ROB RAT
Q0 Q1
Busy
Free Arch
Q2 Q3 Q4 Q5
F6
Z0 Z1
Busy Free Free Free Free Arch
Z2 Z3 Z4 Z5
F8
Initial situation
Renaming registers for F4, F6 e F8 1 2 3 4 5
Parallelism
Register Allocation Table
Top free registers of the circular queues
These are the architectural registers which a program monitor would display These are registers in use by not yet committed instructions. They will become architectural registers when the related instruction is committed
Here we assume that the instruction using Z0 precedes the instruction using Q0. RAT for R5, R10, F0, F2, F10 not displayed
102
R10 yes 1 Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Write Result Load3 Store 1 Store 2 Time Name Busy Op Vj Vk Qj Qk Clock 1 Add1 Add2 Add3 Mult1 Mult2
FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5
Parallelism
CLOCK 1
0,R10 P0 FLD 80000000 Addr Op. Des Sorg P0 P1
Busy
Arch
P2 P3 P4 P5 Q0 Q1
Busy Free Free Free Free Arch
Q2 Q3 Q4 Q5
F6
Z0 Z1
Busy Free Free Free Free Arch
Z2 Z3 Z4 Z5
F8
ROB RAT
1 2 3 4 5
Renaming
F4
103
Mult2 F0 div
yes 2 R10 yes 2- 1
Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load3 Write Result Store 1 Store 2 Time Name Busy Op Vj Vk Qj Qk Clock 2 Add1 Add2 Add3 Mult1 P0
FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5
Parallelism
CLOCK 2
F0,P0 Z1 FDIV 80000004 0,R10 P0 FLD 80000000 Addr
Sorg P0 P1
Busy Free Free Free Free Arch
P2 P3 P4 P5 Q0 Q1
Busy Free Free Free Free Arch
Q2 Q3 Q4 Q5
F6
Z0 Z1
Busy Busy Free Free Free Arch
Z2 Z3 Z4 Z5
F8
Most recent physical register for F4
1 2 3 4 5
ROB RAT Renaming
F4
104
mul yes F0 div
yes
3
2 R10 yes 2-3 1
Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Load3 Write Result Store 1 Store 2 Time Name Busy Op Vj Vk Qj Qk Clock 3 Add1 Add2 Add3 Mult1 Mult2 P0 F2 F3
FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5
F2,F3 P1 FMUL 80000008 F0,P0 Z1 FDIV 80000004 0,R10 P0 FLD 80000000 Addr Op. Des Sorg P0 P1
Busy Busy. Free Free Free Free
P2 P3 P4 P5
F4
ROB RAT
Q0 Q1
Busy Free Free Free Free Free
Q2 Q3 Q4 Q5
F6
Z0 Z1
Arch Busy Free Free Free Free
Z2 Z3 Z4 Z5
F8
1 2 3 4 5
Parallelism
CLOCK 3 waiting for F4 (P0)
Previous instruction using Z0 has ended its execution Z0 is now the architectural register
P0
Op Vj Vk Qj Qk
105
mul yes F0 div
yes
Yet 9 cycles
3
2 2-3 1
Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Load2 Address Write Result Time Name Busy Op Clock 4 Add1 Add2 Add3 Mult1 Mult2 F2 F3 4-
4
FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5
4
Yet 39 cycles
4-
Not yet executable but however inserted in the ROB It does not block the emission
Parallelism
Load3 Store 1 Store 2 Integer Load2 Store 1 Store 2 Qk
P1,P1 P2 FMUL 8000000C F2,F3 P1 FMUL 80000008 F0,P0 Z1 FDIV 80000004 0,R10 P0 FLD 80000000 Addr
Sorg P1 P0
Busy Busy. Busy Free Free Arch
P2 P3 P4 P5
F4
ROB RAT
Q0 Q1
Arch Free Free Free Free Free
Q2 Q3 Q4 Q5
F6
Z0 Z1
Arch Busy Free Free Free Free
Z2 Z3 Z4 Z5
F8
1 2 3 4 5 Ended but not yet committed !
CLOCK 4
Instruction using Q0 has ended its execution Q0 is now the architectural register
106
FLD F4 0 R10 FDIV F8 F0 F4 FMUL F4 F2 F3 FMUL F4 F4 F4 FADD F6 F10 F4 FLD F4 0 R5
mul yes F0 div
yes
Yet 8 cycles
3
2 2-3 1
Instruction status Exe. Instruction j k Issue Compl. Busy Load1 Address Write Result Time Name Busy Op Vj Vk Clock 5 Add1 Add2 Add3 Mult1 Mult2 P0 (F4) F2 F3 4-
4
4 5 yes add F10 P2
Yet 38 cycles
Load2 3-
Parallelism
Load3 Store 1 Store 2 CLOCK 5 waiting for F4 (P1) Qj Qk Integer Load2 Store 1 Store 2 Qk
F10,P2 Q1 FADD 80000010 P1,P1 P2 FMUL 8000000C F2,F3 P2 FMUL 80000008 F0,P1 Z1 FDIV 80000004 Addr Op. Des Sorg P1 P0
Arch Busy. Busy Busy Free Free
P2 P3 P4 P5
F4
ROB RAT
Q0 Q1
Arch Busy Free Free Free Free
Q2 Q3 Q4 Q5
F6
Z0 Z1
Arch Busy Free Free Free Free
Z2 Z3 Z4 Z5
F8
1 2 3 4 5 FLD commited: the architectural register F4 is now P0