Parallel architectures Electronic Computers LM Parallelism 1 - PowerPoint PPT Presentation

Scoreboard Registers FP MUL FP MUL FP DIV Functional units FP ADD INTEG Scoreboard The scoreboard is somehow equivalent to the ID stage (just after the fetch) and determines when an instruction can read its operands and start its execution . The scoreboard considers all system state changes and decides when the first instruction in the FIFO queue (as produced by the compiler) can be started. Parallelism 18

Scoreboard • The four stages equivalent to ID, EX and WB in DLX are: 1. Emission : if a functional unit for the instruction is available (free) and the required operands are available in RF with updated values, the instruction is issued unless another functional unit has already an instruction which must write into the same destination register. No WAW hazards therefore. In this latter case the instruction is stalled which blocks the emission of all the following instructions in the prefetch queue even when all other conditions for them are met! 2. Operand read : the instruction has been emitted. If the operand is available and no already executing instruction must write it, the operand is read otherwise stall in the functional unit 3. Execution : when the result has been computed and stored the scoreboard is informed so as to unblock a possibly waiting instruction 4. In case of possible WAR the instruction is stalled and does not write the result if there is a previous instruction which has not yet read the operands and one of them is the destination register of the considered instruction. Once the operand has been read the result can be written • It must be noticed that with this organisation the forwarding is avoided since the results are written as soon as produced (but for the wait WAR – point 4) Obviously some stalls can be induced because the number of busses available for transfers is • small The scoreboard technique allows to transfer instructions directly from EX to WB stage (reducing the RAW risks) . Parallelism 19

An example Integer LD F6, 34(R2) RAW LD F2, 45(R3) RAW FMUL F0, F2,F4 (MULD) FSUB F8, F6, F2 (SUBD ) FDIV F10, F0, F6 (DIVD) < FADD F6, F8, F2 (ADDD) WAR Do you find more hypothetical hazards? For instance what about F0? Hypothetical timing for different instructions (which includes the operands read and execution) FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles Parallelism 20

Scoreboard entities Instruction stages: emission, operands read, execution and writeback Statuses of the functional units (FU): 9 parameters Busy Unit busy Op Operation Code presently executed Fi Instruction destination (result) register Fj, Fk Operands source registers Qj, Qk Functional units producing the required operands (if not yet ready) for the registers Fj and Fk Rj, Rk Flags (yes) indicating whether Fj, Fk have been already updated Result status register : indicates which functional unit will write each register. Void when no functional unit has to do with the specific register N.B. It must be remembered that in case of possible WAW the instructions emission is stalled (point 1 of the rules) N.B. In the following example we suppose that two multiplication/division units are available Parallelism 21

Example (here we assume that F0 is a “normal”register and not always “0”) NB Read ExecutionWrite Instruction status LD = FLD FLD 1 cycle Instruction j k Issue Op complete Result MULTD = FMUL FADD, FSUB 2 cycles LD F6 34 R2 SUBD = FSUB FMUL 10 cycles LD F2 45 R3 FDIV 40 cycles DIVD = FDIV Instructions states MULTD F0 F2 F4 ADDD = FADD Progression clock SUBD F8 F6 F2 DIVD F10 F0 F6 Register Q i Ready ? FU=Functional Unit ADDD F8 F2 F6 Functional unit status dest Source1Source 2 FU for j FU for k Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk R j and R k indicates whether (possibly in the next cycle if just produced) the Integer data can be read from the operands source registers of the executed Mult1 n. of cycles of instruction. Q j and Q k are the Functional Units which produce them (if not execution yet Mult2 yet ready). F j and F k are the registers where data produced by Q j and Q k are to elapse Add stored (or will be stored in the next cycle – data available if the Divide corresponding Ri is yes) to be used in the executed instruction Register result status Clock 0 Floating point result registers F0 F2 F4 F6 F8 F10 F12 ... F31 1 integer unit Functional Unit producing the result for the floating point register Fx (Qj, Qk) 2 multipl. units 1 add/sub unit 1 division unit Parallelism 22

Cycle 1 Read Execution Write Instruction status At clock 1 the instruction stage of LD j k Issue Op/Excomplete Result Instruction F6,34(R2) is Issue R2 is supposed to be already available LD F6 34 R2 1 and therefore in the next clock can be LD F2 45 R3 used. LD uses the integer unit Brown colour MULTDF0 F2 F4 for state change SUBD F8 F6 F2 R2 DIVD F10 F0 F6 ADDD F6 F8 F2 S2 FUj FUk S1 Fj? Fk? Functional unit status dest Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status 1 F0 F2 F4 F8 F10 F12 ... F31 Clock F6 Integer FU Functional unit used for producing the result in F6 Parallelism 23

Cycle 2 Read Execution Write Instruction status Data ready in R2: instruction can proceed j k Issue Op/Ex complete Result Instruction LD F6 34 R2 1 2 NB : The second LD cannot be LD F2 45 R3 emitted because the only integer unit is busy and the MULTDF0 F2 F4 same applies for MULTD SUBD F8 F6 F2 because instructions must be DIVD F10 F0 F6 emitted in order ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Mult1 No Mult2 No Add No Divide No Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 2 FU Integer Parallelism 24

Cycle 3 Read Execution Write Instruction status j k Issue complete Result Instruction Op/Ex LD F6 34 R2 1 2 3 FLD 1 cycle FADD FSUB 2 cycles LD F2 45 R3 FMUL 10 cycles MULTD F0 F2 F4 FDIV 40 cycles SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Mult1 No Mult2 No Add No Divide No Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 3 FU Integer Parallelism 25

Cycle 4 The change of status of the FUs Read Execution Write Instruction status indicates their value at the clock positive edge concluding ending the j k Issue complete Result Instruction Op/Ex current cycle (future status). For 4 LD F6 34 R2 1 2 3 instance the integer functional unit LD F2 45 R3 is freed at the end of cycle 4 MULTD F0 F2 F4 together with the result writeback. LD F6 34,R2 disappears totally SUBD F8 F6 F2 from scoreboard at the clock DIVD F10 F0 F6 positive edge concluding the current ADDD F6 F8 F2 cycle 4. Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk F6 Integer Yes Load R2 Mult1 No Mult2 No Add No Divide No Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 4 Integer FU Register at the end of the period has been written Parallelism 26 Functional unit freed at the end of the period

Cycle 5 Read Execution Write Instruction status j k Issue complete Result Instruction Op/Ex LD F6 34 R2 1 2 3 4 At the beginning of cycle 5 the integer unit LD F2 45 R3 5 is already free and then LD F2 45, R3 can MULTD F0 F2 F4 start SUBD F8 F6 F2 R3 supposed already ready as in the DIVD F10 F0 F6 previous case ADDD F6 F8 F2 S1 S2 RUj RUk Rj? Rk? Functional unit status dest Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 No Mult2 No Add No Divide No Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 5 RU Integer The Integer Functional Unit must produce a new value for F2 Parallelism 27

Cycle 6 Read ExecutionWrite Instruction status MULTD F0 F2, F4 can start Op/Ex Instruction j k Issue complete Result because its FU is free and the destination register is F0 LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 MULTD F0 F2 F4 6 MULTD waits for F2 F4 supposed SUBD F8 F6 F2 from the integer unit !!!! already DIVD F10 F0 F6 present ADDD F6 F8 F2 S1 S2 FUj FUk Fj? Fk? dest Functional unit status Time Name Busy Op Fi Fj Fk Qk Rj Rk Qj Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add No Divide No Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 6 Mult Integer FU Parallelism 28

Cycle 7 Read ExecutionWrite Instruction status SUBD F8 F6, F2 can start because Op/Ex Instruction j k Issue complete Result the arithmetic FP sum/subtraction is LD F6 34 R2 1 2 3 4 free. LD F2 45 R3 5 6 7 MULTD stalled in the MULTD F0 F2 F4 6 execution unit because F2 SUBD F8 F6 F2 7 not yet ready. The same for SUBD DIVD F10 F0 F6 ADDD F6 F8 F2 S1 S2 FUj FUk Fj? Fk? dest Functional unit status Time Name Busy Op Fi Fj Fk Qk Rj Rk Qj Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No (NB : FP adder executes Add Yes Subd F8 F6 F2 Integer Yes No FP subtractions Divide No too) Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 7 FU Mult Integer Add Parallelism 29

Cycle 8 Instruction status Read EX Write DIVD F10 F0, F6 can start Op/Ex Issue complete. Result j k Instruction because the divide FP FU is free LD F6 34 R2 1 2 3 4 F2 written allows MULTD and SUBD to read the operands during LD F2 45 R3 5 6 7 8 the next cycle MULTDF0 F2 F4 6 F2 available !! SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 F0 not yet available ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Updated at the end of the cycle F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 8 FU Mult1 Add Divide Parallelism 30 F2 is written and therefore the integer unit is free

Cycle 9 - 10 Instruction status Read EX Write N.B.: MULTD and SUBD can read j k Issue complete Result Instruction the operands because F2 Op/Ex available (see cycle 8). DIVD is LD F6 34 R2 1 2 3 4 still stalled because of F0. LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 ADDD cannot start because SUBD F8 F6 F2 7 9 SUBD uses the adder FU DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 10 clock Mult1 Yes Mult F0 F2 F4 Mult2 No 2 clock Add Yes Sub F8 F6 F2 40 clock Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 9-10 FU Mult1 Add Divide Parallelism 31

Cycle 11 Instruction status Read EX Write Nota: FU Add requires 2 cycles for the j k Issue Instruction complete Result Op/Ex SUBD and therefore nothing happens in LD F6 34 R2 1 2 3 4 cycle 10 while MULTD still processes its LD F2 45 R3 5 6 7 8 data MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 NB: ADDD will use the result of the DIVD F10 F0 F6 8 SUBD but is not yet started because of SUBD ADDD F6 F8 F2 dest S1 S2 FUj FUk Fj? Fk? Functional unit status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 8 clocks more Mult2 No 0 Add Yes Sub F8 F6 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 11 FU Mult1 Add Divide Parallelism 32

Cycle 12 Instruction status Read EX Write FLD 1 cycle j k Instruction Issue complete Result Op/Ex FADD and FSUB 2c ycles LD F6 34 R2 1 2 3 4 FMUL 10 cycles FDIV 40 cycles LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 SUBD ends freeing the FU. In the next DIVD F10 F0 F6 8 period ADDD can start ADDD F6 F8 F2 dest S1 S2 FUj FUk Fj? Fk? Functional unit status Name BusyOp Fi Fj Fk Qj Qk Rj Rk Time Integer No Mult1 Yes Mult F0 F2 F4 7 clocks more Mult2 No Add No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status 12 F0 F2 F4 F6 F8 F10 F12 ... F31 Clock FU Mult1 Divide F8 is written and the ADD/SUB FU is freed Parallelism 33

Cycle 13 Instruction status Fead EX Write FLD 1 cycle j k Issue Op/Excomplete Fesult Instruction FADD FSUB 2 cycles FMUL 10 cycles LD F6 34 R2 1 2 3 4 FDIV 40 cycles LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 11 12 SUBD F8 F6 F2 7 9 Now ADDD can start because SUBD has finished its execution and has DIVD F10 F0 F6 8 freed the FU 13 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 6 Clocks more Mult2 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 13 FU Mult1 Add Divide Parallelism 34

Cycle 14 Instruction status Read EX Write j k Issue complete Result Instruction Op/Ex LD F6 34 R2 1 2 3 4 FLD 1 cycle LD F2 45 R3 5 6 7 8 FADD FSUB 2 cycles FMUL 10 cycles MULTD F0 F2 F4 6 9 FDIV 40 cycles 11 12 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 13 14 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 5 clocks more Mult2 No Add Yes Add F6 F8 F2 2 Clocks more Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 14 FU Mult1 Add Divide Parallelism 35

Cycle 15 Instruction status Read EX Write FLD 1 cycle complete Result j k Issue Instruction Op/Ex FADD FSUB 2 cycles LD F6 34 R2 1 2 3 4 FMUL 10 cycles LD F2 45 R3 5 6 7 8 FDIV 40 cycles MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 ADDD requires two cycles and DIVD F10 F0 F6 8 therefore no system status change 13 14 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No 4 Clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 1 Clock more Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status 15 F0 F2 F4 F6 F8 F10 F12 ... F31 Clock FU Mult1 Add Divide Parallelism 36

Cycle 16 Instruction status Read EX Write FLD 1 cycle FADD FSUB 2 cycles j k Issue complete Result Instruction Op/Ex FMUL 10 cycles LD F6 34 R2 1 2 3 4 FDIV 40 cycles LD F2 45 R3 5 6 7 8 ADDD ended its EX stage while MULTD MULTD F0 F2 F4 6 9 and DIVD keep executing 11 12 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 13 14 16 ADDD F6 F8 F2 dest S1 S2 FUj FUk Fj? Fk? Functional unit status Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Integer No Mult1 Yes Mult F0 F2 F4 3 clocks more Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 16 FU Mult1 Add Divide Parallelism 37

Cycle 17 Read EX Write Instruction status FLD 1 cycle j k Issue complete Result Instruction Op/Ex FADD FSUB 2 cycles LD F6 34 R2 1 2 3 4 FMUL 10 cycles FDIV 40 cycles LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 NB !!! ADDD stalled (cannot write) because of a WAR with DIVD on F6. DIVD does not read 11 12 SUBD F8 F6 F2 7 9 F6 because it waits for F0 produced by DIVD F10 F0 F6 8 MULTD (operands are read in parallel). 13 14 16 ADDD F6 F8 F2 MULT and DIVD keep executing dest S1 S2 FUj FUk Fj? Fk? Functional unit status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 2 Clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No Stalled because Add Yes Add F6 F8 F2 WAR F6 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 17 FU Mult1 Add Divide Parallelism 38

Cycle 18 Read EX Write Instruction status FLD 1 cycle j k Issue complete Result Instruction Op/Ex FADD FSUB 2 cycles LD F6 34 R2 1 2 3 4 FMUL 10 cycles FDIV 40 cycles LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 MULT still executing DIVD F10 F0 F6 8 DIVD still stalled 16 13 14 ADDD F6 F8 F2 dest S1 S2 FUj FUk Fj? Fk? Functional unit status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 1 clock more Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 18 FU Mult1 Add Divide Parallelism 39

Cycle 19 Read EX Write Instruction status j k Issue complete Result Instruction Op/Ex FLD 1 cycle LD F6 34 R2 1 2 3 4 FADD FSUB 2 cycles FMUL 10 cycles LD F2 45 R3 5 6 7 8 FDIV 40 cycles MULTD F0 F2 F4 6 9 19 SUBD F8 F6 F2 7 9 11 12 MULT ends its execution, will write in cycle DIVD F10 F0 F6 8 20 (after 10 cycles) which will unblock DIVD and then ADDD 16 13 14 ADDD F6 F8 F2 dest S1 S2 FUj FUk Fj? Fk? Functional unit status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 19 FU Mult1 Add Divide Parallelism 40

Cycle 20 Read EX Write Instruction status FLD 1 cycle j k Issue complete Result Instruction Op/Ex FADD FSUB 2 cycles LD F6 34 R2 1 2 3 4 FMUL 10 cycles FDIV 40 cycles LD F2 45 R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 MULTD writes F0 12 SUBD F8 F6 F2 7 9 11 unblocking DIVD DIVD F10 F0 F6 8 16 13 14 ADDD F6 F8 F2 dest S1 S2 FUj FUk Fj? Fk? Functional unit status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Yes Yes Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 20 FU Add Divide Parallelism 41

Cycle 21 Read EX Write Instruction status j k Issue complete Result Instruction Op/Ex DIVD reads both F0 and F6 (which LD F6 34 R2 1 2 3 4 could not be written by ADDD LD F2 45 R3 5 6 7 8 because of WAR) unblocking ADDD MULTD F0 F2 F4 6 9 19 20 which can write F6 in the next cycle SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 16 13 14 ADDD F6 F8 F2 dest S1 S2 FUj FUk Fj? Fk? Functional unit status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 21 FU Add Divide Parallelism 42

Cycle 22 Read EX Write Instruction status j k Issue complete Result Instruction Op/Ex Now ADDD can write F6 after the LD F6 34 R2 1 2 3 4 WAR hazards with DIVD disappeared. LD F2 45 R3 5 6 7 8 For 6 cycles ADDD couldn’t write F6 19 20 MULTD F0 F2 F4 6 9 although its result was available SUBD F8 F6 F2 7 9 11 12 8 21 DIVD F10 F0 F6 13 14 16 22 ADDD F6 F8 F2 dest S1 S2 FUj FUk Fj? Fk? Functional unit status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No Divide Yes Div F10 F0 F6 Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 22 FU Divide Parallelism 43

Cycle 61 Read EX Write Instruction status j k Issue Result Instruction complete Op/Ex LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 19 20 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD execution ends after 40 cycles DIVD F10 F0 F6 8 21 61 22 13 14 16 ADDD F6 F8 F2 dest S1 S2 FUj FUk Fj? Fk? Functional unit status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No Divide Yes Div F10 F0 F6 Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 61 Divide FU Parallelism 44

Cycle 62 Read EX Write Instruction status j k Issue Result Instruction complete Op/Ex LD F6 34 R2 1 2 3 4 LD F2 45 R3 5 6 7 8 19 20 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 61 62 All executions ended 13 14 16 ADDD F6 F8 F2 22 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 0 Divide No Register result status F0 F2 F4 F6 F8 F10 F12 ... F31 Clock 62 FU Parallelism 45

Scoreboard limits • Register values must be read in any case in parallel only from the register file ( which means that they must have been already stored in the registers – no RAW problem ) • An instruction can be emitted only if all previous instructions have been emitted N.B Hazards of the sequence are only potential: their occurrence depends FDIV F0, F2, F4 on the instructions execution time FADD F6, F0, F8 FSTOR F6, 0(R1) WAR WAW FSUB F8, F10, F14 RAW FMUL F6, F10, F8 Parallelism 46

Renaming – Tomasulo Algorithm «Renaming» indicates a location different from the RF where a requested datum is produced/stored and can be obtained. The name «renaming» is used because it is as if the source registers of an instruction were renamed Tomasulo algorithm: “renaming” is based on the concept of “reservation stations” which are functional units buffers where instructions can be «parked» waiting for the availability of the requested Fu and the needed data.  A reservation station is a place of a FU where an instruction emitted from the instruction queue waits until the FU is free and the needed data arrive as soon as produced (N.B. before being written in the RF). For its operands EITHER the source register data OR the reservation stations producing them are indicated (whence renaming ). The renaming occurs at run-time  A reservation station captures a required operand exactly when and where it is ( not waiting until it is written avoiding the register file access ). Similar to the case of forwarding  When multiple writes to the same register occur (WAW – possible only if multiple busses between FUs and RF are available) only the most recently produced data are written (for each register a TAG is used indicating the FU which has the right to write) The following benefits occur  Hazards detection and execution control are distributed (not grouped as for the Scoreboard) : only the information stored in the reservation stations of each functional unit determines whether an instruction can execute in the FU since the source ( where the datum is being produced ) and NOT the RF is in any case indicated . RAW hazards are no more possible since the requested data are provided as soon as produced. The same for WAR  Results are transferred directly to the waiting FUs reservation stations without the necessity of reading the RF through the common data busses ( multiple reservation stations in addition to RF register can be accessed at the same time when multiple busses are available) Parallelism 47

Tomasulo Algorithm Tomasulo eliminates not only WAWs but also WARs FLD F6, 32(R2) FLD F2, 44(R3) FMUL F0, F2, F4 Possible WAW FSUB F8, F2, F6 FDIV F10, F0, F6 Possible WAR. FADD F6, F8, F2 FLD [T/F6], 32(R2) Renaming ( functional unit FLD F2, 44(R3) producing the datum ) FMUL F0, F2, F4 FSUB F8, F2, [T] FDIV F10, F0, [T] FADD F6, F8, F2 When an instruction is inserted in a RS it is checked whether one or more of its operands are being produced elsewhere by other RS: if yes then renaming For the FADD a potential WAR with the FDIV could occur if FADD ended before FDIV has read its operands (in case of F8 of FSUB and of F2 of FLD they were both immediately available for FADD) but since FDIV points for F6 to the RS of FLD F6, 32(R2) and not to RF the problem does not occur. The same holds for FSUB. As far as the WAW between FLD and FADD per F6 is concerned the mechanism grants that only the most recent instruction in the RS using a destination register can write the register. Parallelism 48

Tomasulo Algorithm Very high performance without special compilers Differences with scoreboard Buffer and controls directly distributed in the FUs (there is no centralized control): buffers are called “reservation stations” Source registers names substituted by pointers to buffers of the reservation stations (if the requested data are being there produced) “Renaming”: a direct pointer to the sources and not to the register One ore more Common Data Bus for sending results to all FUs requiring them Load and Stores considered as FUs (a STORE can also be a source for a RS executing a LOAD) Parallelism 49

Tomasulo Algorithm In this example: In this example is it assumed that 3 RS for add/sub the MUL unit executes the DIVs too 2 RS for mult/div and that the ADD executes the 5 RS for store SUBs too . LOAD and STORES 5 RS for load are handled as other instructions In this example only one Data Bus. Please notice For the that the same Common data produced Data Bus is used also by by the FUs the RS waiting for data Each RS (more than one for each FU) stores an emitted instruction and for each operand either of two elements: either the operand value (i.e. read from RF) or the name of the RS which is producing it (renaming) Parallelism 50

Tomasulo Algorithm • Load buffers are used to store the load addresses • Store buffers contain the computed addresses and the data to be written in memory • Load and store must be executed in sequence if they are related to the same addresses. In the other cases it is possibile to anticipate the LOADs (never the STOREs) • In figure there are 3 phases (each one of which can last several clocks): Emission : the instructions are extracted in order from the general instruction queue when there is a • free RS for the requested FU (the only condition) otherwise the instruction queue stalls. Operands are extracted from RF or the producing FU as indicated. In case of WAW it must be determined which instruction must provide the data • Execution: if one ore more operands are not yet available CDB (s) must be monitored (data must be transferred over a bus anyway) in order to catch them (and their sources) as soon as available: RAW are therefore avoided (we are sure not to read stale data in the RF). • Writeback: as soon as a datum is produced, it is tranferred over one CDB (when more than one are available) to the RF and to the RS waiting for it. Parallelism 51

Tomasulo Algorithm Let’s see the scoreboard example in a Tomasulo Architecture. Let’s suppose that the execution times are the same of the scoreboard (FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) – NB. “+1” for the writeback LD F6, 34(R2) LD F2, 45(R3) FMUL F0, F2,F4 FSUB F8, F6, F2 FDIV F10, F0, F6 FADD F6, F8, F2 Parallelism 52

Reservation Station Op: opcode of the instruction to be executed Vj, Vk: places of the operands (either RF or the FUs producing them) Qj, Qk: Functional units producing the results. A blank indicates that the source operands are already in Vj or Vk or that they are not required Busy: Busy FU Register File Status: Indicates which FU will write the register (if needed). A blank means that there are no instructions which must write the register and therefore its value can be directly used N.B. From the general instruction queue one instruction per clock is emitted when a FUs RS for that instruction is available otherwise stall. In our example we assume only one CDB. Parallelism 53

Cycle 0 NB. For LD (ST here not used) there is a limited Instruction status number of RS. Their BUSY status is here displayed Execution Write Instruction j k Issue differently from the FU (see next slide) LD F6 34 R2 Operands register. If blank the datum is produced in the LD F2 45 R3 corresponding Q FU MULTD F0 F2 F4 SUBD F8 F6 F2 Producing FU – if blank it means that the datum is in RF DIVD F10 F0 F6 ADDD F6 F8 F2 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, Reservation Stations S1 S2 RS for j RS for k FMUL 10+1 cycles, FDIV 40+1 cycles) Time Name Busy Op Vj Vk Qj Qk NB. “+1” for the writeback 0 Add1 No 0 Add2 No For sake of simplicity R j e R k Load/store not Add3 No (ready/notready) are not indicated in the displayed since their values 0 Mult1 No status table are implicit in the status of Q j and Q k 0 Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 0 FU The FU producing the new value Parallelism 54

Cycle 1 5 RS for the LOAD Instruction status Execution Write Address Instruction j k Issue Busy LD F6 34 R2 1 Load1 Yes 34+R2 LD F2 45 R3 NB: Here it is assumed that R2 and R3 are already available MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 3 RS for 0 Add2 No FLD 1+1 cycles, adder/sub FADD and FSUB 2+1 cycles, Add3 No FMUL 10+1 cycles, FDIV 40+1 cycles) 2 RS for 0 Mult1 No NB. “+1” for the writeback mul/div 0 Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 1 FU Load1 Parallelism 55

Cycle 2 5 RS for LOAD Instruction status Execution Write Instruction j k Issue Busy Address LD F6 34 R2 1 2- Load1 Yes 34+R2 LD F2 45 R3 2 Load2 Yes 45+R3 The second LD is emitted. One instruction per clock is MULTD F0 F2 F4 emitted (when possible) SUBD F8 F6 F2 NB: Load -> 2 cycles: the first one for computing the DIVD F10 F0 F6 address and the second for reading the datum ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No FLD 1+1 cycles, N.B. A second LOAD has been emitted FADD and FSUB 2+1 cycles, (not possible with the scoreboard) FMUL 10+1 cycles, Add3 No and parked in the RS. R3 value FDIV 40+1 cycles) NB. “+1” for the writeback Mult1 No already available in the RF Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 2 FU Load2 Load1 Parallelism 56

Cycle 3 Instruction status Execution Write Address Instruction j k Issue Busy LD F6 34 R2 1 2--3 Load1 Yes 34+R2 LD F2 45 R3 2 3- Load2 Yes 45+R3 MULTD F0 F2 F4 3 MULTD emitted (free RS ) SUBD F8 F6 F2 LD two cycles DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk MULTD can be emitted although F2 NOT Add1 No Datum supposed yet available . F2-> renaming already in the RF Add2 No FLD 1+1 cycles, Add3 No FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, Mult1 Yes Mult F4 Load2 Yet10 cycles FDIV 40+1 cycles) NB. “+1” for the writeback Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 3 FU Mult1 Load2 Load1 Parallelism 57

Cycle 4 Instruction status Execution Write Address Instruction j k Issue Busy LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 Load2 Yes 45+R3 MULTD F0 F2 F4 3 The datum read from memory LD F6 34(R2) is SUBD F8 F6 F2 4 SUBD is emitted (RS free) written both in the RF and in the RS of SUBD F6 available in RF at the DIVD F10 F0 F6 which is waiting for it end of the cycle ADDD F6 F8 F2 FLD 1+1 cycles, Reservation Stations S1 S2 RS for j RS for k FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, Time Name Busy Op Vj Vk Qj Qk FDIV 40+1 cycles) NB. “+1” for the writeback Add1 Yes Sub Load2 Yet 3 cycles F6 (captured on the fly) Add2 No Add3 No The FUs execute both sums and Yet 10 cycles Mult1 Yes Mult F4 Load2 subtractions Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 4 FU Mult1 Load2 Add1 Parallelism FU freed 58

Cycle 5 Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 The datum read from memory with LD F2 45(R3) LD F2 45 R3 2 3--4 5 is written both in register F2 and in the RS of SUBD and MULTD which are waiting for it MULTD F0 F2 F4 3 SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 DIVD is emitted (RS free) ADDD F6 F8 F2 FLD 1+1 cycles, Reservation Stations S1 S2 RS for j RS for k FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, Time Name Busy Op Vj Vk Qj Qk FDIV 40+1 cycles) NB. “+1” for the writeback 3 Add1 Yes Sub F6 (capt.) F2 (capt) Cycles yet to be 0 Add2 No executed for completing the Add3 No execution F2 (capt) 10 Mult1 Yes Mult F4 0 Mult2 Yes Div F6 Mult1 Wait for F0 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 5 FU Mult1 Add1 Mult2 Parallelism 59 FU freed

Cycle 6 Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 FLD 1+1 cycles, LD F2 45 R3 2 3--4 5 FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, MULTD F0 F2 F4 3 6 -- FDIV 40+1 cycles) NB. “+1” for the writeback SUBD F8 F6 F2 4 6 -- DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 ADDD is emitted (RS free) Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 2 Add1 Yes Sub F6 F2 Cycles yet to be Wait for F8 F2 Add2 Yes Add Add1 executed for completing the Add3 No execution F2 9 Mult1 Yes Mult F4 Now MULTD can execute (F2 and F4 available) Yet 40 cycles Mult2 Yes Div Mult1 F6 Wait for F0 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 6 FU Mult1 Add2 Add1 Mult2 Parallelism 60

Cycle 7 Instruction status Instruction j k Issue Execution Write LD F6 34 R2 1 2--3 4 Datum in F6 will be overwritten by LD F2 45 R3 2 3--4 5 ADDD but it was already read and is present in the RS of DIVD MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 SUBD (as ADDD) two cycles DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 FLD 1+1 cycles, Reservation Stations S1 S2 RS for j RS for k FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) Time Name Busy Op Vj Vk Qj Qk NB. “+1” for the writeback 1 Add1 Yes Sub F6 F2 ADDD stalled waiting for SUBD (F8) F2 Add2 Yes Add Add1 Add3 No 8 Mult1 Yes Mult F2 F4 Yet 40 cycles Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 7 FU Mult1 Add2 Add1 Mult2 Parallelism 61

Cycle 8 Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 NB: SUBD ends before MULTD and MULTD F0 F2 F4 3 6 -- allows ADDD (which captures the SUBD F8 F6 F2 4 6 -- 7 8 result of F8) to start executing DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 FLD 1+1 cycles, Reservation Stations S1 S2 RS for j RS for k FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, Time Name Busy Op Vj Vk Qj Qk FDIV 40+1 cycles) NB. “+1” for the writeback Add1 No 0 2 Add2 Yes Add F8 F2 Add3 No 7 Mult1 Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 8 FU Mult1 Add2 Mult2 FU freed Parallelism 62

Cycle 9 Instruction status Write Instruction j k Issue Execution LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD executing ADDD F6 F8 F2 6 9 -- FLD 1+1 cycles, Reservation Stations S1 S2 RS for j RS for k FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, Time Name Busy Op Vj Vk Qj Qk FDIV 40+1 cycles) NB. “+1” for the writeback Add1 No 2 Add2 Yes Add F8 F2 Add3 No 6 Mult1 Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 9 FU Mult1 Add2 Mult2 Parallelism 63

Cycle 10 Instruction status Instruction j k Issue Execution Write LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 8 Two execution cycles DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No FLD 1+1 cycles, 1 Add2 Yes Add F8 F2 FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, Add3 No FDIV 40+1 cycles) NB. “+1” for the writeback 5 Mult1 Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 10 FU Mult1 Add2 Mult2 Parallelism 64

Cycle 11 Instruction status Instruction j k Issue Execution Write LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD too ends before MULTD and DIVD ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk FLD 1+1 cycles, Add1 No FADD and FSUB 2+1 cycles, Cycles yet to be FMUL 10+1 cycles, Add2 No 0 executed for FDIV 40+1 cycles) completing the NB. “+1” for the writeback Add3 No execution 4 Mult1 Yes Mult F2 F4 40 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 11 FU Mult1 Mult2 Parallelism FU freed 65

Cycle 12 Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- FLD 1+1 cycles, FADD and FSUB 2+1 cycles, SUBD F8 F6 F2 4 6 -- 7 8 FMUL 10+1 cycles, FDIV 40+1 cycles) DIVD F10 F0 F6 5 NB. “+1” for the writeback ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Cycles yet to be Add2 No Waiting for the datum executed for completing the produced by MULTD Add3 No execution 3 Mult1 Yes Mult F2 F4 40 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 12 FU Mult1 Mult2 Parallelism 66

Cycle 15 Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 FLD 1+1 cycles, LD F2 45 R3 2 3--4 5 FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, MULTD F0 F2 F4 3 6 -- 15 FDIV 40+1 cycles) NB. “+1” for the writeback SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Waiting for the datum produced by MULTD Add3 No 1 Mult1 Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 15 FU Mult1 Mult2 Parallelism 67

Cycle 16 Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, MULTD F0 F2 F4 3 6 -- 15 16 FMUL 10+1 cycles, FDIV 40+1 cycles) SUBD F8 F6 F2 4 6 -- 7 8 NB. “+1” for the writeback DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No 0 Add3 No Now DIVD can execute 0 Mult1 No 40 Mult2 Yes Div F0 F6 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 16 FU Mult2 Parallelism 68 FU freed

Cycle 56 Instruction status Execution Write Instruction j k Issue LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- 15 16 SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 17 -- 56 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 0 Mult2 Yes Div F0 F6 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 56 FU Mult2 Parallelism 69

Cycle 57 Instruction status Execution Write Instruction j k Issue complete Result LD F6 34 R2 1 2--3 4 LD F2 45 R3 2 3--4 5 MULTD F0 F2 F4 3 6 -- 15 16 SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 17 -- 56 57 ADDD F6 F8 F2 6 9 -- 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F31 57 FU Parallelism 70

A demo can be found at http://www.ecs.umass.edu/ece/koren/architecture/Tomasulo1/tomasulo_files/tomasulo.htm Parallelism 71

Limits of Tomasulo Algorithm • Very complex • Each CDB must be connetcted to each RS – Complex cabling – Reduce n. of CDB means reduced efficiency • If a single CDB is present only one instruction per cycle can end • Ouf of order instructions completion !!!!!! • NOT precise interrupts Parallelism 73

Exceptions • Exception/interrupt: non-programmed control transfer  Return address and all other information necessary to restore the interrupted situation must be saved  «Response» subroutine (handler) must be executed • Two exceptions types: interrupt and trap  Interrupts: external causes  The user program are interrupted and the then restored  Asyncronous to the current process  Acknowledged at the end of the current instruction (if interrupts enabled)  The handler is responsibility of the user program • Traps: internal causes  Exceptional conditions (overflow, zero division etc.)  Errors (i.e. parity)  Page fault (or – see later – segment fault): data not available in memory  Syncronous to the current process  Operating systems handler  Instruction can be interrupted during its execution (i.e. page fault) and therefore must be «restartable»,. The executing program is normally temporarily aborted. Parallelism 74

Examples Instruction Restart Parallelism 75

Precise exceptions/interrupts • Exceptions must be “precise” that is their behaviour must be same that would occur in a “non-pipelined” architecture • Precise: machine status is saved as if the code would have been executed until the exception :  All preceding instruction must be terminated  All instructions following the instruction which provoked the exception must be handled as if they never started  The same code must executed identically on different architectures • Complex problem with pipeline, OOO execution (see later) etc. • Scoreboard and Tomasulo have: In order emission, execution (and therefore terminated) out of order fuori ordine • Precise exceptions(interrupts) : instruction commitment in order Parallelism 76

Reorder Buffer (ROB) • FIFO queue • Stores pointers to all instructions in FIFO order as they are emitted. For sake of simplicity we say that the instruction is virtually inserted in the ROB • When instructions are terminated the results are stored in the ROB (instead of the RF) which provides also the operands to other instructions which requires them Commitment (renaming!) • Commitment: the result of the instruction in the top slot of the FIFO are transferred to the architectural registers ROB • Easy “undo” of speculated instructions (see later) FP Op or of branches erroneously predicted or exceptions Queue FP Regs • Automatic WAW avoidance Res Stations Res Stations FP Adder FP Adder Parallelism 78

Tomasulo again Parallelism 79

Tomasulo in 4 steps Emission— Emission of an instruction from the instruction queue when a RS and • a ROB slot available. In the RS are indicated the operands source and the ROB slot where an instruction will be “parked” after its esecution (this phase is called «dispatch”). The results are NOT written in the RF until the commitment phase . NB the lack of one of the two conditions blocks the emission of the following instructions • Execution — Operands transformation. If not yet ready they can be in the ROB (in this case the operand value computed by the nearest previous instructions is used) or still computed in the FU. This phase is indicated as “issue”. • Result writeback — Execution ends. Result trasmitted on the CDB for the RS waiting of them and for the ROB . • Commitment— Register (or memory) update with the results stored in the ROB when the instruction is on the top of the ROB FIFO. In case of erroneously predicted branch the ROB results are just dropped (“graduation”). EMISSION IN ORDER COMMITMENT IN ORDER N.B. Sometimes more instructions can be commited simultaneously. If the destination is the same (unlikely, otherwise the compiler would have dropped the first one) the result of the most recent instruction is used. Parallelism 80

HW with ROB Compar network Destination Register Valid (terminated ) Reorder Program Counter Buffer FP Op Exception? Queue FP Regs Result Res Stations Res Stations ROB FP Adder FP Adder • ROB is a circular queue Parallelism 81

Example LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4 Parallelism 82

Tomasulo with ROB – cycle 1 Source Completed? Dest. Istruction FP Op ROB7 ROB end queue ROB6 ROB5 ROB4 ROB LD F0, 10(R2) FADD F10, F4, F0 ROB3 FDIV F2, F10, F6 ROB2 BRNE F2, +100 ROB top F0 LD F0,10(R2) N LD F4, 0(R3) ROB1 FADD F0, F4, F6 ST 0(R3), F4 FP registers To memory Dest. => destination position in ROB Dest. Dest From memory Dest Reservation 1 10+R2 M1 Stations FP adders FP multipliers Parallelism 83

Tomasulo with ROB – cycle 2 Source Completed? Dest. Istruction FP Op ROB7 End queue ROB6 ROB5 ROB4 ROB LD F0, 10(R2) FADD F10, F4, F0 ROB3 FDIV F2, F10, F6 F10 ROB1 FADD F10, F4, F0 N ROB2 Top BRNE F2, +100 F0 LD F0,10(R2) Ex LD F4, 0(R3) ROB1 Renaming !! FADD F0, F4, F6 ST 0(R3), F4 FP registers To memory Three slots Dest. for memory Dest From memory operations 2 FADD F10,F4, ROB1 Dest (Memory Reservation 1 10+R2 M1 2 clocks) Stations FP adders FP multipliers Parallelism 84 There can be also two ROB sources

Tomasulo with ROB – cycle 3 Source Completed? Dest. Istruction FP Op ROB7 End queue ROB6 ROB5 ROB4 ROB LD F0, 10(R2) FADD F10, F4, F0 F2 ROB 2 N FDIV F2, F10, F6 ROB3 FDIV F2, F10, F6 F10 ROB 1 FADD F10, F4, F0 N ROB2 Top BRNE F2, +100 F0 LD F0,10(R2) Ex LD F4, 0(R3) ROB1 FADD F0, F4, F6 ST 0(R3), F4 FP registers To memory Dest. Dest From memory 2 FADD F10, F4, ROB1 3 FDIV F2, ROB2, F6 Dest Reservation 1 10+R2 M1 Stations FP adders FP multipliers Parallelism 85

Tomasulo with ROB – cycle 5 Source Completed? Dest. Istruction FP Op ROB7 End queue ROB6 F0 ROB5 FADD F0, F4, F6 N ROB5 F4 LD F4,0(R3) Ex Emitted in cycle 4 in parallel with -- BRNE F2, +100 N ROB4 LD F4, 0(R3) In cycle 4 (end of the first LD) F2 ROB2 N FDIV F2, F10, F6 ROB3 FADD F10, F4, F0 started executing F10 ROB1 FADD F10, F4, F0 Ex ROB2 Top Completed and committed ROB1 LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 (Memory 1) F0 BRNE F2, +100 LD F4, 0(R3) To memory FADD F0, F4, F6 ST 0(R3), F4 Dest. Dest From memory 2 FADD F10, F4, M1 3 FDIV F2, ROB2, F6 6 FADD F0, ROB5, F6 Dest M1 Reservation 5 0+R3 Datum captured Stations on the fly. Not FP adders FP multipliers more present in the ROB Parallelism 86

Tomasulo with ROB – cycle 6 Source Completed? Dest. Istruction FP Op ROB7 ROB5 ST 0(R3), F4 N End queue ROB6 F0 ROB5 FADD F0, F4, F6 N ROB5 F4 LD F4, 0(R3) Ex NB ST can start its execution when -- BRNE F2, +100 N ROB4 LD F4, 0(R3) has terminated the F2 ROB2 N execution NOT when is committed FDIV F2, F10, F6 ROB3 (one cycle later ) F10 FADD F10, F4, F0 Ex ROB1 ROB2 Top ROB1 LD F0, 10(R2) FADD F10, F4, F0 FP registers FDIV F2, F10, F6 BRNE F2, +100 M1 F0 LD F4, 0(R3) FADD F0, F4, F6 To memory ST 0(R3), F4 Dest. Dest From memory 2 FADD F10, F4, M1 3 FDIV F2, ROB2, F6 6 FADD F0, ROB5, F6 Dest Reservation 5 0+R3 M1 Stations FP adders FP multipliers Parallelism 87

Register Renaming • But when an emitted instruction must use a register where can it be found? In the ROB or in the RF ? The entire ROB should be analysed and the most recent slot found whose destination is the required register: the instruction should either point to it (if any) or to the RF • Solution: to use a number of physical registers greater than that of the architectural registers (ISA) and to keep a pointer to the most recent ( which is actually the architectural register ). • Whenever an instruction inserted in the ROB must write a register (i.e. F17), it points to a new physical register associate to the involved register (F17) where the result will be temporarily stored. Any following instruction which must use that register (F17) will use that physical register • For each commitment the pointer to the architectural register points to the physical register linked the commited instruction. When a new instruction regarding the same architectural register is committed the pointer to it is changed (and the physical register previously embodying the architectural register is freed). Parallelism 88

An example with R2 LD R2, 10(R5) ; R2-4 (dest.) Circular queue of Architectural register R2 MUL R8, R2, R5 ; R2-4 (sorg.) register R2-0 RADD R2, R9, R6 ; R2-5 (dest.) R2-1 R2-1 Let’ suppose that R2-2 and DIV R2, R2, R10 ; R2-6 (dest) e R2-5 (sorg.) R2-2 R2-2 R2-3 are alredy occupied ( commitment of instruction using R2-2 ) by previous not yet R2-3 ( Normally the compiler would drop the next to the committed instructions R2-4 last instruction) R2-5 first physical register Pointer to the first free free associated to R2 R2-6 register R2 when LD R2, R2-7 10(R5) is emitted R2-8 • When LD R2, 10(R5) is emitted register R2-4 is given to it as destination which will be used by MUL (as soon as the new datum is computed). Now R2-2, R2-3 e R2-4 are «busy» and the first free register will be R2-5. R2-2, R2-3 ans R2-4 will be freed as soon the related instructions end. If the commitment is “in-order” all hazards disappear. R2-1 is the architectural R2 register. R2-2 will become the architectural R2 register at the commitment of the related instruction. The busy registers are freed when no more needes • Normallly there are 40-120 physical registers Parallelism 89

HW support for register renaming • Free/busy register table. Two solutions: one pool of physical registers for all architectural registers or one pool for each architectural register. • Fast mapping between architectural and physical registers (run time) • Great number of physical registers • If no physical registers (circurarly) are available the instruction is stalled . There is no emission also if no free slot in the ROB is available and no RS is available Parallelism 90

ROB «without Tomasulo» • Instructions are emitted as soon a free slot in the ROB and a physical destination register are available using the register renaming (the used registers are normally not yet committed) • For each FU there is a virtual queue whose slots point to the ROB slots which require that FU. • The instruction of this queue are executed as soon as the required operands are available. • When two instructions are ready for execution, FIFO rule (so as to speed-up the commitment, always in order ) Parallelism 91

ROB and speculation • Dynamic instruction execution granting precise interrupts which are checked at the instruction commitment always in order • Cancellation of speculative instructions when a branch is erroneously predicted  The prediction error must be revealed ASAP. The cancellation of post-branch instructions erroneously executed allows preceding instruction to keep executing. The erroneously executed instructions are not yet commited  The early branch prediction avoids the execution of useless instructions (sometimes very time expensive). It must remembered that not only the ROB flush occurs but also the cancellation of all the instructions already in the pipeline  Need of a separated Return Stack Buffer for the speculative calls (otherwise the stack could be damaged). It is a separated stack whose content is copied onto the stack if the branch has been correctly predicted as taken. All instructions following a branch not yet commited use this stack. In case of misprediction the RSB content is cancelled Parallelism 92

Example - 1 FLD F4,0(R10) RAW WAW FDIV F8, F0, F4 FMUL F4, F2, F3 RAW FMUL F4, F4, F4 WAW RAW FADD F6, F10,F4 FLD F4, 0(R5) Parallelism 93

Tomasulo without ROB and with renaming . Multiplication FU execute the divisions too. Instruction status Exe. Write j Issue Compl. Instruction k Result Busy Address FLD F4 0 R10 Load1 FDIV F8 F0 F4 Load2 FMUL F4 F2 F3 Load2 FMUL F4 F4 F4 Store 1 FADD F6 F10 F4 Store 2 FLD F4 0 R5 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 Mult1 Mult2 F2 F4 F6 F8 F10 Register result status Clock 0 FU Parallelism 94 Three RS for LOAD, 2 for STORE, 2 for MUL/DIV

CLOCK 1 Instruction status Exe. Write j Issue Compl. Instruction k Result Busy Address FLD F4 0 R10 1 yes R10 Load1 FDIV F8 F0 F4 Load2 FMUL F4 F2 F3 Load2 FMUL F4 F4 F4 Store 1 FADD F6 F10 F4 Store 2 FLD F4 0 R5 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 Mult1 Mult2 Register result status F2 F4 F6 F8 F10 Clock 1 FU Load1 Parallelism 95 Three RS for LOAD, 2 for STORE, 2 for MUL/DIV

CLOCK 2 Instruction status Exe. Write j Issue Compl. Instruction k Result Busy Address FLD F4 0 R10 1 2- yes R10 Load1 FDIV F8 F0 F4 2 Load2 FMUL F4 F2 F3 Load2 FMUL F4 F4 F4 Store 1 FADD F6 F10 F4 Store 2 FLD F4 0 R5 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 yes div F0 Load1 Mult1 Mult2 Register result status F2 F4 F6 F8 F10 Clock 2 FU Load1 Mult1 Parallelism 96 Three RS for LOAD, 2 for STORE, 2 for MUL/DIV

CLOCK 3 Instruction status Exe. Write j Issue Compl. Instruction k Result Busy Address FLD F4 0 R10 1 2-3 R10 yes Load1 FDIV F8 F0 F4 2 Load2 FMUL F4 F2 F3 Load2 3 FMUL F4 F4 F4 Store 1 FADD F6 F10 F4 Store 2 FLD F4 0 R5 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 yes div F0 Load1 Mult1 F2 F3 yes mul Mult2 Register result status F2 F4 F6 F8 F10 Clock 3 FU Mult2 Mult1 Parallelism 97 Three RS for LOAD, 2 for STORE, 2 for MUL/DIV

CLOCK 4 Instruction status Exe. Write j Issue Compl. Instruction k Result Busy Address 2-3 4 FLD F4 0 R10 1 Load1 FDIV F8 F0 F4 2 Load2 4- FMUL F4 F2 F3 Load2 3 4- FMUL F4 F4 F4 Store 1 FADD F6 F10 F4 Store 2 Stalled for lack of free RS until cycle 13 (end of the preceding multiplication FLD F4 0 R5 – only two slots) blocking the emission of FADD which could be executed since there are two free slots in the corresponding RS Reservation Stations Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 yes div F0 F4 yet 39 cycles Mult1 yet 9 cycles F2 F3 yes mul Mult2 Register result status F2 F4 F6 F8 F10 Clock 4 FU Mult2 Mult1 Parallelism 98 Three RS for LOAD, 2 for STORE, 2 for MUL/DIV

Example - 2 Same instruction stream 80000000: FLD F4, 0(R10) RAW WAW 80000004: FDIV F8, F0, F4 80000008: FMUL F4, F2, F3 RAW 8000000C: FMUL F4, F4, F4 WAW RAW 80000010: FADD F6, F10,F4 80000014: FLD F4, 0(R5) ROB and register renaming. The instructions are in any case inserted in the ROB when a free slot is available and then executed when the FU and the operands are available (policy of all modern processors). By so doing instructions are not only terminated OOO (but with results reordered in the ROB) but also emitted even if the FU is not available The execution is totally OOO but with an In-Order commitment Parallelism 99

Initial situation Addr Op. Des Sorg Top free registers of the 1 circular queues 2 3 4 5 F4 P0 P1 P2 P3 P4 P5 Renaming registers Free Free. Free Free Free Arch These are the for F4, F6 e F8 F6 architectural registers which a program monitor Q0 Q1 Q2 Q3 Q4 Q5 would display Busy Free. Free Free Free Arch F8 Z0 Z1 Z2 Z3 Z4 Z5 Busy Free Free Free Free Arch ROB RAT These are registers in use by not yet committed instructions. They will Register Allocation Table become architectural registers when the related Here we assume that the instruction using Z0 precedes the instruction is committed instruction using Q0. RAT for R5, R10, F0, F2, F10 not displayed Parallelism 100

CLOCK 1 Instruction status Exe. Write j Issue Compl. Instruction k Result Busy Address FLD F4 0 R10 1 yes R10 Load1 FDIV F8 F0 F4 Load2 Addr Op. Des Sorg FMUL F4 F2 F3 Load3 80000000 FLD P0 0,R10 1 FMUL F4 F4 F4 Store 1 2 FADD F6 F10 F4 Store 2 3 Renaming FLD F4 0 R5 4 5 F4 P0 P1 P2 P3 P4 P5 Time Name Busy Vj Vk Qj Qk Op Busy Free. Free Free Free Arch F6 Add1 Q0 Q1 Q2 Q3 Q4 Q5 Add2 Busy Free Free Free Free Arch Add3 F8 Mult1 Z0 Z1 Z2 Z3 Z4 Z5 Mult2 Busy Free Free Free Free Arch ROB RAT Clock 1 Parallelism 102

CLOCK 2 Instruction status Exe. Write j Issue Compl. Instruction k Result Busy Address Most recent physical FLD F4 0 R10 1 2- yes R10 register for F4 Load1 FDIV F8 F0 F4 2 Load2 FMUL F4 F2 F3 Load3 FMUL F4 F4 F4 Store 1 Addr Op. Des Sorg FADD F6 F10 F4 1 Store 2 80000000 FLD P0 0,R10 2 FLD F4 0 R5 80000004 FDIV Z1 F0,P0 3 4 5 Time Name Busy Op Vj Vk Qj Qk F4 P0 P1 P2 P3 P4 P5 Add1 Busy Free Free Free Free Arch Add2 F6 ROB Add3 Q0 Q1 Q2 Q3 Q4 Q5 RAT yes P0 div F0 Busy Free Free Free Free Arch Mult1 F8 Mult2 Z0 Z1 Z2 Z3 Z4 Z5 Busy Busy Free Free Free Arch Clock 2 Renaming Parallelism 103

Parallel architectures Electronic Computers LM Parallelism 1 - PowerPoint PPT Presentation

Parallel architectures Electronic Computers LM Parallelism 1 Architecture Architecture: functional behaviour of a computer. For instance a processor which executes DLX code Implementation: a logical network implementing the

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

Architectures Architectural styles Software architectures Architectures versus middleware

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Architectures for Parallel Processing Current Architectures for Parallel "With the

Parallel Architectures Parallel Architectures 1 Memory Access Multiple processing units

Parallel Computing Basics Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel Architectures Frdric Desprez INRIA F. Desprez - UE Parallel alg. and prog.

Overview Parallel computing platforms Approaches to building parallel computers

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Accelerating Kernels from WRF on GPUs John Michalakes, NREL Manish Vachharajani, University of

SciForum MOL2NET Efficient Actor-critic Algorithm with Dual Piecewise Model Learning Shan Zhong

for CLIR CLEF09: Ad-hoc (TEL) Session, Corfu, Greece Institute AIFB University of Karlsruhe

Redes de rea Extensa (WAN) Area de Ingeniera Telemtica http://www.tlm.unavarra.es Redes de

Searching for Subspace Trails and Truncated Differentials March 5th, 2018 Horst Grtz Institute

Structure determination of genomes and genomic domains by satisfaction of spatial restraints

Upper triangular forms for some classes of infinite dimensional operators Ken Dykema, 1 Fedor

A Schur-Horn theorem in II 1 factors Mart n Argerami and Pedro Massey Florianopolis, July

Parallel architectures Electronic Computers LM Parallelism 1 - PowerPoint PPT Presentation

Parallel architectures Electronic Computers LM Parallelism 1 Architecture Architecture: functional behaviour of a computer. For instance a processor which executes DLX code Implementation: a logical network implementing the

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

Architectures Architectural styles Software architectures Architectures versus middleware

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Architectures for Parallel Processing Current Architectures for Parallel &quot;With the

Parallel Architectures Parallel Architectures 1 Memory Access Multiple processing units

Parallel Computing Basics Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel Architectures Frdric Desprez INRIA F. Desprez - UE Parallel alg. and prog.

Overview Parallel computing platforms Approaches to building parallel computers

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Accelerating Kernels from WRF on GPUs John Michalakes, NREL Manish Vachharajani, University of

SciForum MOL2NET Efficient Actor-critic Algorithm with Dual Piecewise Model Learning Shan Zhong

for CLIR CLEF09: Ad-hoc (TEL) Session, Corfu, Greece Institute AIFB University of Karlsruhe

Redes de rea Extensa (WAN) Area de Ingeniera Telemtica http://www.tlm.unavarra.es Redes de

Searching for Subspace Trails and Truncated Differentials March 5th, 2018 Horst Grtz Institute

Structure determination of genomes and genomic domains by satisfaction of spatial restraints

Upper triangular forms for some classes of infinite dimensional operators Ken Dykema, 1 Fedor

A Schur-Horn theorem in II 1 factors Mart n Argerami and Pedro Massey Florianopolis, July

Architectures for Parallel Processing Current Architectures for Parallel "With the