1
DLX computer Electronic Computers M 1 RISC architectures RISC vs - - PowerPoint PPT Presentation
DLX computer Electronic Computers M 1 RISC architectures RISC vs - - PowerPoint PPT Presentation
DLX computer Electronic Computers M 1 RISC architectures RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer In CISC architectures the 10% of the instructions are used in 90% of cases Waste of silicon
2
- RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set
Computer
RISC architectures
- In CISC architectures the 10% of the instructions are used in 90% of cases
- Waste of silicon
- Bottleneck: the bus
- Mid ‘80s a new architecture: RISC
- Solution: reduction of instruction number and complexity (fewer simpler machine
instructions)
- Fixed instruction format (simpler instruction decoders)
- Simpler control logic network increasing the number of on-chip registers
- Reduction of bus/memory accesses
- Increase of machine instructions needed for a job which is (in many cases) more
than compensated (in term of time) by the reduction of bus accesses
- CISC and RISC are each one the best solution in different application fields
- Nowadays coexistence of both architectures in the same processor: analysis at the end
- f the course
- A simplified RISC architecture: DLX (implemented as real processor in the ‘80s as
R4000)
3
DLX (fixed) instruction format
R
Op-code Ra Rb Rc
- Cod. op (11 bit) extension
6 bit 5 bit 5 bit 5 bit 11 bit
31 26 25 21 20 16 15 11 10 0
Arithmetic or logic instructions; i.e. Ra ← Rb op Rc or Set Conditions between registers Branch instructions
I
Op-code Ra Rb Immediate operand
- r offset
Data transfer (Load, Store), conditional Branch , JR and JALR (Control transfer via register), Set Condition e ALU with immediate operator. In Load and ALU instructions Ra=destination, in the Store Ra=source. -- Rb as ALU value for the immediate instructions - Branch instructions
31 26 25 21 20 16 15 0
J
Op-code 26 bit (PC relative) offset
Direct, unconditional control transfer(J e JAL)
31 26 25 0
4
DLX non floating-point instructions
(31x32bit registers R31…R1 - R0=0 fixed - Ra and Rb any of the 32 registers)
Data Transfer LW Ra, offset(Rb) LB Ra, offset(Rb) LBU Ra, offset(Rb) LHU Ra, offset(Rb) LH Ra, offset(Rb) SW Ra, offset(Rb) SH Ra, offset(Rb) SB Ra, offset(Rb) LHI Ra, value Arithmetic/Logic ADD Ra,Rb,Rc ADDI Ra,Rb,value ADDU Ra,Rb,Rc ADDUI Ra,Rb, value SUB Ra,Rb,Rc SUBI Ra,Rb,value SUBU Ra,Rb,Rc SUBUI Ra,Rb, value DIV Ra,Rb,Rc DIVI Ra,Rb,value MULU Ra,Rb,Rc MULI Ra,Rb, value SLL Ra ,Rb,Rc SLLI Ra,Rb;value SHR Ra,Rb.Rc SHRI Ra,Rb,value SLA Ra,Rb,Rc SLAI Ra,Rb,value OR Ra,Rb,Rc ORI Ra,Rb,value XOR Ra,Rb,Rc XORI Ra,Rb,value AND Ra,Rb,Rc ANDI Ra,Rb,value Control SETx Ra,Rb,Rc SETIx Ra,Rb,value BEQZ Ra, offset (- - - +[PC]) BNEQZ Ra, offset (- - - +[PC]) J
- ffset
JR Ra JL
- ffset (- - - +[PC])
JLR Ra
N.B. Postfix x (set condition) can be LT, GT, LE, GE, EQ, NE JL (via or non via register) -> Jump and link saving PC in R31 Offset is a value within the instruction Postfix I means «immediate» (value within the instruction) PostfixA means «arithmetic» (sign extension) Postfix U means «unsigned» Value is the immediate within the instruction No STACK registers
5
DLX ALU operations
Two inputs data One output data plus flags
S1 , S2 : ALU inputs (32 bit) S1 + S2 S1 – S2 S1 and S2 S1 or S2 S1 exor S2 Left Shift S1 of S2 positions Right Shift S1 of S2 positions Arithmetic Right Shift S1 of S2 positions S1 S2 1 Output Flags Zero Negative sign
ALU is a combinatorial circuit !!!
32 32 32 S1 S2 OUT
ALU
Flags
Controls
Ready ?
6
PC is the Program Counter, A and B are two scratchpad internal registers,REGinstr is the register where the new fetched instruction is
- stored. All these registers are
unknown to the programmer
Abstract instruction execution
INSTRUCTION FETCH [REGINSTR] ]<= M [PC]
Data transfer ALU Set Jump Branch
INSTRUCTION EXECUTION Sequential DLX
This is a synchronous state diagram
INSTRUCTION DECODE [PC] <= [PC] +4 [A ]<= [Ra] [B ]<= [Rb] [C] <= [Rc] [X ]<= num [Ra] [X] number of the destination register
Next Instruction 7
INSTR <= M [PC] Example: LB (LOAD BYTE format I)
Sign extension !! Example M[Addr]7..0=A7H => (10100111)b Sign extended address <= FFFFFFA7H
Instr15.0. is the instruction offset Address is always 32 bit
31 MBbit 0 LSbit
LB Ra, offset(Rb)
Op-code Ra Rb
- ffset
[Ra] < =(M[Addr.]7)24 ## M[Addr.]7..0 Byte in register LOAD Byte Addr. < =[B] + (Instr15)16 ## Instr15..0
31 26 25 21 20 16 15 0
## => JOIN operator
Byte address compute
Sign extension
Instruction bit 15 (sign) is left extended 16 times
[PC] <= [PC] +4 [A ]<= [Ra] [B ]<= [Rb] [C ]<= [Rc] [X ]<= num [Ra]
8
Sign extension - example with IR
(IR15)16 ## IR15..0
15 31 IR 31 30…………17 16 From the Control Unit 15-0
Tri-state devices
Data transfer Instructions (R format) Examples
LW Ra, offset(Rb) LB Ra, offset(Rb) LBU Ra, offset(Rb) unsigned LHU Ra, offset(Rb) unsigned SW Ra, offset(Rb) LB
LB (byte) [Ra] <= (M[Addr]7)24 ## M[Addr]7..0
M[Addr]<=[A]
SW
- Addr. <= [B] + (Instr15)16 ## Instr15..0
Ra unsigned
LH
LH (half word)
[Ra ]< = (M[Addr]15)16 ## M[Addr]15..0
Signed
LW
.
LHU
LHU (half word)
[Ra] <= (0)16 ## M[Addr]15..0
9
[Ra] < = (0)24 ## M[Addr]7..0
LBU (byte)
(T is a hidden register unknown to the programmer storing temporary data) 10 [Ra ]<= [Rb ]+ [T]
ADD
[Ra] <= [Rb] and [T]
AND
[Ra]<= [Rb] - [T]
SUB
[Ra] <= [Rb] xor [T]
XOR
[Ra] <=[Rb] or [T]
OR
ALUinstructions examples (I format)
The same scheme for the shift etc. A and B generic registers (Ra, Rb)
ADD Ra,Rb,Rc ADDI Ra,Rb,value ADDU Ra,Rb,Rc ADDUI Ra,Rb, value ………………………
Register (format R) Immediate (format I)
[T]<= (Instr15)16 ## Instr15..0] [T]<= [Rc] Register content signed if arithmetic operations
11
SET instructions (see branch)
- ex. SLT Ra,Rb,Rc
Set Ra=1 if Rb is less than Rc
- therwise Ra=0
[Ra] = 1 if [Rb] = [T]
SEQ SLT
[Ra] = 1 if [Rb] < [T]
SGE
[Ra] = 1 if [Rb] >= [T]
SLE
[Ra] =1 if [Rb] <= [T]
SGT
[Ra] = 1 if [Rb] > [T]
SNE
[Ra] = 1 if [Rb]! = [T]
Register content as signed (T is a hidden register unknown to the programmer storing temporary data)
Register (format R) Immediate (format I)
[T]<= (Instr15)16 ## Instr15..0] [T]<= [Rc]
12
[T] <= [PC]
JUMP Instructions
JAL
[T] <= [PC]
JALR JAL
[R31 ]<= [T]
For saving [PC] in R31
[PC] <= [PC] + (Instr25)6 ## Instr25..0
[PC] <= [Ra]
format I format J
J
- ffset (jump address)
JR Ra (jump register) JL
- ffset (jump and link address)
JLR Ra (jump and link register)
JALR JR JALR JAL JMP
INIT
13
NO NO
[Ra] = 1
BEQZ
Branch Instructions
BNEZ
[Ra!] = 1
YES YES
[PC] <= [PC] + (Instr15)16 ## Instr15..0
- Ex. BNEQZ R5, 100
Jump to PC+100 if R5 not equal 0
BRANCH
format R
14
The Pipelining Principle
Pipelining is the main basic technique used for “speeding-up” a CPU. The key idea for pipelining is general, and is currently applied to several industry fields (productions lines, oil pipelines, …) A system S must operate N times on a task Ai producing result Ri : A1 , A2 , A3 …AN S R1 , R2 , R3 …RN Latency : time occurring between the beginning and the end of task A (TA ). Throughput : frequency of each task completion
15
The Pipelining Principle
1) Sequential System - A new instruction starts when the previous instruction is finished A2 A3 t An A1 TA
An n-th instruction - Latency (execution time of a single instruction) = TAn Different execution times
2) Pipelined System (instruction are subdivided in stages – each stage during
- ne nth – 1/4 in this example - of the entire instruction time)
Successive instructions stages overlap
S A
P1 P2 P3 P4
t S1 S2 S3 S4 Si: pipeline stage
16
The Pipelining Principle
P1
TP
P2 P3 A1 P4
P1 P2 P3 P4
P1 A3 P2 P3 P4 P1 A4 P2 P3 P4 t An
TP : pipeline cycle (ideally one clock) For each cycle one instruction terminates In figure A1 terminates at tx
tx
P1 A2 P2 P3 P4
ty
Next cycle A2 terminates at ty etc.
Typical instruction stages
17
IF
Instruction fetch (from memory)
ID
Instruction decode
EX
Instruction execution (ALU)
MEM
Data memory access (if needed – registers instructions no need)
WB
Write-back (if needed – jump no need)
N.B. The execution time (latency) of all instructions must be the same, for maintaining the results order. Some stages are not used for some instructions (the stage is a NOP for them) – i.e. the stage MEM for register operations)
18
Pipelining of a CPU (DLX)
Instruction sequence: I1 , I2 , I3 …IN Instruction j
EX ID
t
MEM WB IF
IF/ID ID/EX EX/MEM MEM/WB
CPU (datapath)
IF ID EX MEM WB
ClockPerInstruction (CPI)=1 (ideally !) Pipeline Cycle Clock Cycle Delay of the slowest stage
Registers (Pipeline Registers D FF) Combinatorial circuits
19
DLX Pipeline
Instr i
IF ID EX MEM WB
CPI (ideally) = 1
Overhead introduced by the Pipeline Registers:
Clock Cycle Switch delay of the input stage register
Tclk = Td + TP + Tsu
Set-up time of the
- utput stage register
Delay of the slowest combinatorial stage
Instr i+1
IF ID EX MEM WB
Instr i+2
IF ID EX MEM WB
Instr i+3
IF ID EX MEM WB
Instr i+4
IF ID EX MEM WB
Switch delay of the input stage register
D
Tp
D
Set-up time of the
- utput stage register
Combinatorial Circuit
Delay of the slowest combinatorial stage
20
21
Pipeline implementation requirements
- Each stage is active at each clock cycle.
- The PC is incremented in the IF stage.
31 2 1 0
PC
Always 0
- An ADDER should be introduced (PC <=PC+4 – one instruction is 4 bytes) in the IF stage. But instructions
are aligned (each one ends to an address multiple of the instruction length in bytes) and therefore a 30 bit
- nly register (a programmable counter for jumps) is used, incremented by 1 each clock cycle
- Two Memory Data Registers are required (referred to as LMDR e SMDR). In fact when a LOAD is
immediately followed by a STORE there is a WB/MEM stages overlap – two data waiting therefore to be written (one onto the memory, the other onto a register of the RF).
- Each clock cycle 2 memory accesses must be possibly executed (IF, MEM): Instruction Memory (IM) and
Data Memory (DM): “Harvard” Architecture
- The CPU clock is determined by the slowest stage
- Pipeline Registers store both data and control information ( “distributed” control unit)
SE
Ra Rb
PC
DEC
Rc
RF
Num [Ra]
IF
ID EX MEM WB
DLX Pipelined Datapath
IF/ID ID/EX EX/MEM MEM/WB
Number of dest. registers in case of LOAD and ALU instr.
For computing new PC value when branch For operations with immediates Actually a programmable counter
if jump For Set Condition (also <0 and >0) [it acts on the output] for Branch A D D 4
M U X
INSTR MEM PC
JL and JLR (PC in R31)
M U X
A L U
M U X =0? =0?
Sign extension
Data (from reg. or mem or PC per link) destination register number (1-31)
DR D
DATA MEM
M U X
C A B Num Ra Ra Rb IR25-21 IR20-16
RF
Rc IR15-11 23
ID stage (N.B. stage layout different from previous slide!)
I R
IF/ID ID/EX (31-16) Immed./Branch (31-26) Jump
LB SW
IR15-0 (Offset/Immediate– 11-15 as dest. reg. in R instr. ) IR25-16 (Jump; Jump and Link) PC31-0 (JL and JLR) Number of the dest. register (from WB stage) Data (from WB stage) DR D
P C
26 (J and JL) 6 16 32 32 32
Info travelling with the instruction
SE IR15 IR25 Sign extension DEC IR10-00 (R Istr.) IR31-26 (Opcode)
Sign extension
32 32 5
DLX Pipelined Datapath
A D D 4
M U X
DM A L U
M U X M U X
IM RF PC
M U X
IF/ID ID/EX EX/MEM MEM/WB
I R 1 I R 2 P C 2 C O N D Z X: Computed data or Memory Address or Branch Address S M D R Y L M D R Y: Computed data from the previous stage
IF
ID EX MEM WB
P C 1 P C 3 P C 4 Address Data I R 3 I R 4 destination register number DEC
for Set Condition (also <0 e >0) [it acts on output] =0? =0?
for Branch
JL JLR (PC saved in R31)
SMDR => Store Memory Data Register LMDR => Load memory data Register IRi => Instruction Register i 24 DR D
Ra Rb Rc
Num [Ra]
SE
25
Pipelined execution of an “ALU” instruction The result of each stage is sampled at the end of its cycle
IF ID EX MEM Y <= Z (temporary storage for WB) WB Ra <= Y
IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4
A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[Ra]
Z<= A op B
- r
Z <= A op [(IR215)16 ## IR215..0]
[PC4 <= PC3] [PC3 <= PC2]
Decoded
- pcode travels
through all stages
[IR3 <= IR2] [IR4 <.= IR3] NOTE: IRi bits which are dropped stage by stage when no more needed for all instructions. Why ? JAL, JALR!!
26
Pipelined execution of a “MEM” instruction
IF ID EX MEM
LMDR <= M[MAR] (if LOAD)
- r
M[MAR] <= SMDR (if STORE)
WB
Ra <= MDR (if LOAD) [Sign ext.]
IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4
A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[Ra]
MAR <= B op (IR215)16 ## IR215..0
SMDR <= A
[PC4 <= PC3] [PC3 <= PC2] Decoded
- pcode travels
through all stages
[IR3 <= IR2
[IR4 <= IR3]
27
Pipelined execution of a “BRANCH” instruction (normally after a SCn instruction – see later)
X : “BTA (BRANCH TARGET ADDRESS)”
IF ID EX MEM
if (Cond) PC <= Z
WB
(NOP)
IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4
A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[Ra]
Z <= PC2 op (IR15)16 ## IR15..0
Cond <= A op 0
[PC4 <= PC3] [PC3 <= PC2] Decoded
- pcode travels
through all stages [IR3 <= IR2] [IR4 <= IR3
Branch on Reg A value (0/1) New value in PC at the end of this cycle. When Branch is taken 3 new unwanted instructions have already started Computed new PC address
28
Pipelined execution of a “JR” instruction
ID MEM WB IF ID EX MEM
PC <= Z
WB
(NOP)
IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4
A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[Ra]
Z <= A
[PC4 <= PC3] [PC3 <= PC2] Decoded
- pcode travels
through all stages [IR3 <= IR2] [IR4 <= IR3]
Which would be the stage sequence for a J instruction? New value in PC in this interval . When Jump executed 3 new unwanted instructions are already started new PC address
29
Pipelined execution of a “JL or JLR” instruction
ID IF ID EX MEM
PC <= Z ; PC4<= PC3
WB
R31 <= PC4
IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4
A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[Ra]
PC3 <= PC2 Z <= A (If JLR) Z <= PC2 + (IR25)6 ## IR25..0 (If JL)
NOTE: Write on R31 CANNOT be performed on-the fly since it could overlap with another register write
Decoded
- pcode
through all stages [IR4 <= IR3] [IR3 <= IR2]
In this case PCi values are used New value in PC in this interval . When Jump executed 3 new unwanted instructions are already started
30
Which would be the sequence in case of SCn (ex SLT R1,R2,R3) ?
ID IF ID EX MEM WB IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4
A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[Ra]
? ? ?
31
Pipeline Hazards
A “Hazard” occurs when during a clock cycle an instruction currently in a pipeline stage can’t be executed in the same clock cycle.
- Structural Hazards – The same resource is used by two different
pipeline stages: the instructions currently in those stages can’t be executed simultaneously.
- Data Hazards – they are due to instruction dependencies. For example,
an instruction that needs to read a RF register not yet written by a previous instruction (Read After Write).
- Control Hazards – Instructions following a branch depend from the
branch result (taken/not taken). The instruction that cannot be executed must be stalled (“pipeline stall” or “pipeline bubbling”), together with all the following instructions, while the previous instructions must proceed normally (so as to eliminate the hazard).
Clk 6 Clk 7 Clk 8
Hazards and stalls
IF ID EX MEM WB Ii-3 Ii-2 Ii-1 ID EX MEM ID EX IF IF
Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
WB
Clk 9 Clk 10 Clk 11 Clk 12
Ti = 8 * CLK = (5 + 3) * CLK Ti = 5 * (1 + 3/5 ) * CLK Instruction stalls
ID Ii ID IF IF Ii+1 WB WB S S S S S IF S MEM WB
Stall: the clock signal for Ii, Ii+1 …etc. is blocked for three periods
The consequence of a data hazard: if instruction Ii needs the result of instruction Ii-1 (data are read in ID stage), must wait until after WB of Ii-1
32 Normally the three stalled instructions are transformed in NOPs to avoid clock blocking
33
Forwarding
Forwarding allows eliminating almost all RAW hazards of the pipeline without stalling the pipeline. (NOTE: in DLX, registers are modified only in WB stage)
Clk 6 Clk 7 Clk 8 ADD R3, R1, R4
IF ID EX MEM WB
Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 SUB R7, R3, R5 hazard
ID EX MEM IF WB
Clk 9 OR R1, R3, R5 hazard
ID MEM WB EX IF
Here too the requested data is not yet in RF since it is written on the positive clock edge at the end of WB (register value is read in ID!)
LW R6, 100 (R3) hazard
ID IF EX MEM WB
AND R9, R5, R3 no hazard
IF ID EX MEM WB
Data are read from registers in the ID stage
A,B,C OpCode
34
Forward implementation
FU
EX/MEM
M U X
MEM/WB
A L U
M U X
ID/EX
M U X M U X
Rd2/OpCode Rd1 (/OpCode)
RF
M U X Often performed inside the RF It allows “the anticipation” of the register on ID/EX MUX control: IF opcode and comparison of RD with Ra, Rb and Rc numbers
Mem ALU IR3 IR4 Offset B A Bypass
M U X Combinatorial!! comparison between A,B,C, and Rd1, Rd2 and the Opcodes
PC
M U X
PC C PC Rd1, Rd2 destination registers 1-31 A,B,C source registers 1-31 FD1 FD2 FD3 FD3
FD1 NO FD1 35
Does the instruction in the MEM or WB stage will write a register number which is identical to Ra or Rb or Rc number?
No FD1 FD2
Does the instruction in the Mem stage want to write a register? Is the destination register number identical to Ra or Rb or Rc number ?
Yes Yes FD1 FD2 No No Does the fetched instruction needs the register in Mem stage? Yes No FD3
Forward Unit implementation
Does the instruction in the WB stage want to write a register? Is the destination register number identical to Ra or Rb or Rc number and different from the register which will be written by the MEM stage?
Yes Yes Yes
Does the instruction in the WB stage want to write a register? Is the destination register number identical to Ra or Rb or Rc number
Yes No FD2 No No No Does the fetched instruction needs the register being written by WB stage? Yes FD3 No No NoFD3 Yes
36
Data hazard due to LOAD instructions
NOTE: the data required by the ADD is available only at the end of MEM
- stage. This hazard cannot be eliminated by forwarding (unless there is an
additional input in the MUXs between memory and ALU – delays!) ADD R4,R1,R7 SUB R5,R1,R8 AND R6,R1,R7 LW R1,32(R6) MEM WB IF ID EX MEM IF ID EX IF ID IF ID EX LW R1,32(R6) IF ID EX MEM WB ADD R4,R1,R7 IF ID S EX MEM SUB R5,R1,R8 IF ID EX AND R6,R1,R7 IF ID
Transformed in NOP PC-<PC-4 From the end
- f this stage
- nwards:
standard forwarding
NOP IF ID EX MEM WB
This slide must be viewed using its .PSM version
ADD R4,R1,R7 IF ID EX MEM
(Re-fetch)
The pipeline needs to be stalled
37
Delayed load
In many RISC CPUs, the special hazard associated with the LOAD instruction (which would in any case lead to a stall ) is not handled by stalling the pipeline but by software through the compiler (delayed load). In this example R3 is needed by the ADD instruction while it is read from the memory [instruction LW R3, 10(R4)]. Please notice that in any case a hardware forward netwotk is required
LOAD Instruction delay slot Next instruction
The compiler tries to fill the delay-slot with a “useful” instruction (worst case: NOP).
LW R1,32(R6) LW R3,10 (R4) ADD R5,R1,R3 LW R6, 20 (R7) LW R8, 40(R9) LW R1,32(R6) LW R3,10 (R4) ADD R5,R1,R3 LW R6, 20 (R7) LW R8, 40(R9)
Forward hardware
38
Control Hazards
BEQZ R4, 200
PC BEQZ R4, 200 PC+4 SUB R7, R3, R5 PC+8 OR R1, R3, R5 PC+12 LW R6, 100 (R8) PC+4+200 AND R9, R5, R3 (BTA)
Clk 6 Clk 7 Clk 8
IF ID EX MEM WB ID ID
Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
MEM WB EX MEM EX IF IF WB ID ID ID IF EX WB ID MEM
New computed PC value (Aluout)
SUB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8)
New value in PC (one clock after: new value must be clocked onto the PC)
ID IF EX WB ID MEM Fetch with the new PC
Next InstructionAddress
R4 = 0 : Branch Target Address (taken) R4 ≠ 0 : PC+4 (not taken)
A D D 4 IM RF SE PC DEC
Instruction Fetch
ID EX MEM WB
IF/ID ID/EX
A L U
M U X
EX/MEM
M U X M U X
DLX Branch or JMP
BEQZ R4, 200
DM
MEM/WB
When the new PC acts on the IM three instructions have already travelled through the first three stages (EX included) NOTE if the feedback signal of the new PC were output directly from the ALU output instead of Z the required stalls would be only two – slower clock!
=0? =0?
39
M U X
Mem ALU PC
Detailed dapath slide: See DLX Pipelined Datapath Here we assume that the JMP instruction is the Ith instruction
Num [Ra]
DR D
Ra Rb Rc
RF Z J M P J M P J M P I + 1 I + 2 I + 3 I + 1 I + 2 J M P I + 1
40
Handling the Control Hazards
BEQZ R4,200
Clk 6 Clk 7 Clk 8
IF ID EX MEM WB
Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
S S IF S Fetch at new PC
- Always Stall (three-clock block being propagated)
- Predict Not Taken
IF
ID EX MEM WB ID ID ID
BEQZ R4, 200 SUB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8)
Clk 6 Clk 7 Clk 8 Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
MEM WB EX MEM EX IF IF IF WB EX WB ID ID ID MEM Branch Completion S IF IF ID S
Real situation Repeated IF PC <= PC - 4
Here the new value is sampled by the PC
IF here: the previous instruction (BEQZ) has not been yet decoded
No problem because no instruction in WB stage
NOP NOP NOP
If branch taken:
- flush. They
become
- NOP. No data
yet written Here the new value
- f PC has been computed
IF
ID EX MEM WB
Stalls with jumps (1/3)
A D D 4
M U X
DM A L U
M U X M U X =0?
IM SE PC DEC
M U X
IF/ID ID/EX EX/MEM MEM/WB
Data PC
Active if jump
=0?
N O P N O P N O P
Jump forced NOP Three NOPs MUST replace the 3 unwanted instructions already started When the Branch Target Address is clocked into the PC three unwanted instructions are already in IF/ID, ID/EX and EX/MEM 41 RF
DR D Rb Rc Ra Num [Ra]
IF
ID EX MEM WB
Stalls with jump (2/3)
A D D 4
M U X
DATA MEM A L U
M U X M U X =0?
SE PC DEC
M U X
IF/ID ID/EX EX/MEM MEM/WB
DR D RS1 RS2
Data PC
Active if jump
=0?
N O P N O P
forced NOP when jump NOTE in this case the jump condition detection and the new PC value are input to the MUX in the same clok interval Two NOPs MUST replace the 2 unwanted instructions already started 42 RF
DR D Rb Rc Ra Num [Ra]
IM DM
IF
ID EX MEM WB
Stalls with jump (3/3)
A D D 4 DATA MEM A L U
M U X M U X =0?
SE DEC
M U X
IF/ID ID/EX EX/MEM MEM/WB
Data PC
Active if jump
=0?
N O P
Becomes NOP if jump
NOTE In this case the jump condition and the new PC act
- n the MUX in the same
period when the condition is detected PC
M U X
A NOP MUST replace the unwanted instruction already started
Very slow clock solution !
43 RF
DR D Rb Rc Ra Num [Ra]
DM IM
44
Delayed branch
Similarly to the LOAD case. In several RISC CPUs the BRANCH instructions hazard is handled by SW through the compiler (delayed branch):
BRANCH instruction delay slot Next instruction
The compiler tries to fill the delay-slots with “useful” instructions (worst case: NOP).
delay slot delay slot
45
Delayed branch/jump
Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Sne R1, R8, R9 ; Br R1, +100 Sne R1, R8, R9 ; branch condition Br R1, +100 Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21
Compiled Original Executed in both cases Obviously in this instructions group there must be no jumps!!!
Instead of one or more “postponed” instructions, the compiler inserts NOPs when no suitable instructions are available
branch condition
48
Handling the Control Hazards
Dynamic Prediction: Branch Target Buffer => no stall (almost..)
T/NT
TAGS
Predicted PC
PC
=
HIT : Fetch with predicted PC MISS : Fetch with PC + 4
Correct prediction : no stalls Wrong prediction : 1-3 stalls (correct fetch in ID or EX, see before)
N.B. Here the branch slot is selected during the IF clock cycle that loads IR1 in IF/ID
T/NT taken/Not taken
49
Prediction Buffer: the simplest implementation uses a single bit that indicates what happened when last branch occurred. In case of predominance of one prediction, when the opposite situation occurs we have two consecutive errors.
Loop1 Loop2 When the program ends loop2, the prediction fails (branch predicted as taken but actually it is untaken), then it fails again when it predicts as untaken whilst entering once again loop2
50
Usually two bits.
TAKEN TAKEN UNTAKEN UNTAKEN TAKEN UNTAKEN TAKEN UNTAKEN TAKEN TAKEN UNTAKEN UNTAKEN