DLX computer Electronic Computers M 1 RISC architectures RISC vs - - PowerPoint PPT Presentation

dlx computer
SMART_READER_LITE
LIVE PREVIEW

DLX computer Electronic Computers M 1 RISC architectures RISC vs - - PowerPoint PPT Presentation

DLX computer Electronic Computers M 1 RISC architectures RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer In CISC architectures the 10% of the instructions are used in 90% of cases Waste of silicon


slide-1
SLIDE 1

1

DLX computer

Electronic Computers M

slide-2
SLIDE 2

2

  • RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set

Computer

RISC architectures

  • In CISC architectures the 10% of the instructions are used in 90% of cases
  • Waste of silicon
  • Bottleneck: the bus
  • Mid ‘80s a new architecture: RISC
  • Solution: reduction of instruction number and complexity (fewer simpler machine

instructions)

  • Fixed instruction format (simpler instruction decoders)
  • Simpler control logic network increasing the number of on-chip registers
  • Reduction of bus/memory accesses
  • Increase of machine instructions needed for a job which is (in many cases) more

than compensated (in term of time) by the reduction of bus accesses

  • CISC and RISC are each one the best solution in different application fields
  • Nowadays coexistence of both architectures in the same processor: analysis at the end
  • f the course
  • A simplified RISC architecture: DLX (implemented as real processor in the ‘80s as

R4000)

slide-3
SLIDE 3

3

DLX (fixed) instruction format

R

Op-code Ra Rb Rc

  • Cod. op (11 bit) extension

6 bit 5 bit 5 bit 5 bit 11 bit

31 26 25 21 20 16 15 11 10 0

Arithmetic or logic instructions; i.e. Ra ← Rb op Rc or Set Conditions between registers Branch instructions

I

Op-code Ra Rb Immediate operand

  • r offset

Data transfer (Load, Store), conditional Branch , JR and JALR (Control transfer via register), Set Condition e ALU with immediate operator. In Load and ALU instructions Ra=destination, in the Store Ra=source. -- Rb as ALU value for the immediate instructions - Branch instructions

31 26 25 21 20 16 15 0

J

Op-code 26 bit (PC relative) offset

Direct, unconditional control transfer(J e JAL)

31 26 25 0

slide-4
SLIDE 4

4

DLX non floating-point instructions

(31x32bit registers R31…R1 - R0=0 fixed - Ra and Rb any of the 32 registers)

Data Transfer LW Ra, offset(Rb) LB Ra, offset(Rb) LBU Ra, offset(Rb) LHU Ra, offset(Rb) LH Ra, offset(Rb) SW Ra, offset(Rb) SH Ra, offset(Rb) SB Ra, offset(Rb) LHI Ra, value Arithmetic/Logic ADD Ra,Rb,Rc ADDI Ra,Rb,value ADDU Ra,Rb,Rc ADDUI Ra,Rb, value SUB Ra,Rb,Rc SUBI Ra,Rb,value SUBU Ra,Rb,Rc SUBUI Ra,Rb, value DIV Ra,Rb,Rc DIVI Ra,Rb,value MULU Ra,Rb,Rc MULI Ra,Rb, value SLL Ra ,Rb,Rc SLLI Ra,Rb;value SHR Ra,Rb.Rc SHRI Ra,Rb,value SLA Ra,Rb,Rc SLAI Ra,Rb,value OR Ra,Rb,Rc ORI Ra,Rb,value XOR Ra,Rb,Rc XORI Ra,Rb,value AND Ra,Rb,Rc ANDI Ra,Rb,value Control SETx Ra,Rb,Rc SETIx Ra,Rb,value BEQZ Ra, offset (- - - +[PC]) BNEQZ Ra, offset (- - - +[PC]) J

  • ffset

JR Ra JL

  • ffset (- - - +[PC])

JLR Ra

N.B. Postfix x (set condition) can be LT, GT, LE, GE, EQ, NE JL (via or non via register) -> Jump and link saving PC in R31 Offset is a value within the instruction Postfix I means «immediate» (value within the instruction) PostfixA means «arithmetic» (sign extension) Postfix U means «unsigned» Value is the immediate within the instruction No STACK registers

slide-5
SLIDE 5

5

DLX ALU operations

Two inputs data One output data plus flags

S1 , S2 : ALU inputs (32 bit) S1 + S2 S1 – S2 S1 and S2 S1 or S2 S1 exor S2 Left Shift S1 of S2 positions Right Shift S1 of S2 positions Arithmetic Right Shift S1 of S2 positions S1 S2 1 Output Flags Zero Negative sign

ALU is a combinatorial circuit !!!

32 32 32 S1 S2 OUT

ALU

Flags

Controls

slide-6
SLIDE 6

Ready ?

6

PC is the Program Counter, A and B are two scratchpad internal registers,REGinstr is the register where the new fetched instruction is

  • stored. All these registers are

unknown to the programmer

Abstract instruction execution

INSTRUCTION FETCH [REGINSTR] ]<= M [PC]

Data transfer ALU Set Jump Branch

INSTRUCTION EXECUTION Sequential DLX

This is a synchronous state diagram

INSTRUCTION DECODE [PC] <= [PC] +4 [A ]<= [Ra] [B ]<= [Rb] [C] <= [Rc] [X ]<= num [Ra] [X] number of the destination register

slide-7
SLIDE 7

Next Instruction 7

INSTR <= M [PC] Example: LB (LOAD BYTE format I)

Sign extension !! Example M[Addr]7..0=A7H => (10100111)b Sign extended address <= FFFFFFA7H

Instr15.0. is the instruction offset Address is always 32 bit

31 MBbit 0 LSbit

LB Ra, offset(Rb)

Op-code Ra Rb

  • ffset

[Ra] < =(M[Addr.]7)24 ## M[Addr.]7..0 Byte in register LOAD Byte Addr. < =[B] + (Instr15)16 ## Instr15..0

31 26 25 21 20 16 15 0

## => JOIN operator

Byte address compute

Sign extension

Instruction bit 15 (sign) is left extended 16 times

[PC] <= [PC] +4 [A ]<= [Ra] [B ]<= [Rb] [C ]<= [Rc] [X ]<= num [Ra]

slide-8
SLIDE 8

8

Sign extension - example with IR

(IR15)16 ## IR15..0

15 31 IR 31 30…………17 16 From the Control Unit 15-0

Tri-state devices

slide-9
SLIDE 9

Data transfer Instructions (R format) Examples

LW Ra, offset(Rb) LB Ra, offset(Rb) LBU Ra, offset(Rb) unsigned LHU Ra, offset(Rb) unsigned SW Ra, offset(Rb) LB

LB (byte) [Ra] <= (M[Addr]7)24 ## M[Addr]7..0

M[Addr]<=[A]

SW

  • Addr. <= [B] + (Instr15)16 ## Instr15..0

Ra unsigned

LH

LH (half word)

[Ra ]< = (M[Addr]15)16 ## M[Addr]15..0

Signed

LW

.

LHU

LHU (half word)

[Ra] <= (0)16 ## M[Addr]15..0

9

[Ra] < = (0)24 ## M[Addr]7..0

LBU (byte)

slide-10
SLIDE 10

(T is a hidden register unknown to the programmer storing temporary data) 10 [Ra ]<= [Rb ]+ [T]

ADD

[Ra] <= [Rb] and [T]

AND

[Ra]<= [Rb] - [T]

SUB

[Ra] <= [Rb] xor [T]

XOR

[Ra] <=[Rb] or [T]

OR

ALUinstructions examples (I format)

The same scheme for the shift etc. A and B generic registers (Ra, Rb)

ADD Ra,Rb,Rc ADDI Ra,Rb,value ADDU Ra,Rb,Rc ADDUI Ra,Rb, value ………………………

Register (format R) Immediate (format I)

[T]<= (Instr15)16 ## Instr15..0] [T]<= [Rc] Register content signed if arithmetic operations

slide-11
SLIDE 11

11

SET instructions (see branch)

  • ex. SLT Ra,Rb,Rc

Set Ra=1 if Rb is less than Rc

  • therwise Ra=0

[Ra] = 1 if [Rb] = [T]

SEQ SLT

[Ra] = 1 if [Rb] < [T]

SGE

[Ra] = 1 if [Rb] >= [T]

SLE

[Ra] =1 if [Rb] <= [T]

SGT

[Ra] = 1 if [Rb] > [T]

SNE

[Ra] = 1 if [Rb]! = [T]

Register content as signed (T is a hidden register unknown to the programmer storing temporary data)

Register (format R) Immediate (format I)

[T]<= (Instr15)16 ## Instr15..0] [T]<= [Rc]

slide-12
SLIDE 12

12

[T] <= [PC]

JUMP Instructions

JAL

[T] <= [PC]

JALR JAL

[R31 ]<= [T]

For saving [PC] in R31

[PC] <= [PC] + (Instr25)6 ## Instr25..0

[PC] <= [Ra]

format I format J

J

  • ffset (jump address)

JR Ra (jump register) JL

  • ffset (jump and link address)

JLR Ra (jump and link register)

JALR JR JALR JAL JMP

slide-13
SLIDE 13

INIT

13

NO NO

[Ra] = 1

BEQZ

Branch Instructions

BNEZ

[Ra!] = 1

YES YES

[PC] <= [PC] + (Instr15)16 ## Instr15..0

  • Ex. BNEQZ R5, 100

Jump to PC+100 if R5 not equal 0

BRANCH

format R

slide-14
SLIDE 14

14

The Pipelining Principle

Pipelining is the main basic technique used for “speeding-up” a CPU. The key idea for pipelining is general, and is currently applied to several industry fields (productions lines, oil pipelines, …) A system S must operate N times on a task Ai producing result Ri : A1 , A2 , A3 …AN S R1 , R2 , R3 …RN Latency : time occurring between the beginning and the end of task A (TA ). Throughput : frequency of each task completion

slide-15
SLIDE 15

15

The Pipelining Principle

1) Sequential System - A new instruction starts when the previous instruction is finished A2 A3 t An A1 TA

An n-th instruction - Latency (execution time of a single instruction) = TAn Different execution times

2) Pipelined System (instruction are subdivided in stages – each stage during

  • ne nth – 1/4 in this example - of the entire instruction time)

Successive instructions stages overlap

S A

P1 P2 P3 P4

t S1 S2 S3 S4 Si: pipeline stage

slide-16
SLIDE 16

16

The Pipelining Principle

P1

TP

P2 P3 A1 P4

P1 P2 P3 P4

P1 A3 P2 P3 P4 P1 A4 P2 P3 P4 t An

TP : pipeline cycle (ideally one clock) For each cycle one instruction terminates In figure A1 terminates at tx

tx

P1 A2 P2 P3 P4

ty

Next cycle A2 terminates at ty etc.

slide-17
SLIDE 17

Typical instruction stages

17

IF

Instruction fetch (from memory)

ID

Instruction decode

EX

Instruction execution (ALU)

MEM

Data memory access (if needed – registers instructions no need)

WB

Write-back (if needed – jump no need)

N.B. The execution time (latency) of all instructions must be the same, for maintaining the results order. Some stages are not used for some instructions (the stage is a NOP for them) – i.e. the stage MEM for register operations)

slide-18
SLIDE 18

18

Pipelining of a CPU (DLX)

Instruction sequence: I1 , I2 , I3 …IN Instruction j

EX ID

t

MEM WB IF

IF/ID ID/EX EX/MEM MEM/WB

CPU (datapath)

IF ID EX MEM WB

ClockPerInstruction (CPI)=1 (ideally !) Pipeline Cycle Clock Cycle Delay of the slowest stage

Registers (Pipeline Registers D FF) Combinatorial circuits

slide-19
SLIDE 19

19

DLX Pipeline

Instr i

IF ID EX MEM WB

CPI (ideally) = 1

Overhead introduced by the Pipeline Registers:

Clock Cycle Switch delay of the input stage register

Tclk = Td + TP + Tsu

Set-up time of the

  • utput stage register

Delay of the slowest combinatorial stage

Instr i+1

IF ID EX MEM WB

Instr i+2

IF ID EX MEM WB

Instr i+3

IF ID EX MEM WB

Instr i+4

IF ID EX MEM WB

slide-20
SLIDE 20

Switch delay of the input stage register

D

Tp

D

Set-up time of the

  • utput stage register

Combinatorial Circuit

Delay of the slowest combinatorial stage

20

slide-21
SLIDE 21

21

Pipeline implementation requirements

  • Each stage is active at each clock cycle.
  • The PC is incremented in the IF stage.

31 2 1 0

PC

Always 0

  • An ADDER should be introduced (PC <=PC+4 – one instruction is 4 bytes) in the IF stage. But instructions

are aligned (each one ends to an address multiple of the instruction length in bytes) and therefore a 30 bit

  • nly register (a programmable counter for jumps) is used, incremented by 1 each clock cycle
  • Two Memory Data Registers are required (referred to as LMDR e SMDR). In fact when a LOAD is

immediately followed by a STORE there is a WB/MEM stages overlap – two data waiting therefore to be written (one onto the memory, the other onto a register of the RF).

  • Each clock cycle 2 memory accesses must be possibly executed (IF, MEM): Instruction Memory (IM) and

Data Memory (DM): “Harvard” Architecture

  • The CPU clock is determined by the slowest stage
  • Pipeline Registers store both data and control information ( “distributed” control unit)
slide-22
SLIDE 22

SE

Ra Rb

PC

DEC

Rc

RF

Num [Ra]

IF

ID EX MEM WB

DLX Pipelined Datapath

IF/ID ID/EX EX/MEM MEM/WB

Number of dest. registers in case of LOAD and ALU instr.

For computing new PC value when branch For operations with immediates Actually a programmable counter

if jump For Set Condition (also <0 and >0) [it acts on the output] for Branch A D D 4

M U X

INSTR MEM PC

JL and JLR (PC in R31)

M U X

A L U

M U X =0? =0?

Sign extension

Data (from reg. or mem or PC per link) destination register number (1-31)

DR D

DATA MEM

M U X

slide-23
SLIDE 23

C A B Num Ra Ra Rb IR25-21 IR20-16

RF

Rc IR15-11 23

ID stage (N.B. stage layout different from previous slide!)

I R

IF/ID ID/EX (31-16) Immed./Branch (31-26) Jump

LB SW

IR15-0 (Offset/Immediate– 11-15 as dest. reg. in R instr. ) IR25-16 (Jump; Jump and Link) PC31-0 (JL and JLR) Number of the dest. register (from WB stage) Data (from WB stage) DR D

P C

26 (J and JL) 6 16 32 32 32

Info travelling with the instruction

SE IR15 IR25 Sign extension DEC IR10-00 (R Istr.) IR31-26 (Opcode)

Sign extension

32 32 5

slide-24
SLIDE 24

DLX Pipelined Datapath

A D D 4

M U X

DM A L U

M U X M U X

IM RF PC

M U X

IF/ID ID/EX EX/MEM MEM/WB

I R 1 I R 2 P C 2 C O N D Z X: Computed data or Memory Address or Branch Address S M D R Y L M D R Y: Computed data from the previous stage

IF

ID EX MEM WB

P C 1 P C 3 P C 4 Address Data I R 3 I R 4 destination register number DEC

for Set Condition (also <0 e >0) [it acts on output] =0? =0?

for Branch

JL JLR (PC saved in R31)

SMDR => Store Memory Data Register LMDR => Load memory data Register IRi => Instruction Register i 24 DR D

Ra Rb Rc

Num [Ra]

SE

slide-25
SLIDE 25

25

Pipelined execution of an “ALU” instruction The result of each stage is sampled at the end of its cycle

IF ID EX MEM Y <= Z (temporary storage for WB) WB Ra <= Y

IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4

A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[Ra]

Z<= A op B

  • r

Z <= A op [(IR215)16 ## IR215..0]

[PC4 <= PC3] [PC3 <= PC2]

Decoded

  • pcode travels

through all stages

[IR3 <= IR2] [IR4 <.= IR3] NOTE: IRi bits which are dropped stage by stage when no more needed for all instructions. Why ? JAL, JALR!!

slide-26
SLIDE 26

26

Pipelined execution of a “MEM” instruction

IF ID EX MEM

LMDR <= M[MAR] (if LOAD)

  • r

M[MAR] <= SMDR (if STORE)

WB

Ra <= MDR (if LOAD) [Sign ext.]

IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4

A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[Ra]

MAR <= B op (IR215)16 ## IR215..0

SMDR <= A

[PC4 <= PC3] [PC3 <= PC2] Decoded

  • pcode travels

through all stages

[IR3 <= IR2

[IR4 <= IR3]

slide-27
SLIDE 27

27

Pipelined execution of a “BRANCH” instruction (normally after a SCn instruction – see later)

X : “BTA (BRANCH TARGET ADDRESS)”

IF ID EX MEM

if (Cond) PC <= Z

WB

(NOP)

IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4

A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[Ra]

Z <= PC2 op (IR15)16 ## IR15..0

Cond <= A op 0

[PC4 <= PC3] [PC3 <= PC2] Decoded

  • pcode travels

through all stages [IR3 <= IR2] [IR4 <= IR3

Branch on Reg A value (0/1) New value in PC at the end of this cycle. When Branch is taken 3 new unwanted instructions have already started Computed new PC address

slide-28
SLIDE 28

28

Pipelined execution of a “JR” instruction

ID MEM WB IF ID EX MEM

PC <= Z

WB

(NOP)

IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4

A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[Ra]

Z <= A

[PC4 <= PC3] [PC3 <= PC2] Decoded

  • pcode travels

through all stages [IR3 <= IR2] [IR4 <= IR3]

Which would be the stage sequence for a J instruction? New value in PC in this interval . When Jump executed 3 new unwanted instructions are already started new PC address

slide-29
SLIDE 29

29

Pipelined execution of a “JL or JLR” instruction

ID IF ID EX MEM

PC <= Z ; PC4<= PC3

WB

R31 <= PC4

IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4

A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[Ra]

PC3 <= PC2 Z <= A (If JLR) Z <= PC2 + (IR25)6 ## IR25..0 (If JL)

NOTE: Write on R31 CANNOT be performed on-the fly since it could overlap with another register write

Decoded

  • pcode

through all stages [IR4 <= IR3] [IR3 <= IR2]

In this case PCi values are used New value in PC in this interval . When Jump executed 3 new unwanted instructions are already started

slide-30
SLIDE 30

30

Which would be the sequence in case of SCn (ex SLT R1,R2,R3) ?

ID IF ID EX MEM WB IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4

A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[Ra]

? ? ?

slide-31
SLIDE 31

31

Pipeline Hazards

A “Hazard” occurs when during a clock cycle an instruction currently in a pipeline stage can’t be executed in the same clock cycle.

  • Structural Hazards – The same resource is used by two different

pipeline stages: the instructions currently in those stages can’t be executed simultaneously.

  • Data Hazards – they are due to instruction dependencies. For example,

an instruction that needs to read a RF register not yet written by a previous instruction (Read After Write).

  • Control Hazards – Instructions following a branch depend from the

branch result (taken/not taken). The instruction that cannot be executed must be stalled (“pipeline stall” or “pipeline bubbling”), together with all the following instructions, while the previous instructions must proceed normally (so as to eliminate the hazard).

slide-32
SLIDE 32

Clk 6 Clk 7 Clk 8

Hazards and stalls

IF ID EX MEM WB Ii-3 Ii-2 Ii-1 ID EX MEM ID EX IF IF

Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

WB

Clk 9 Clk 10 Clk 11 Clk 12

Ti = 8 * CLK = (5 + 3) * CLK Ti = 5 * (1 + 3/5 ) * CLK Instruction stalls

ID Ii ID IF IF Ii+1 WB WB S S S S S IF S MEM WB

Stall: the clock signal for Ii, Ii+1 …etc. is blocked for three periods

The consequence of a data hazard: if instruction Ii needs the result of instruction Ii-1 (data are read in ID stage), must wait until after WB of Ii-1

32 Normally the three stalled instructions are transformed in NOPs to avoid clock blocking

slide-33
SLIDE 33

33

Forwarding

Forwarding allows eliminating almost all RAW hazards of the pipeline without stalling the pipeline. (NOTE: in DLX, registers are modified only in WB stage)

Clk 6 Clk 7 Clk 8 ADD R3, R1, R4

IF ID EX MEM WB

Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 SUB R7, R3, R5 hazard

ID EX MEM IF WB

Clk 9 OR R1, R3, R5 hazard

ID MEM WB EX IF

Here too the requested data is not yet in RF since it is written on the positive clock edge at the end of WB (register value is read in ID!)

LW R6, 100 (R3) hazard

ID IF EX MEM WB

AND R9, R5, R3 no hazard

IF ID EX MEM WB

Data are read from registers in the ID stage

slide-34
SLIDE 34

A,B,C OpCode

34

Forward implementation

FU

EX/MEM

M U X

MEM/WB

A L U

M U X

ID/EX

M U X M U X

Rd2/OpCode Rd1 (/OpCode)

RF

M U X Often performed inside the RF It allows “the anticipation” of the register on ID/EX MUX control: IF opcode and comparison of RD with Ra, Rb and Rc numbers

Mem ALU IR3 IR4 Offset B A Bypass

M U X Combinatorial!! comparison between A,B,C, and Rd1, Rd2 and the Opcodes

PC

M U X

PC C PC Rd1, Rd2 destination registers 1-31 A,B,C source registers 1-31 FD1 FD2 FD3 FD3

slide-35
SLIDE 35

FD1 NO FD1 35

Does the instruction in the MEM or WB stage will write a register number which is identical to Ra or Rb or Rc number?

No FD1 FD2

Does the instruction in the Mem stage want to write a register? Is the destination register number identical to Ra or Rb or Rc number ?

Yes Yes FD1 FD2 No No Does the fetched instruction needs the register in Mem stage? Yes No FD3

Forward Unit implementation

Does the instruction in the WB stage want to write a register? Is the destination register number identical to Ra or Rb or Rc number and different from the register which will be written by the MEM stage?

Yes Yes Yes

Does the instruction in the WB stage want to write a register? Is the destination register number identical to Ra or Rb or Rc number

Yes No FD2 No No No Does the fetched instruction needs the register being written by WB stage? Yes FD3 No No NoFD3 Yes

slide-36
SLIDE 36

36

Data hazard due to LOAD instructions

NOTE: the data required by the ADD is available only at the end of MEM

  • stage. This hazard cannot be eliminated by forwarding (unless there is an

additional input in the MUXs between memory and ALU – delays!) ADD R4,R1,R7 SUB R5,R1,R8 AND R6,R1,R7 LW R1,32(R6) MEM WB IF ID EX MEM IF ID EX IF ID IF ID EX LW R1,32(R6) IF ID EX MEM WB ADD R4,R1,R7 IF ID S EX MEM SUB R5,R1,R8 IF ID EX AND R6,R1,R7 IF ID

Transformed in NOP PC-<PC-4 From the end

  • f this stage
  • nwards:

standard forwarding

NOP IF ID EX MEM WB

This slide must be viewed using its .PSM version

ADD R4,R1,R7 IF ID EX MEM

(Re-fetch)

The pipeline needs to be stalled

slide-37
SLIDE 37

37

Delayed load

In many RISC CPUs, the special hazard associated with the LOAD instruction (which would in any case lead to a stall ) is not handled by stalling the pipeline but by software through the compiler (delayed load). In this example R3 is needed by the ADD instruction while it is read from the memory [instruction LW R3, 10(R4)]. Please notice that in any case a hardware forward netwotk is required

LOAD Instruction delay slot Next instruction

The compiler tries to fill the delay-slot with a “useful” instruction (worst case: NOP).

LW R1,32(R6) LW R3,10 (R4) ADD R5,R1,R3 LW R6, 20 (R7) LW R8, 40(R9) LW R1,32(R6) LW R3,10 (R4) ADD R5,R1,R3 LW R6, 20 (R7) LW R8, 40(R9)

Forward hardware

slide-38
SLIDE 38

38

Control Hazards

BEQZ R4, 200

PC BEQZ R4, 200 PC+4 SUB R7, R3, R5 PC+8 OR R1, R3, R5 PC+12 LW R6, 100 (R8) PC+4+200 AND R9, R5, R3 (BTA)

Clk 6 Clk 7 Clk 8

IF ID EX MEM WB ID ID

Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

MEM WB EX MEM EX IF IF WB ID ID ID IF EX WB ID MEM

New computed PC value (Aluout)

SUB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8)

New value in PC (one clock after: new value must be clocked onto the PC)

ID IF EX WB ID MEM Fetch with the new PC

Next InstructionAddress

R4 = 0 : Branch Target Address (taken) R4 ≠ 0 : PC+4 (not taken)

slide-39
SLIDE 39

A D D 4 IM RF SE PC DEC

Instruction Fetch

ID EX MEM WB

IF/ID ID/EX

A L U

M U X

EX/MEM

M U X M U X

DLX Branch or JMP

BEQZ R4, 200

DM

MEM/WB

When the new PC acts on the IM three instructions have already travelled through the first three stages (EX included) NOTE if the feedback signal of the new PC were output directly from the ALU output instead of Z the required stalls would be only two – slower clock!

=0? =0?

39

M U X

Mem ALU PC

Detailed dapath slide: See DLX Pipelined Datapath Here we assume that the JMP instruction is the Ith instruction

Num [Ra]

DR D

Ra Rb Rc

RF Z J M P J M P J M P I + 1 I + 2 I + 3 I + 1 I + 2 J M P I + 1

slide-40
SLIDE 40

40

Handling the Control Hazards

BEQZ R4,200

Clk 6 Clk 7 Clk 8

IF ID EX MEM WB

Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

S S IF S Fetch at new PC

  • Always Stall (three-clock block being propagated)
  • Predict Not Taken

IF

ID EX MEM WB ID ID ID

BEQZ R4, 200 SUB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8)

Clk 6 Clk 7 Clk 8 Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

MEM WB EX MEM EX IF IF IF WB EX WB ID ID ID MEM Branch Completion S IF IF ID S

Real situation Repeated IF PC <= PC - 4

Here the new value is sampled by the PC

IF here: the previous instruction (BEQZ) has not been yet decoded

No problem because no instruction in WB stage

NOP NOP NOP

If branch taken:

  • flush. They

become

  • NOP. No data

yet written Here the new value

  • f PC has been computed
slide-41
SLIDE 41

IF

ID EX MEM WB

Stalls with jumps (1/3)

A D D 4

M U X

DM A L U

M U X M U X =0?

IM SE PC DEC

M U X

IF/ID ID/EX EX/MEM MEM/WB

Data PC

Active if jump

=0?

N O P N O P N O P

Jump forced NOP Three NOPs MUST replace the 3 unwanted instructions already started When the Branch Target Address is clocked into the PC three unwanted instructions are already in IF/ID, ID/EX and EX/MEM 41 RF

DR D Rb Rc Ra Num [Ra]

slide-42
SLIDE 42

IF

ID EX MEM WB

Stalls with jump (2/3)

A D D 4

M U X

DATA MEM A L U

M U X M U X =0?

SE PC DEC

M U X

IF/ID ID/EX EX/MEM MEM/WB

DR D RS1 RS2

Data PC

Active if jump

=0?

N O P N O P

forced NOP when jump NOTE in this case the jump condition detection and the new PC value are input to the MUX in the same clok interval Two NOPs MUST replace the 2 unwanted instructions already started 42 RF

DR D Rb Rc Ra Num [Ra]

IM DM

slide-43
SLIDE 43

IF

ID EX MEM WB

Stalls with jump (3/3)

A D D 4 DATA MEM A L U

M U X M U X =0?

SE DEC

M U X

IF/ID ID/EX EX/MEM MEM/WB

Data PC

Active if jump

=0?

N O P

Becomes NOP if jump

NOTE In this case the jump condition and the new PC act

  • n the MUX in the same

period when the condition is detected PC

M U X

A NOP MUST replace the unwanted instruction already started

Very slow clock solution !

43 RF

DR D Rb Rc Ra Num [Ra]

DM IM

slide-44
SLIDE 44

44

Delayed branch

Similarly to the LOAD case. In several RISC CPUs the BRANCH instructions hazard is handled by SW through the compiler (delayed branch):

BRANCH instruction delay slot Next instruction

The compiler tries to fill the delay-slots with “useful” instructions (worst case: NOP).

delay slot delay slot

slide-45
SLIDE 45

45

Delayed branch/jump

Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Sne R1, R8, R9 ; Br R1, +100 Sne R1, R8, R9 ; branch condition Br R1, +100 Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21

Compiled Original Executed in both cases Obviously in this instructions group there must be no jumps!!!

Instead of one or more “postponed” instructions, the compiler inserts NOPs when no suitable instructions are available

branch condition

slide-46
SLIDE 46

48

Handling the Control Hazards

Dynamic Prediction: Branch Target Buffer => no stall (almost..)

T/NT

TAGS

Predicted PC

PC

=

HIT : Fetch with predicted PC MISS : Fetch with PC + 4

Correct prediction : no stalls Wrong prediction : 1-3 stalls (correct fetch in ID or EX, see before)

N.B. Here the branch slot is selected during the IF clock cycle that loads IR1 in IF/ID

T/NT taken/Not taken

slide-47
SLIDE 47

49

Prediction Buffer: the simplest implementation uses a single bit that indicates what happened when last branch occurred. In case of predominance of one prediction, when the opposite situation occurs we have two consecutive errors.

Loop1 Loop2 When the program ends loop2, the prediction fails (branch predicted as taken but actually it is untaken), then it fails again when it predicts as untaken whilst entering once again loop2

slide-48
SLIDE 48

50

Usually two bits.

TAKEN TAKEN UNTAKEN UNTAKEN TAKEN UNTAKEN TAKEN UNTAKEN TAKEN TAKEN UNTAKEN UNTAKEN