Outline Processors and Instruction Sets Review of pipelining - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Processors and Instruction Sets Review of pipelining - - PowerPoint PPT Presentation

Advanced Topics on Heterogeneous System Architectures Pipelining Politecnico di Milano Seminar Room @ DEIB 30 November, 2017 Antonio R. Antonio R. Miele Miele Marco D. Santambrogio Marco D. Santambrogio Politecnico di


slide-1
SLIDE 1

Advanced Topics on Heterogeneous System Architectures

Politecnico di Milano Seminar Room @ DEIB 30 November, 2017 Antonio R. Antonio R. Miele Miele Marco D. Santambrogio Marco D. Santambrogio Politecnico di Milano

Pipelining

slide-2
SLIDE 2

2

Outline

  • Processors and Instruction Sets
  • Review of pipelining
  • MIPS

– Reduced Instruction Set of MIPS™ Processor – Implementation of MIPS Processor Pipeline – The Problem of Pipeline Hazards – Performance Issues in Pipelining

2

slide-3
SLIDE 3

Main Characteristics of MIPS™ Architecture

  • RISC (Reduced Instruction Set Computer) Architecture

Based on the concept of executing only simple instructions in a reduced basic cycle to optimize the performance of CISC CPUs.

  • LOAD/STORE Architecture

ALU operands come from the CPU general purpose registers and they cannot directly come from the memory. Dedicated instructions are necessary to:

– load data from memory to registers – store data from registers to memory

  • Pipeline Architecture:

Performance optimization technique based on the overlapping of the execution of multiple instructions derived from a sequential execution flow.

3

slide-4
SLIDE 4

A Typical RISC ISA

  • 32-bit fixed format instruction (3 formats)
  • 32 32-bit GPR (R0 contains zero, DP take pair)
  • 3-address, reg-reg arithmetic instruction
  • Single address mode for load/store:

base + displacement

– no indirection

  • Simple branch conditions
  • Delayed branch
  • Example: SPARC, MIPS, HP PA-Risc, DEC Alpha,

IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

4

slide-5
SLIDE 5

Approaching an ISA

  • Instruction Set Architecture

– Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing

  • Meaning of each instruction is described by RTL on

architected registers and memory

  • Given technology constraints assemble adequate datapath

– Architected storage mapped to actual storage – Function units to do all the required operations – Possible additional storage (eg. MAR, MBR, …) – Interconnect to move information among regs and FUs

  • Map each instruction to sequence of RTLs
  • Collate sequences into symbolic controller state transition

diagram (STD)

  • Implement controller

5

slide-6
SLIDE 6

Example: MIPS

Op

31 26 15 16 20 21 25

Rs1 Rd immediate Op

31 26 25

Op

31 26 15 16 20 21 25

Rs1 Rs2 target Rd Opx Register-Register

5 6 10 11

Register-Immediate Op

31 26 15 16 20 21 25

Rs1

Rs2/Opx

immediate Branch Jump / Call

6

slide-7
SLIDE 7

Datapath vs Control

  • Datapath: Storage, FU, interconnect sufficient to perform the

desired functions

– Inputs are Control Points – Outputs are signals

  • Controller: State machine to orchestrate operation on the data path

– Based on desired function and signals

Datapath Controller Control Points signals

7

slide-8
SLIDE 8

Datapath vs Control

Memory Data BUS Address BUS Control BUS PC

PSW

Data path Control Unit CPU IR Registri ALU Data Control InstrucHon

8

slide-9
SLIDE 9

The code…

9

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … …

slide-10
SLIDE 10

Starting scenario

0789

Data BUS Address BUS Contro BUS

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …

PSW

Data path Control Unit

CPU

Register ALU IR PC

R00 R01 R02 R03 R04 R05 …

Memory

InstrucHon Data 10

slide-11
SLIDE 11

0789

Read Instruction 0789

0789

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …

PSW

Data path Control Unit

CPU

Registri ALU IR PC

R00 R01 R02 R03 R04 R05 … load R02,4000

+1 0790

Reading

Memory

InstrucHon Data Data Bus Address Bus Contro bus 11

slide-12
SLIDE 12

0790

Exe Instruction 0789

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …

PSW

Data path Control Unit

CPU

Registri ALU IR PC

R00 R01 R02 R03 R04 R05 … load R02,4000 4000 1492 Reading

Memory

InstrucHon Data Data Bus Address Bus Contro bus 12

slide-13
SLIDE 13

0790

Read instruction 0790

0790

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …

PSW

Data path Control Unit

CPU

Registri ALU IR PC

R00 R01 R02 R03 R04 R05 … load R03,4004

+1 0791

load R02,4000 1492 Reading

Memory

InstrucHon Data Data Bus Address Bus Contro bus 13

slide-14
SLIDE 14

0791

Exe Instruction 0790

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …

PSW

Data path Control Unit

CPU

Registri ALU IR PC

R00 R01 R02 R03 R04 R05 … load R03,4004 4004 1918 Reading 1492

Memory

InstrucHon Data Data Bus Address Bus Contro bus 14

slide-15
SLIDE 15

0791

Read Instruction 0791

0791

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …

PSW

Data path Control Unit

CPU

Registri ALU IR PC

R00 R01 R02 R03 R04 R05 … add R01,R02,R03

+1 0792

load R03,4004 1492 Reading 1918

Memory

InstrucHon Data Data Bus Address Bus Contro bus 15

slide-16
SLIDE 16

0792

Exe Instruction 0791

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …

PSW

Data path Control Unit

CPU

Registri ALU IR PC

R00 R01 R02 R03 R04 R05 … add R01,R02,R03 1492 1918 1492 1918 add ackt 3410

Memory

InstrucHon Data Data Bus Address Bus Contro bus 16

slide-17
SLIDE 17

0792

Read Instruction 0792

0792

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …

PSW

Data path Control Unit

CPU

Registri ALU IR PC

R00 R01 R02 R03 R04 R05 … load R02,4008

+1 0793

add R01,R02,R03 1492 Reading 1918 3410

Memory

InstrucHon Data Data Bus Address Bus Contro bus 17

slide-18
SLIDE 18

0793

Exe Instruction 0792

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …

PSW

Data path Control Unit

CPU

Registri ALU IR PC

R00 R01 R02 R03 R04 R05 … load R02,4008 4008 2006 Reading 1492 1918 3410

Memory

InstrucHon Data Data Bus Address Bus Contro bus 18

slide-19
SLIDE 19

0793

Read Instruction 0793

0793

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …

PSW

Data path Control Unit

CPU

Registri ALU IR PC

R00 R01 R02 R03 R04 R05 … add R01,R01,R02

+1 0794

load R02,4008 2006 Reading 1918 3410

Memory

InstrucHon Data Data Bus Address Bus Contro bus 19

slide-20
SLIDE 20

0794

Exe Instruction 0793

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …

PSW

Data path Control Unit

CPU

Registri ALU IR PC

R00 R01 R02 R03 R04 R05 … add R01,R01,R02 2006 1918 2006 3410 add ack 5416 3410

Memory

InstrucHon Data Data Bus Address Bus Contro bus 20

slide-21
SLIDE 21

0794

Read Instruction 0794

0794

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …

PSW

Data path Control Unit

CPU

Registri ALU IR PC

R00 R01 R02 R03 R04 R05 … store R01,4000

+1 0795

add R01,R01,R02 2006 Reading 1918 5416

Memory

InstrucHon Data Data Bus Address Bus Contro bus 21

slide-22
SLIDE 22

… … … … … 4000 1492 4004 1918 4008 2006 … … … … …

5416

0795

Exe Instruction 0794

… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … …

PSW

Data path Control Unit

CPU

Registri ALU IR PC

R00 R01 R02 R03 R04 R05 … store R01,4000 4000 writing 2006 1918 5416

Memory

InstrucHon Data Data Bus Address Bus Contro bus 22

slide-23
SLIDE 23

Reduced Instruction Set of MIPS Processor

  • ALU instructions:

add $s1, $s2, $s3 # $s1 ← $s2 + $s3 addi $s1, $s1, 4 # $s1 ← $s1 + 4

  • Load/store instructions:

lw $s1, offset ($s2) # $s1 ← M[$s2+offset] sw $s1, offset ($s2) M[$s2+offset] ← $s1

  • Branch instructions to control the control flow of the

program:

– Conditional branches: the branch is taken only if the condition is

  • satisfied. Examples: beq (branch on equal) and bne (branch on not equal)

beq $s1, $s2, L1 # go to L1 if ($s1 == $s2) bne $s1, $s2, L1 # go to L1 if ($s1 != $s2) – Unconditional jumps: the branch is always taken. Examples: j (jump) and jr (jump register) j L1 # go to L1 jr $s1 # go to add. contained in $s1

23

slide-24
SLIDE 24

Execution of MIPS Instructions

Every instruction in the MIPS subset can be implemented in at most 5 clock cycles as follows:

  • Instruction Fetch Cycle:

– Send the content of Program Counter register to Instruction Memory and fetch the current instruction from Instruction Memory. Update the PC to the next sequential address by adding 4 to the PC (since each instruction is 4 bytes).

  • Instruction Decode and Register Read Cycle

– Decode the current instruction (fixed-field decoding) and read from the Register File of one or two registers corresponding to the registers specified in the instruction fields. – Sign-extension of the offset field of the instruction in case it is needed.

24

slide-25
SLIDE 25

Execution of MIPS instructions

  • Execution Cycle

The ALU operates on the operands prepared in the previous cycle depending on the instruction type:

– Register-Register ALU Instructions:

  • ALU executes the specified operation on the operands read

from the RF

– Register-Immediate ALU Instructions:

  • ALU executes the specified operation on the first operand

read from the RF and the sign-extended immediate operand

– Memory Reference:

  • ALU adds the base register and the offset to calculate the

effective address.

– Conditional branches:

  • Compare the two registers read from RF and compute the

possible branch target address by adding the sign- extended offset to the incremented PC.

25

slide-26
SLIDE 26

Execution of MIPS instructions

  • Memory Access (ME)

– Load instructions require a read access to the Data Memory using the effective address – Store instructions require a write access to the Data Memory using the effective address to write the data from the source register read from the RF – Conditional branches can update the content of the PC with the branch target address, if the conditional test yielded true.

  • Write-Back Cycle (WB)

– Load instructions write the data read from memory in the destination register of the RF – ALU instructions write the ALU results into the destination register of the RF.

26

slide-27
SLIDE 27

Execution of MIPS Instructions

ALU Instructions: op $x,$y,$z

Read of Source

  • Regs. $y and $z
  • Instr. Fetch

&. PC Increm. ALU OP

($y op $z)

Write Back of

  • Destinat. Reg. $x

Conditional Branch: beq $x,$y,offset

Read of Source

  • Regs. $x and $y
  • Instr. Fetch

& PC Increm. ALU Op. ($x-$y)

& (PC+4+offset)

Write PC

Load Instructions: lw $x,offset($y)

Read of Base

  • Reg. $y
  • Instr. Fetch

& PC Increm. ALU Op.

($y+offset)

Read Mem.

M($y+offset)

Write Back of

  • Destinat. Reg. $x

Store Instructions: sw $x,offset($y)

Read of Base Reg. $y & Source $x

  • Instr. Fetch

& PC Increm. ALU Op.

($y+offset)

Write Mem.

M($y+offset) 27

slide-28
SLIDE 28

Memory Access Write Back Instruction Fetch

  • Instr. Decode
  • Reg. Fetch

Execute

  • Addr. Calc

L M D ALU

MUX

Memory Reg File

MUX MUX

Data Memory

MUX

Sign Extend

4 Adder

Zero?

Next SEQ PC

Address

Next PC WB Data

Inst

RD RS1 RS2 Imm

IR <= mem[PC] PC <= PC + 4 Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt]

MIPS Data path

28

slide-29
SLIDE 29

Instructions Latency

Instruction Type Instruct. Mem. Register Read ALU Op. Data Memory Write Back Total Latency ALU Instr. 2 1 2 1 6 ns Load 2 1 2 2 1 8 ns Store 2 1 2 2 7 ns

  • Cond. Branch

2 1 2 5 ns Jump 2 2 ns

29

slide-30
SLIDE 30

Single-cycle Implementation of MIPS

  • The length of the clock cycle is defined by the

critical path given by the load instruction: T = 8 ns (f = 125 MHz).

  • We assume each instruction is executed in a single

clock cycle

– Each module must be used once in a clock cycle – The modules used more than once in a cycle must be duplicated.

  • We need an Instruction Memory separated from the Data Memory.
  • Some modules must be duplicated, while other modules must be

shared from different instruction flows

  • To share a module between two different instructions, we need a

multiplexer to enable multiple inputs to a module and select one of different inputs based on the configuration of control lines.

30

slide-31
SLIDE 31

Implementation of MIPS data path with Control Unit

C ontent R eg. 2 C ontent R eg. 1 [15-0] [25-21] [20-16] [15-11] R ead Data Write Data R agister R ead 2 R egister R ead 1 Write R egister Zero 32 bit 16 bit M U X M U X R esult

AL U Regis ter File

Write Data Write Address R ead Address

Data Memory

S ign E xtension

Ins truction Memory

Instruction R ead Address

+4 Adder PC

2-bit Left S hifter

Adder

M U X M U X R egWR MemWR MemR D OP [31-26] C ontrol Unit Destination R egister Branch MemToR eg ALU_opB A B ALU C ontrol Unit ALU_op [5-0]

31

slide-32
SLIDE 32

Multi-cycle Implementation

  • The instruction execution is distributed on multiple

cycles (5 cycles for MIPS)

  • The basic cycle is smaller

(2 ns ⇒ instruction latency = 10 ns)

  • Implementation of multi-cycle CPU:

– Each phase of the instruction execution requires a clock cycle – Each module can be used more than once per instruction in different clock cycles: possible sharing of modules – We need internal registers to store the values to be used in the next clock cycles.

32

slide-33
SLIDE 33

Pipelining

  • Performance optimization technique based on the overlap of the

execution of multiple instructions deriving from a sequential execution flow.

  • Pipelining exploits the parallelism among instructions in a

sequential instruction stream.

  • Basic idea:

The execution of an instruction is divided into different phases (pipelines stages), requiring a fraction of the time necessary to complete the instruction.

  • The stages are connected one to the next to form the pipeline:

instructions enter in the pipeline at one end, progress through the stages, and exit from the other end, as in an assembly line.

33

slide-34
SLIDE 34

Pipelining

  • Advantage: technique transparent for the

programmer.

  • Technique similar to a assembly line: a new car

exits from the assembly line in the time necessary to complete one of the phases.

  • An assembly line does not reduce the time

necessary to complete a car, but increases the number of cars produced simultaneously and the frequency to complete cars.

34

slide-35
SLIDE 35

Sequential vs. Pipelining Execution

2 ns Time I2 I3 I1 WB MEM EX ID IF 2 ns 2 ns WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF 2 ns I4 I5 I2 … I1 WB MEM EX ID IF WB MEM EX ID IF 10 ns 10 ns 35

slide-36
SLIDE 36

Pipelining

  • The time to advance the instruction of one stage in

the pipeline corresponds to a clock cycle.

  • The pipeline stages must be synchronized: the

duration of a clock cycle is defined by the time requested by the slower stage of the pipeline (i.e. 2 ns).

  • The goal is to balance the length of each pipeline

stage

  • If the stages are perfectly balanced, the ideal

speedup due to pipelining is equal to the number of pipeline stages.

36

slide-37
SLIDE 37

Performance Improvement

  • Ideal case (asymptotically):If we consider the

single-cycle unpipelined CPU1 with clock cycle

  • f 8 ns and the pipelined CPU2 with 5 stages of

2 ns :

– The latency (total execution time) of each instruction is worsened: from 8 ns to 10 ns – The throughput (number of instructions completed in the time unit) is improved of 4 times: (1 instruction completed each 8 ns) vs. (1 instruction completed each 2 ns)

37

slide-38
SLIDE 38

Performance Improvement

  • Ideal case (asymptotically): If we consider the

multi-cycle unpipelined CPU3 composed of 5 cycles of 2 ns and the pipelined CPU2 with 5 stages of 2 ns :

– The latency (total execution time) of each instruction is not varied (10 ns) – The throughput (number of instructions completed in the time unit) is improved of 5 times: (1 instruction completed every 10 ns) vs. (1 instruction completed every 2 ns)

38

slide-39
SLIDE 39

Pipeline Execution of MIPS Instructions

ALU Instructions: op $x,$y,$z Conditional Branches: beq $x,$y,offset Load Instructions: lw $x,offset($y)

Read of Base

  • Reg. $y
  • Instr. Fetch

& PC Increm. ALU Op.

($y+offset)

Read Mem.

M($y+offset)

Write Back

  • Destinat. Reg. $x

Read of Source

  • Regs. $y and $z
  • Instr. Fetch

& PC Increm. ALU Op. ($y op $z) Write Back

  • Destinat. Reg. $x

Store Instructions: sw $x,offset($y)

Read of Base Reg.

$y & Source $x

  • Instr. Fetch

& PC Increm. ALU Op.

($y+offset)

Write Mem.

M($y+offset)

Read of Source

  • Regs. $x and $y
  • Instr. Fetch

& PC Increm. ALU Op. ($x-$y)

& (PC+4+offset)

Write PC

ID Instruction Decode IF Instruction Fetch EX Execution ME Memory Access WB Write Back

39

slide-40
SLIDE 40

Implementation of MIPS pipeline

C ontent register 2 C ontent register 1 [15-0] [25-21] [20-16] [15-11] R ead Data Write Data R egister R ead 2 R egister R ead 1 R egister Write Zero 32 bit 16 bit M U X M U X R esult

AL U RF

Write Data Write Address R ead Address

Data Memory

S ign extension

Ins truction Memory

Instruction R ead Address

+4 Adder PC

2-bit Left S hifter M U X M U X

ID/E X IF/ID ME M/WB E X/ME M

IF — Instruction Fetch ID — Instruction Decode EX — Execution WB — Write Back MEM — Memory Access

WR WR RD OP

Adder

40

slide-41
SLIDE 41

Visualizing Pipelining

I n s t r. O r d e r Time (clock cycles)

Reg ALU DMem Ifetch Reg Reg ALU DMem Ifetch Reg Reg ALU DMem Ifetch Reg Reg ALU DMem Ifetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

41

slide-42
SLIDE 42

Note: Optimized Pipeline

  • Register File used in 2 stages: Read access during

ID and write access during WB

– What happens if read and write refer to the same register in the same clock cycle?

  • It is necessary to insert one stall
  • Optimized Pipeline: the RF read occurs in the

second half of clock cycle and the RF write in the first half of clock cycle

– What happens if read and write refer to the same register in the same clock cycle?

  • It is not necessary to insert one stall

42

42

slide-43
SLIDE 43

Note: Optimized Pipeline

  • Register File used in 2 stages: Read access during

ID and write access during WB

– What happens if read and write refer to the same register in the same clock cycle?

  • It is necessary to insert one stall
  • Optimized Pipeline: the RF read occurs in the

second half of clock cycle and the RF write in the first half of clock cycle

– What happens if read and write refer to the same register in the same clock cycle?

  • It is not necessary to insert one stall

43

F r

  • m

n

  • w
  • n

, t h i s i s t h e P i p e l i n e w e a r e g

  • i

n g t

  • u

s e

43

slide-44
SLIDE 44

Memory Access Write Back Instruction Fetch

  • Instr. Decode
  • Reg. Fetch

Execute

  • Addr. Calc

ALU Memory Reg File

MUX MUX

Data Memory

MUX

Sign Extend

Zero?

IF/ID ID/EX MEM/WB EX/MEM

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD

WB Data Data stationary control local decode for each instruction phase / pipeline stage Next PC

Address

RS1 RS2 Imm

MUX

IR <= mem[PC]; PC <= PC + 4 A <= Reg[IRrs]; B <= Reg[IRrt] rslt <= A opIRop B Reg[IRrd] <= WB WB <= rslt

5 Steps of MIPS Datapath

Figure A.3, Page A-9

44

slide-45
SLIDE 45

The Problem of Hazards

45

slide-46
SLIDE 46

The Problem of Hazards

  • A hazard is created whenever there is a

dependence between instructions, and instructions are close enough that the overlap caused by pipelining would change the order of access to the operands involved in the dependence.

  • Hazards prevent the next instruction in the

pipeline from executing during its designated clock cycle.

  • Hazards reduce the performance from the ideal

speedup gained by pipelining.

46

slide-47
SLIDE 47

Three Classes of Hazards

  • Structural Hazards: Attempt to use the same

resource from different instructions simultaneously

– Example: Single memory for instructions and data

  • Data Hazards: Attempt to use a result before it is

ready

– Example: Instruction depending on a result of a previous instruction still in the pipeline

  • Control Hazards: Attempt to make a decision on the

next instruction to execute before the condition is evaluated

– Example: Conditional branch execution

47

slide-48
SLIDE 48

Structural Hazards

  • No structural hazards in MIPS architecture:

– Instruction Memory separated from Data Memory – Register File used in the same clock cycle: Read access by an instruction and write access by another instruction

2 ns Time

I2 I3 I1

2 ns 2 ns 2 ns

I4 I5 IM

REG

DM

REG A L U

IM

REG

DM

REG A L U

IM

REG

DM

REG A L U

IM

REG

DM

REG A L U

IM

REG

DM

REG A L U

48

slide-49
SLIDE 49

Data Hazards

  • If the instructions executed in the pipeline are

dependent, data hazards can arise when instructions are too close

  • Example:

sub

$2, $1, $3 # Reg. $2 written by sub and $12, $2, $5 # 1° operand ($2) depends on sub

  • r $13, $6, $2 # 2° operand ($2) depend on sub

add $14, $2, $2 # 1° ($2) & 2° ($2) depend on sub sw $15,100($2) # Base reg. ($2) depends on sub

49

slide-50
SLIDE 50

Data Hazards in the Optimized Pipeline: Example

  • 2 ns

2 ns 2 ns 2 ns

IM

REG

DM

REG A L U

IM

REG

DM

REG A L U

IM

REG

DM

REG A L U

IM

REG

DM

REG A L U

IM

REG

DM

REG A L U

It is necessary to insert two stalls

sub $2, $1, $3 and $12, $2, $5

  • r $13, $6, $2

add $14, $2, $2 sw $15,100($2)

Time Instruction order

50

slide-51
SLIDE 51

Type of Data Hazard

  • Read After Write (RAW)

InstrJ tries to read operand before InstrI writes it

  • Caused by a “Dependence” (in compiler

nomenclature). This hazard results from an actual need for communication.

I: add r1,r2,r3 J: sub r4,r1,r3

51

slide-52
SLIDE 52

Data Hazards: Possible Solutions

  • Compilation Techniques:

– Insertion of nop (no operation) instructions – Instructions Scheduling to avoid that correlating instructions are too close

  • The compiler tries to insert independent instructions

among correlating instructions

  • When the compiler does not find independent

instructions, it insert nops.

  • Hardware Techniques:

– Insertion of “bubbles” or stalls in the pipeline – Data Forwarding or Bypassing

52

slide-53
SLIDE 53

Just an example...

2 ns Tempo Ordine di esecuzione delle istruzioni

and

  • r

sub

2 ns 2 ns 2 ns

add sw IM

REG

DM

REG A L U

IM

REG

DM

REG A L U

IM

REG

DM

REG A L U

IM

REG

DM

REG A L U

IM

REG

DM

REG A L U

I need to insert 2 bubbles or 2 stalls

53

slide-54
SLIDE 54

Insertion of nops

IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB

sub $2, $1, $3 and $12, $2, $5

  • r $13, $6, $2

add $14, $2, $2 sw $15,100($2)

IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB

nop nop

54

slide-55
SLIDE 55

Insertion of bubbles and stalls

CK2 CK1 Time Contenuto di $2 sub $2 , $1, $3 ID IF MEM EX and $12, $2 , $5 bubble IF

  • r $13, $6,

$2 add $14, $2 , $2 sw $15,100( $2 ) (clock cycles) WB ID WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF WB MEM EX CK4 CK3 CK6 CK5 CK9 CK8 CK7 10 10 10 10

  • 20
  • 20
  • 20
  • 20
  • 20

CK12 CK11 CK10

  • 20
  • 20
  • 20

bubble 55

slide-56
SLIDE 56

Scheduling: Example

sub

$2, $1, $3 sub $2, $1, $3 and $12, $2, $5 add $4, $10, $11

  • r $13, $6, $2

and $7, $8, $9 add $14, $2, $2 lw $16, 100($18) sw $15,100($2) lw $17, 200($19) add $4, $10, $11 and $12, $2, $5 and $7, $8, $9

  • r $13, $6, $2

lw $16, 100($18) add $14, $2, $2 lw $17, 200($19) sw $15,100($2)

56

slide-57
SLIDE 57

Forwarding

  • Data forwarding uses temporary results stored

in the pipeline registers instead of waiting for the write back of results in the RF.

  • We need to add multiplexers at the inputs of

ALU to fetch inputs from pipeline registers to avoid the insertion of stalls in the pipeline.

57

slide-58
SLIDE 58

Forwarding: Example

IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB

sub $2, $1, $3 and $12, $2, $5

  • r $13, $6, $2

add $14, $2, $2 sw $15,100($2)

MEM/EX path EX/EX path MEM/ID path 58

slide-59
SLIDE 59

P C M u x M u x A L U M u x F

  • r

w a r d i n g u n i t I n s t r u c t i

  • n

M u x R d E X / M E M . R e g i s t e r R d M E M / W B . R e g i s t e r R d R t R t R s I F / I D . R e g i s t e r R d I F / I D . R e g i s t e r R t I F / I D . R e g i s t e r R t I F / I D . R e g i s t e r R s ID/EX IF/ID EX/MEM MEM/WB

Instr Memory Reg. Data Memory

M u x M u x

EX/EX path MEM/EX path MEM/ID path WB path

Implementation of MIPS with Forwarding Unit

59

slide-60
SLIDE 60

Data Hazards: Load/Use Hazard

L1: lw $s0, 4($t1)

# $s0 <- M [4 + $t1] L2: add $s5, $s0, $s1 # 1° operand depends from L1

CK2 CK1 lw $s0, 4($t1) ID IF WB MEM EX ID IF WB MEM EX CK4 CK3 CK6 CK5 CK7 add $s5,$s0,$s1

60

slide-61
SLIDE 61

Data Hazards: Load/Use Hazard

  • With forwarding using the MEM/EX path: 1 stall

CK2 CK1 lw $s0, 4($t1) ID IF WB MEM EX ID IF WB MEM EX CK4 CK3 CK6 CK5 CK7 add $s5,$s0,$s1

61

slide-62
SLIDE 62

Data Hazards: Load/Store

L1: lw $s0, 4($t1)

# $s0 <- M [4 + $t1] L2: sw $s0, 4($t2) # M[4 + $t2] <- $s0

CK2 CK1 Contenuto di $s0 lw $s0, 4($t1) ID IF WB MEM EX ID IF WB MEM EX CK4 CK3 CK6 CK5 CK7 10 10 10 10 20 10/20 20 sw $s0, 4($t2)

Without forwarding : 3 stalls

62

slide-63
SLIDE 63

Data Hazards: Load/Store

  • Forwarding: Stall = 0
  • We need a forwarding path to bring the load result from the

memory (in MEM/WB) to the memory’s input for the store.

CK2 CK1 lw $s0, 4($t1) ID IF WB MEM EX ID IF WB MEM EX CK4 CK3 CK6 CK5 CK7 sw $s0, 4($t2)

63

slide-64
SLIDE 64

P C M u x M u x A L U M u x F

  • r

w a r d i n g u n i t I n s t r u c t i

  • n

M u x R d E X / M E M . R e g i s t e r R d M E M / W B . R e g i s t e r R d R t R t R s I F / I D . R e g i s t e r R d I F / I D . R e g i s t e r R t I F / I D . R e g i s t e r R t I F / I D . R e g i s t e r R s ID/ EX IF/ID EX/MEM MEM/ WB

Memoria Istruzioni Reg. Mem . Dati

M u x M u x

EX/EX path MEM/EX path MEM/ID path WB path

M u x

MEM/MEM path

Ø EX/EX path Ø MEM/EX path Ø MEM/ID path Ø MEM/MEM path

Implementation of MIPS with Forwarding Unit

64

slide-65
SLIDE 65

Data Hazards

  • Data hazards analyzed up to now are:

– RAW (READ AFTER WRITE) hazards: instruction n+1 tries to read a source register before the previous instruction n has written it in the RF. – Example: add $r1, $r2, $r3 sub $r4, $r1, $r5

  • By using forwarding, it is always possible to

solve this conflict without introducing stalls, except for the load/use hazards where it is necessary to add one stall

65

slide-66
SLIDE 66

Data Hazards

  • Other types of data hazards in the pipeline:

– WAW (WRITE AFTER WRITE) – WAR (WRITE AFTER READ)

66

slide-67
SLIDE 67

Data Hazards: WAW

  • Instruction n+1 tries to write a destination
  • perand before it has been written by the

previous instruction n ⇒ write operations executed in the wrong order

  • This type of hazards could not occur in the MIPS

pipeline because all the register write

  • perations occur in the WB stage

67

slide-68
SLIDE 68

Data Hazards: WAW

  • Example: If we assume the register write in the

ALU instructions occurs in the fourth stage and that load instructions require two stages (MEM1 and MEM2) to access the data memory, we can have:

CK2 CK1

lw $r1, 0($r2)

ID IF MEM2 MEM 1 EX IF ID WB EX CK4 CK3 CK6 CK5 CK7

add $r1,$r2,$r3

WB

68

slide-69
SLIDE 69

Data Hazards: WAW

  • Example: If we assume the floating point ALU
  • perations require a multi-cycle execution, we

can have:

CK2 CK1

mul $f6,$f2,$f2

ID IF MUL3 MUL2 MUL1 IF ID AD2 AD1 CK4 CK3 CK6 CK5 CK7

add $f6,$f2,$f2

MUL4 MEM WB CK8 MEM WB

69

slide-70
SLIDE 70

WAW Data Hazards

  • Write After Write (WAW)

InstrJ writes operand before InstrI writes it.

  • Called an “output dependence” by compiler writers

This also results from the reuse of name “r1”.

  • Can’t happen in MIPS 5 stage pipeline because:

– All instructions take 5 stages, and – Writes are always in stage 5

  • Will see WAR and WAW in more complicated pipes

I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

70

slide-71
SLIDE 71

Data Hazards: WAR

  • Instruction n+1 tries to write a destination operand

before it has been read from the previous instruction n ⇒ instruction n reads the wrong value.

  • This type of hazards could not occur in the MIPS

pipeline because the operand read operations

  • ccur in the ID stage and the write operations in the

WB stage.

  • As before, if we assume the register write in the

ALU instructions occurs in the fourth stage and that we need two stages to access the data memory, some instructions could read operands too late in the pipeline

71

slide-72
SLIDE 72

Data Hazards: WAR

  • Example: Instruction sw reads $r2 in the second

half of MEM2 stage and instruction add writes $r2 in the first half of WB stage ⇒ sw reads the new value of $r2

CK2 CK1

sw $r1, 0($r2)

ID IF MEM2 MEM 1 EX IF ID WB EX CK4 CK3 CK6 CK5 CK7

add $r2, $r3, $r4

WB

72

slide-73
SLIDE 73

WAR Data Hazards

  • Write After Read (WAR)

InstrJ writes operand before InstrI reads it

  • Called an “anti-dependence” by compiler writers.

This results from reuse of the name “r1”.

  • Can’t happen in MIPS 5 stage pipeline because:

– All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5

I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

73

slide-74
SLIDE 74

THANK YOU THANK YOU FOR YOUR ATTENTION FOR YOUR ATTENTION

74