Pipelining Instruction Pipelining is the use of pipelining to allow - - PowerPoint PPT Presentation

pipelining
SMART_READER_LITE
LIVE PREVIEW

Pipelining Instruction Pipelining is the use of pipelining to allow - - PowerPoint PPT Presentation

Pipelining Instruction Pipelining is the use of pipelining to allow more than one instruction to be in some stage of execution at the same time. Ferranti ATLAS (1963): Pipelining reduced the average time per instruction by


slide-1
SLIDE 1

1

Pipelining

Instruction Pipelining is the use of pipelining to allow more than one instruction to be in some stage of execution at the same time. Ferranti ATLAS (1963):

  • Pipelining reduced the average time per instruction by 375%
  • Memory could not keep up with the CPU, needed a cache.
slide-2
SLIDE 2

2

What Is Pipelining

° Laundry Example ° Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold ° Washer takes 30 minutes ° Dryer takes 40 minutes ° “Folder” takes 20 minutes

A B C D

slide-3
SLIDE 3

3

What Is Pipelining

Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

A B C D 30 40 20 30 40 20 30 40 20 30 40 20 6 PM 7 8 9 10 11 Midnight

T a s k O r d e r Time

slide-4
SLIDE 4

4

What Is Pipelining Start work ASAP

° Pipelined laundry takes 3.5 hours for 4 loads

A B C D 6 PM 7 8 9 10 11 Midnight

T a s k O r d e r Time

30 40 40 40 40 20

slide-5
SLIDE 5

5

Pipelining Lessons

° Pipelining doesn’t help latency of single task, it helps throughput of entire workload ° Pipeline rate limited by slowest pipeline stage ° Multiple tasks operating simultaneously ° Potential speedup = Number pipe stages ° Unbalanced lengths of pipe stages reduces speedup ° Time to “ fill” pipeline and time to “drain” it reduces speedup

A B C D 6 PM 7 8 9

T a s k O r d e r Time

30 40 40 40 40 20

slide-6
SLIDE 6

6

Models

Single-cycle model (non-overlapping)

  • The instruction latency executes in a single cycle
  • Every instruction and clock-cycle must be

stretched to the slowest instruction Pipeline model (overlapping)

  • The instruction latency executes in multiple-cycles
  • The clock-cycle must be stretched to the slowest step
  • The throughput is mainly one clock-cycle/instruction
  • Gains efficiency by overlapping the execution of multiple

instructions, increasing hardware utilization. Multi-cycle model (non-overlapping)

  • The instruction latency executes in multiple-cycles
  • The clock-cycle must be stretched to the slowest step
  • Ability to share functional units within the execution
  • f a single instruction
slide-7
SLIDE 7

7

Instruction memory Read address Instruction 16 32 Write register Write data Read data 1 Read data 2 Read register 1 Read register 2 M u x 3 ALU result Zero ALU Data memory Address Write data Read data M u x Sign extend PC 4 Add Add Result M u x Shift left 2

RegWrite MemWrite ALUctl MemRead Branch MemtoReg ALUSrc

M u x

RegDst

And

Harvard Architecture: Separate instruction and data memory 2 adders: PC+4 adder, Branch/Jump offset adder

Review: Single-Cycle Datapath

slide-8
SLIDE 8

8

Shift left 2 MemtoReg IorD MemRead MemWrite PC Memory MemData Write data M u x 1 Registers Write register Write data Read data 1 Read data 2 Read register 1 Read register 2 Instruction [15– 11] M u x 1 M u x 1 4 ALUOp ALUSrcB RegDst RegWrite Instruction [15–0] Instruction [5– 0] Sign extend 32 16 Instruction [25– 21] Instruction [20– 16] Instruction [15–0] Instruction register 1 M u x 3 2 ALU control M u x 1 ALU result ALU ALUSrcA Zero A B ALUOut IRWrite Address Memory data register

Multi-cycle = 1 ALU + 1 Mem + 5½ Muxes + 5 Reg (IR,A,B,MDR,ALUOut) + FSM Single-cycle= 1 ALU + 2 Mem + 4 Muxes + 2 adders + OpcodeDecoders

Combine adders: add 1½ Mux & 3 temp. registers, A, B, ALUOut Combine Memory: add 1 Mux & 2 temp. registers, IR, MDR

Review: Multi vs. Single-cycle Processor Datapath

slide-9
SLIDE 9

9

Shift left 2 MemtoReg IorD MemRead MemWrite PC Memory MemData Write data M u x 1 Registers Write register Write data Read data 1 Read data 2 Read register 1 Read register 2 Instruction [15– 11] M u x 1 M u x 1 4 ALUOp ALUSrcB RegDst RegWrite Instruction [15–0] Instruction [5– 0] Sign extend 32 16 Instruction [25– 21] Instruction [20– 16] Instruction [15–0] Instruction register 1 M u x 3 2 ALU control M u x 1 ALU result ALU ALUSrcA Zero A B ALUOut IRWrite Address Memory data register

Multi-cycle = 1 ALU + 1 Mem + 5½ Muxes + 5 Reg (IR,A,B,MDR,ALUOut) + FSM Single-cycle= 1 ALU + 2 Mem + 4 Muxes + 2 adders + OpcodeDecoders

Multi-cycle Processor Datapath

5x32 = 160 additional FFs for multi-cycle processor over single-cycle processor 5x32 = 160 additional FFs for multi-cycle processor over single-cycle processor

slide-10
SLIDE 10

10

PC Instruction memory Address Instruction Instruction [20– 16] MemtoReg

ALUOp

Branch

RegDst ALUSrc

4

16 32

Instruction [15– 0] Registers Write register Write data Read data 1 Read data 2 Read register 1 Read register 2 Sign extend M u x 1 Write data Read data M u x 1

ALU control RegWrite

MemRead Instruction [15– 11] 6 IF/ID ID/EX EX/MEM MEM/WB MemWrite Address Data memory PCSrc Zero

Add Add result Shift left 2 ALU result ALU Zero

Add 1 M u x 1 M u x

Figure 6.25

PC 32 bits PC 32 bits IR 32 bits IR 32 bits PC 32 PC 32 A 32 A 32 B 32 B 32 Si 32 Si 32 RT 5 RT 5 RD 5 RD 5 PC 32 PC 32 Z 1 Z 1

ALU Out

32

ALU Out

32 B 32 B 32 D 5 D 5 M D R 32 M D R 32

ALU Out

32

ALU Out

32 D 5 D 5

Datapath Registers

+ 213 FFs + 213 FFs 160 FFs 160 FFs

2 W 3 M 4 EX 2 W 3 M 4 EX 2 W 3 M 2 W 3 M 2 W 2 W

+ 16 FFs + 16 FFs

PC 32 bits PC 32 bits

213+16 = 229 additional FFs for pipeline over multi-cycle processor

slide-11
SLIDE 11

11

Single-cycle model

  • 8 ns Clock (125 MHz), (non-overlapping)
  • 1 ALU + 2 adders
  • 0 Muxes
  • 0 Datapath Register bits (Flip-Flops)

Multi-cycle model

  • 2 ns Clock (500 MHz), (non-overlapping)
  • 1 ALU + Controller
  • 5 Muxes
  • 160 Datapath Register bits (Flip-Flops)

Pipeline model

  • 2 ns Clock (500 MHz), (overlapping)
  • 2 ALU + Controller
  • 4 Muxes
  • 373 Datapath + 16 Controlpath Register bits (Flip-Flops)

Chip Area Speed

Overhead

slide-12
SLIDE 12

12

CISC RISC

Comparison

Any instruction may reference memory Only load/store references memory Many instructions & addressing modes Few instructions & addressing modes Variable instruction formats Fixed instruction formats Single register set Multiple register sets Multi-clock cycle instructions Single-clock cycle instructions Micro-program interprets instructions Hardware (FSM) executes instructions Complexity is in the micro-program Complexity is in the complier Less to no pipelining Highly pipelined Program code size small Program code size large

slide-13
SLIDE 13

13

Pipelining versus Parallelism

Most high-performance computers exhibit a great deal of concurrency. However, it is not desirable to call every modern computer a parallel computer. Pipelining and parallelism are 2 methods used to achieve concurrency. Pipelining increases concurrency by dividing a computation into a number of steps. Parallelism is the use of multiple resources to increase concurrency.

slide-14
SLIDE 14

14

Can pipelining get us into trouble? ° Yes: Pipeline Hazards

  • structural hazards: attempt to use the same resource two different

ways at the same time

  • e.g., multiple memory accesses, multiple register writes
  • solutions:

– multiple memories (separate instruction & data memory) – stretch pipeline

  • control hazards: attempt to make a decision before condition is

evaulated

  • e.g., any conditional branch
  • solutions: prediction, delayed branch
  • data hazards: attempt to use item before it is ready
  • e.g., add r1,r2,r3; sub r4, r1 ,r5; lw r6, 0(r7); or r8, r6 ,r9
  • solutions: forwarding/bypassing, stall/bubble
slide-15
SLIDE 15

15

When stalls are unavoidable ° There are numerous instances where data hazards lead unavoidably to stalls!

  • perhaps the most surprising case is the simple

assignment statement: A = B + C;

slide-16
SLIDE 16

16

Classification of data hazards

° Assume that instruction J occurs after instruction I

  • RAW -- read after write -- J tries to read a source before I writes

to it. J gets a possibly incorrect value

  • WAR -- write after read -- J tries to write a destination before the

destination is read by I. I reads a possibly incorrect value

  • WAW -- write after write -- J tries to write an operand before it is

written by I. In this case the writes are performed in the wrong

  • rder

° MIPS avoids the latter two hazards in its integer pipeline ° However, they can still arise in its floating-point pipeline since instructions have different lengths

  • f time there
slide-17
SLIDE 17

17

The Five Stages of Load

° Ifetch: Instruction Fetch

  • Fetch the instruction from the Instruction Memory

° Reg/Dec: Registers Fetch and Instruction Decode ° Exec: Calculate the memory address ° Mem: Read the data from the Data Memory ° Wr: Write the data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Ifetch Reg/Dec Exec Mem Wr Load

Clock

slide-18
SLIDE 18

18

P C P C

address Read Data Write Data address Read Data Write Data Read Data Accumulator Write Data Read Data Accumulator Write Data ALU ALU Out ALU Out MemWrite RegDst RegWrite

1 0

Instruction[7-0]

1 2

MDR MDR

1 1 2 1 2

ALUsrcA ALUsrcB PCSrc P0 | (~AluZero & BZ) P0 | (~AluZero & BZ) IorD MemRead IRWrite Y X

ALUop 1 X+0 2 X-Y 3 0+Y 4 0 5 X+Y ALUop 1 X+0 2 X-Y 3 0+Y 4 0 5 X+Y

I R I R MDR2 MDR2

RISCEE 4 Architecture

Clock

Clock = load value into register

slide-19
SLIDE 19

19

Single Cycle, Multiple Cycle, vs. Pipeline

Clk Cycle 1 Multiple Cycle Implementation: Ifetch Reg Exec Mem Wr Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Load Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Load Store Pipeline Implementation: Ifetch Reg Exec Mem Wr Store Clk Single Cycle Implementation: Load Store Waste Ifetch R-type Ifetch Reg Exec Mem Wr R-type Cycle 1 Cycle 2

slide-20
SLIDE 20

20

Why Pipeline?

° Suppose we execute 100 instructions ° Single Cycle Machine

  • 45 ns/cycle x 1 CPI x 100 inst = 4500 ns

° Multicycle Machine

  • 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns

° Ideal pipelined machine

  • 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
slide-21
SLIDE 21

21

Why Pipeline? Because the resources are there!

I n s t r. O r d e r Time (clock cycles)

Inst 0 Inst 1 Inst 2 Inst 4 Inst 3

ALU Im Reg Dm Reg ALU Im Reg Dm Reg ALU Im Reg Dm Reg ALU Im Reg Dm Reg ALU Im Reg Dm Reg

Resource MemInst MemData RegRead RegWrite ALU busy idle idle idle idle busy idle busy idle idle busy idle busy idle busy busy busy busy idle busy busy busy busy busy busy RegRead RegWrite idle busy busy busy busy idle busy idle busy busy idle busy idle busy idle idle idle idle busy idle

slide-22
SLIDE 22

22

Can pipelining get us into trouble?

° Yes: Pipeline Hazards

  • structural hazards: attempt to use the same resource two different ways at the

same time

  • E.g., combined washer/dryer would be a structural hazard or folder busy

doing something else (watching TV)

  • data hazards: attempt to use item before it is ready
  • E.g., one sock of pair in dryer and one in washer; can’t fold until get sock

from washer through dryer

  • instruction depends on result of prior instruction still in the pipeline
  • control hazards: attempt to make a decision before condition is evaulated
  • E.g., washing football uniforms and need to get proper detergent level; need

to see after dryer before next load in

  • branch instructions

° Can always resolve hazards by waiting

  • pipeline control must detect the hazard
  • take action (or delay action) to resolve hazards
slide-23
SLIDE 23

23

Single Memory (Inst & Data) is a Structural Hazard

Mem

I n s t r. O r d e r

Load Instr 1 Instr 2 Instr 3

ALU Mem Reg Mem Reg ALU Mem Reg Mem Reg ALU Mem Reg Mem Reg ALU Reg Mem Reg

Detection is easy in this case! Detection is easy in this case!

structural hazards: attempt to use the same resource two different ways at the same time

Resource Mem(Inst & Data) RegRead RegWrite ALU busy idle idle idle busy busy idle idle busy busy idle busy busy busy idle busy busy busy busy busy idle busy busy busy idle idle busy busy idle idle busy idle right half: highlight means read. left half write. right half: highlight means read. left half write.

Previous example: Separate InstMem and DataMem Previous example: Separate InstMem and DataMem

slide-24
SLIDE 24

24

Single Memory (Inst & Data) is a Structural Hazard

structural hazards: attempt to use the same resource two different ways at the same time By change the architecture from a Harvard (separate instruction and data memory) to a von Neuman memory, we actually created a structural hazard! Structural hazards can be avoid by changing

  • hardware: design of the architecture (splitting resources)
  • software: re-order the instruction sequence
  • software: delay
slide-25
SLIDE 25

25

Pipelining

° Improve perfomance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?

Instruction fetch Reg ALU Data access Reg

8 ns

Instruction fetch Reg ALU Data access Reg

8 ns

Instruction fetch

8 ns Time lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) 2 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14

...

Program execution

  • rder

(in instructions)

Instruction fetch Reg ALU Data access Reg

Time lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) 2 ns

Instruction fetch Reg ALU Data access Reg

2 ns

Instruction fetch Reg ALU Data access Reg

2 ns 2 ns 2 ns 2 ns 2 ns

  • Program

execution

  • rder

(in instructions)

slide-26
SLIDE 26

26

Instruction fetch Reg ALU Data access Reg

Time beq $1, $2, 40 add $4, $5, $6 lw $3, 300($0) 4 ns

Instruction fetch Reg ALU Data access Reg

2ns

Instruction fetch Reg ALU Data access Reg

2ns 2 4 6 8 10 12 14 16 Program execution

  • rder

(in instructions)

Figure 6.4

Stall on Branch

slide-27
SLIDE 27

27

Figure 6.5

Predicting branches

Instruction fetch Reg ALU Data access Reg

Time beq $1, $2, 40 add $4, $5, $6 lw $3, 300($0)

Instruction fetch Reg ALU Data access Reg

2 ns

Instruction fetch Reg ALU Data access Reg

2 ns Program execution

  • rder

(in instructions)

Instruction fetch Reg ALU Data access Reg

Time beq $1, $2, 40 add $4, $5 ,$6

  • r $7, $8, $9

Instruction fetch Reg ALU Data access Reg

2 4 6 8 10 12 14 2 4 6 8 10 12 14

Instruction fetch Reg ALU Data access Reg

2 ns 4 ns

bubble bubble bubble bubble bubble

Program execution

  • rder

(in instructions)

slide-28
SLIDE 28

28

Figure 6.6

Delayed branch

Instruction fetch Reg ALU Data access Reg

Time beq $1, $2, 40 add $4, $5, $6 lw $3, 300($0)

Instruction fetch Reg ALU Data access Reg

2 ns

Instruction fetch Reg ALU Data access Reg

2 ns 2 4 6 8 10 12 14 2 ns (Delayed branch slot) Program execution

  • rder

(in instructions)

slide-29
SLIDE 29

29

Figure 6.7

Instruction pipeline

Time 2 4 6 8 10 add $s0, $t0, $t1 IF ID WB EX MEM

Pipeline stages

  • IF

instruction fetch (read)

  • ID

instruction decode and register read (read)

  • EX

execute alu operation

  • MEM data memory (read or write)
  • WB

Write back to register Resources

  • Mem
  • instr. & data memory
  • RegRead1

register read port #1

  • RegRead2

register read port #2

  • RegWrite

register write

  • ALU

alu operation

slide-30
SLIDE 30

30

Figure 6.8

Forwarding

add $s0, $t0, $t1 sub $t2, $s0, $t3 Program execution

  • rder

(in instructions) IF ID WB EX IF ID MEM EX Time 2 4 6 8 10 MEM WB MEM

slide-31
SLIDE 31

31

Figure 6.9

Load Forwarding

Time 2 4 6 8 10 12 14 lw $s0, 20($t1) sub $t2, $s0, $t3 Program execution

  • rder

(in instructions) IF ID WB MEM EX IF ID WB MEM EX

bubble bubble bubble bubble bubble

slide-32
SLIDE 32

32

Figure 6.9

Reordering

lw $t0, 0($t1) $t0=Memory[0+$t1] lw $t2, 4($t1) $t2=Memory[4+$t1] sw $t2, 0($t1) Memory[0+$t1]=$t2 sw $t0, 4($t1) Memory[4+$t1]=$t0 lw $t2, 4($t1) lw $t0, 0($y1) sw $t2, 0($t1) sw $t0, 4($t1)

slide-33
SLIDE 33

33

Basic Idea: split the datapath

° What do we need to add to actually split the datapath into stages?

Instruction memory Address 4 32 Add Add result Shift left 2 Instruction M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data Address Data memory 1 ALU result M u x ALU Zero

IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back

slide-34
SLIDE 34

34

Graphically Representing Pipelines

° Can help with answering questions like:

  • how many cycles does it take to execute this code?
  • what is the ALU doing during cycle 4?
  • use this representation to help understand datapaths

IM Reg DM Reg IM Reg DM Reg CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Time (in clock cycles) lw $10, 20($1) Program execution

  • rder

(in instructions) sub $11, $2, $3 ALU ALU

slide-35
SLIDE 35

35

Pipeline datapath with registers

Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM MEM/WB M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero ID/EX Data memory Address

Figure 6.12

slide-36
SLIDE 36

36

Load instruction fetch and decode

Figure 6.13

Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM MEM/WB M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero ID/EX

Instruction fetch lw

Address Data memory Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero ID/EX MEM/WB

Instruction decode lw

Address Data memory

slide-37
SLIDE 37

37

Load instruction execution

Figure 6.14

Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero ID/EX MEM/WB

Execution lw

Address Data memory

slide-38
SLIDE 38

38

Load instruction memory and write back

Figure 6.15

Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data Data memory 1 ALU result M u x ALU Zero ID/EX MEM/WB

Memory lw

Address Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write data Read data Data memory 1 ALU result M u x ALU Zero ID/EX MEM/WB

Write back lw

Write register Address

slide-39
SLIDE 39

39

Store instruction execution

Figure 6.16

Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data Data memory 1 ALU result M u x ALU Zero ID/EX MEM/WB

Execution sw

Address

slide-40
SLIDE 40

40

Store instruction memory and write back

Figure 6.17

Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data Data memory 1 ALU result M u x ALU Zero ID/EX MEM/WB

Memory sw

Address Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM M u x 1 Add PC Address Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data Data memory 1 ALU result M u x ALU Zero ID/EX MEM/WB

Write back sw

slide-41
SLIDE 41

41

Load instruction: corrected datapath

Figure 6.18

Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM MEM/WB M u x 1 Add PC Address Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data Data memory 1 ALU result M u x ALU Zero ID/EX

slide-42
SLIDE 42

42

Load instruction: overall usage

Figure 6.19

Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM MEM/WB M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero ID/EX Address Data memory

slide-43
SLIDE 43

43

Multi-clock-cycle pipeline diagram

Figure 6.20-21

IM Reg DM Reg IM Reg DM Reg CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Time (in clock cycles) lw $10, 20($1) Program execution

  • rder

(in instructions) sub $11, $2, $3 ALU ALU

Program execution

  • rder

(in instructions) Time ( in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Instruction fetch Instruction decode Instruction fetch Instruction decode Execution Write back Execution Data access Data access Write back lw $10, $20($1) sub $11, $2, $3

slide-44
SLIDE 44

44

Single-cycle #1-2

Figure 6.22

Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM MEM/WB M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero ID/EX

Instruction decode lw $10, 20($1) Instruction fetch sub $11, $2, $3

Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM MEM/WB M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero ID/EX

Instruction fetch lw $10, 20($1)

Address Data memory Address Data memory

Clock 1 Clock 2

slide-45
SLIDE 45

45

Single-cycle #3-4

Figure 6.23

Instruction memory Address 4 Add Add result Shift left 2 Instruction IF/ID EX/MEM MEM/WB M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 32 16 Sign extend Write register Write data

Memory lw $10, 20($1)

Read data 1 ALU result M u x ALU Zero ID/EX

Execution sub $11, $2, $3

Instruction memory Address 4 Add Add result Shift left 2 Instruction IF/ID EX/MEM MEM/WB M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 Write register Write data Read data 1 ALU result M u x ALU Zero ID/EX

Execution lw $10, 20($1) Instruction decode sub $11, $2, $3

32 16 Sign extend Address Data memory Data memory Address

Clock 3 Clock 4

slide-46
SLIDE 46

46

Single-cycle #5-6

Figure 6.24

Instruction memory Address 4 32 Add Add result 1 ALU result Zero Shift left 2 Instruction IF/ID EX/MEM ID/EX MEM/WB

Write back

M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend M u x ALU Read data Write register Write data

lw $10, 20($1)

Instruction memory Address 4 32 Add Add result 1 ALU result Zero Shift left 2 Instruction IF/ID EX/MEM ID/EX MEM/WB

Write back

M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend M u x ALU Read data Write register Write data

sub $11, $2, $3 Memory sub $11, $2, $3

Address Data memory Address Data memory

Clock 6 Clock 5

slide-47
SLIDE 47

47

Conventional Pipelined Execution Representation

IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow Time

slide-48
SLIDE 48

48

Structural Hazards limit performance

° Example: if 1.3 memory accesses per instruction and only one memory access per cycle then

  • average CPI 1.3
  • otherwise resource is more than 100% utilized
slide-49
SLIDE 49

49

° Stall: wait until decision is clear

  • Its possible to move up decision to 2nd stage by adding

hardware to check registers as being read

° Impact: 2 clock cycles per branch instruction => slow

Control Hazard Solutions

I n s t r. O r d e r Time (clock cycles)

Add Beq Load

ALU Mem Reg Mem Reg ALU Mem Reg Mem Reg ALU Reg Mem Reg Mem

slide-50
SLIDE 50

50

° Predict: guess one direction then back up if wrong

  • Predict not taken

° Impact: 1 clock cycles per branch instruction if right, 2 if wrong (right - 50% of time) ° More dynamic scheme: history of 1 branch (- 90%)

Control Hazard Solutions

I n s t r. O r d e r Time (clock cycles)

Add Beq Load

ALU Mem Reg Mem Reg ALU Mem Reg Mem Reg Mem ALU Reg Mem Reg

slide-51
SLIDE 51

51

° Redefine branch behavior (takes place after next instruction) “delayed branch” ° Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (- 50% of time) ° As launch more instruction per clock cycle, less useful

Control Hazard Solutions

I n s t r. O r d e r Time (clock cycles)

Add Beq Misc

ALU Mem Reg Mem Reg ALU Mem Reg Mem Reg Mem ALU Reg Mem Reg

Load

Mem ALU Reg Mem Reg

slide-52
SLIDE 52

52

Data Hazard on r1 add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7

  • r r8, r1 ,r9

xor r10, r1 ,r11

slide-53
SLIDE 53

53

  • Dependencies backwards in time are hazards

Data Hazard on r1:

I n s t r. O r d e r Time (clock cycles)

add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7

  • r r8,r1,r9

xor r10,r1,r11

IF ID/RF EX MEM WB

ALU Im Reg Dm Reg ALU Im Reg Dm Reg ALU Im Reg Dm Reg Im ALU Reg Dm Reg ALU Im Reg Dm Reg

slide-54
SLIDE 54

54

  • “Forward” result from one stage to another
  • “or” OK if define read/write properly

Data Hazard Solution:

I n s t r. O r d e r Time (clock cycles)

add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7

  • r r8,r1,r9

xor r10,r1,r11

IF ID/RF EX MEM WB

ALU Im Reg Dm Reg ALU Im Reg Dm Reg ALU Im Reg Dm Reg Im ALU Reg Dm Reg ALU Im Reg Dm Reg

slide-55
SLIDE 55

55

  • Dependencies backwards in time are hazards
  • Can’t solve with forwarding:
  • Must delay/stall instruction dependent on loads

Forwarding (or Bypassing): What about Loads

Time (clock cycles)

lw r1,0(r2) sub r4,r1,r3

IF ID/RF EX MEM WB

ALU Im Reg Dm Reg ALU Im Reg Dm Reg

slide-56
SLIDE 56

56

Pipelining the Load Instruction

° The five independent functional units in the pipeline datapath are:

  • Instruction Memory for the Ifetch stage
  • Register File’s Read ports (bus A and busB) for the Reg/Dec stage
  • ALU for the Exec stage
  • Data Memory for the Mem stage
  • Register File’s Write port (bus W) for the Wr stage

Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg/Dec Exec Mem Wr 1st lw Ifetch Reg/Dec Exec Mem Wr 2nd lw Ifetch Reg/Dec Exec Mem Wr 3rd lw

slide-57
SLIDE 57

57

The Four Stages of R-type

° Ifetch: Instruction Fetch

  • Fetch the instruction from the Instruction Memory

° Reg/Dec: Registers Fetch and Instruction Decode ° Exec:

  • ALU operates on the two register operands
  • Update PC

° Wr: Write the ALU output back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Ifetch Reg/Dec Exec Wr R-type

slide-58
SLIDE 58

58

Pipelining the R-type and Load Instruction

° We have pipeline conflict or structural hazard:

  • Two instructions try to write to the register file at the same time!
  • Only one write port

Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Mem Wr Load Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr R-type Ops! We have a problem!

slide-59
SLIDE 59

59

Important Observation

° Each functional unit can only be used once per instruction ° Each functional unit must be used at the same stage for all instructions:

  • Load uses Register File’s Write Port during its 5th stage
  • R-type uses Register File’s Write Port during its 4th stage

Ifetch Reg/Dec Exec Mem Wr Load 1 2 3 4 5 Ifetch Reg/Dec Exec Wr R-type 1 2 3 4

° 2 ways to solve this pipeline hazard.

slide-60
SLIDE 60

60

Solution 1: Insert “Bubble” into the Pipeline

° Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle

  • The control logic can be complex.
  • Lose instruction fetch and issue opportunity.

° No instruction is started in Cycle 6!

Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Ifetch Reg/Dec Exec Mem Wr Load Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr R-type Pipeline Bubble Ifetch Reg/Dec Exec Wr

slide-61
SLIDE 61

61

Solution 2: Delay R-type’s Write by One Cycle

° Delay R-type’s register write by one cycle:

  • Now R-type instructions also use Reg File’s write port at Stage 5
  • Mem stage is a NOOP stage: nothing is being done.

Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Mem Wr R-type Ifetch Reg/Dec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr Load Ifetch Reg/Dec Mem Wr R-type Ifetch Reg/Dec Mem Wr R-type Ifetch Reg/Dec Exec Wr R-type Mem Exec Exec Exec Exec 1 2 3 4 5

slide-62
SLIDE 62

62

The Four Stages of Store

° Ifetch: Instruction Fetch

  • Fetch the instruction from the Instruction Memory

° Reg/Dec: Registers Fetch and Instruction Decode ° Exec: Calculate the memory address ° Mem: Write the data into the Data Memory

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Ifetch Reg/Dec Exec Mem Store Wr

slide-63
SLIDE 63

63

The Three Stages of Beq

° Ifetch: Instruction Fetch

  • Fetch the instruction from the Instruction Memory

° Reg/Dec:

  • Registers Fetch and Instruction Decode

° Exec:

  • compares the two register operand,
  • select correct branch target address
  • latch into PC

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Ifetch Reg/Dec Exec Mem Beq Wr

slide-64
SLIDE 64

64

Summary: Pipelining

° What makes it easy

  • all instructions are the same length
  • just a few instruction formats
  • memory operands appear only in loads and stores

° What makes it hard?

  • structural hazards: suppose we had only one memory
  • control hazards: need to worry about branch instructions
  • data hazards: an instruction depends on a previous instruction

° We’ll build a simple pipeline and look at these issues ° We’ll talk about modern processors and what really makes it hard:

  • exception handling
  • trying to improve performance with out-of-order execution, etc.
slide-65
SLIDE 65

65

Summary

° Pipelining is a fundamental concept

  • multiple steps using distinct resources

° Utilize capabilities of the Datapath by pipelined instruction processing

  • start next instruction while working on the current one
  • limited by length of longest stage (plus fill/flush)
  • detect and resolve hazards
slide-66
SLIDE 66

66

Figure 6.29

Control EX M WB M WB WB IF/ID ID/EX EX/MEM MEM/WB Instruction

9 control bits 9 control bits 5 control bits 5 control bits 2 control bits 2 control bits

Pipeline Control: Controlpath Register bits

slide-67
SLIDE 67

67

Figure 5.20, Single Cycle

Instruction R-format lw sw beq Reg Dst 1 1 X X ALU Src 1 1 Mem Reg 1 X X Reg Wrt 1 1 Mem Red 1 Mem Wrt 1 Bra- nch 1 ALU

  • p1

1 ALU

  • p0

1

Figure 6.28

Instruction R-format lw sw beq Reg Dst 1 1 X X ALU Op1 1 ALU Op0 1 ALU Src 1 1 Bra- nch 1 Mem Red 1 Mem Wrt 1 Reg Wrt 1 1 Mem Reg 1 X X ID / EX control lines EX / MEM control lines MEM / WB cntrl lines

Pipeline Control: Controlpath table

slide-68
SLIDE 68

68

Pipeline hazards

  • Solution #1 always works (for non-realtime) applications:

stall, delay & procrastinate! Structural Hazards (i.e. fetching same memory bank)

  • Solution #2: partition architecture

Control Hazards (i.e. branching)

  • Solution #1: stall! but decreases throughput
  • Solution #2: guess and back-track
  • Solution #3: delayed decision: delay branch & fill slot

Data Hazards (i.e. register dependencies)

  • Worst case situation
  • Solution #2: re-order instructions
  • Solution #3: forwarding or bypassing: delayed load

Pipeline Hazards

slide-69
SLIDE 69

69

PC Instruction memory Instruction Add Instruction [20– 16] MemtoReg ALUOp Branch RegDst ALUSrc 4 16 32 Instruction [15–0] M u x 1 Add Add result Registers Write register Write data Read data 1 Read data 2 Read register 1 Read register 2 Sign extend M u x 1 ALU result Zero Write data Read data M u x 1 ALU control Shift left 2 RegWrite MemRead Control ALU Instruction [15– 11] 6 EX M WB M WB WB IF/ID PCSrc ID/EX EX/MEM MEM/WB M u x 1 MemWrite Address Data memory Address

Pipeline Datapath and Controlpath

slide-70
SLIDE 70

70

PC Instruction memory Instruction Add Instruction [20– 16] MemtoReg ALUOp Branch RegDst ALUSrc 4 16 32 Instruction [15–0] M u x 1 Add Add result Registers Write register Write data Read data 1 Read data 2 Read register 1 Read register 2 Sign extend M u x 1 ALU result Zero Write data Read data M u x 1 ALU control Shift left 2 RegWrite MemRead Control ALU Instruction [15– 11] 6 EX M WB M WB WB IF/ID PCSrc ID/EX EX/MEM MEM/WB M u x 1 MemWrite Address Data memory Address

PC=4 PC=4 IR= lw $10,20($1) IR= lw $10,20($1)

Clock 1 Clock 1

PC=4+20<<2 PC=4+20<<2 MDR=Mem[20+C$1] MDR=Mem[20+C$1] D=$10 D=$10

WB=11 WB=11

Aluout Aluout

Clock 3 Clock 3

PC=0 PC=0

Clock 0 Clock 0

PC=4 PC=4 A=C$1 A=C$1 B=X B=X S=20 S=20 T=$10 T=$10 D=0 D=0

WB=11, M=010 EX=0001 WB=11, M=010 EX=0001 Clock 2 Clock 2

PC=4+20<<2 PC=4+20<<2 ALU=20+C$1 ALU=20+C$1 D=$10 D=$10

Clock 3 Clock 3 WB=11 M=010 WB=11 M=010

load inst.

slide-71
SLIDE 71

71

PC Instruction memory Instruction Add Instruction [20– 16] MemtoReg ALUOp Branch RegDst ALUSrc 4 16 32 Instruction [15–0] M u x 1 Add Add result Registers Write register Write data Read data 1 Read data 2 Read register 1 Read register 2 Sign extend M u x 1 ALU result Zero Write data Read data M u x 1 ALU control Shift left 2 RegWrite MemRead Control ALU Instruction [15– 11] 6 EX M WB M WB WB IF/ID PCSrc ID/EX EX/MEM MEM/WB M u x 1 MemWrite Address Data memory Address

PC=4 PC=4 IR= lw $10,20($1) IR= lw $10,20($1)

Clock 1 Clock 1

PC=4+20<<2 PC=4+20<<2 MDR=Mem[20+C$1] MDR=Mem[20+C$1] D=$10 D=$10

WB=11 WB=11

Aluout Aluout

Clock 3 Clock 3

PC=0 PC=0

Clock 0 Clock 0

PC=4 PC=4 A=C$1 A=C$1 B=X B=X S=20 S=20 T=$10 T=$10 D=0 D=0

WB=11, M=010 EX=0001 WB=11, M=010 EX=0001 Clock 2 Clock 2

PC=4+20<<2 PC=4+20<<2 ALU=20+C$1 ALU=20+C$1 D=$10 D=$10

Clock 3 Clock 3 WB=11 M=010 WB=11 M=010

load inst.

slide-72
SLIDE 72

72

Clock <IF/ID> <ID/EX> <EX/MEM> <MEM/WB> <PC, IR> <PC, A, B, S, Rt, Rd> <PC, Z, ALU, B, R> <MDR, ALU, R> <0,?> <?,?,?,?,?,?> <?,?,?,?,?> <?,?,?> 1 <4,lw $10,20($1)> <0,?,?,?,?,?> <?,?,?,?,?> <?,?,?> 2 <8,sub $11,$2,$3> <4,C$1→ → → →3,C$10→ → → →8,20,$10,0> <0,?,?,?,?> <?,?,?> 3 <12,and $12,$4,$5> <8,C$2→ → → →4,C$3→ → → →4,X,$3,$11> <4+20<<2→ → → →84,0,20+3→ → → →23,8,$10><?,?,?> Contents of Register 1 = C$1 = 3; C$2=4; C$3=4; C$4=6; C$5=7; C$10=8; … Memory[23]=9; Formats: add $rd,$rs=A,$rt=B; lw $rt=B,@($rs=A) 4 <16,or $13,$6,$7> <12,C$4→ → → →6,C$5→ → → →7,X,$5,$12><X,1,4-4=0,4,$11> <Mem[23]→ → → →9,23,$10> 5 <20,add $14,$8,$9> <16,C$6 ,C$7,X,$7,$13> <X,0,1,7,$12> <X,0,$11>

Pipeline single stepping

slide-73
SLIDE 73

73

Instruction memory Instruction [20– 16] MemtoReg ALUOp Branch RegDst ALUSrc 4 Instruction [15– 0] M u x 1 Add Add result Registers Write register Write data Read data 1 Read data 2 Read register 1 Read register 2 Sign extend M u x 1 ALU result Zero ALU control Shift left 2 RegWrite MemRead Control ALU Instruction [15– 11] EX M WB M WB WB Instruction IF/ID EX/MEM ID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000 00 0000 000 00 00 00 M u x 1 Add PC Data memory Address Write data Read data M u x 1

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3> IF: sub $11, $2, $3

MemWrite Address

Clock 1

PC=4 PC=4 IR=lw $10,20($1) IR=lw $10,20($1) PC=0 PC=0

Clock 1: Figure 6.31a

slide-74
SLIDE 74

74

RegDst 1 [15–11] WB EX M Instruction memory MemtoReg ALUOp Branch RegDst ALUSrc 4 M u x 1 Add Add result Write register Write data M u x 1 ALU result Zero ALU control Shift left 2 RegWrite ALU M WB WB Instruction IF/ID EX/MEM ID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010 11 0001 000 00 00 00 M u x 1 Add PC Write data Read data M u x 1 lw Control Registers Read data 1 Read data 2 Read register 1 Read register 2 X 10 20 X 1 Instruction [20–16] Instruction [15– 0] Sign extend Instruction [15–11] 20 $X $1 10 X MemRead MemWrite Data memory Address Address

Clock 2 Clock 1

PC=4 PC=4 IR=lw $10,20($1) IR=lw $10,20($1)

Figure 6.31b

PC=4 PC=4 A=C$1 A=C$1 B=X B=X S=20 S=20 T=$10 T=$10 D=0 D=0 PC=8 PC=8 IR=sub $11, $2, $3 IR=sub $11, $2, $3 PC=4 PC=4 C C

slide-75
SLIDE 75

75

Instruction memory Address Instruction [20–16] MemtoReg Branch ALUSrc 4 Instruction [15–0] 1 Add Add result Registers Write register Write data Read data 1 Read data 2 Read register 1 Read register 2 ALU result Shift left 2 RegWrite MemRead Control ALU Instruction [15–11] EX M WB WB Instruction IF/ID EX/MEM ID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000 10 1100 010 11 00 1 00 M u x 1 Add PC Write data Read data M u x 1 MemWrite sub 11 X X 3 2 X $3 $2 X 11 $1 20 10 M u x M u x 1 ALUOp RegDst ALU control M WB Zero Sign extend Data memory Address

Clock 3

PC=8 PC=8 IR=sub $11, $2, $3 IR=sub $11, $2, $3 PC=4 PC=4 A=C$1 A=C$1 B=X B=X S=20 S=20 T=$10 T=$10 D=0 D=0 C C PC=8 PC=8 PC=12 PC=12 IR=and $12,$4,$5 IR=and $12,$4,$5 PC=8 PC=8 A=C$2 A=C$2 B=C$3 B=C$3 S=X S=X T=$3 T=$3 D=$11 D=$11 C C PC=4+20<<2 PC=4+20<<2 ALU=20+C$1 ALU=20+C$1 D=$10 D=$10 C C

slide-76
SLIDE 76

76

1 [15– 11] WB EX M Instruction memory Address MemtoReg ALUOp Branch RegDst ALUSrc 4 1 Add Add result Write register Write data 1 ALU result ALU control Shift left 2 RegWrite M WB Instruction IF/ID EX/MEM ID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

000 10 1100 000 10 10 1 11 1 M u x 1 Add PC Write data M u x 1 and Control Registers Read data 1 Read data 2 Read register 1 Read register 2 12 X X 5 4 Instruction [20– 16] Instruction [15– 0] Instruction [15– 11] X $5 $4 X 12 MemRead MemWrite 11 11 x RegDst $3 $2 11 M u x M u x ALU Address Read data Data memory 10 WB Zero Sign extend

Clock 3 Clock 4

PC=12 PC=12 IR=and $12,$4,$5 IR=and $12,$4,$5 PC=8 PC=8 A=C$2 A=C$2 B=C$3 B=C$3 S=X S=X T=$3 T=$3 D=$11 D=$11 C C PC=4+20<<2 PC=4+20<<2 ALU=20+C$1 ALU=20+C$1 D=$10 D=$10 C C

Clock 4: Figure 6.32b

PC=4+20<<2 PC=4+20<<2 MDR=Mem[20+C$1] MDR=Mem[20+C$1] D=$10 D=$10 C C ALU ALU PC=X PC=X ALU=C$2-C$3 ALU=C$2-C$3 D=$11 D=$11 C C PC=12 PC=12 A=C$4 A=C$4 B=C$5 B=C$5 S=X S=X T=$3 T=$3 D=$12 D=$12 C C PC=16 PC=16 IR=or $13,$6,$7 IR=or $13,$6,$7 PC=20 PC=20

slide-77
SLIDE 77

77

IM Reg IM Reg CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Time (in clock cycles) sub $2, $1, $3 Program execution

  • rder

(in instructions) and $12, $2, $5 IM Reg DM Reg IM DM Reg IM DM Reg CC 7 CC 8 CC 9 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

  • r $13, $6, $2

add $14, $2, $2 sw $15, 100($2) Value of register $2: DM Reg Reg Reg Reg DM

Data Hazards Data Hazards Resolved by forwarding Resolved by forwarding At same time: Not a hazard At same time: Not a hazard Forward in time: Not a hazard Forward in time: Not a hazard

Data Dependencies

Data Dependencies: that can be resolved by forwarding

slide-78
SLIDE 78

78

IM Reg IM Reg CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Time (in clock cycles) sub $2, $1, $3 Program execution order (in instructions) and $12, $2, $5 IM Reg DM Reg IM DM Reg IM DM Reg CC 7 CC 8 CC 9 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

  • r $13, $6, $2

add $14, $2, $2 sw $15, 100($2) Value of register $2 : DM Reg Reg Reg Reg X X X – 20 X X X X X Value of EX/MEM : X X X X – 20 X X X X Value of MEM/WB : DM

Forwards in time: Can be resolved Forwards in time: Can be resolved At same time: Not a hazard At same time: Not a hazard

Data Hazards: arithmetic

slide-79
SLIDE 79

79

sub $2,$1,$3 and $12,$2,$5 Clock Clock

IF

1 1 6 6

EX

7 7

M

8 8

WB ID IF

2 2

MIPS = Clock = 500 Mhz = 167 MIPS CPI 3

WB

5 5

ID

Write 1st Half Write 1st Half Read 2nd Half Read 2nd Half

3 3

EX

Stall Stall

ID M

4 4

Stall Stall

ID

Suppose every instruction is dependant = 1 + 2 stalls = 3 clocks

Data Dependencies: no forwarding

slide-80
SLIDE 80

80

MIPS = Clock = 500 Mhz = 417 MIPS (10% dependency) CPI 1.2

A dependant instruction will take = 1 + 2 stalls = 3 clocks An independent instruction will take = 1 + 0 stalls = 1 clocks Suppose 10% of the time the instructions are dependant? Averge instruction time = 10%*3 + 90%*1 = 0.10*3 + 0.90*1 = 1.2 clocks

MIPS = Clock = 500 Mhz = 167 MIPS (100% dependency) CPI 3 MIPS = Clock = 500 Mhz = 500 MIPS (0% dependency) CPI 1 Data Dependencies: no forwarding

slide-81
SLIDE 81

81

Data Dependencies: with forwarding

sub $2,$1,$3 and $12,$2,$5 Clock Clock

IF

1 1 6 6

WB M WB

5 5

Detected Data Hazard 1a ID/EX.$rs = EX/M.$rd Detected Data Hazard 1a ID/EX.$rs = EX/M.$rd

ID

3 3

EX EX M

4 4

ID IF

2 2

MIPS = Clock = 500 Mhz = 500 MIPS CPI 1

Suppose every instruction is dependant = 1 + 0 stalls = 1 clock

slide-82
SLIDE 82

82

Data Dependencies: Hazard Conditions ID/EX.$rs ID/EX.$rt EX/MEM.$rdest = { 1a. 1b. Hazard Type Source Destination Data Hazard Condition

  • ccurs whenever a data source needs a previous

unavailable result due to a data destination. Data Hazard Detection is always comparing a destination with a source. Example sub $2, $1, $3 sub $rd, $rs, $rt and $12, $2, $5 and $rd, $rs, $rt ID/EX.$rs ID/EX.$rt MEM/WB.$rdest = 2a. 2b.

{

slide-83
SLIDE 83

83

Data Dependencies: Hazard Conditions 1a Data Hazard: EX/MEM.$rd = ID/EX.$rs sub $2, $1, $3 sub $rd, $rs, $rt and $12, $2, $5 and $rd, $rs, $rt 1b Data Hazard: EX/MEM.$rd = ID/EX.$rt sub $2, $1, $3 sub $rd, $rs, $rt and $12, $1, $2 and $rd, $rs, $rt 2a Data Hazard: MEM/WB.$rd = ID/EX.$rs sub $2, $1, $3 sub $rd, $rs, $rt and $12, $1, $5 sub $rd, $rs, $rt

  • r $13, $2,

$1 and $rd, $rs, $rt 2b Data Hazard: MEM/WB.$rd = ID/EX.$rt sub $2, $1, $3 sub $rd, $rs, $rt and $12, $1, $5 sub $rd, $rs, $rt

  • r $13, $6, $2

and $rd, $rs, $rt

slide-84
SLIDE 84

84

Data Dependencies: Worst case Data Hazard: sub $2, $1, $3 sub $rd, $rs, $rt and $12, $2, $2 and $rd, $rs, $rt

  • r $13, $2,

$2 and $rd, $rs, $rt Data Hazard 1a: EX/MEM.$rd = ID/EX.$rs Data Hazard 1b: EX/MEM.$rd = ID/EX.$rt Data Hazard 2a: MEM/WB.$rd = ID/EX.$rs Data Hazard 2b: MEM/WB.$rd = ID/EX.$rt

slide-85
SLIDE 85

85

Data Dependencies: Hazard Conditions ID/EX.$rs ID/EX.$rt = EX/MEM.$rdest

}

1a. 1b. ID/EX.$rs ID/EX.$rt = MEM/WB.$rdest

}

2a. 2b. Hazard Type $rs $rs $rt $rt $rd $rd $rd $rd ID/EX EX/MEM Source Destination MEM/WB $rd $rd Pipeline Registers

slide-86
SLIDE 86

86

Registers M u x M u x ALU ID/EX MEM/WB Data memory M u x Forwarding unit EX/MEM

  • b. With forwarding

ForwardB Rd EX/MEM.RegisterRd MEM/WB.RegisterRd Rt Rt Rs ForwardA M u x ALU ID/EX MEM/WB Data memory EX/MEM

  • a. No forwarding

Registers M u x

slide-87
SLIDE 87

87

Data Hazards: Loads

Reg IM Reg Reg IM CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Time (in clock cycles) lw $2, 20($1) Program execution

  • rder

(in instructions) and $4, $2, $5 IM Reg DM Reg IM DM Reg IM DM Reg CC 7 CC 8 CC 9

  • r $8, $2, $6

add $9, $4, $2 slt $1, $6, $7 DM Reg Reg Reg DM

Backwards in time: Cannot be resolved Backwards in time: Cannot be resolved Forwards in time: Cannot be resolved Forwards in time: Cannot be resolved At same time: Not a hazard At same time: Not a hazard

slide-88
SLIDE 88

88

Data Hazards: load stalling

lw $2, 20($1) Program execution

  • rder

(in instructions) and $4, $2, $5

  • r $8, $2, $6

add $9, $4, $2 slt $1, $6, $7 Reg IM Reg Reg IM DM CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Time (in clock cycles) IM Reg DM Reg IM IM DM Reg IM DM Reg CC 7 CC 8 CC 9 CC 10 DM Reg Reg Reg Reg

bubble

Stall Stall

slide-89
SLIDE 89

89

Data Hazards: Hazard detection unit (page 490) IF/ID.$rs IF/ID.$rt = ID/EX.$rt Λ Λ Λ Λ ID/EX.MemRead=1

}

Source Destination Stall Condition Stall Example lw $2, 20($1) lw $rt, addr($rs) and $4, $2, $5 and $rd, $rs, $rt No Stall Example: (only need to look at next instruction) lw $2, 20($1) lw $rt, addr($rs) and $4, $1, $5 and $rd, $rs, $rt

  • r $8, $2, $6
  • r $rd, $rs, $rt
slide-90
SLIDE 90

90

No Stall Example: (only need to look at next instruction) lw $2, 20($1) lw $rt, addr($rs) and $4, $1, $5 and $rd, $rs, $rt

  • r $8, $2, $6
  • r $rd, $rs, $rt

Data Hazards: Hazard detection unit (page 490) Example load: assume half of the instructions are immediately followed by an instruction that uses it. load instruction time: 50%*(1 clock) + 50%*(2 clocks)=1.5 What is the average number of clocks for the load?

slide-91
SLIDE 91

91

Hazard Detection Unit: when to stall

PC Instruction memory Registers M u x M u x M u x Control ALU EX M WB M WB WB ID/EX EX/MEM MEM/WB Data memory M u x Hazard detection unit Forwarding unit M u x IF/ID Instruction ID/EX.MemRead IF/IDWrite PCWrite ID/EX.RegisterRt IF/ID.RegisterRd IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRs Rt Rs Rd Rt EX/MEM.RegisterRd MEM/WB.RegisterRd

slide-92
SLIDE 92

92

IF/ID.$rs IF/ID.$rt = ID/EX.$rt Λ Λ Λ Λ ID/EX.MemRead=1

}

Source Destination Stall Condition ID/EX.$rs ID/EX.$rt = EX/MEM.$rd

}

ID/EX.$rs ID/EX.$rt = MEM/WB.$rd

}

Source Destination Forwarding Condition Data Dependency Units

slide-93
SLIDE 93

93

$rs $rs $rt $rt $rd $rd ID/EX Pipeline Registers Data Dependency Units $rs $rs $rt $rt $rd $rd IF/ID IF/ID.$rs IF/ID.$rt = ID/EX.$rt Λ Λ Λ Λ ID/EX.MemRead=1

}

Source Destination Stall Condition $rd $rd EX/MEM MEM/WB $rd $rd Forwarding Comparisons Stalling Comparisons

slide-94
SLIDE 94

94

Instruction fetch Reg ALU Data access Reg

Time beq $1, $2, 40 add $4, $5, $6 lw $3, 300($0) 4 ns

Instruction fetch Reg ALU Data access Reg

2ns

Instruction fetch Reg ALU Data access Reg

2ns 2 4 6 8 10 12 14 16 Program execution

  • rder

(in instructions)

Branch Hazards: Soln #1, Stall until Decision made (fig. 6.4)

Decision made in ID stage: do load Decision made in ID stage: do load

@3C: add $4, $5, $6 @40: beq $1, $3, 7 @44: and $12, $2, $5 @48:

  • r

$13, $6, $2 @4C: add $14, $2, $2 @50: lw $4, 50($7)

Soln #1: Stall until Decision is made Soln #1: Stall until Decision is made Stall Stall

slide-95
SLIDE 95

95

Branch Hazards: Soln #2, Predict until Decision made

beq $1,$3,7 and $12, $2, $5 Clock Clock

IF

1 1 6 6

EX

7 7

M

8 8

WB ID IF

2 2

WB

5 5

ID

3 3

EX M

4 4 lw $4, 50($7)

EX M WB IF ID

Predict false branch Predict false branch Decision made in ID stage: discard & branch Decision made in ID stage: discard & branch discard “and $12,$2,$5” instruction discard “and $12,$2,$5” instruction

slide-96
SLIDE 96

96

beq $1,$3,7 add $4,$6,$6 Clock Clock

IF

1 1 6 6

EX

7 7

M

8 8

WB ID IF

2 2

WB

5 5

ID

3 3

EX M

4 4 lw $4, 50($7)

EX M WB IF ID

Move instruction before branch Move instruction before branch Decision made in ID stage: branch Decision made in ID stage: branch Do not need to discard instruction Do not need to discard instruction

Branch Hazards: Soln #3, Delayed Decision

slide-97
SLIDE 97

97

Branch Hazards: Soln #3, Delayed Decision

beq $1,$3,7 and $12, $2, $5 Clock Clock

IF

1 1 6 6

EX

7 7

M

8 8

WB ID IF

2 2

WB

5 5

ID

3 3

EX M

4 4 lw $4, 50($7)

EX M WB IF ID

Decision made in ID stage: do branch Decision made in ID stage: do branch

slide-98
SLIDE 98

98

Branch Hazards: Decision made in the ID stage (figure 6.4)

beq $1,$3,7 nop Clock Clock

IF

1 1 6 6

EX

7 7

M

8 8

WB ID IF

2 2

WB

5 5

ID

3 3

EX M

4 4 lw $4, 50($7)

EX M WB IF ID

No decision yet: insert a nop No decision yet: insert a nop Decision: do load Decision: do load

slide-99
SLIDE 99

99

Reg Reg CC 1 Time (in clock cycles) 40 beq $1, $3, 7 Program execution

  • rder

(in instructions) IM Reg IM DM IM DM IM DM DM DM Reg Reg Reg Reg Reg Reg IM 44 and $12, $2, $5 48 or $13, $6, $2 52 add $14, $2, $2 72 lw $4, 50($7) CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Reg

Branch Hazards: Soln #2, Predict until Decision made

Branch Decision made in MEM stage: Discard values when wrong prediction Branch Decision made in MEM stage: Discard values when wrong prediction Same effect as 3 stalls Same effect as 3 stalls Predict false branch Predict false branch

slide-100
SLIDE 100

100

Figure 6.51

PC Instruction memory 4 Registers M u x M u x M u x ALU EX M WB M WB WB ID/EX EX/MEM MEM/WB Data memory M u x Hazard detection unit Forwarding unit IF.Flush IF/ID Sign extend Control M u x

=

Shift left 2 M u x

Early branch comparison Early branch comparison Flush: if wrong prediciton, add nops Flush: if wrong prediciton, add nops

slide-101
SLIDE 101

101

load: assume half of the instructions are immediately followed by an instruction that uses it (i.e. data dependency) load instruction time = 50%*(1 clock) + 50%*(2 clocks)=1.5 Branch: the branch delay of misprediction is 1 clock cycle that 25% of the branches are mispredicted. branch time = 75%*(1 clocks) + 25%*(2 clocks) = 1.25 Jump: assume that jumps always pay 1 full clock cycle delay (stall). Jump instruction time = 2 Performance

slide-102
SLIDE 102

102 Instruction Pipeline Cycles Instruction Mix loads arithmetic branches stores 1.5

(50% dependancy)

1 1 1.25

(25% dependancy)

23% 13% 43% 19% Clock speed 500 Mhz 2 ns CPI 1.18 MIPS 424 MIPS = Clock/CPI jumps 2 2% Single- Cycle 1 1 1 1 125 Mhz 8 ns 1 125 MIPS 1 Multi-Cycle Clocks 5 4 3 4 500 Mhz 2 ns 4.02 125 MIPS 3 = Σ Σ Σ Σ Cycles*Mix

Performance, page 504

branch time = 75%*(1 clocks) + 25%*(2 clocks) = 1.25 load instruction time = 50%*(1 clock) + 50%*(2 clocks)=1.5

Also known as the instruction latency with in a pipeline Pipeline throughput

slide-103
SLIDE 103

103

Branch prediction ° Datapath parallelism only useful if you can keep it fed. ° Easy to fetch multiple (consecutive) instructions per cycle

  • essentially speculating on sequential flow

° Jump: unconditional change of control flow

  • Always taken

° Branch: conditional change of control flow

  • Taken about 50% of the time
  • Backward: 30% x 80% taken
  • Forward: 70% x 40% taken
slide-104
SLIDE 104

104

A Big Idea for Today

° Reactive: past actions cause system to adapt use

  • do what you did before better
  • ex: caches
  • TCP windows
  • URL completion, ...

° Proactive: uses past actions to predict future actions

  • optimize speculatively, anticipate what you are about to do
  • branch prediction
  • long cache blocks
  • ???
slide-105
SLIDE 105

105

Case for Branch Prediction when Issue N instructions per clock cycle

  • 1. Branches will arrive up to n times faster in an n-

issue processor

  • 2. Amdahl’s Law => relative impact of the control

stalls will be larger with the lower potential CPI in an n-issue processor conversely, need branch prediction to ‘see’ potential parallelism

slide-106
SLIDE 106

106

Branch Prediction Schemes

  • 0. Static Branch Prediction

° 1-bit Branch-Prediction Buffer ° 2-bit Branch-Prediction Buffer ° Correlating Branch Prediction Buffer ° Tournament Branch Predictor ° Branch Target Buffer ° Integrated Instruction Fetch Units ° Return Address Predictors

slide-107
SLIDE 107

107

Dynamic Branch Prediction

° Performance = ƒ(accuracy, cost of misprediction) ° Branch History Table: Lower bits of PC address index table of 1-bit values

  • Says whether or not branch taken last time
  • No address check
  • saves HW, but may not be right branch
  • If inst == BR, update table with outcome

° Problem: in a loop, 1-bit BHT will cause 2 mispredictions

  • End of loop case, when it exits instead of looping as before
  • First time through loop on next time through code, when it predicts exit

instead of looping

  • avg is 9 iterations before exit
  • Only 80% accuracy even if loop 90% of the time

° Local history

  • This particular branch inst
  • Or one that maps into same lost

PC

slide-108
SLIDE 108

108

° 2-bit scheme where change prediction only if get misprediction twice: ° Red: stop, not taken ° Green: go, taken ° Adds hysteresis to decision making process ° Generalize to n-bit saturating counter

2-bit Dynamic Branch Prediction

T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T NT NT

slide-109
SLIDE 109

109

Consider 3 Scenarios

° Branch for loop test ° Check for error or exception ° Alternating taken / not-taken

  • example?

° Your worst-case prediction scenario ° How could HW predict “this loop will execute 3 times” using a simple mechanism?

taken predictors Global history

slide-110
SLIDE 110

110

Correlating Branches

Idea: taken/not taken

  • f recently executed

branches is related to behavior of next branch (as well as the history of that branch behavior)

  • Then behavior of recent

branches selects between, say, 4 predictions of next branch, updating just that prediction

° (2,2) predictor: 2-bit global, 2-bit local

Branch address (4 bits) 2-bits per branch local predictors Prediction Prediction 2-bit recent global branch history (01 = not taken then taken)

slide-111
SLIDE 111

111

0% 1% 5% 6% 6% 11% 4% 6% 5% 1% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li Frequency of Mispredictions 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Accuracy of Different Schemes

4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT

0% 18% Frequency of Mispredictions What’s missing in this picture?

slide-112
SLIDE 112

112

Re-evaluating Correlation

° Several of the SPEC benchmarks have less than a dozen branches responsible for 90% of taken branches:

program branch % static # = 90% compress 14% 236 13 eqntott 25% 494 5 gcc 15% 9531 2020 mpeg 10% 5598 532 real gcc 13% 17361 3214

° Real programs + OS more like gcc ° Small benefits beyond benchmarks for correlation? problems with branch aliases?

slide-113
SLIDE 113

113

BHT Accuracy

° Mispredict because either:

  • Wrong guess for that branch
  • Got branch history of wrong branch when index the table

° 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% ° For SPEC92, 4096 about as good as infinite table

slide-114
SLIDE 114

114

Dynamically finding structure in Spaghetti

?

slide-115
SLIDE 115

115

Tournament Predictors

° Motivation for correlating branch predictors is 2- bit predictor failed on important branches; by adding global information, performance improved ° Tournament predictors: use 2 predictors, 1 based

  • n global information and 1 based on local

information, and combine with a selector ° Use the predictor that tends to guess correctly

addr history Predictor A Predictor B

slide-116
SLIDE 116

116

Tournament Predictor in Alpha 21264

° 4K 2-bit counters to choose from among a global predictor and a local predictor ° Global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor

  • 12-bit

pattern: ith bit => ith prior branch not taken; ith bit 1 => ith prior branch taken;

° Local predictor consists of a 2-level predictor:

  • Top level a local history table consisting of 1024 10-bit

entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted.

  • Next level Selected entry from the local history table is used

to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction

° Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180,000 transistors)

slide-117
SLIDE 117

117

% of predictions from local predictor in Tournament Prediction Scheme

98% 100% 94% 90% 55% 76% 72% 63% 37% 69% 0% 20% 40% 60% 80% 100%

nasa7 matrix300 tomcatv doduc spice fpppp gcc espresso eqntott li

slide-118
SLIDE 118

118

94% 96% 98% 98% 97% 100% 70% 82% 77% 82% 84% 99% 88% 86% 88% 86% 95% 99% 0% 20% 40% 60% 80% 100% g cc e sp re sso li fp p p p d o d uc to m ca tv Bra nch p re d ictio n a ccura cy Pro file -b a se d 2-b it co unte r T o urna m e nt

Accuracy of Branch Prediction

fig 3.40

slide-119
SLIDE 119

119

Accuracy v. Size (SPEC89)

0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Total predictor size (Kbits) Local Correlating Tournament

slide-120
SLIDE 120

120

GSHARE

° A good simple predictor ° Sprays predictions for a given address across a large table for different histories

address xor history n

slide-121
SLIDE 121

121

Need Address At Same Time as Prediction

° Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)

  • Note: must check for branch match now, since can’t use wrong branch address

(Figure 3.19, 3.20)

Branch PC Predicted PC =? P C

  • f

i n s t r u c t i

  • n

F E T C H Extra prediction state bits Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4)

slide-122
SLIDE 122

122

° Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP

  • If false, then neither store result nor cause exception
  • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have

conditional move; PA-RISC can annul any following instr.

  • IA-64: 64 1-bit condition fields selected

so conditional execution of any instruction

  • This transformation is called “if-conversion”

° Drawbacks to conditional instructions

  • Still takes a clock even if “annulled”
  • Stall if condition evaluated late
  • Complex conditions reduce effectiveness;

condition becomes known late in pipeline

x A = B op C

Predicated Execution

slide-123
SLIDE 123

123

Special Case Return Addresses

° Register Indirect branch hard to predict address ° SPEC89 85% such branches for procedure return ° Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate

slide-124
SLIDE 124

124

Pitfall: Sometimes bigger and dumber is Better

° 21264 uses tournament predictor (29 Kbits) ° Earlier 21164 uses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits) ° SPEC95 benchmarks, 22264 outperforms

  • 21264 avg. 11.5 mispredictions per 1000 instructions
  • 21164 avg. 16.5 mispredictions per 1000 instructions

° Reversed for transaction processing (TP) !

  • 21264 avg. 17 mispredictions per 1000 instructions
  • 21164 avg. 15 mispredictions per 1000 instructions

° TP code much larger & 21164 hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21264)

slide-125
SLIDE 125

125

A“zero-cycle” jump

° What really has to be done at runtime?

  • Once an instruction has been detected as a jump or JAL, we might

recode it in the internal cache.

  • Very limited form of dynamic compilation?

° Use of “Pre-decoded” instruction cache

  • Called “branch folding” in the Bell-Labs CRISP processor.
  • Original CRISP cache had two addresses and could thus fold a

complete branch into the previous instruction

  • Notice that JAL introduces a structural hazard on write

and r3,r1,r5 addi r2,r3,#4 sub r4,r2,r1 jal doit subi r1,r1,#1

A:

sub r4,r2,r1 doit addi r2,r3,#4 A+8

N

sub r4,r2,r1

L

  • and

r3,r1,r5 A+4

N

subi r1,r1,#1 A+20

N

Internal Cache state:

slide-126
SLIDE 126

126

reflect PREDICTIONS and remove delay slots

° This causes the next instruction to be immediately fetched from branch destination (predict taken) ° If branch ends up being not taking, then squash destination instruction and restart pipeline at address A+16 Internal Cache state:

and r3,r1,r5 addi r2,r3,#4 sub r4,r2,r1 bne r4,loop subi r1,r1,#1

A:

sub r4,r2,r1 addi r2,r3,#4 sub r4,r2,r1 bne loop and r3,r1,r5 subi r1,r1,#1

N N N N N

A+12 A+8 loop A+4 A+20

Next A+16:

slide-127
SLIDE 127

127

Dynamic Branch Prediction Summary

° Prediction becoming important part of scalar execution ° Branch History Table: 2 bits for loop accuracy ° Correlation: Recently executed branches correlated with next branch.

  • Either different branches
  • Or different executions of same branches

° Tournament Predictor: more resources to competitive solutions and pick between them ° Branch Target Buffer: include branch address & prediction ° Predicated Execution can reduce number of branches, number of mispredicted branches ° Return address stack for prediction of indirect jump

slide-128
SLIDE 128

128

Exceptions and Interrupts

(Hardware)

slide-129
SLIDE 129

129

Example: Device Interrupt

(Say, arrival of network message)

… … … … add r1,r2,r3 subi r4,r1,#4 slli r4,r4,#2 Hiccup(!) lw r2,0(r4) lw r3,4(r4) add r2,r2,r3 sw 8(r4),r2 … … … … Raise priority Reenable All Ints Save registers … … … … lw r1,20(r0) lw r2,0(r1) addi r3,r0,#5 sw 0(r1),r3 … … … … Restore registers Clear current Int Disable All Ints Restore priority RTE

E x t e r n a l I n t e r r u p t PC saved Disable A ll Ints Supervisor M

  • de

R estore PC U ser M

  • de

“ I n t e r r u p t H a n d l e r ”

slide-130
SLIDE 130

130

Disable Network Intr … … … … subi r4,r1,#4 slli r4,r4,#2 lw r2,0(r4) lw r3,4(r4) add r2,r2,r3 sw 8(r4),r2 lw r1,12(r0) beq r1,no_mess lw r1,20(r0) lw r2,0(r1) addi r3,r0,#5 sw 0(r1),r3 Clear Network Intr … … … …

Alternative: Polling

(again, for arrival of network message) E x t e r n a l I n t e r r u p t “Handler”

no_mess:

Polling Point (check device register)

slide-131
SLIDE 131

131

Polling is faster/slower than Interrupts.

° Polling is faster than interrupts because

  • Compiler knows which registers in use at polling point. Hence, do not

need to save and restore registers (or not as many).

  • Other interrupt overhead avoided (pipeline flush, trap priorities, etc).

° Polling is slower than interrupts because

  • Overhead of polling instructions is incurred regardless of whether or not

handler is run. This could add to inner-loop delay.

  • Device may have to wait for service for a long time.

° When to use one or the other?

  • Multi-axis tradeoff
  • Frequent/regular events good for polling, as long as device can be

controlled at user level.

  • Interrupts good for infrequent/irregular events
  • Interrupts good for ensuring regular/predictable service of events.
slide-132
SLIDE 132

132

Exception/Interrupt classifications

° Exceptions: relevant to the current process

  • Faults, arithmetic traps, and synchronous traps
  • Invoke software on behalf of the currently executing process

° Interrupts: caused by asynchronous, outside events

  • I/O devices requiring service (DISK, network)
  • Clock interrupts (real time scheduling)

° Machine Checks: caused by serious hardware failure

  • Not always restartable
  • Indicate that bad things have happened.
  • Non-recoverable ECC error
  • Machine room fire
  • Power outage
slide-133
SLIDE 133

133

A related classification: Synchronous vs. Asynchronous

° Synchronous: means related to the instruction stream, i.e. during the execution of an instruction

  • Must stop an instruction that is currently executing
  • Page fault on load or store instruction
  • Arithmetic exception
  • Software Trap Instructions

° Asynchronous: means unrelated to the instruction stream, i.e. caused by an outside event.

  • Does not have to disrupt instructions that are already

executing

  • Interrupts are asynchronous
  • Machine checks are asynchronous

° SemiSynchronous (or high-availability interrupts):

  • Caused by external event but may have to disrupt current

instructions in order to guarantee service

slide-134
SLIDE 134

134

Interrupt Priorities Must be Handled

… … … … add r1,r2,r3 subi r4,r1,#4 slli r4,r4,#2 Hiccup(!) lw r2,0(r4) lw r3,4(r4) add r2,r2,r3 sw 8(r4),r2 … … … … Raise priority Reenable All Ints Save registers … … … … lw r1,20(r0) lw r2,0(r1) addi r3,r0,#5 sw 0(r1),r3 … … … … Restore registers Clear current Int Disable All Ints Restore priority RTE

N e t w

  • r

k I n t e r r u p t PC saved Disable A ll Ints Supervisor M

  • de

R estore PC U ser M

  • de

C

  • u

l d b e i n t e r r u p t e d b y d i s k Note that priority must be raised to avoid recursive interrupts!

slide-135
SLIDE 135

135

Interrupt controller hardware and mask levels

° Operating system constructs a hierarchy of masks that reflects some form of interrupt priority. ° For instance:

  • This reflects the an order of urgency to interrupts
  • For instance, this ordering says that disk events can interrupt the

interrupt handlers for network interrupts.

Priority Examples Software interrupts 2 Network Interrupts 4 Sound card 5 Disk Interrupt 6 Real Time clock

  • Non-Maskable Ints (power)
slide-136
SLIDE 136

136

Can we have fast interrupts?

° Pipeline Drain: Can be very Expensive ° Priority Manipulations ° Register Save/Restore

  • 128 registers + cache misses + etc.

… … … … add r1,r2,r3 subi r4,r1,#4 slli r4,r4,#2 Hiccup(!) lw r2,0(r4) lw r3,4(r4) add r2,r2,r3 sw 8(r4),r2 … … … … Raise priority Reenable All Ints Save registers … … … … lw r1,20(r0) lw r2,0(r1) addi r3,r0,#5 sw 0(r1),r3 … … … … Restore registers Clear current Int Disable All Ints Restore priority RTE

F i n e G r a i n I n t e r r u p t PC saved Disable A ll Ints Supervisor M

  • de

R estore PC U ser M

  • de

C

  • u

l d b e i n t e r r u p t e d b y d i s k

slide-137
SLIDE 137

137

SPARC (and RISC I) had register windows

° On interrupt or procedure call, simply switch to a different set of registers ° Really saves on interrupt overhead

  • Interrupts can happen at any point in the execution, so compiler

cannot help with knowledge of live registers.

  • Conservative handlers must save all registers
  • Short handlers might be able to save only a few, but this analysis

is compilcated

° Not as big a deal with procedure calls

  • Original statement by Patterson was that Berkeley didn’t have a

compiler team, so they used a hardware solution

  • Good compilers can allocate registers across procedure

boundaries

  • Good compilers know what registers are live at any one time

° However, register windows have returned!

  • IA64 has them
  • Many other processors have shadow registers for interrupts
slide-138
SLIDE 138

138

Supervisor State

° Typically, processors have some amount of state that user programs are not allowed to touch.

  • Page mapping hardware/TLB
  • TLB prevents one user from accessing memory of another
  • TLB protection prevents user from modifying mappings
  • Interrupt controllers -- User code prevented from crashing machine

by disabling interrupts. Ignoring device interrupts, etc.

  • Real-time clock interrupts ensure that users cannot lockup/crash

machine even if they run code that goes into a loop:

  • “Preemptive Multitasking” vs “non-preemptive multitasking”

° Access to hardware devices restricted

  • Prevents malicious user from stealing network packets
  • Prevents user from writing over disk blocks

° Distinction made with at least two-levels: USER/SYSTEM (one hardware mode-bit)

  • x86 architectures actually provide 4 different levels, only two usually

used by OS (or only 1 in older Microsoft OSs)

slide-139
SLIDE 139

139

Entry into Supervisor Mode

° Entry into supervisor mode typically happens on interrupts, exceptions, and special trap instructions. ° Entry goes through kernel instructions:

  • interrupts, exceptions, and trap instructions change to supervisor

mode, then jump (indirectly) through table of instructions in kernel intvec: j handle_int0 j handle_int1 … j handle_fp_except0 … j handle_trap0 j handle_trap1

  • OS “System Calls” are just trap instructions:

read(fd,buffer,count) => st 20(r0),r1 st 24(r0),r2 st 28(r0),r3 trap $READ

° OS overhead can be serious concern for achieving fast interrupt behavior.

slide-140
SLIDE 140

140

Precise Interrupts/Exceptions

° An interrupt or exception is considered precise if there is a single instruction (or interrupt point) for which:

  • All instructions before that have committed their state
  • No following instructions (including the interrupting instruction)

have modified any state.

° This means, that you can restart execution at the interrupt point and “get the right answer”

  • Implicit in our previous example of a device interrupt:
  • Interrupt point is at first lw instruction

… … … … add r1,r2,r3 subi r4,r1,#4 slli r4,r4,#2 lw r2,0(r4) lw r3,4(r4) add r2,r2,r3 sw 8(r4),r2 … … … …

E x t e r n a l I n t e r r u p t PC saved Disable All Ints

Supervisor Mode

Restore PC User Mode

I n t h a n d l e r

slide-141
SLIDE 141

141

Precise interrupt point may require multiple PCs

° On SPARC, interrupt hardware produces “pc” and “npc” (next pc) ° On MIPS, only “pc” – must fix point in software

addi r4,r3,#4 sub r1,r2,r3 bne r1,there and r2,r3,r5 <other insts>

PC: PC+4: Interrupt point described as <PC,PC+4>

addi r4,r3,#4 sub r1,r2,r3 bne r1,there and r2,r3,r5 <other insts>

Interrupt point described as: <PC+4,there> (branch was taken)

  • r

<PC+4,PC+8> (branch was not taken) PC: PC+4:

slide-142
SLIDE 142

142

Why are precise interrupts desirable?

° Restartability doesn’t require preciseness. However, preciseness makes it a lot easier to restart. ° Simplify the task of the operating system a lot

  • Less state needs to be saved away if unloading process.
  • Quick to restart (making for fast interrupts)

° Many types of interrupts/exceptions need to be

  • restartable. Easier to figure out what actually

happened:

  • I.e. TLB faults. Need to fix translation, then restart load/store
  • IEEE gradual underflow, illegal operation, etc:

e.g. Suppose you are computing: Then, for , Want to take exception, replace NaN with 1, then restart.

→ → → → x

  • peration

illegal NaN f _ ) ( + + + + ⇒ = = = =

x x x f ) sin( ) ( = = = =

slide-143
SLIDE 143

143

Approximations to precise interrupts

° Hardware has imprecise state at time of interrupt ° Exception handler must figure out how to find a precise PC at which to restart program.

  • Emulate instructions that may remain in pipeline
  • Example: SPARC allows limited parallelism between FP and integer

core:

  • possible that integer instructions #1 - #4

have already executed at time that the first floating instruction gets a recoverable exception

  • Interrupt handler code must fixup <float 1>,

then emulate both <float 1> and <float 2>

  • At that point, precise interrupt point is

integer instruction #5.

<float 1> <int 1> <int 2> <int 3> <float 2> <int 4> <int 5>

° Vax had string move instructions that could be in middle at time that page-fault occurred. ° Could be arbitrary processor state that needs to be restored to restart execution.

slide-144
SLIDE 144

144

Precise Exceptions in simple 5-stage pipeline:

° Exceptions may occur at different stages in pipeline (I.e. out of order):

  • Arithmetic exceptions occur in execution stage
  • TLB faults can occur in instruction fetch or memory stage

° What about interrupts? The doctor’s mandate of “do no harm” applies here: try to interrupt the pipeline as little as possible ° All of this solved by tagging instructions in pipeline as “cause exception or not” and wait until end of memory stage to flag exception

  • Interrupts become marked NOPs (like bubbles) that are placed into

pipeline instead of an instruction.

  • Assume that interrupt condition persists in case NOP flushed
  • Clever instruction fetch might start fetching instructions from

interrupt vector, but this is complicated by need for supervisor mode switch, saving of one or more PCs, etc

slide-145
SLIDE 145

145

Another look at the exception problem

° Use pipeline to sort this out!

  • Pass exception status along with instruction.
  • Keep track of PCs for every instruction in pipeline.
  • Don’t act on exception until it reache WB stage

° Handle interrupts through “faulting noop” in IF stage ° When instruction reaches WB stage:

  • Save PC ⇒ EPC, Interrupt vector addr ⇒ PC
  • Turn all instructions in earlier stages into noops!

Program Flow Time IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Data TLB Bad Inst Inst TLB fault Overflow

slide-146
SLIDE 146

146

How to achieve precise interrupts when instructions executing in arbitrary order?

° Jim Smith’s classic paper discusses several methods for getting precise interrupts:

  • In-order instruction completion
  • Reorder buffer
  • History buffer

° We will discuss these after we see the advantages of

  • ut-of-order execution.
slide-147
SLIDE 147

147

Summary

° Control flow causes lots of trouble with pipelining

  • Other hazards can be “fixed” with more transistors or forwarding
  • We will spend a lot of time on branch prediction techniques

° Some pre-decode techniques can transform dynamic decisions into static ones (VLIW-like)

  • Beginnings of dynamic compilation techniques

° Interrupts and Exceptions either interrupt the current instruction or happen between instructions

  • Possibly large quantities of state must be saved before interrupting

° Machines with precise exceptions provide one single point in the program to restart execution

  • All instructions before that point have completed
  • No instructions after or including that point have completed

° Hardware techniques exist for precise exceptions even in the face of out-of-order execution!

  • Important enabling factor for out-of-order execution