[PPT] - Previous lecture stalls reduce performance but are required to PowerPoint Presentation

SLIDE 1

dt10 2011 13.1

Previous lecture

stalls

– reduce performance – but are required to get correct results

compiler

– arranges code to avoid hazards and stalls – requires knowledge of the pipeline structure

SLIDE 2

dt10 2011 13.2

Branch hazards

branch outcome is determined in MEM stage

PC

Flush these instructions (Set control values to 0)

SLIDE 3

dt10 2011 13.3

Reducing branch delay

move hardware to determine outcome to ID stage

– target address adder – register comparator

example: branch taken

36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $2, $5 48: or $13, $2, $6 52: add $14, $4, $2 56: slt $15, $6, $7 ... ??: lw $4, 50($7)

SLIDE 4

dt10 2011 13.4

Example: branch taken

SLIDE 5

dt10 2011 13.5

Example: branch taken

SLIDE 6

dt10 2011 13.6

Delay slots: clawing back the stalls

taken branch always means one stall cycle

– nothing we can do to get rid of it – can we use the stall cycle to do something useful?

MIPS approach : change the ISA specification

– instruction following branch is always executed – delay slot instruction : executed even when branch taken

do{ $2 = $2 * $3; $3 = $3 - 1; }while($3==0) ; $3 = $2 + $4; 24 mul $2, $2, $3 28 addi $3, $3, -1 32 beq $3, $0, -3 stall 36 add $3, $2, $4 taken : no delay slot 24 mul $2, $2, $3 28 beq $3, $0, -2 32 addi $3, $3, -1 36 add $3, $2, $4 taken : with delay slot

SLIDE 7

dt10 2011 13.7

Data hazards for branches

if a comparison register is a destination of 2nd or

3rd preceding ALU instruction

…

IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB

add $4, $5, $6 add $1, $2, $3 beq $1, $4, target

can resolve using forwarding

SLIDE 8

dt10 2011 13.8

Data hazards for branches

two data hazards that cause stall on branch

– comparison reg. is destination of preceding ALU instr. – comparison reg. is destination of 2nd preceding load instr.

need 1 stall cycle

IF ID EX MEM WB IF ID EX MEM WB IF ID

add $4, $5, $6 lw $1, addr beq $1, $4, target

EX MEM WB

SLIDE 9

dt10 2011 13.9

Data hazards for branches

two data hazards that cause stall on branch

– comparison reg. is destination of preceding ALU instr. – comparison reg. is destination of 2nd preceding load instr.

need 1 stall cycle

beq stalled

IF ID EX MEM WB IF ID EX MEM WB IF ID ID EX MEM WB

add $4, $5, $6 lw $1, addr beq $1, $4, target

SLIDE 10

dt10 2011 13.10

Data hazards for branches

if a comparison register is a destination of

immediately preceding load instruction

– need 2 stall cycles

beq stalled

IF ID EX MEM WB IF ID ID ID EX MEM WB

beq stalled lw $1, addr beq $1, $0, target

SLIDE 11

dt10 2011 13.11

Dynamic branch prediction

deeper and superscalar pipelines

– branch penalty is more significant

use dynamic prediction

– branch prediction buffer (aka branch history table) – indexed by recent branch instruction addresses – stores outcome (taken/not taken)

dynamic prediction: execute a branch

– check table, expect the same outcome – start fetching from fall-through or target – if wrong, flush pipeline and flip prediction

SLIDE 12

dt10 2011 13.12

1-bit predictor: shortcoming

inner loop branches mispredicted twice!
uter: …

… inner: … … beq …, …, inner … beq …, …, outer

– mispredict as taken on last iteration of inner loop – then mispredict as not taken on first iteration of inner loop next time around

SLIDE 13

dt10 2011 13.13

2-Bit predictor

only change prediction on two successive

mispredictions

SLIDE 14

dt10 2011 13.14

even with predictor, still need to calculate the

target address

– 1-cycle penalty for a taken branch

branch target buffer

– cache of target addresses – indexed by PC when instruction fetched – if hit and instruction is branch predicted taken, can fetch target immediately

Calculating the branch target

SLIDE 15

dt10 2011 13.15

The role of the compiler

compilers can have a huge impact on performance

– register allocation – instruction selection – data placement

optimisation is subordinate to correctness

– must always compile against ISA specification – can try to optimise code according to architecture

CPU specific optimisation may reduce performance

– optimise for P4 → might be slower than generic code on P3

ISA extensions remove backwards compatibility

– optimise for P4 → SSE not available on P2

SLIDE 16

dt10 2011 13.16

Compiling to avoid hazards

data hazards

– instruction scheduling: avoid load-use data hazard – register allocation: avoid immediate re-use of registers – MIPS: large number of registers to make this easier

structural hazards

– instruction selection: select simple instructions – e.g. : sll $1,$2,1

vs. add $1,$2,$2

– instruction scheduling: move instructions apart

control hazards

– instruction selection: eliminate branches if possible – e.g.: cmov : conditional move – e.g.: predicated execution

SLIDE 17

dt10 2011 13.17

Exceptions and interrupts

unexpected events requiring change in flow of control

– different ISAs use the terms differently

exception: internal signal, arises from within the CPU

– e.g. undefined opcode, overflow, syscall, …

interrupt: external signal, source is outside CPU

– e.g. external IO: hard disk saying “your data is ready now”!

must handle them without sacrificing performance

– exceptions are... exceptional – not the common/expected case – interrupts are frequent, but not that frequent

CPU instruction rate: >1GHz; interrupt rate <10KHz

SLIDE 18

dt10 2011 13.18

Handling exceptions in MIPS

exceptions managed by System Control Coprocessor

– follows set of steps to record and handle exception

1. save PC of offending (or interrupted) instruction

– in Exception Program Counter, EPC

2. save indication of the problem

– in Cause Register – 0 for undefined opcode, 1 for overflow

3. jump to handler at 8000 00180

SLIDE 19

dt10 2011 13.19

An alternate mechanism

vectored interrupts

– handler address determined by Cause Register

example:

– undefined opcode: C000 0000 – overflow: C000 0020 – …: C000 0040

instructions either

– deal with the interrupt, or jump to real handler

SLIDE 20

dt10 2011 13.20

Handler actions

read Cause Register, and transfer to relevant handler
determine action required
if restartable

– take corrective action – use EPC to return to program

otherwise

– terminate program – report error using EPC, Cause, …

SLIDE 21

dt10 2011 13.21

Exceptions in a pipeline

another form of control hazard
consider exception on add in EX stage

add $1, $2, $1 – prevent $1 from being clobbered – complete previous instructions – flush add and subsequent instructions – set Cause and EPC register values – transfer control to handler

similar to mispredicted branch

– use much of the same hardware

SLIDE 22

dt10 2011 13.22

Pipeline with exceptions

SLIDE 23

dt10 2011 13.23

Exception properties

restartable exceptions

– pipeline can flush the instruction – handler executes, then returns to the instruction – refetched and executed from scratch

PC saved in EPC register

– identifies offending instruction – actually PC + 4 is saved, handler must adjust

SLIDE 24

dt10 2011 13.24

Exception example

exception on add in

40 sub $11, $2, $4 44 and $12, $2, $5 48

r $13, $2, $6

4C add $1, $2, $1 50 slt $15, $6, $7 54 lw $16, 50($7) …

handler

80000180 sw $25, 1000($0) 80000184 sw $26, 1004($0) …

SLIDE 25

dt10 2011 13.25

Exception example

SLIDE 26

dt10 2011 13.26

Exception example

SLIDE 27

dt10 2011 13.27

Multiple exceptions

pipelining overlaps multiple instructions

– could have multiple exceptions at once

simple way: deal with exception from earliest instruction

– flush subsequent instructions – “precise” exceptions

in complex pipelines

– multiple instructions issued per cycle – out-of-order completion – maintaining precise exceptions is difficult!

SLIDE 28

dt10 2011 13.28

Pipelining: summary

ISA influences design of datapath and control
datapath and control influence design of ISA
pipelining improves instruction throughput

– using parallelism – more instructions completed per second – but latency for each instruction not reduced

hazards: structural, data, control
advanced issues