Previous lecture stalls reduce performance but are required to - - PowerPoint PPT Presentation

previous lecture
SMART_READER_LITE
LIVE PREVIEW

Previous lecture stalls reduce performance but are required to - - PowerPoint PPT Presentation

Previous lecture stalls reduce performance but are required to get correct results compiler arranges code to avoid hazards and stalls requires knowledge of the pipeline structure dt10 2011 13.1 Branch hazards


slide-1
SLIDE 1

dt10 2011 13.1

Previous lecture

  • stalls

– reduce performance – but are required to get correct results

  • compiler

– arranges code to avoid hazards and stalls – requires knowledge of the pipeline structure

slide-2
SLIDE 2

dt10 2011 13.2

Branch hazards

  • branch outcome is determined in MEM stage

PC

Flush these instructions (Set control values to 0)

slide-3
SLIDE 3

dt10 2011 13.3

Reducing branch delay

  • move hardware to determine outcome to ID stage

– target address adder – register comparator

  • example: branch taken

36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $2, $5 48: or $13, $2, $6 52: add $14, $4, $2 56: slt $15, $6, $7 ... ??: lw $4, 50($7)

slide-4
SLIDE 4

dt10 2011 13.4

Example: branch taken

slide-5
SLIDE 5

dt10 2011 13.5

Example: branch taken

slide-6
SLIDE 6

dt10 2011 13.6

Delay slots: clawing back the stalls

  • taken branch always means one stall cycle

– nothing we can do to get rid of it – can we use the stall cycle to do something useful?

  • MIPS approach : change the ISA specification

– instruction following branch is always executed – delay slot instruction : executed even when branch taken

do{ $2 = $2 * $3; $3 = $3 - 1; }while($3==0) ; $3 = $2 + $4; 24 mul $2, $2, $3 28 addi $3, $3, -1 32 beq $3, $0, -3 stall 36 add $3, $2, $4 taken : no delay slot 24 mul $2, $2, $3 28 beq $3, $0, -2 32 addi $3, $3, -1 36 add $3, $2, $4 taken : with delay slot

slide-7
SLIDE 7

dt10 2011 13.7

Data hazards for branches

  • if a comparison register is a destination of 2nd or

3rd preceding ALU instruction

IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB

add $4, $5, $6 add $1, $2, $3 beq $1, $4, target

  • can resolve using forwarding
slide-8
SLIDE 8

dt10 2011 13.8

Data hazards for branches

  • two data hazards that cause stall on branch

– comparison reg. is destination of preceding ALU instr. – comparison reg. is destination of 2nd preceding load instr.

  • need 1 stall cycle

IF ID EX MEM WB IF ID EX MEM WB IF ID

add $4, $5, $6 lw $1, addr beq $1, $4, target

EX MEM WB

slide-9
SLIDE 9

dt10 2011 13.9

Data hazards for branches

  • two data hazards that cause stall on branch

– comparison reg. is destination of preceding ALU instr. – comparison reg. is destination of 2nd preceding load instr.

  • need 1 stall cycle

beq stalled

IF ID EX MEM WB IF ID EX MEM WB IF ID ID EX MEM WB

add $4, $5, $6 lw $1, addr beq $1, $4, target

slide-10
SLIDE 10

dt10 2011 13.10

Data hazards for branches

  • if a comparison register is a destination of

immediately preceding load instruction

– need 2 stall cycles

beq stalled

IF ID EX MEM WB IF ID ID ID EX MEM WB

beq stalled lw $1, addr beq $1, $0, target

slide-11
SLIDE 11

dt10 2011 13.11

Dynamic branch prediction

  • deeper and superscalar pipelines

– branch penalty is more significant

  • use dynamic prediction

– branch prediction buffer (aka branch history table) – indexed by recent branch instruction addresses – stores outcome (taken/not taken)

  • dynamic prediction: execute a branch

– check table, expect the same outcome – start fetching from fall-through or target – if wrong, flush pipeline and flip prediction

slide-12
SLIDE 12

dt10 2011 13.12

1-bit predictor: shortcoming

  • inner loop branches mispredicted twice!
  • uter: …

… inner: … … beq …, …, inner … beq …, …, outer

– mispredict as taken on last iteration of inner loop – then mispredict as not taken on first iteration of inner loop next time around

slide-13
SLIDE 13

dt10 2011 13.13

2-Bit predictor

  • only change prediction on two successive

mispredictions

slide-14
SLIDE 14

dt10 2011 13.14

  • even with predictor, still need to calculate the

target address

– 1-cycle penalty for a taken branch

  • branch target buffer

– cache of target addresses – indexed by PC when instruction fetched – if hit and instruction is branch predicted taken, can fetch target immediately

Calculating the branch target

slide-15
SLIDE 15

dt10 2011 13.15

The role of the compiler

  • compilers can have a huge impact on performance

– register allocation – instruction selection – data placement

  • optimisation is subordinate to correctness

– must always compile against ISA specification – can try to optimise code according to architecture

  • CPU specific optimisation may reduce performance

– optimise for P4 → might be slower than generic code on P3

  • ISA extensions remove backwards compatibility

– optimise for P4 → SSE not available on P2

slide-16
SLIDE 16

dt10 2011 13.16

Compiling to avoid hazards

  • data hazards

– instruction scheduling: avoid load-use data hazard – register allocation: avoid immediate re-use of registers – MIPS: large number of registers to make this easier

  • structural hazards

– instruction selection: select simple instructions – e.g. : sll $1,$2,1

  • vs. add $1,$2,$2

– instruction scheduling: move instructions apart

  • control hazards

– instruction selection: eliminate branches if possible – e.g.: cmov : conditional move – e.g.: predicated execution

slide-17
SLIDE 17

dt10 2011 13.17

Exceptions and interrupts

  • unexpected events requiring change in flow of control

– different ISAs use the terms differently

  • exception: internal signal, arises from within the CPU

– e.g. undefined opcode, overflow, syscall, …

  • interrupt: external signal, source is outside CPU

– e.g. external IO: hard disk saying “your data is ready now”!

  • must handle them without sacrificing performance

– exceptions are... exceptional – not the common/expected case – interrupts are frequent, but not that frequent

  • CPU instruction rate: >1GHz; interrupt rate <10KHz
slide-18
SLIDE 18

dt10 2011 13.18

Handling exceptions in MIPS

  • exceptions managed by System Control Coprocessor

– follows set of steps to record and handle exception

  • 1. save PC of offending (or interrupted) instruction

– in Exception Program Counter, EPC

  • 2. save indication of the problem

– in Cause Register – 0 for undefined opcode, 1 for overflow

  • 3. jump to handler at 8000 00180
slide-19
SLIDE 19

dt10 2011 13.19

An alternate mechanism

  • vectored interrupts

– handler address determined by Cause Register

  • example:

– undefined opcode: C000 0000 – overflow: C000 0020 – …: C000 0040

  • instructions either

– deal with the interrupt, or jump to real handler

slide-20
SLIDE 20

dt10 2011 13.20

Handler actions

  • read Cause Register, and transfer to relevant handler
  • determine action required
  • if restartable

– take corrective action – use EPC to return to program

  • otherwise

– terminate program – report error using EPC, Cause, …

slide-21
SLIDE 21

dt10 2011 13.21

Exceptions in a pipeline

  • another form of control hazard
  • consider exception on add in EX stage

add $1, $2, $1 – prevent $1 from being clobbered – complete previous instructions – flush add and subsequent instructions – set Cause and EPC register values – transfer control to handler

  • similar to mispredicted branch

– use much of the same hardware

slide-22
SLIDE 22

dt10 2011 13.22

Pipeline with exceptions

slide-23
SLIDE 23

dt10 2011 13.23

Exception properties

  • restartable exceptions

– pipeline can flush the instruction – handler executes, then returns to the instruction – refetched and executed from scratch

  • PC saved in EPC register

– identifies offending instruction – actually PC + 4 is saved, handler must adjust

slide-24
SLIDE 24

dt10 2011 13.24

Exception example

  • exception on add in

40 sub $11, $2, $4 44 and $12, $2, $5 48

  • r $13, $2, $6

4C add $1, $2, $1 50 slt $15, $6, $7 54 lw $16, 50($7) …

  • handler

80000180 sw $25, 1000($0) 80000184 sw $26, 1004($0) …

slide-25
SLIDE 25

dt10 2011 13.25

Exception example

slide-26
SLIDE 26

dt10 2011 13.26

Exception example

slide-27
SLIDE 27

dt10 2011 13.27

Multiple exceptions

  • pipelining overlaps multiple instructions

– could have multiple exceptions at once

  • simple way: deal with exception from earliest instruction

– flush subsequent instructions – “precise” exceptions

  • in complex pipelines

– multiple instructions issued per cycle – out-of-order completion – maintaining precise exceptions is difficult!

slide-28
SLIDE 28

dt10 2011 13.28

Pipelining: summary

  • ISA influences design of datapath and control
  • datapath and control influence design of ISA
  • pipelining improves instruction throughput

– using parallelism – more instructions completed per second – but latency for each instruction not reduced

  • hazards: structural, data, control
  • advanced issues

– instruction-level parallelism, system-on-chip – courses: custom computing, advanced architectures