CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining - - PowerPoint PPT Presentation

cs356 unit 12a
SMART_READER_LITE
LIVE PREVIEW

CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining - - PowerPoint PPT Presentation

12a.1 12a.2 CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining 12a.3 12a.4 Logic Circuits Combinational Logic Gates Combinational ________________ logic Main Idea: Circuits called "gates" perform Logic


slide-1
SLIDE 1

12a.1

CS356 Unit 12a

Processor Hardware Organization Pipelining

12a.2

BASIC HW

12a.3

Logic Circuits

  • ________________ logic

– Performs a specific function (mapping

  • f ___ input combinations to desired
  • utput combinations)

– No internal state or feedback

  • Given a set of inputs, we will always

get the same output after some time (propagation) __________

  • __________ logic (Storage devices)

– _____________ are the fundamental building blocks

  • Remembers a set of bits for later use
  • Acts like a variable from software
  • Controlled by a "_________" signal

Outputs depend only on current outputs

Inputs Combinational Logic

(Usually

  • perations like

+, -, *, /, &, |, <<)

Outputs

Outputs depend on current inputs and previous inputs (previous inputs summarized via state)

Current inputs Out 1 0 1 Sequential values feedback and provide "memory" Combinational Logic

Sequential Logic

Register holding "state"

12a.4

Combinational Logic Gates

  • Main Idea: Circuits called "gates" perform

logic operations to produce desired outputs from some digital inputs

N P R

V T B

C

1 1 1 1

OR gate AND gate NOT gate OR gate OR gate

1

slide-2
SLIDE 2

12a.5

Propagation Delay

  • Main Idea: All digital logic circuits have

propagation delay

– Time it takes for output to change when inputs are changed

N P R

V T B

C

1 1 1 1 1 4 gate delays for input to propagate to outputs

12a.6

Combinational Logic Functions

  • Map input combinations of n-bits to desired

m-bit output

  • Can describe function with a truth table and

then find its circuit implementation

Logic Circuit Outputs Inputs

IN0 IN1 IN2 OUT0 OUT1 1 1 … 1 1 1 1

In0 In1 In2 Out1 12a.7

ALU’s

  • Perform a selected
  • peration on two input

numbers.

– FS[5:0] select the desired

  • peration

ALU

A B C0 RES OF ZF

32 32 32

X[31:0] Y[31:0] RES[31:0] Func

6

FS[5:0]

Func. Code Op. Func. Code Op. 00_0000 A SHL B 10_0000 A+B 00_0010 A SHR B 10_0010 A-B 00_0011 A SAR B

… … … …

10_0100 A AND B 01_1000 A * B 10_0101 A OR B 01_1001 A * B (uns.) 10_0110 A XOR B 01_1010 A / B 10_0111 A NOR B 01_1011 A / B (uns.)

… … … …

10_1010 A < B

100010 0x00000001 0x80000000 __________ CF __ __ __

12a.8

Sequential Devices (Registers)

  • Registers _______ the D input value when a control

input (aka the clock signal) transitions from _____ (clock edge) and stores that value at the Q output until the next clock edge

  • A register is like a ___________ in software. It stores a

value for later use.

  • We can choose to only clock the register at "________"

times when we want the register to capture a new value (i.e. when it is the __________ of an instruction)

  • Key Idea: Registers __________ data while we operate
  • n those values

Block Diagram of a Register The clock pulse (positive edge) here…

…causes q(t) to sample and hold the current d(t) value until another clock pulse %rax ALU sum

add %rbx,%rax add %rcx,%rax

%rax

slide-3
SLIDE 3

12a.9

Clock Signal

  • Alternating high/low voltage

pulse train

  • Controls the ordering and timing
  • f operations performed in the

processor

  • 1 cycle is usually measured from

rising edge to rising edge

  • Clock frequency = # of cycles per

second (e.g. 2.8 GHz = 2.8 * 109 cycles per second)

Processor

Clock Signal

0 (0V) 1 (5V) 1 cycle 2.8 GHz = 2.8*109 cycles per second = 0.357 ns/cycle

  • Op. 1
  • Op. 2
  • Op. 3

12a.10

FROM X86 TO RISC

Basic HW organization for a simplified instruction set

12a.11

From CISC to RISC

  • Complex Instruction Set Computers often

have instructions that ______ widely in how much _________ they perform and how much _________ they take to execute

– Benefit is _________ instructions are needed to accomplish a task

  • Reduced Instruction Set Computers favor

instructions that take roughly the _______ time to execute and follow a common _____________ of steps

– It often requires _______ instructions to describe the overall task (larger code size)

  • See example to the right
  • RISC makes the ___________ design easier

so let's tweak our x86 instructions to be more RISC-like

// CISC instruction movq 0x40(%rdi, %rsi, 4), %rax // RISC Equiv. w/ 1 mem. or ALU op. // per instruction mov %rsi, %rbx # use %rbx as a temp. shl 2, %rbx # %rsi * 4 add %rdi, %rbx # %rdi + (%rsi*4) add $0x40, %rbx # 0x40 + %rdi + (%rsi*4) mov (%rbx), %rax # %rax = *%rbx CISC vs. RISC Equivalents 12a.12

A RISC Subset of x86

  • Split mov instructions that access memory

into separate instructions:

– ld = _____________ from memory – st = _____________ to memory

  • Limit ld & st instructions to use at most

____________________________

– No ld 0x04(%rdi, %rsi, 4), %rax

  • Too much work

– At most ld ____________, %rax or

st %rax, _____________

  • Limit arithmetic & logic instructions to only
  • perate on registers

– No add (%rsp), %rax since this implicitly accesses (dereferences) memory – Only add ______________

// CISC instruction add %rax, (%rsp) // Equiv. RISC sequence (w/ ld and st) ld 0(%rsp), %rbx add %rax, %rbx st %rbx, 0(%rsp) // 3 x86 memory read instructions mov (%rdi), %rax // 1 mov 0x40(%rdi), %rax // 2 mov 0x40(%rdi,%rsi), %rax // 3 // Equiv. load sequences ld 0x0(%rdi), %rax // 1 ld 0x40(%rdi), %rax // 2 mov %rsi, %rbx // 3a add %rdi, %rbx // 3b ld 0x40(%rbx), %rax // 3c // 3 x86 memory write instructions mov %rax, (%rdi) // 1 mov %rax, 0x40(%rdi) // 2 mov %rax, 0x40(%rdi,%rsi) // 3 // Equiv. store sequences st %rax, 0x0(%rdi) // 1 st %rax, 0x40(%rdi) // 2 mov %rsi, %rbx // 3a add %rdi, %rbx // 3b st %rax, 0x40(%rbx) // 3c

slide-4
SLIDE 4

12a.13

Developing a Processor Organization

  • Identify which hardware components each instruction type

would use and in what order: ALU-Type, LD, ST, Jump

ALU-Type add %rax,%rbx

PC I-Cache / I-MEM Addr. Data D-Cache / D-MEM Addr. Data Registers

(%rax, %rbx, etc.)

(aka RegFile)

ALU

Res. Zero

LD ld 8(%rax),%rbx ST st %rbx, 8(%rax) JE je label/displacement

  • Cond. Codes

1. 2. 3. 4. 5. 6.

12a.14

Processor Block Diagram

I-Cache D-Cache ALU

Registers (aka RegFile)

Fetch Decode Exec. Mem WB PC Decode

Instruction (Machine Code) Operands ALU Output (Addr. or Result) Data to write to

  • dest. register

Clock Cycle Time = Sum of delay through worst case pathway = 50 ns

10 ns 10 ns 10 ns 10 ns 10 ns

Control Signals

(e.g. ALU operation, Read/Write D-Cache, etc.) Addr Data Data

ZF OFCFSF

12a.15

Processor Execution (add)

I-Cache D-Cache ALU

Registers (aka RegFile)

Fetch Decode Exec. Mem WB PC Decode

Instruction (Machine Code) Operands ALU Output (Addr. or Result) Data to write to

  • dest. register

Control Signals

(e.g. ALU operation, Read/Write D-Cache, etc.) add %rax,%rdx [Machine Code: 48 01 c2] %rax+%rdx Addr Data %rdx = %rax+%rdx %rax %rdx

ZF OFCFSF

12a.16

Processor Execution (load)

I-Cache D-Cache ALU

Registers (aka RegFile)

Fetch Decode Exec. Mem WB PC Decode

Instruction (Machine Code) Operands ALU Output (Addr. or Result) Data to write to

  • dest. register

Control Signals

(e.g. ALU operation, Read/Write D-Cache, etc.) Addr Data ld 0x40(%rbx),%rax [Machine Code: 48 8b 43 40] addr %rdx = data %rbx data

ZF OFCFSF

0x40

slide-5
SLIDE 5

12a.17

Processor Execution (store)

I-Cache D-Cache ALU

Registers (aka RegFile)

Fetch Decode Exec. Mem WB PC Decode

Instruction (Machine Code) Control Signals

(e.g. ALU operation, Read/Write D-Cache, etc.) Addr Data Data st %rax,0x40(%rbx) [Machine Code: 48 89 43 40] addr %rbx 0x40 %rax

ZF OFCFSF

12a.18

Processor Execution (branch/jump)

I-Cache D-Cache ALU

Registers (aka RegFile)

Fetch Decode Exec. Mem WB PC Decode

Instruction (Machine Code) Control Signals

(e.g. ALU operation, Read/Write D-Cache, etc.) Addr Data Data je L1 (disp. = 0x08) [Machine Code: 74 08] PC + 0x08 PC 0x08

ZF 1 OF CF 1 SF

12a.19

PIPELINING

12a.20

Example

for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; 10 ns per input set = _____ ns total

Memory

A[i] B[i] A: B: C: i

Cntr

slide-6
SLIDE 6

12a.21

Pipelining Example

Stage 1 Stage 2 Clock 0 Clock 1 Clock 2

Stage 1 Stage 2 for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4;

Pipelining refers to insertion of registers to split combinational logic into smaller stages that can be overlapped in time (i.e. create an assembly line)

12a.22

Need for Registers

  • Provides separation between combinational functions

– Without registers, fast signals could “___________” to data values in the next operation stage

Performing an

  • peration yields

signals with different paths and delays We don’t want signals from two different data values mixing. Therefore we must collect and synchronize the values from the previous operation before passing them on to the next Signal i Signal j 5 ns 2 ns CLK CLK

12a.23

Processors & Pipelines

  • Overlaps execution of multiple instructions
  • Natural breakdown into stages

– _______, _________, _________

  • Fetch an instruction, while decoding another, while

executing another

___ ___ ___

CLK 1 CLK 2 CLK 3 Inst 1 Inst 2 Inst 3 Inst 4 CLK 4

____ ___ ___ ____ ___ ____

Fetch Decode Exec.

  • Inst. 1

Clk 1 Clk 2

  • Inst. 1

Clk 3

  • Inst. 1
  • Inst. 2
  • Inst. 2

Clk 4

  • Inst. 2
  • Inst. 3
  • Inst. 3

Clk 5

  • Inst. 3
  • Inst. 4
  • Inst. 4
  • Inst. 5

Pipelining (Instruction View) Pipelining (Stage View)

12a.24

Balancing Pipeline Stages

  • Clock period must equal the LONGEST

delay from register to register

  • Fig. 1: If total logic delay is 20ns => 50MHz

– Throughput: 1 instruc. / 20 ns

  • Fig. 2: Unbalanced stage delays limit the

clock speed to the slowest stage (worst case)

– Throughput: 1 instruc. / 10 ns => 100MHz

  • Fig. 3: Better to split into more, balanced

stages

– Throughput: 1 instruc. / 5 ns => 200MHz

  • Fig. 4: Are more stages better

– Ideally: 2x stages => 2x throughput – Throughput: 1 instruc. / 2.5 ns => 400MHz – Each register adds extra delay so at some point deeper pipelines don't pay off

Processor Logic

(Fetch + Decode + Execute)

Register

Fetch Decode

Register

  • Exec. 1
  • Exec. 1

Register Register

Fetch Decode

Register

Exec

5 ns 5 ns 10 ns 20 ns 5 ns 5 ns 5 ns 5 ns

F1

2.5 ns

F2 D1 D2 E1a E1b E2a E2b

  • Fig. 1
  • Fig. 2
  • Fig. 3
  • Fig. 4
slide-7
SLIDE 7

12a.25

Balancing Pipeline Stages

Main Points:

  • Latency of any single

instruction is _________

  • Throughput and thus
  • verall program

performance can be dramatically __________

– Ideally K stage pipeline will lead to throughput increase by a factor of ____ – Reality is splitting stages adds some ______ and thus hits a point of diminishing returns

Processor Logic

(Fetch + Decode + Execute)

Register

Fetch Decode

Register

  • Exec. 1
  • Exec. 2

Register 20 ns 5 ns 5 ns 5 ns 5 ns

F1

2.5 ns

F2 D1 D2 E1a E1b E2a E2b

  • Fig. 1
  • Fig. 2
  • Fig. 3

Non-pipelined (Latency = 20ns, Throughput = 1x) 4 Stage Pipeline (Latency = 20ns, Throughput = 4x) 8 Stage Pipeline (Latency = 20ns, Throughput = 8x)

12a.26

5-Stage Pipeline

Instruction (Machine Code) Operands & Control Signals ALU Output (Addr. or Result) Result of Instruction

I-Cache D-Cache ALU Reg. File Fetch Decode Exec. Mem WB PC Decode

ZF OFCFSF

12a.27

Pipelining

  • Let's see how a sequence of instructions can

be executed

Instruction ld 0x8(%rbx), %rax add %rcx,%rdx je L1

12a.28

Sample Sequence - 1

Instruction (Machine Code) Operands & Control Signals ALU Output (Addr. or Result) Result of Instruction

I-Cache D-Cache ALU Reg. File Decode Exec. Mem WB PC Decode Fetch (LD)

Fetch LD

ZF OFCFSF

slide-8
SLIDE 8

12a.29

Sample Sequence - 2

LD 0x8(%rbx), %rax Operands & Control Signals ALU Output (Addr. or Result) Result of Instruction

I-Cache D-Cache ALU Reg. File Exec. Mem WB PC Decode Fetch (ADD)

Fetch ADD

Decode (LD)

Decode instruction and fetch operands

ZF OFCFSF

12a.30

Sample Sequence - 3

ADD %rdx, %rcx 0x40 / %rbx / READ ALU Output (Addr. or Result) Result of Instruction

I-Cache D-Cache ALU Reg. File Mem WB PC Decode Fetch (JE)

Fetch JE

Decode (ADD)

Decode instruction and fetch operands

Exec. (LD)

Add displacement 0x04 to %rbx

ZF OFCFSF

12a.31

Sample Sequence - 4

JE / displacement %rcx / %rdx / ADD %rbx + 0x40 / READ Result of Instruction

I-Cache D-Cache ALU Reg. File WB PC Decode Fetch (i+1) Fetch instruc i+1 Decode (JE) Exec. (ADD) Add %rcx+%rdx Mem (LD) Read word from memory

Decode instruction and fetch operands

ZF OFCFSF

12a.32

Sample Sequence - 5

  • Instruc. i+1 Machine Code

PC / Dsiplacement %rdx / %rcx + %rdx %rax / Value from Memory

I-Cache D-Cache ALU Reg. File WB (LD) PC Decode Fetch (i+2) Decode (i+1) Exec. (JE) Mem (ADD)

Write word to %rax Just pass sum to next stage Check if condition is true Fetch next instruc i+2 Decode instruction i+1 and fetch

  • perands

ZF OFCFSF

slide-9
SLIDE 9

12a.33

Sample Sequence - 6

  • Instruc. i+2 Machine Code

Instruc i+1 operands New PC %rdx / sum

I-Cache D-Cache ALU Reg. File WB (ADD) PC Decode Fetch (i+3) Decode (i+2) Exec. (i+1) Mem (JE)

Write word to %rdx Update PC Use the ALU Fetch next instruc i+3 Decode instruction i+2 and fetch

  • perands

ZF OFCFSF

12a.34

Sample Sequence - 7

Deleted Deleted Deleted %rdx / sum

I-Cache D-Cache ALU Reg. File WB (JE) PC Decode Fetch (target) Decode (i+3) Exec. (i+2) Mem (i+1)

Do nothing Delete i+1 Delete i+2 Fetch next instruc i+3 Delete i+3

ZF OFCFSF

12a.35

HAZARDS

Problems from overlapping instruction execution…

12a.36

Hazards

  • Hazards prevent parallel or ____________ execution!
  • __________ Hazards

– Problem: We don't know what instruction ________ but we need to – Examples: ___________________ and calls

  • ____________ Hazards / Data ___________________

– Problem: When a later instruction needs data from a ___________ instruction – Examples:

  • sub %rdx,%rax
  • add %rax,%rcx
  • ____________ Hazards

– Problem: Due to limited _____________, the HW doesn't support

  • verlapping a certain _____________ of instructions

– Examples: See next slides

slide-10
SLIDE 10

12a.37

Structural Hazards

  • Example structural hazard: A single cache rather

than separate instruction & data caches

– Structural hazard any time an instruction needs to perform a data access (i.e. ld or st) since we always want to fetch a new instruction each clock cycle

Cache ALU Reg. File

PC LD i+3 i+2 i+1 Hazard!

12a.38

Data Hazard - 1

ADD %rax, %rcx %rdx / %rax / SUB ALU Output (Addr. or Result) Result of Instruction

I-Cache D-Cache ALU Reg. File Mem WB PC Decode Fetch (i+1)

Fetch i+1

Decode (ADD)

Decode and get register operands (Do we get the desired %rax value?)

Exec. (SUB)

Perform %rax-%rdx sub %rdx,%rax add %rax,%rcx

12a.39

Data Hazard - 2

i+1 %rax / %rcx / ADD New value for %rax Result of Instruction

I-Cache D-Cache ALU Reg. File Mem (SUB) WB PC Decode Fetch (i+2)

Fetch i+2

Decode (i+1)

Decode i+1

Exec. (ADD)

Perform %rax+%rcx using the wrong value! sub %rdx,%rax add %rax,%rcx New value for %rax has not been written back yet

12a.40

Stalling

  • Solution 1: _______ the ADD instruction in the DECODE stage

and insert _____ into the pipeline until the new value of the needed register is present at the cost of lower performance

ADD %rax, %rcx nop nop New value of %rax

I-Cache D-Cache ALU Reg. File Mem (nop) WB (SUB) PC Decode Fetch (i+1) Decode (ADD) Exec. (nop)

sub %rdx,%rax add %rax,%rcx

slide-11
SLIDE 11

12a.41

Forwarding

  • Solution 2: Create new hardware paths to hand-off

(________) the data from the __________ instruction in the pipeline to the ___________ instruction

sub %rdx,%rax add %rax,%rcx

i+1 %rax / %rcx / ADD New value for %rax Result of Instruction

I-Cache D-Cache ALU Reg. File Mem (SUB) WB PC Decode Fetch (i+2) Decode (i+1) Exec. (ADD)

12a.42

Solving Data Hazards

  • Key Point: Data dependencies (i.e. instructions

needing values produced by earlier ones) limit performance

  • Forwarding solves many of the data hazards (data

dependencies) that exist

– It allows instructions to continue to flow through the pipeline without the need to stall and waste time – The cost is additional hardware and added complexity

  • Even forwarding cannot solve all the issues

– A structural hazard still exists when a ___________ a value needed by the next instruction

12a.43

LD + Dependent Instruction Hazard

  • Even forwarding cannot prevent the need to stall when a Load instruction

produces a value needed by the instruction behind it

– Would require performing _____________ worth of work in only a single cycle ld 8(%rdx),%rax add %rax,%rcx

i+1 %old rax / %rcx / ADD Address (8+%rdx) / READ Result of Instruction

I-Cache D-Cache ALU Reg. File Mem (LD) WB PC Decode Fetch (i+2) Decode (i+1) Exec. (ADD)

New value for %rax

12a.44

LD + Dependent Instruction Hazard

  • We would need to introduce _____ stall cycle (nop) into the

pipeline to get the timing correct

  • Keep this in mind as we move through the next slides

ld 8(%rdx),%rax add %rax,%rcx

i+1 %old rax / %rcx / ADD nop New Value of %rax

I-Cache D-Cache ALU Reg. File Mem (nop) WB (LD) PC Decode Fetch (i+2) Decode (i+1) Exec. (ADD)

New value for %rax

slide-12
SLIDE 12

12a.45

Control Hazards

  • Branches/Jumps require us to know

– __________ we want to jump to (aka branch/jump target location)…really just the new value of the ______ – If we _______ branch or not (checking the jump ________)

  • Problem: We often don't know those values until

_________ in the pipeline and thus we are not sure what instructions should be fetched in the interim

– Requires us to __________ unwanted instructions and waste time

12a.46

Control Hazard - 1

  • Instruc. i+2 Machine Code

Instruc i+1 operands New PC %rdx / sum

I-Cache D-Cache ALU Reg. File WB (ADD) PC Decode Fetch (i+3) Decode (i+2) Exec. (i+1) Mem (JE)

Write word to %rdx Update PC Use the ALU Fetch next instruc i+3 Decode instruction i+2 and fetch

  • perands

12a.47

Control Hazard - 2

Deleted Deleted Deleted %rdx / sum

I-Cache D-Cache ALU Reg. File WB (JE) PC Decode Fetch (target) Decode (i+3) Exec. (i+2) Mem (i+1)

Do nothing Delete i+1 Delete i+2 Fetch next instruc i+3 Delete i+3

Need to "flush" wrongly fetched instructions

12a.48

A FIRST LOOK: CODE REORDERING

Enlisting the help of the compiler

slide-13
SLIDE 13

12a.49

Two Sides of the Coin

  • If the hardware has some problems it

just can't solve, can software (i.e. the compiler) help?

– __________

  • Compilers can ___________

instructions to take best advantage of the processor (pipeline) organization

  • Identify the dependencies that will

incur stalls and slow performance

– ____________________ – Jump instructions

void sum(int* data, int n, int x) { for(int i=0; i < n; i++){ data[i] += x; } } sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax add %edx, %eax st %eax, 0(%rdi) add $4, %rdi add $1, %ecx j L1 L2: retq C code and its assembly translation 12a.50

How Can the Compiler Help

  • Compilers are written with general parsing and

semantic representation front ends but ___________________ backends that generate code optimized for a particular processor

  • Q: How could the compiler help improve

pipelined performance while still maintaining the external behavior that the high level code indicates

  • A: By finding __________________ instructions

and reordering the code

– Could we have moved any other instruction into that slot? ______

sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax stall/nop add %edx, %eax st %eax, 0(%rdi) add $4, %rdi add $1, %ecx j L1 L2: retq C code and its assembly translation sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax _______________________ add %edx, %eax st %eax, 0(%rdi) add $4, %rdi j L1 L2: retq Original Code (incurring 1 stall cycle) Updated Code (w/ Compiler reordering) 12a.51

Taken or Not Taken: Branch Behavior

  • When a conditional jump/branch is

– True, we say it is ____________ – False, we say it is ______________

  • Currently our pipeline will fetch sequentially

and then potentially flush if the branch is taken

– Effectively, our pipeline "_________" that each branch is ______________

  • The j L1 instruction is always ________ and

thus will incur wasted clock cycles each time it is executed

  • Most of the time the jge L2 will be

______________ and perform well

C code and its assembly translation sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax add $1, %ecx add %edx, %eax st %eax, 0(%rdi) add $4, %rdi j L1 L2: retq T NT 12a.52

Branch Delay Slots

  • Problem: After a jump/branch we fetch instructions

that we are not sure should be executed

  • Idea: Find an instruction(s) that should __________

be executed (independent of whether branch is taken

  • r not), move those instructions to directly _______

the branch, and have HW just let them be executed (not ___________) no matter what the branch

  • utcome is
  • Branch delay slot(s) = _________________ that the

HW will always execute (not flush) after a jump/branch instruction

slide-14
SLIDE 14

12a.53

Branch Delay Slot Example

ld 0(%rdi), %rcx cmp %rcx, %rdx je NEXT add %rbx, %rax NOT TAKEN CODE … NEXT: TAKEN CODE ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx je NEXT delay slot instruc. NOT TAKEN CODE … NEXT: TAKEN CODE

Assume a single instruction delay slot Move an ALWAYS executed instruction down into the delay slot and let it execute no matter what

“Before” Code ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx Not Taken Path Code je Taken Path Code “After” Code T NT

Delay Slot

Flowchart perspective of the delay slot

Delay Slot

12a.54

Implementing Branch Delay Slots

  • HW will define the number of

branch delay slots (usually a small number…1 or 2)

  • Compiler will be responsible for

_____________ instructions to fill the delay slots

– Must find instructions that the branch does _______________ on – If no instructions can be rearranged, can always insert 'nop' and just waste those cycles

ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx je NEXT delay slot instruc. Cannot move ‘ld’ into delay slot because je needs the %rcx value generated by it ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rax je NEXT delay slot instruc. If no instruction can be found a 'nop' can be inserted by the compiler

12a.55

A Look Ahead: Branch Prediction

  • Currently our pipeline assumes Not Taken and

fetches down the sequential path after a jump/branch

  • Could we build a pipeline that could predict

taken?

– _________! Location to jump to (branch target) not known until _______________

  • But suppose we could overcome those problems,

would we even know how to predict the

  • utcome of a jump/branch before actually

looking at the condition codes deeper in the pipeline?

  • We could allow a _______ prediction per

instruction (give a _______ with the branch that indicates T or NT)

  • We could allow __________ prediction per

instruction (use its ___________ history)

Loop Body loop branch NT Code Loops High probability

  • f being Taken.

Prediction can be static. T: loop NT: done if..else branch NT: else T: if If Statements May exhibit data dependent behavior. Prediction may need to be dynamic. After Code T Code 12a.56

Demo

slide-15
SLIDE 15

12a.57

Summary 1

  • Pipelining is an effective and important technique to

improve the throughput of a processor

  • Overlapping execution creates hazards which lead to

stalls or wasted cycles

– Data, Control, Structural – More hardware can be investigated to attempt to mitigate the stalls (e.g. forwarding)

  • The compiler can help reorder code to avoid stalls

and perform useful work (e.g. delay slots)