CS356 Unit 12 Processor Hardware Organization Pipelining 12.2 - - PowerPoint PPT Presentation

cs356 unit 12
SMART_READER_LITE
LIVE PREVIEW

CS356 Unit 12 Processor Hardware Organization Pipelining 12.2 - - PowerPoint PPT Presentation

12.1 CS356 Unit 12 Processor Hardware Organization Pipelining 12.2 From combinational to sequential logic BASIC HW 12.3 Logic Circuits Combinational Combinational Logic Logic Outputs Inputs Performs a specific function (Usually


slide-1
SLIDE 1

12.1

CS356 Unit 12

Processor Hardware Organization Pipelining

slide-2
SLIDE 2

12.2

BASIC HW

From combinational to sequential logic

slide-3
SLIDE 3

12.3

Logic Circuits

  • Combinational Logic

– Performs a specific function (mapping of 2n input combinations to desired output combinations) – No internal state or feedback

  • Given a set of inputs, we will always

get the same output after some time (propagation) delay

  • Sequential Logic

– Registers: fundamental building blocks

  • Remembers a set of bits for later use
  • Acts like a variable from software
  • Controlled by a "clock" signal

Outputs depend only on current outputs

Inputs Combinational Logic

(Usually

  • perations like

+, -, *, /, &, |, <<)

Outputs

Outputs depend on current inputs and previous inputs (previous inputs summarized by current state)

Current inputs Outputs 1 0 1 Sequential values feedback and provide "memory" Combinational Logic

Sequential Logic

Register holding "state"

slide-4
SLIDE 4

12.4

Combinational Logic Gates

  • Circuits called gates perform logic operations to

produce desired outputs from some digital inputs

1 1 1 1 1

OR gate AND gate NOT gate OR gate OR gate

slide-5
SLIDE 5

12.5

Propagation Delay

  • All digital logic circuits have propagation delay

– Time for output to change when inputs are changed

1 1 1 1 1 4 gate delays for input to propagate to outputs 1 1 1 1

slide-6
SLIDE 6

12.6

Combinational Logic Functions

  • Map input combinations of n-bits

to desired m-bit output

  • Can describe function with a truth table and

then find its circuit implementation

Logic Circuit Outputs Inputs

IN0 IN1 IN2 OUT0 OUT1 1 1 … 1 1 1 1

In0 In1 In2 Out1

slide-7
SLIDE 7

12.7

ALU’s

  • Perform a selected
  • peration on two input

numbers

– FS[5:0] selects the desired

  • peration

ALU

A B C0 RES OF ZF

32 32 32

X[31:0] Y[31:0] RES[31:0] Func

6

FS[5:0]

Func. Code Op. Func. Code Op. 00_0000 A SHL B 10_0000 A+B 00_0010 A SHR B 10_0010 A-B 00_0011 A SAR B

… … … …

10_0100 A AND B 01_1000 A * B 10_0101 A OR B 01_1001 A * B (uns.) 10_0110 A XOR B 01_1010 A / B 10_0111 A NOR B 01_1011 A / B (uns.) …

… … …

10_1010 A < B

100010 0x00000001 0x80000000 0x7fffffff CF 1

slide-8
SLIDE 8

12.8

Sequential Devices (Registers)

  • Registers capture the D input value when a control input

(clock signal) transitions from 0 to 1 (clock edge) and store that value at the Q output until the next clock edge

  • A register is similar to a variable in software: at the clock

edge, it stores a value for later use.

  • We can choose to only clock the register at "certain"

times when we want the register to capture a new value (e.g., when it is the destination of an instruction)

  • Key Idea

Registers store data while we operate on those values

Block Diagram of a Register The clock pulse (positive edge) here…

…causes q(t) to sample and hold the current d(t) value until another clock pulse

%rax ALU sum

add %rbx,%rax add %rcx,%rax

%rax

slide-9
SLIDE 9

12.9

Clock Signal

  • Alternating high/low voltage

pulse train

  • Controls the ordering and timing
  • f operations performed in the

processor

  • 1 cycle is usually measured from

rising edge to rising edge

  • Clock frequency = # of cycles per

second (e.g. 2.8 GHz = 2.8 * 109 cycles per second)

Processor

Clock Signal

0 (0V) 1 (5V) 1 cycle 2.8 GHz = 2.8*109 cycles per second = 0.357 ns/cycle

  • Op. 1
  • Op. 2
  • Op. 3
slide-10
SLIDE 10

12.10

FROM X86 TO RISC

Basic HW organization for a simplified instruction set

slide-11
SLIDE 11

12.11

From CISC to RISC

  • Complex Instruction Set Computers (CISC)
  • ften have instructions that vary widely in

how much work they perform and how much time they take to execute

– Fewer instructions are needed for a task

  • Reduced Instruction Set Computers (RISC)

favor instructions that take roughly the same time to execute and follow a common sequence of steps

– More instructions needed, each faster

// CISC instruction movq 0x40(%rdi, %rsi, 4), %rax // RISC equivalent with 1 memory or ALU // operation per instruction mov %rsi, %rbx # use %rbx as a temp. shl 2, %rbx # %rsi * 4 add %rdi, %rbx # %rdi + (%rsi*4) add $0x40, %rbx # 0x40 + %rdi + (%rsi*4) mov (%rbx), %rax # %rax = *%rbx

CISC vs. RISC Equivalents

John Hennessy and David Patterson, ACM Turing Award Lecture, 2017

slide-12
SLIDE 12

12.12

A RISC Subset of x86

  • Split mov instructions that access memory

into separate instructions:

– ld = Load/Read from memory – st = Store/Write to memory

  • Limit ld & st instructions to use at most

indirect w/ displacement

– No ld 0x04(%rdi, %rsi, 4), %rax

  • Too much work

– At most ld 0x40(%rdi), %rax or st %rax, 0x40(%rdi)

  • Limit arithmetic & logic instructions to only
  • perate on registers

– No add (%rsp), %rax since this implicitly accesses (dereferences) memory – Only add %reg1, %reg2

// CISC instruction add %rax, (%rsp) // Equivalent RISC sequence with ld / st ld 0(%rsp), %rbx add %rax, %rbx st %rbx, 0(%rsp) // 3 x86 memory read instructions mov (%rdi), %rax // 1 mov 0x40(%rdi), %rax // 2 mov 0x40(%rdi,%rsi), %rax // 3 // Equivalent load sequences ld 0x0(%rdi), %rax // 1 ld 0x40(%rdi), %rax // 2 mov %rsi, %rbx // 3a add %rdi, %rbx // 3b ld 0x40(%rbx), %rax // 3c // 3 x86 memory write instructions mov %rax, (%rdi) // 1 mov %rax, 0x40(%rdi) // 2 mov %rax, 0x40(%rdi,%rsi) // 3 // Equivalent store sequences st %rax, 0x0(%rdi) // 1 st %rax, 0x40(%rdi) // 2 mov %rsi, %rbx // 3a add %rdi, %rbx // 3b st %rax, 0x40(%rbx) // 3c

slide-13
SLIDE 13

12.13

Developing a Processor Organization

Hardware components used by each instruction type:

ALU-Type add %rax,%rbx

P C I-Cache / I-MEM Addr. Data D-Cache / D-MEM Addr. Data

Registers

(%rax, %rbx, etc.)

(aka RegFile)

ALU

Res. Zero

LD ld 8(%rax),%rbx ST st %rbx, 8(%rax) JE je label/displacement

Cond. Codes

1. 2. 3. 4. 5. PC I-Cache Registers ALU Registers

  • Addr. of Instruc
  • Fetch Instruc
  • Get %rax,%rbx
  • Sum %rax+%rbx
  • Save result to %rbx

PC I-Cache Registers ALU D-Cache

  • Addr. of Instruc
  • Fetch Instruc
  • Get %rax
  • Sum %rax+8
  • Read data

6. Registers

  • Save data to %rbx

PC I-Cache Registers ALU D-Cache

  • Addr. of Instruc
  • Fetch Instruc
  • Get %rax / %rbx
  • Sum %rax+8
  • Write %rbx data

PC I-Cache ALU

  • Addr. of Instruc
  • Fetch Instruc
  • If cond=TRUE, PC = PC+disp.
slide-14
SLIDE 14

12.14

Processor Block Diagram

I-Cache D-Cache ALU

Registers (aka RegFile)

Fetch Decode Exec. Mem WB

PC Dec

  • de

Instruction (Machine Code) Operands ALU Output (Addr. or Result) Data to write to dest. register Clock Cycle Time = Sum of delay through worst case pathway = 50 ns

10 ns 10 ns 10 ns 10 ns 10 ns

Control Signals

(e.g. ALU operation, Read/Write D-Cache, etc.)

Addr Data Z F O F C F S F

slide-15
SLIDE 15

12.15

Processor Execution (add)

I-Cache D-Cache ALU

Registers (aka RegFile)

Fetch Decode Exec. Mem WB

PC Dec

  • de

Instruction (Machine Code) Operands ALU Output (Addr. or Result) Data to write to dest. register Control Signals

(e.g. ALU operation, Read/Write D-Cache, etc.)

add %rax,%rdx [Machine Code: 48 01 c2] %rax+%rdx

Addr Data

%rdx = %rax+%rdx %rax %rdx

Z F O F C F S F

slide-16
SLIDE 16

12.16

Processor Execution (ld)

I-Cache D-Cache ALU

Registers (aka RegFile)

Fetch Decode Exec. Mem WB

PC Dec

  • de

Instruction (Machine Code) Operands ALU Output (Addr. or Result) Data to write to dest. register Control Signals

(e.g. ALU operation, Read/Write D-Cache, etc.)

Addr Data

ld 0x40(%rbx),%rax [Machine Code: 48 8b 43 40] addr %rdx = data %rbx data

Z F O F C F S F

0x40

slide-17
SLIDE 17

12.17

Processor Execution (st)

I-Cache D-Cache ALU

Registers (aka RegFile)

Fetch Decode Exec. Mem WB

PC Dec

  • de

Instruction (Machine Code) Control Signals

(e.g. ALU operation, Read/Write D-Cache, etc.)

Addr Data

st %rax,0x40(%rbx) [Machine Code: 48 89 43 40] addr %rbx 0x40 %rax

Z F O F C F S F

slide-18
SLIDE 18

12.18

Processor Execution (je)

I-Cache D-Cache ALU

Registers (aka RegFile)

Fetch Decode Exec. Mem WB

PC Dec

  • de

Instruction (Machine Code) Control Signals

(e.g. ALU operation, Read/Write D-Cache, etc.)

Addr Data

je L1 (disp. = 0x08) [Machine Code: 74 08] PC + 0x08 PC 0x08

Z F 1 O F C F 1 S F

slide-19
SLIDE 19

12.19

PIPELINING

slide-20
SLIDE 20

12.20

Example

for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; 10 ns per input set = 1000 ns total

Memory

A[i] B[i] A: B: C: i

Cntr

slide-21
SLIDE 21

12.21

Pipelining Example

Stage 1 Stage 2 Clock 0 A[0] + B[0] Clock 1 A[1] + B[1] (A[0] + B[0]) / 4 Clock 2 A[2] + B[2] (A[1] + B[1]) / 4

Stage 1 Stage 2 for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4;

Pipelining refers to insertion of registers to split combinational logic into smaller stages that can be overlapped in time (i.e., create an assembly line)

slide-22
SLIDE 22

12.22

Need for Registers

  • Provides separation between combinational functions

– Without registers, fast signals could “catch-up” to data values in the next operation stage

slide-23
SLIDE 23

12.23

Processors & Pipelines

  • Overlaps execution of multiple instructions
  • Natural breakdown into stages

– Fetch, Decode, Execute, Memory, Write-Back

  • Fetch an instruction, while decoding another, while

executing another

Execute Decode Fetch

CLK 1 CLK 2 CLK 3 Inst 1 Inst 2 Inst 3 Inst 4 CLK 4

Execute Decode Fetch Decode Fetch Fetch

Fetch Decode Exec.

  • Inst. 1

Clk 1 Clk 2

  • Inst. 1

Clk 3

  • Inst. 1
  • Inst. 2
  • Inst. 2

Clk 4

  • Inst. 2
  • Inst. 3
  • Inst. 3

Clk 5

  • Inst. 3
  • Inst. 4
  • Inst. 4
  • Inst. 5

Pipelining (Instruction View) Pipelining (Stage View)

slide-24
SLIDE 24

12.24

Balancing Pipeline Stages

  • Clock period must equal the LONGEST

delay from register to register

  • Fig. 1: If total logic delay is 20ns => 50MHz

– Throughput: 1 instruc. / 20 ns

  • Fig. 2: Unbalanced stage delays limit the

clock speed to the slowest stage (worst case)

– Throughput: 1 instruc. / 10 ns => 100MHz

  • Fig. 3: Better to split into more, balanced

stages

– Throughput: 1 instruc. / 5 ns => 200MHz

  • Fig. 4: Are more stages better

– Ideally: 2x stages => 2x throughput – Throughput: 1 instruc. / 2.5 ns => 400MHz – Each register adds extra delay so at some point deeper pipelines don't pay off

Processor Logic

(Fetch + Decode + Execute)

Register

Fetch Decode

Register

  • Exec. 1
  • Exec. 2

Register Register

Fetch Decode

Register

Exec

5 ns 5 ns 10 ns 20 ns 5 ns 5 ns 5 ns 5 ns

F1

2.5 ns

F2 D1 D2 E1a E1b E2a E2b

  • Fig. 1
  • Fig. 2
  • Fig. 3
  • Fig. 4
slide-25
SLIDE 25

12.25

Balancing Pipeline Stages

Main Points:

  • Latency of any single

instruction is unaffected

  • Throughput and thus
  • verall program

performance can be dramatically improved

– Ideally K stage pipeline will lead to throughput increase by a factor of K – Reality is splitting stages adds some delay and thus hits a point of diminishing returns

Processor Logic

(Fetch + Decode + Execute)

Register

Fetch Decode

Register

  • Exec. 1
  • Exec. 1

Register

20 ns 5 ns 5 ns 5 ns 5 ns

F1

2.5 ns

F2 D1 D2 E1a E1b E2a E2b

  • Fig. 1
  • Fig. 2
  • Fig. 3

Non-pipelined (Latency = 20ns, Throughput = 1x) 4 Stage Pipeline (Latency = 20ns, Throughput = 4x) 8 Stage Pipeline (Latency = 20ns, Throughput = 8x)

slide-26
SLIDE 26

12.26

Throughput and Latency

n clock (ps) tput (GIPS) 1 320 3.125 2 170 5.882 3 120 8.333 4 95 10.526 5 80 12.500 6 70 14.286 clock = 300/n + 20 tput = 1/clock delay = n*clock

slide-27
SLIDE 27

12.27

5-Stage Pipeline

Instruction (Machine Code) Operands & Control Signals ALU Output (Addr. or Result) Result of Instruction

I-Cache D-Cach e ALU Reg. File Fetch Decode Exec. Mem WB PC Dec

  • de

Z F O F C F S F

slide-28
SLIDE 28

12.28

Pipelining

  • Let's see how a sequence of instructions can

be executed

Instruction ld 0x40(%rbx),%rax add %rcx,%rdx je L1

slide-29
SLIDE 29

12.29

Sample Sequence - 1

Decode Exec. Mem WB Fetch (LD)

Fetch LD

Instruction (Machine Code) Operands & Control Signals ALU Output (Addr. or Result) Result of Instruction

I-Cache D-Cach e ALU Reg. File PC Dec

  • de

Z F O F C F S F

slide-30
SLIDE 30

12.30

Sample Sequence - 2

Exec. Mem WB Fetch (ADD)

Fetch ADD

Decode (LD)

Decode instruction and fetch operands

LD 0x40(%rbx), %rax Operands & Control Signals ALU Output (Addr. or Result) Result of Instruction

I-Cache D-Cach e ALU Reg. File PC Dec

  • de

Z F O F C F S F

slide-31
SLIDE 31

12.31

Sample Sequence - 3

ADD %rcx, %rdx 0x40 / %rbx / READ ALU Output (Addr. or Res.) Result of Instruction

I-Cache D-Cach e ALU Reg. File Mem WB PC Dec

  • de

Fetch (JE)

Fetch JE

Decode (ADD)

Decode instruction and fetch operands

Exec. (LD)

Add displacement 0x40 to %rbx

Z F O F C F S F

slide-32
SLIDE 32

12.32

Sample Sequence - 4

JE / displacement %rcx / %rdx / ADD %rbx + 0x40 / READ Result of Instruction

I-Cache D-Cach e ALU Reg. File WB PC Dec

  • de

Fetch (i+1) Fetch instruc i+1 Decode (JE) Exec. (ADD) Add %rcx+%rdx Mem (LD) Read word from memory

Decode instruction and fetch operands

Z F O F C F S F

slide-33
SLIDE 33

12.33

Sample Sequence - 5

  • Instruc. i+1 Machine Code

PC / Dsiplacement %rdx / %rcx + %rdx %rax / Value from Memory

I-Cache D-Cach e ALU Reg. File WB (LD) PC Dec

  • de

Fetch (i+2) Decode (i+1) Exec. (JE) Mem (ADD)

Write word to %rax Just pass sum to next stage Check if condition is true, add displ. Fetch next instruc i+2 Decode instruction i+1 and fetch

  • perands

Z F O F C F S F

slide-34
SLIDE 34

12.34

Sample Sequence - 6

  • Instruc. i+2 Machine Code

Instruc i+1 operands New PC %rdx / sum

I-Cache D-Cach e ALU Reg. File WB (ADD) PC Dec

  • de

Fetch (i+3) Decode (i+2) Exec. (i+1) Mem (JE)

Write word to %rdx Update PC Use the ALU Fetch next instruc i+3 Decode instruction i+2 and fetch

  • perands

Z F O F C F S F

slide-35
SLIDE 35

12.35

Sample Sequence - 7

Deleted Deleted Deleted No input for JE

I-Cache D-Cach e ALU Reg. File WB (JE) PC Dec

  • de

Fetch (target) Decode (i+3) Exec. (i+2) Mem (i+1)

Do nothing Delete i+1 Delete i+2 Fetch next instruc i+3 Delete i+3

Z F O F C F S F

slide-36
SLIDE 36

12.36

HAZARDS

Problems from overlapping instruction execution…

slide-37
SLIDE 37

12.37

Hazards

Hazards prevent parallel or overlapped execution!

  • Control Hazards

– Problem: We don't know what instruction to fetch but we need to – Examples: Jumps (branches) and procedure calls

  • Data Hazards / Data Dependencies

– Problem: When a later instruction needs data from a previous instruction – Examples:

  • sub %rdx,%rax
  • add %rax,%rcx
  • Structural Hazards

– Problem: Due to limited resources, the HW doesn't support overlapping a certain sequence of instructions – Examples: See next slides

slide-38
SLIDE 38

12.38

Structural Hazards

  • Example structural hazard: A single cache rather

than separate instruction & data caches

– Structural hazard any time an instruction needs to perform a data access (i.e. ld or st) since we always want to fetch a new instruction each clock cycle

Cache ALU Reg. File

PC LD i+3 i+2 i+1 Hazard!

slide-39
SLIDE 39

12.39

Data Hazard - 1

ADD %rax, %rcx %rdx / %rax / SUB ALU Output (Addr. or Res.) Result of Instruction

I-Cache D-Cach e ALU Reg. File Mem WB PC Dec

  • de

Fetch (i+1)

Fetch i+1

Decode (ADD)

Decode and get register operands (Do we get the desired %rax value?)

Exec. (SUB)

Perform %rax-%rdx sub %rdx,%rax add %rax,%rcx

slide-40
SLIDE 40

12.40

Data Hazard - 2

i+1 %rax / %rcx / ADD New value for %rax Result of Instruction

I-Cache D-Cach e ALU Reg. File Mem (SUB) WB PC Dec

  • de

Fetch (i+2)

Fetch i+2

Decode (i+1)

Decode i+1

Exec. (ADD)

Perform %rax+%rcx using the wrong value! sub %rdx,%rax add %rax,%rcx New value for %rax has not been written back yet

slide-41
SLIDE 41

12.41

Stalling

  • Solution 1: Halt/Stall the ADD instruction in the DECODE stage

and insert nops into the pipeline until the new value of the needed register is present at the cost of lower performance

ADD %rax, %rcx nop nop New value of %rax

I-Cache D-Cach e ALU Reg. File Mem (nop) WB (SUB) PC Dec

  • de

Fetch (i+1) Decode (ADD) Exec. (nop)

sub %rdx,%rax add %rax,%rcx

slide-42
SLIDE 42

12.42

Forwarding

  • Solution 2: Create new hardware paths to hand-off (forward)

the data from the producing instruction in the pipeline to the consuming instruction

sub %rdx,%rax add %rax,%rcx

i+1 %rax / %rcx / ADD New value for %rax Result of Instruction

I-Cache D-Cach e ALU Reg. File Mem (SUB) WB PC Dec

  • de

Fetch (i+2) Decode (i+1) Exec. (ADD)

slide-43
SLIDE 43

12.43

Solving Data Hazards

  • Key Point: Data dependencies (i.e., instructions

needing values produced by earlier ones) limit performance

  • Forwarding solves many of the data hazards (data

dependencies) that exist

– It allows instructions to continue to flow through the pipeline without the need to stall and waste time – The cost is additional hardware and added complexity

  • Even forwarding cannot solve all the issues

– A structural hazard still exists when a LD reads a value needed by the next instruction

slide-44
SLIDE 44

12.44

LD + Dependent Instruction Hazard

  • Even forwarding cannot prevent the need to stall when a ld instruction

produces a value needed by the instruction behind it

– Would require performing 2 cycles worth of work in only a single cycle ld 8(%rdx),%rax add %rax,%rcx

i+1 %old rax / %rcx / ADD Address (8+%rdx) / READ Result of Instruction

I-Cache D-Cach e ALU Reg. File Mem (LD) WB PC Dec

  • de

Fetch (i+2) Decode (i+1) Exec. (ADD)

New value for %rax

slide-45
SLIDE 45

12.45

LD + Dependent Instruction Hazard

  • We would need to introduce 1 stall cycle (nop) into the

pipeline to get the timing correct

  • Keep this in mind as we move through the next slides

ld 8(%rdx),%rax add %rax,%rcx

i+1 %old rax / %rcx / ADD nop New Value of %rax

I-Cache D-Cach e ALU Reg. File Mem (nop) WB (LD) PC Dec

  • de

Fetch (i+2) Decode (i+1) Exec. (ADD)

New value for %rax

slide-46
SLIDE 46

12.46

Control Hazards

  • Branches/Jumps require us to know

– Where we want to jump to (aka branch/jump target location)… really just the new value of the PC – If we should branch or not (checking the jump condition)

  • Problem: We often don't know those values until

deep in the pipeline and thus we are not sure what instructions should be fetched in the interim

– Requires us to flush unwanted instructions and waste time

slide-47
SLIDE 47

12.47

Control Hazard - 1

  • Instruc. i+2 Machine Code

Instruc i+1 operands New PC %rdx / sum

I-Cache D-Cach e ALU Reg. File WB (ADD) PC Dec

  • de

Fetch (i+3) Decode (i+2) Exec. (i+1) Mem (JE)

Write word to %rdx Update PC Use the ALU Fetch next instruc i+3 Decode instruction i+2 and fetch

  • perands
slide-48
SLIDE 48

12.48

Control Hazard - 2

Deleted Deleted Deleted %rdx / sum

I-Cache D-Cach e ALU Reg. File WB (JE) PC Dec

  • de

Fetch (target) Decode (i+3) Exec. (i+2) Mem (i+1)

Do nothing Delete i+1 Delete i+2 Fetch next instruc i+3 Delete i+3

Need to "flush" wrongly fetched instructions

slide-49
SLIDE 49

12.49

A FIRST LOOK: CODE REORDERING

Enlisting the help of the compiler

slide-50
SLIDE 50

12.50

Two Sides of the Coin

  • If the hardware has some problems it

just can't solve, can software (i.e., the compiler) help?

– Yes!!

  • Compilers can re-order instructions to

take best advantage of the processor (pipeline) organization

  • Identify the dependencies that will

incur stalls and slow performance

– Load followed by add – Jump instructions

void sum(int *data, int n, int x) { for(int i=0; i < n; i++){ data[i] += x; } } sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax add %edx, %eax st %eax, 0(%rdi) add $4, %rdi add $1, %ecx j L1 L2: retq C code and its assembly translation

slide-51
SLIDE 51

12.51

How Can the Compiler Help

  • Compilers are written with general parsing and

semantic representation front ends but architecture-specific backends that generate code optimized for a particular processor

  • Q: How could the compiler help improve

pipelined performance while still maintaining the external behavior that the high level code indicates

  • A: By finding independent instructions and

reordering the code

– Could we have moved any other instruction into that slot? No!

sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax stall/nop add %edx, %eax st %eax, 0(%rdi) add $4, %rdi add $1, %ecx j L1 L2: retq

C code and its assembly translation

sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax add $1, %ecx add %edx, %eax st %eax, 0(%rdi) add $4, %rdi j L1 L2: retq

Original Code (incurring 1 stall cycle) Updated Code (w/ Compiler reordering)

slide-52
SLIDE 52

12.52

Taken or Not Taken: Branch Behavior

  • When a conditional jump/branch is

– True, we say it is Taken – False, we say it is Not Taken

  • Currently our pipeline will fetch sequentially

and then potentially flush if the branch is taken

– Effectively, our pipeline "predicts" that each branch is Not Taken

  • The j L1 instruction is always taken and

thus will incur wasted clock cycles each time it is executed

  • Most of the time the jge L2 will be not

taken and perform well

C code and its assembly translation

sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax add $1, %ecx add %edx, %eax st %eax, 0(%rdi) add $4, %rdi j L1 L2: retq

T NT

slide-53
SLIDE 53

12.53

Branch Delay Slots

  • Problem: After a jump/branch we fetch instructions

that we are not sure should be executed

  • Idea: Find an instruction(s) that should ALWAYS be

executed (independent of whether branch is taken or not), move those instructions to directly after the branch, and have HW just let them be executed (not flushed) no matter what the branch outcome is

  • Branch delay slots = # of instructions that the HW will

always execute (not flush) after a jump/branch instruction

slide-54
SLIDE 54

12.54

Branch Delay Slot Example

ld 0(%rdi), %rcx cmp %rcx, %rdx je NEXT add %rbx, %rax NOT TAKEN CODE … NEXT: TAKEN CODE ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx je NEXT delay slot instruc. NOT TAKEN CODE … NEXT: TAKEN CODE

Assume a single instruction delay slot Move an ALWAYS executed instruction down into the delay slot and let it execute no matter what

“Before” Code ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx Not Taken Path Code je Taken Path Code “After” Code T NT

Delay Slot

Flowchart perspective of the delay slot

Delay Slot

slide-55
SLIDE 55

12.55

Implementing Branch Delay Slots

  • HW will define the number of

branch delay slots (usually a small number…1 or 2)

  • Compiler will be responsible for

arranging instructions to fill the delay slots

– Must find instructions that the branch does NOT DEPEND on – If no instructions can be rearranged, can always insert 'nop' and just waste those cycles

ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx je NEXT delay slot instruc. Cannot move ‘ld’ into delay slot because je needs the %rcx value generated by it ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rax je NEXT delay slot instruc. If no instruction can be found a 'nop' can be inserted by the compiler

slide-56
SLIDE 56

12.56

A Look Ahead: Branch Prediction

  • Currently our pipeline assumes Not Taken and

fetches down the sequential path after a jump/branch

  • Could we build a pipeline that could predict taken?

– Not yet! Location to jump to (branch target) not known until later stages

  • But suppose we could overcome those problems,

would we even know how to predict the outcome

  • f a jump/branch before actually looking at the

condition codes deeper in the pipeline?

  • We could allow a static prediction per instruction

(give a hint with the branch that indicates T or NT)

  • We could allow dynamic prediction per instruction

(use its runtime history)

Loop Body loop branch NT Code Loops High probability

  • f being Taken.

Prediction can be static. T: loop NT: done

if..else branch

NT: else T: if If Statements May exhibit data dependent behavior. Prediction may need to be dynamic. After Code T Code

slide-57
SLIDE 57

12.57

Summary 1

  • Pipelining is an effective and important technique to

improve the throughput of a processor

  • Overlapping execution creates hazards which lead to

stalls or wasted cycles

– Data, Control, Structural – More hardware can be investigated to attempt to mitigate the stalls (e.g. forwarding)

  • The compiler can help reorder code to avoid stalls

and perform useful work (e.g. delay slots)