[PPT] - ECE 550D Fundamentals of Computer Systems and Engineering Fall 2016 PowerPoint Presentation

SLIDE 1

ECE 550D

Fundamentals of Computer Systems and Engineering

Fall 2016

Datapaths

Tyler Bletsch Duke University Slides are derived from work by Andrew Hilton (Duke) and Amir Roth (Penn)

SLIDE 2

2

Last time

What did we do last time?
MIPS Assembly
Practice translating C to assembly together
Using functions
Calling conventions
jal (call)
jr (return)

SLIDE 3

3

Now confluence of MIPS + digital logic

Start of semester: Digital Logic
Building blocks of digital design
Most recently: MIPS assembly, ISA
Lowest level software
Now: where they meet…
Datapaths: hardware implementation of processors
By the way: homework 4 = build a datapath
With some components from the TAs…

SLIDE 4

4

Necessary ingredient: the ALU

ALU: Arithmetic/Logic Unit
Performs any supported math or logic operation on two inputs
Which operation is chosen by a third input

ALU

A B

p
ut

SLIDE 5

5

Add/Subtract With Overflow Detection

Full Adder Full Adder Full Adder Full Adder

S0 S1 Sn- 2 Sn- 1 Overflow a0 a1 b0 b1 an- 2 bn- 2 an- 1 bn- 1 Add/Sub

SLIDE 6

6

Add/sub C in C

u

t Add/sub F 2 1 2 3

a b Q

A F Q 0 0 a + b 1 0 a - b

1 NOT b
2 a OR b
3 a AND b

ALU Slice

SLIDE 7

7

The ALU

ALU Slice ALU Slice ALU Slice ALU Slice

ALU control

a b a

1 b

1 a

n-2

b

n-2

a

n-1

b

n-1

Q Q

1 Q

n-2

Q

n-1

Overflow

Is non-zero?

ALU

A B

p
ut

SLIDE 8

8

Datapath for MIPS ISA

Consider only the following instructions

add $1,$2,$3 addi $1,2,$3 lw $1,4($3) sw $1,4($3) beq $1,$2,PC_relative_target j absolute_target

Why only these?
Most other instructions are the same from datapath viewpoint
The one’s that aren’t are left for you to figure out

SLIDE 9

9

Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction

Remember The von Neumann Model?

Instruction Fetch:

Read instruction bits from memory

Decode:

Figure out what those bits mean

Operand Fetch:

Read registers (+ mem to get sources)

Execute:

Do the actual operation (e.g., add the #s)

Result Store:

Write result to register or memory

Next Instruction:

Figure out mem addr of next insn, repeat

SLIDE 10

10

Start With Fetch

Same for all instructions (don’t know insn yet)
PC and instruction memory
A +4 incrementer computes default next instruction PC
Details of Insn Mem: later…
For now: just assume a bunch of DFFs

P C Insn Mem

+ 4

SLIDE 11

11

First Instruction: add

Add register file and ALU

P C Insn Mem Register File

R-type s1 s2 d + 4 Decoding: Very easy in MIPS Op(6) Rs(5) Rt(5) Rd(5) Sh(5) Func(6)

SLIDE 12

12

Second Instruction: addi

Destination register can now be either Rd or Rt
Add sign extension unit and mux into second ALU input

P C Insn Mem Register File

S X

Op(6) Rs(5) Rt(5) I-type Immed(16) s1 s2 d + 4

SLIDE 13

13

Third Instruction: lw

Add data memory, address is ALU output
Add register write data mux to select memory output or ALU output

P C Insn Mem Register File

S X

Op(6) Rs(5) Rt(5) I-type Immed(16) s1 s2 d

Data Mem

d + 4 a

SLIDE 14

14

Fourth Instruction: sw

Add path from second input register to data memory data input

P C Insn Mem Register File

S X

Op(6) Rs(5) Rt(5) I-type Immed(16) s1 s2 d

Data Mem

a d + 4

SLIDE 15

15

Fifth Instruction: beq

Add left shift unit and adder to compute PC-relative branch target
Add PC input mux to select PC+4 or branch target
Note: shift by fixed amount very simple

P C Insn Mem Register File

S X

Op(6) Rs(5) Rt(5) I-type Immed(16) s1 s2 d

Data Mem

a d + 4

<< 2

z

SLIDE 16

16

Sixth Instruction: j

Add shifter to compute left shift of 26-bit immediate
Add additional PC input mux for jump target

P C Insn Mem Register File

S X

Op(6) J-type Immed(26) s1 s2 d

Data Mem

a d + 4

<< 2 << 2

SLIDE 17

17

More Instructions…

Figure out datapath modifications for
jal (J-type)
jr (R-type)

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2 << 2

SLIDE 18

18

Jal

For jal, need to get PC+4 to RF write mux

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2 << 2

SLIDE 19

19

JR

For JR need to get RF read value to next PC mux

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2 << 2

SLIDE 20

20

Good practice: Try other insns

Pick other MIPS instructions, contemplate how to add them

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2 << 2

SLIDE 21

21

d

“Continuous Read” Datapath Timing

Works because writes (PC, RegFile, DMem) are independent
And because no read logically follows any write

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a + 4 Read IMem Read Registers Read DMEM Write DMEM Write Registers Write PC

SLIDE 22

22

d

What Is Control?

8 signals control flow of data through this datapath
MUX selectors, or register/memory write enable signals
A real datapath has 300-500 control signals

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a + 4

<< 2 << 2

Rwe ALUinB DMwe JP ALUop BR Rwd Rdst

SLIDE 23

23

Example: Control for add

Control for an instruction:
Values of all control signals to correctly execute it

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2 << 2

BR=0 JP=0 Rwd=0 DMwe=0 ALUop=0 ALUinB=0 Rdst=1 Rwe=1

SLIDE 24

24

Example: Control for sw

Difference between sw and add is 5 signals
3 if you don’t count the X (don’t care) signals

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2 << 2

Rwe=0 ALUinB=1 DMwe=1 JP=0 ALUop=0 BR=0 Rwd=X Rdst=X

SLIDE 25

25

d

Example: Control for beq

Difference between sw and beq is only 4 signals

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a + 4

<< 2 << 2

Rwe=0 ALUinB=0 DMwe=0 JP=0 ALUop=1 BR=1 Rwd=X Rdst=X

SLIDE 26

26

You all figure LW

How would these control signals be set for LW?

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2 << 2

Rwe ALUinB DMwe JP ALUop BR Rwd Rdst

SLIDE 27

27

Example: Control for LW

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2 << 2

BR=0 JP=0 Rwd=1 DMwe=0 ALUop=0 ALUinB=1 Rdst=0 Rwe=1

SLIDE 28

28

d

How Is Control Implemented?

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a + 4

<< 2 << 2

Rwe ALUinB DMwe JP ALUop BR Rwd Rdst Control?

SLIDE 29

29

Implementing Control

Each insn has a unique set of control signals
Most are function of opcode
Some may be encoded in the instruction itself
E.g., the ALUop signal is some portion of the MIPS Func field

+ Simplifies controller implementation

Requires careful ISA design

SLIDE 30

30

Control Implementation: ROM

ROM (read only memory): think rows of bits
Bits in data words are control signals
Lines indexed by opcode
Example: ROM control for 6-insn MIPS datapath
X is “don’t care”

BR JP ALUinB ALUop DMwe Rwe Rdst Rwd add 1 addi 1 1 1 lw 1 1 1 1 sw 1 1 X X beq 1 1 X X j 1 X X

pcode

SLIDE 31

31

Control Implementation: Random Logic

Real machines have 100+ insns 300+ control signals
30,000+ control bits (~4KB)
Not huge, but hard to make faster than datapath (important!)
Alternative: random logic (random = ‘non-repeating’)
Exploits the observation: many signals have few 1s or few 0s
Example: random logic control for 6-insn MIPS datapath

ALUinB

pcode

add addi lw sw beq j BR JP DMwe Rwd Rdst ALUop Rwe

Yes, “random logic” is a very dumb and misleading name for this concept. Sorry.

SLIDE 32

32

Datapath and Control Timing

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

Control ROM/random logic

Read IMem Read Registers (Read Control ROM) Read DMEM Write DMEM Write Registers Write PC

SLIDE 33

33

Single-Cycle Datapath Performance

Goes against make common case fast (MCCF) principle

+ Low Cycles Per Instruction (CPI): 1 – Long clock period: to accommodate slowest insn P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

Control ROM/random logic

SLIDE 34

34

Interlude: Performance

Previous slide alludes to something new: Performance
Don’t just want it to work…
But want it to go fast!
Three components to performance:

Number of instructions x Cycles per instruction (CPI) x Clock Period (1 / Clock frequency) Instructions Cycles Seconds Seconds —————— x ————— x ————— = —————— Program Instruction Cycle Program

SLIDE 35

35

Interlude: Performance

Three components to performance:

Number of instructions <- Compiler’s Job x Cycles per instruction (CPI) x Clock Period (1 / Clock frequency) Instructions Cycles Seconds Seconds —————— x ————— x ————— = —————— Program Instruction Cycle Program

Insns/Program: determined by compiler + ISA
Generally assume fixed program when doing micro-architecture

SLIDE 36

36

Micro-architectural factors

Micro-architecture:
The details of how the ISA is implemented
Affects CPI and Clock frequency
Often will look at fixed program, and consider MIPS
Million Instructions Per Second
MIPS = IPC * Frequency (in MHz)
IPC = Instruction Per Cycle (1 / CPI)
Gives “Bigger is better” number

Instructions Cycles Instructions ————— x ————— = —————— Cycle Second Second (IPC) (Frequency) (Throughput)

The use of “MIPS” to mean “Millions of Instructions Per Second” has nothing to do with the CPU architecture also called “MIPS”, which actually stands for “Microprocessor without Interlocked Pipeline Stages”. This fact that a major CPU architecture shares a name with an important metric for performance is incredibly confusing and dumb, and I apologize. I blame the cocaine-fueled CPU architects of the 1980s.

SLIDE 37

37

“Best” IPC

For now, best we can do: IPC = 1 (CPI = 1)
Do 1 instruction every cycle
Later:
Real processors can do multiple instructions at once!
Potentially: IPC > 1! (CPI < 1!)
Best possible IPC depends on design

SLIDE 38

38

Performance vs ….

1990s: Performance at all cost
Actually more “clock frequency” at all cost…
Now: Care about other things
Energy (electric bill, battery life)
Power (cooling, also affects energy)
Area (chip cost)
Reliability (tolerance of transient faults: e.g., charge particle strikes)
…
Important metric these days “Performance / Watt”
Throughput divided by power consumption
Why?

SLIDE 39

39

Performance Modeling and Analysis

Speaking of performance
Making a processor takes time (years) and money (millions)
Want to know it will perform well before you finish
If its wrong, doing it all over is painful…
Performance can be simulated in software
Estimate what IPC will be
Guide design

SLIDE 40

40

Single-Cycle Datapath Performance

Goes against make common case fast (MCCF) principle

+ Low Cycles Per Instruction (CPI): 1 – Long clock period: to accommodate slowest insn P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

Control ROM/random logic

SLIDE 41

41

Alternative: Multi-Cycle Datapath

Multi-cycle datapath: attacks high clock period
Cut datapath into multiple stages (5 here), isolate using FFs
FSM control “walks” insns thru stages (by staging control signals)

+ Insns can bypass stages and exit early P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2

I R D O B A

s3 s3 s3 s4 s5 s5 s5

SLIDE 42

42

Multi-cycle Datapath FSM

First state: Get a New Instruction
Output signals to fetch (e.g., read enable IMEM)
Next State: Always Decode

Next Insn Decode Insn

SLIDE 43

43

Multi-cycle Datapath FSM

Second State: Decode
Output signals to decode instruction (RdEn RegFile)
Go to Next Insn if NOP
Otherwise Execute

Next Insn Decode Insn Execute Insn NOP

SLIDE 44

44

Multi-cycle Datapath FSM

Execute State
Execute Insn (varies by insn type)
Next State: Also depends on insn type
Branches: Next Insn

Next Insn Decode Insn Execute Insn Branch NOP

SLIDE 45

45

Multi-cycle Datapath FSM

Execute State
Execute Insn (varies by insn type)
Next State: Also depends on insn type
ALU op: write register

Next Insn Decode Insn Execute Insn Branch Writeback ALU NOP

SLIDE 46

46

Multi-cycle Datapath FSM

Execute State
Execute Insn (varies by insn type)
Next State: Also depends on insn type
Load: Read Memory

Next Insn Decode Insn Execute Insn Branch Writeback ALU Read DMEM Load NOP

SLIDE 47

47

Multi-cycle Datapath FSM

Execute State
Execute Insn (varies by insn type)
Next State: Also depends on insn type
Store: Write Memory

Next Insn Decode Insn Execute Insn Branch Writeback ALU Read DMEM Load Write DMEM Store NOP

SLIDE 48

48

Multi-cycle Datapath FSM

Read DMEM State
Control signals enable DMEM Read
Next state is writeback

Next Insn Decode Insn Execute Insn Branch Writeback ALU Read DMEM Load Write DMEM Store NOP

SLIDE 49

49

Multi-cycle Datapath FSM

Writeback state
Control signals enable regfile write
Next state: Next Insn

Next Insn Decode Insn Execute Insn Branch Writeback ALU Read DMEM Load Write DMEM Store NOP

SLIDE 50

50

Multi-cycle Datapath FSM

Write DMEM state
Control signals enable memory write
Next state: Next Insn

Next Insn Decode Insn Execute Insn Branch Writeback ALU Read DMEM Load Write DMEM Store NOP

SLIDE 51

51

Multi-Cycle Datapath Example: Add

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2

I R D O B A

Example: Add
Cycle 1: Read IMEM

SLIDE 52

52

Multi-Cycle Datapath Example: Add

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2

I R D O B A

Example: Add
Cycle 1: Read IMEM
Cycle 2: Decode + Read RF

SLIDE 53

53

Multi-Cycle Datapath Example: Add

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2

I R D O B A

Example: Add
Cycle 1: Read IMEM
Cycle 2: Decode + Read RF
Cycle 3: ALU

SLIDE 54

54

Multi-Cycle Datapath Example: Add

Example: Add
Cycle 1: Read IMEM
Cycle 2: Decode + Read RF
Cycle 3: ALU
Cycle 4: Writeback + Increment PC

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2

I R D O B A

SLIDE 55

55

Multi-Cycle Datapath Performance

Opposite performance split of single-cycle datapath

+ Short clock period – High CPI P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2

I R D O B A

SLIDE 56

56

Multi-cycle Data-path CPI

CPI depends on instructions
Branches / Jumps: 3 cycles
ALU: 4 cycles
Stores: 4 cycles
Loads: 5 cycles
Overall CPI is weighted average
Example:
20% loads, 15% stores, 20% branches, 45% ALU

SLIDE 57

57

Multi-cycle Data-path CPI

CPI depends on instructions
Branches / Jumps: 3 cycles
ALU: 4 cycles
Stores: 4 cycles
Loads: 5 cycles
Overall CPI is weighted average
Example:
20% loads, 15% stores, 20% branches, 45% ALU

CPI= 0.20 * 5 +

SLIDE 58

58

Multi-cycle Data-path CPI

CPI depends on instructions
Branches / Jumps: 3 cycles
ALU: 4 cycles
Stores: 4 cycles
Loads: 5 cycles
Overall CPI is weighted average
Example:
20% loads, 15% stores, 20% branches, 45% ALU

CPI= 0.20 * 5 + 0.15 * 4 +

SLIDE 59

59

Multi-cycle Data-path CPI

CPI depends on instructions
Branches / Jumps: 3 cycles
ALU: 4 cycles
Stores: 4 cycles
Loads: 5 cycles
Overall CPI is weighted average
Example:
20% loads, 15% stores, 20% branches, 45% ALU

CPI= 0.20 * 5 + 0.15 * 4 + 0.20 * 3 + 0.45 * 4 = 4.0

SLIDE 60

60

Multi-cycle Datapath Performance

Single-cycle
Clock period = 50ns, CPI = 1
Performace = 50 ns/insn
Multi-cycle
Clock period = 10ns
CPI = (0.2*3+0.2*5+0.6*4) = 4
Performance = 40 ns/insn
But wait…

SLIDE 61

61

Multi-Cycle Datapath Performance

Did not just cut up existing logic into 5 pieces
Also added logic (flip flops)
So clock period not 1/5 of single cycle, but slightly longer

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d + 4

<< 2

I R D O B A

SLIDE 62

62

Multi-cycle Datapath Performance

Single-cycle
Clock period = 50ns, CPI = 1
Performace = 50 ns/insn
Multi-cycle
Clock period = 12ns
CPI = (0.2*3+0.2*5+0.6*4) = 4
Performance = 48 ns/insn
Better, but not as exciting…
Can we do better still?
Have our cake (low CPI) and eat it too (high clock frequency)?

SLIDE 63

63

Clock Period and CPI

Single-cycle datapath

+ Low CPI: 1 – Long clock period: to accommodate slowest insn

Multi-cycle datapath

+ Short clock period – High CPI

Can we have both low CPI and short clock period?

– No good way to make a single insn go faster + Insn latency doesn’t matter anyway … insn throughput matters

Key: exploit inter-insn parallelism

insn0.fetch, dec, exec insn1.fetch, dec, exec insn0.dec insn0.fetch insn1.dec insn1.fetch insn0.exec insn1.exec

SLIDE 64

64

Pipelining

Pipelining: important performance technique
Improves insn throughput rather than insn latency
Exploits parallelism at insn-stage level to do so
Begin with multi-cycle design
When insn advances from stage 1 to 2, next insn enters stage 1
Individual insns take same number of stages

+ But insns enter and leave at a much faster rate

Physically breaks “atomic” VN loop ... but must maintain illusion
Revisit at end of semester (hopefully)

insn0.dec insn0.fetch insn1.dec insn1.fetch insn0.exec insn1.exec insn0.dec insn0.fetch insn1.dec insn1.fetch insn0.exec insn1.exec

SLIDE 65

65

Summary

Datapaths:
Single Cycle
What do we need?
Control
How control is implemented
Multi-cycle
Faster clock (yay!)
Worse CPI (boooo)
Performance:
IPC
Performance / Watt
CPU Performance Equation
Pipelining
Teaser for later!