[PPT] - Single-Cycle DataPath Lecture 15 CDA 3103 07-09-2014 Review of PowerPoint Presentation

SLIDE 1

Single-Cycle DataPath

Lecture 15 CDA 3103 07-09-2014

SLIDE 2

Review of Virtual Memory

Next level in the memory hierarchy

– Provides illusion of very large main memory – Working set of “pages” residing in main memory (subset of all pages residing on disk)

Main goal: Avoid reaching all the way back to

disk as much as possible

Additional goals:

– Let OS share memory among many programs and protect them from each other – Each process thinks it has all the memory to itself

2

SLIDE 3

Review: Paging Terminology

Programs use virtual addresses (VAs)

– Space of all virtual addresses called virtual memory (VM) – Divided into pages indexed by virtual page number (VPN)

Main memory indexed by physical addresses

(PAs)

– Space of all physical addresses called physical memory (PM) – Divided into pages indexed by physical page number (PPN)

3

SLIDE 4

Review: Translation Look-Aside Buffers (TLBs)

TLBs usually small, typically 128 - 256 entries
Like any other cache, the TLB can be direct mapped,

set associative, or fully associative Processor TLB Lookup Cache Main Memory

VA PA miss

hit data Trans- lation hit miss

On TLB miss, get page table entry from main memory

7

SLIDE 5

Regs L2 Cache Memory Disk Tape

Instr. Operands

Blocks Pages Files Upper Level Lower Level Faster Larger Cache Blocks

{

Last Week: Virtual Memory

Review: Memory Hierarchy

5

SLIDE 6

Review Example 1

 A set-associative cache consists of 64 lines, or slots,

divided into four-line sets. Main memory contains 4K blocks of 128 words each. Show the format of main memory addresses.

SLIDE 7

Solution

 The cache is divided into 16 sets of 4 lines each. Therefore,

4 bits are needed to identify the set number. Main memory consists of 4K = 212 blocks. Therefore, the set plus tag lengths must be 12 bits and therefore the tag length is 8

bits. Each block contains 128 words. Therefore, 7 bits are

needed to specify the word.

SLIDE 8

Review Example 2

 A two-way set-associative cache has lines of 16 bytes

and a total size of 8 kbytes. The 64-Mbyte main memory is byte addressable. Show the format of main memory addresses.

SLIDE 9

Solution

 There are a total of 8 kbytes/16 bytes = 512 lines in the

cache. Thus the cache consists of 256 sets of 2 lines each.

Therefore 8 bits are needed to identify the set number. For the 64-Mbyte main memory, a 26-bit address is needed. Main memory consists of 64-Mbyte/16 bytes = 222 blocks. Therefore, the set plus tag lengths must be 22 bits, so the tag length is 14 bits and the word field length is 4 bits.

SLIDE 10

Agenda

Stages of the Datapath
Datapath Instruction Walkthroughs
Datapath Design

Dr Dan Garcia

SLIDE 11

Five Components of a Computer

Processor Computer Control Datapath Memory (passive) (where programs, data live when running) Devices Input Output Keyboard, Mouse Display, Printer Disk (where programs, data live when not running)

Dr Dan Garcia

SLIDE 12

The CPU

Processor (CPU): the active part of the

computer that does all the work (data manipulation and decision-making)

Datapath: portion of the processor that

contains hardware necessary to perform

perations required by the processor (the

brawn)

Control: portion of the processor (also in

hardware) that tells the datapath what needs to be done (the brain)

Dr Dan Garcia

SLIDE 13

Stages of the Datapath : Overview

Problem: a single, atomic block that “executes

an instruction” (performs all necessary

perations beginning with fetching the

instruction) would be too bulky and inefficient

Solution: break up the process of “executing

an instruction” into stages, and then connect the stages to create the whole datapath

– smaller stages are easier to design – easy to optimize (change) one stage without touching the others

Dr Dan Garcia

SLIDE 14

Five Stages of the Datapath

Stage 1: Instruction Fetch
Stage 2: Instruction Decode
Stage 3: ALU (Arithmetic-Logic Unit)
Stage 4: Memory Access
Stage 5: Register Write

Dr Dan Garcia

SLIDE 15

Stages of the Datapath (1/5)

There is a wide variety of MIPS instructions: so

what general steps do they have in common?

Stage 1: Instruction Fetch

– no matter what the instruction, the 32-bit instruction word must first be fetched from memory (the cache-memory hierarchy) – also, this is where we Increment PC (that is, PC = PC + 4, to point to the next instruction: byte addressing so + 4)

Dr Dan Garcia

SLIDE 16

Stages of the Datapath (2/5)

Stage 2: Instruction Decode

– upon fetching the instruction, we next gather data from the fields (decode all necessary instruction data) – first, read the opcode to determine instruction type and field lengths – second, read in data from all necessary registers

for add, read two registers
for addi, read one register
for jal, no reads necessary

Dr Dan Garcia

SLIDE 17

Stage 3: ALU (Arithmetic-Logic Unit)

– the real work of most instructions is done here: arithmetic (+, -, *, /), shifting, logic (&, |), comparisons (slt) – what about loads and stores?

lw

$t0, 40($t1)

the address we are accessing in memory = the value in

$t1 PLUS the value 40

so we do this addition in this stage

Stages of the Datapath (3/5)

Dr Dan Garcia

SLIDE 18

Stages of the Datapath (4/5)

Stage 4: Memory Access

– actually only the load and store instructions do anything during this stage; the others remain idle during this stage or skip it all together – since these instructions have a unique step, we need this extra stage to account for them – as a result of the cache system, this stage is expected to be fast

Dr Dan Garcia

SLIDE 19

Stages of the Datapath (5/5)

Stage 5: Register Write

– most instructions write the result of some computation into a register – examples: arithmetic, logical, shifts, loads, slt – what about stores, branches, jumps?

don’t write anything into a register at the end
these remain idle during this fifth stage or skip it all

together

Dr Dan Garcia

SLIDE 20

Chapter 4 — The Processor — 20

Logic Design Basics

§4.2 Logic Design Conventions

 Information encoded in binary

 Low voltage = 0, High voltage = 1  One wire per bit  Multi-bit data encoded on multi-wire buses

 Combinational element

 Operate on data  Output is a function of input

 State (sequential) elements

 Store information

SLIDE 21

Chapter 4 — The Processor — 21

Combinational Elements

 AND-gate

 Y = A & B

A B Y I0 I1 Y

M u x

S

 Multiplexer

 Y = S ? I1 : I0

A B Y + A B Y ALU F

 Adder

 Y = A + B

 Arithmetic/Logic Unit

 Y = F(A, B)

SLIDE 22

Chapter 4 — The Processor — 22

ALU Control

 ALU used for

 Load/Store: F = add  Branch: F = subtract  R-type: F depends on funct field

§4.4 A Simple Implementation Scheme ALU control Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 set-on-less-than 1100 NOR

SLIDE 23

Chapter 4 — The Processor — 23

Sequential Elements

 Register: stores data in a circuit

 Uses a clock signal to determine when to

update the stored value

 Edge-triggered: update when Clk changes

from 0 to 1

D Clk Q

Clk D Q

SLIDE 24

Chapter 4 — The Processor — 24

Sequential Elements

 Register with write control

 Only updates on clock edge when write

control input is 1

 Used when stored value is required later

D Clk Q Write

Write D Q Clk

SLIDE 25

Chapter 4 — The Processor — 25

Clocking Methodology

 Combinational logic transforms data during

clock cycles

 Between clock edges  Input from state elements, output to state

element

 Longest delay determines clock period

SLIDE 26

Chapter 4 — The Processor — 26

Building a Datapath

 Datapath

 Elements that process data and addresses

in the CPU

 Registers, ALUs, mux’s, memories, …

 We will build a MIPS datapath

incrementally

 Refining the overview design

§4.3 Building a Datapath

SLIDE 27

Chapter 4 — The Processor — 27

Instruction Fetch

32-bit register Increment by 4 for next instruction

SLIDE 28

Chapter 4 — The Processor — 28

R-Format Instructions

 Read two register operands  Perform arithmetic/logical operation  Write register result

SLIDE 29

Chapter 4 — The Processor — 29

Load/Store Instructions

 Read register operands  Calculate address using 16-bit offset

 Use ALU, but sign-extend offset

 Load: Read memory and update register  Store: Write register value to memory

SLIDE 30

Chapter 4 — The Processor — 30

Branch Instructions

 Read register operands  Compare operands

 Use ALU, subtract and check Zero output

 Calculate target address

 Sign-extend displacement  Shift left 2 places (word displacement)  Add to PC + 4

 Already calculated by instruction fetch

SLIDE 31

Chapter 4 — The Processor — 31

Branch Instructions

Just re-routes wires Sign-bit wire replicated

SLIDE 32

Chapter 4 — The Processor — 32

Composing the Elements

 First-cut data path does an instruction in

ne clock cycle

 Each datapath element can only do one

function at a time

 Hence, we need separate instruction and data

memories

 Use multiplexers where alternate data

sources are used for different instructions

SLIDE 33

Chapter 4 — The Processor — 33

R-Type/Load/Store Datapath

SLIDE 34

Chapter 4 — The Processor — 34

Full Datapath

SLIDE 35

Generic Steps of Datapath

instruction memory +4 rt rs rd registers ALU Data memory imm

1. Instruction

Fetch

2. Decode/

Register Read

3. Execute 4. Memory 5. Register

Write

PC

Dr Dan Garcia

SLIDE 36

add $r3,$r1,$r2 # r3 = r1+r2

Datapath Walkthroughs (1/3)

Dr Dan Garcia

SLIDE 37

7/9/2014 Dr Dan Garcia 37

add $r3,$r1,$r2 # r3 = r1+r2

– Stage 1: fetch this instruction, increment PC – Stage 2: decode to determine it is an add, then read registers $r1 and $r2 – Stage 3: add the two values retrieved in Stage 2 – Stage 4: idle (nothing to write to memory) – Stage 5: write result of Stage 3 into register $r3

Datapath Walkthroughs (1/3)

SLIDE 38

instruction memory +4 registers ALU Data memory imm 2 1 3 add r3, r1, r2 reg[1]+ reg[2] reg[2] reg[1]

Example: add Instruction

PC

Dr Dan Garcia

SLIDE 39

slti $r3,$r1,17

# if (r1 <17 )r3 = 1 else r3 = 0

Datapath Walkthroughs (2/3)

Dr Dan Garcia

SLIDE 40

7/9/2014 Dr Dan Garcia 40

slti $r3,$r1,17

# if (r1 <17 )r3 = 1 else r3 = 0

– Stage 1: fetch this instruction, increment PC – Stage 2: decode to determine it is an slti, then read register $r1 – Stage 3: compare value retrieved in Stage 2 with the integer 17 – Stage 4: idle – Stage 5: write the result of Stage 3 (1 if reg source was less than signed immediate, 0 otherwise) into register $r3

Datapath Walkthroughs (2/3)

SLIDE 41

instruction memory +4 registers ALU Data memory imm 3 1 x slti r3, r1, 17 reg[1] <17? 17 reg[1]

Example: slti Instruction

PC

Dr Dan Garcia

SLIDE 42

sw $r3,17($r1) # Mem[r1+17]=r3

Datapath Walkthroughs (3/3)

Dr Dan Garcia

SLIDE 43

43

sw $r3,17($r1) # Mem[r1+17]=r3

– Stage 1: fetch this instruction, increment PC – Stage 2: decode to determine it is a sw, then read registers $r1 and $r3 – Stage 3: add 17 to value in register $r1 (retrieved in Stage 2) to compute address – Stage 4: write the value contained in register $r3 (retrieved in Stage 2) into memory address computed in Stage 3 – Stage 5: idle (nothing to write into a register)

Datapath Walkthroughs (3/3)

Dr Dan Garcia

SLIDE 44

instruction memory +4 registers ALU Data memory imm 3 1 x SW r3, 17(r1) reg[1] +17 17 reg[1] MEM[r1+17]<=r3 reg[3]

Example: sw Instruction

PC

Dr Dan Garcia

SLIDE 45

Why Five Stages? (1/2)

Could we have a different number of stages?

– Yes, and other architectures do

So why does MIPS have five if instructions

tend to idle for at least one stage?

– Five stages are the union of all the operations needed by all the instructions. – One instruction uses all five stages: the load

Dr Dan Garcia

SLIDE 46

lw $r3,17($r1) # r3=Mem[r1+17]

– Stage 1: fetch this instruction, increment PC – Stage 2: decode to determine it is a lw, then read register $r1 – Stage 3: add 17 to value in register $r1 (retrieved in Stage 2) – Stage 4: read value from memory address computed in Stage 3 – Stage 5: write value read in Stage 4 into register $r3

Why Five Stages? (2/2)

Dr Dan Garcia

SLIDE 47

ALU instruction memory +4 registers Data memory imm 3 1 x LW r3, 17(r1) reg[1] +17 17 reg[1] MEM[r1+17]

Example: lw Instruction

PC

Dr Dan Garcia

SLIDE 48

Peer Instruction

How many places in this diagram will need a multiplexor to select one from multiple inputs? a) 0 b) 1 c) 2 d) 3 e) 4 or more

Dr Dan Garcia

SLIDE 49

Dr Dan Garcia

Peer Instruction

 Can’t just join

wires together

 Use multiplexers

SLIDE 50

Datapath and Control

Datapath based on data transfers required to perform

instructions

Controller causes the right transfers to happen

PC

instruction memory +4 rt rs rd registers Data memory imm ALU

Controller

pcode, funct

Dr Dan Garcia

SLIDE 51

What Hardware Is Needed? (1/2)

PC: a register that keeps track of address of

the next instruction to be fetched

General Purpose Registers

– Used in Stages 2 (Read) and 5 (Write) – MIPS has 32 of these

Memory

– Used in Stages 1 (Fetch) and 4 (R/W) – Caches makes these stages as fast as the others (on average, otherwise multicycle stall)

Dr Dan Garcia

SLIDE 52

What Hardware Is Needed? (2/2)

ALU

– Used in Stage 3 – Performs all necessary functions: arithmetic, logicals, etc.

Miscellaneous Registers

– One stage per clock cycle: Registers inserted between stages to hold intermediate data and control signals as they travel from stage to stage – Note: Register is a general purpose term meaning something that stores bits. Realize that not all registers are in the “register file”

Dr Dan Garcia

SLIDE 53

CPU Clocking (1/2)

For each instruction, how do we control the flow of

information though the datapath?

Single Cycle CPU: All stages of an instruction

completed within one long clock cycle

– Clock cycle sufficiently long to allow each instruction to complete all stages without interruption within one cycle

1. Instruction

Fetch

2. Decode/

Register Read

3. Execute 4. Memory 5. Reg.

Write

Dr Dan Garcia

SLIDE 54

CPU Clocking (2/2)

Alternative multiple-cycle CPU: only one stage of instruction

per clock cycle – Clock is made as long as the slowest stage – Several significant advantages over single cycle execution: Unused stages in a particular instruction can be skipped OR instructions can be pipelined (overlapped)

1. Instruction

Fetch

2. Decode/

Register Read

3. Execute
4. Memory
5. Register

Write

Dr Dan Garcia

SLIDE 55

Processor Design

Analyze instruction set architecture (ISA) to determine

datapath requirements

– Meaning of each instruction is given by register transfers – Datapath must include storage element for ISA registers – Datapath must support each register transfer

Select set of datapath components and establish

clocking methodology

Assemble datapath components to meet requirements
Analyze each instruction to determine sequence of

control point settings to implement the register transfer

Assemble the control logic to perform this sequencing

Dr Dan Garcia

SLIDE 56

Instruction Level Parallelism

P 1 Instr 1 IF ID ALU MEM WR P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 P 11 P 12 Instr 2 IF ID ALU MEM WR IF ID ALU MEM WR Instr 2 Instr 3 IF ID ALU MEM WR IF ID ALU MEM WR Instr 4 IF ID ALU MEM WR Instr 5 IF ID ALU MEM WR Instr 6 IF ID ALU MEM WR Instr 7 IF ID ALU MEM WR Instr 8

Dr Dan Garcia

SLIDE 57

Summary

CPU design involves Datapath, Control

– 5 Stages for MIPS Instructions

1. Instruction Fetch
2. Instruction Decode & Register Read
3. ALU (Execute)
4. Memory
5. Register Write
Datapath timing: single long clock cycle or one

short clock cycle per stage

Dr Dan Garcia