CSEE 3827: Fundamentals of Computer Systems Lecture 18, 19, & 20 - - PowerPoint PPT Presentation
CSEE 3827: Fundamentals of Computer Systems Lecture 18, 19, & 20 - - PowerPoint PPT Presentation
CSEE 3827: Fundamentals of Computer Systems Lecture 18, 19, & 20 April 2009 Martha Kim martha@cs.columbia.edu Outline We will examine two MIPS implementations A single-cycle version A pipelined version Simple subset of
CSEE 3827, Spring 2009 Martha Kim
Outline
- We will examine two MIPS implementations
- A single-cycle version
- A pipelined version
- Simple subset of MIPS, showing most aspects
- Memory reference: lw, sw
- Arithmetic/logical: add, sub, and, or, slt
- Control transfer: beq, j
- CPU performance factors
- Instruction count (determined by ISA and compiler)
- Cycles per instruction and cycle time (determined by CPU hardware)
2
CSEE 3827, Spring 2009 Martha Kim
Instruction Execution
- PC → instruction memory, fetch instruction
- Register numbers → register file, read registers
- Depending on instruction class:
- Use ALU to calculate:
- Arithmetic or logical result
- Memory address for load/store
- Branch target address
- Access data for load/store
- PC ← target address or PC + 4
3
CSEE 3827, Spring 2009 Martha Kim
CPU Overview
4
-
CSEE 3827, Spring 2009 Martha Kim
Can’t just join wires together, use muxes
5
-
CSEE 3827, Spring 2009 Martha Kim
Control
6
-
MIPS Datapath
CSEE 3827, Spring 2009 Martha Kim
Combinational Elements
- AND gate (Y = A & B)
- Multiplexer (Y = S ? A : B)
8
- Adder (Y = A + B)
- Arithmetic/Logic Unit (ALU)
A B Y A B Y A B Y F (Y = F(A,B)) A B Y S
+
ALU
CSEE 3827, Spring 2009 Martha Kim
Clocking Methodology
9
Combinational logic transforms data during clock cycles. Longest combinational delay determines clock period.
-
-
-
CSEE 3827, Spring 2009 Martha Kim
Building a datapath incrementally
- Datapath: elements that process data and addresses in the CPU
- Datapath will execute one instruction in one clock cycle
- Each datapath element can only do one function at a time
- Hence, we need separate instruction and data memories
- Use multiplexers where alternate data sources are used for different
instructions
10
CSEE 3827, Spring 2009 Martha Kim
Instruction Fetch
11
-
-
- Fetch Instruction contained in PC register from memory
- Compute PC + 4 for next instruction
CSEE 3827, Spring 2009 Martha Kim
Part 1: Instruction Fetch
12
CSEE 3827, Spring 2009 Martha Kim
R-Format Instructions
- Read two register operands
- Perform arithmetic/logical operation
- Write register result
13
-
-
-
-
-
-
CSEE 3827, Spring 2009 Martha Kim
Load/Store Instructions
14
- Read register operands
- Calculate address using 16-bit offset (use ALU but sign-extend offset)
- Load: read memory and update register
- Store: write register value to memory
-
-
-
CSEE 3827, Spring 2009 Martha Kim
Part 2: R-Type/Load/Store Datapath
15
-
-
-
-
CSEE 3827, Spring 2009 Martha Kim
Branch Instructions
- Read register operands
- Compare operands (use ALU: subtract and check zero output)
- Calculate target address
- Sign-extend displacement
- Shift left two places (word displacement)
- Add to PC+4 (already calculated by instruction fetch)
16
CSEE 3827, Spring 2009 Martha Kim
Part 3: Instruction Fetch w. Branch
17
-
-
-
-
-
-
-
-
CSEE 3827, Spring 2009 Martha Kim
Full Datapath
18
-
-
-
-
-
MIPS Datapath Control
CSEE 3827, Spring 2009 Martha Kim
Datapath Control Scheme
20
-
-
-
-
-
-
- Main control controls whole
datapath based on opcode
ALU control controls ALU based on opcode (ALUOp) and function field (funct)
CSEE 3827, Spring 2009 Martha Kim
ALU Control Inputs/Outputs
21
R-type → 10 lw → 00 sw → 00 beq → 01 0000 → AND 0001 → OR 0010 → add 0110 → subtract 0111 → set on less than Instruction[5:0]
Main Control
ALUOp Operation 2 4
ALU
ALU control
(See Appendix C of text for implementation of corresponding ALU.)
CSEE 3827, Spring 2009 Martha Kim
ALU Control Implementation
22
lw sw beq R-type R-type R-type R-type R-type → 00 → 00 → 01 → 10 → 10 → 10 → 10 → 10 xxxxxx → load word xxxxxx → store word xxxxxx → branch equal 100000 → add 100010 → subtract 100100 → AND 100101 → OR 101010 → set on less than → add → add → subtract → add → subtract → AND → OR → set on less than → 0010 → 0010 → 0110 → 0010 → 0110 → 0000 → 0001 → 0111
- p
c
- d
e A L U O p f r
- m
m a i n c
- n
t r
- l
I n s t r u c t i
- n
[ 5 : ] O p e r a t i
- n
CSEE 3827, Spring 2009 Martha Kim
ALU Control Truth Table
23
xxxxxx xxxxxx xxxxxx 100000 100010 100100 100101 101010 0010 0010 0110 0010 0110 0000 0001 0111
A L U O p f r
- m
m a i n c
- n
t r
- l
I n s t r u c t i
- n
[ 5 : ] O p e r a t i
- n
00 00 01 10 10 10 10 10
CSEE 3827, Spring 2009 Martha Kim
ALU Control Truth Table 2
24
-
-
-
-
-
-
CSEE 3827, Spring 2009 Martha Kim
Datapath Control Scheme
25
-
-
-
-
-
-
CSEE 3827, Spring 2009 Martha Kim
Main control signals derive from instruction types
26
rs rt rd shamt funct
31:26 25:21 20:16 15:11 10:6 5:0
35 or 43 rs rt constant
15:0
4 rs rt constant
15:0
R-type: Load/Store: Branch:
31:26 25:21 20:16 31:26 25:21 20:16
always read read, except for load write for R-type and load sign-extend and add
CSEE 3827, Spring 2009 Martha Kim
R-Type Control Signals
27
10 1 1
(Alt. illustration: Fig. 4.19)
CSEE 3827, Spring 2009 Martha Kim
lw Control Signals
28
00 1 1 1 1
(Alt. illustration: Fig. 4.20)
CSEE 3827, Spring 2009 Martha Kim
sw Control Signals
29
1 00 x x 1
CSEE 3827, Spring 2009 Martha Kim
beq Control Signals
30
1 01 x x
(Alt. illustration: Fig. 4.21)
CSEE 3827, Spring 2009 Martha Kim
Main Control Truth Table
31
-
-
000000 100011 101011 000100
Instruction[31:26]
Implementing Jumps
CSEE 3827, Spring 2009 Martha Kim
- Unconditional jump to instruction at label
- Instruction encoded in J-type format
- Jump uses word addresses
- Update PC with concatenation of:
- Top 4 bits of old PC
- 26-bit jump address
- 00
The j instruction
33
2 address
j label
25:0 31:26
CSEE 3827, Spring 2009 Martha Kim
Implementing the jump instruction
34
CSEE 3827, Spring 2009 Martha Kim
Implementing the jump instruction -- in class soln
35
CPU Performance
CSEE 3827, Spring 2009 Martha Kim
Understanding Performance
- Algorithm → number of operations executed
- Programming language, compiler, architecture → determine number of
machine instructions executed per operation
- Processor and memory system → determines how fast instructions are
executed
- I/O system (including OS) → determines how fast I/O operations are executed
37
CSEE 3827, Spring 2009 Martha Kim
Defining Performance
- Which airplane has the best performance?
38
100 200 300 400 500 Douglas DC-8-50 BAC/Sud Concorde Boeing 747 Boeing 777 Passenger Capacity 2000 4000 6000 8000 10000 Douglas DC- 8-50 BAC/Sud Concorde Boeing 747 Boeing 777 Cruising Range (miles) 500 1000 1500 Douglas DC-8-50 BAC/Sud Concorde Boeing 747 Boeing 777 Cruising Speed (mph) 100000 200000 300000 400000 Douglas DC- 8-50 BAC/Sud Concorde Boeing 747 Boeing 777 Passengers x mph
CSEE 3827, Spring 2009 Martha Kim
Response Time and Throughput
39
Response time: how long it takes to do a task, sometimes also called latency [time/work] Throughput: total work done per unit time [work/time]
How are response time and throughput affected by. . . Replacing the processor with a faster version? Adding more processors?
For now, we’ll focus on response time
CSEE 3827, Spring 2009 Martha Kim
Relative Performance
40
Define: Performance = 1 / Execution Time
“X is n times faster than Y” → Performance X / Performance Y = Execution Time Y / Execution Time X = n Program takes 10 s to run on machine A, 15 s on machine B Execution Time B / Execution Time A = 15 / 10 = 1.5 “A is 1.5 times faster than B”
Example:
CSEE 3827, Spring 2009 Martha Kim
Measuring Execution Time
41
Define: Elapsed Time
Total response time including all aspects (Processing, I/O, overhead, idle time)
Define: CPU Time
Time spent processing a given job (discounts I/O time, other jobs shares) Elapsed Time > CPU Time
CSEE 3827, Spring 2009 Martha Kim
CPU Clocking
42
Operation of digital hardware governed by a constant-rate clock
Clock Data transfer and computation Update state
Clock period
Time
Clock period: duration of a clock cycle e.g., 250ps = 0.25ns Clock frequency (rate): cycles per second e.g., 4.0GHz = 4000MHz
CSEE 3827, Spring 2009 Martha Kim
CPU Time
43
CPU Time = CPU Clock Cycles * Clock Cycle Time = CPU Clock Cycles / Clock Rate
Performance improved by:
- 1. Reducing number of clock cycles
- 2. Increasing clock rate (reducing clock period)
Hardware designer must often trade off clock rate against cycle count.
CSEE 3827, Spring 2009 Martha Kim
CPU Time Example
44
Computer A: 2GHz clock, 10s CPU time Designing Computer B:
- Aim for 6s CPU Time
- Clock rate increase requires 1.2x the number of cycles
How fast must Computer B’s clock be?
4GHz 6s 10 24 6s 10 20 1.2 Rate Clock 10 20 2GHz 10s Rate Clock Time CPU Cycles Clock 6s Cycles Clock 1.2 Time CPU Cycles Clock Rate Clock
9 9 B 9 A A A A B B B
= × = × × = × = × = × = × = =
CSEE 3827, Spring 2009 Martha Kim
Instruction Count and CPI
45
Clock Cycles = Instruction Count * Cycles per Instruction CPU Time = Instruction Count * CPI * Clock Cycle Time = (Instruction Count * CPI) / Clock Rate
Instruction count Determined by program, ISA, and compiler Average cycles per instruction (CPI)
- Determined by CPU hardware
- If different instructions have different CPI, can compute a
weighted average based on instruction mix
CSEE 3827, Spring 2009 Martha Kim
CPI Example
46
Computer A: cycle time = 250ps, CPI=2.0 Computer B: cycle time = 500ps, CPI=1.2 Same ISA Which is faster, and by how much?
1.2 500ps I 600ps I A Time CPU B Time CPU 600ps I 500ps 1.2 I B Time Cycle B CPI Count n Instructio B Time CPU 500ps I 250ps 2.0 I A Time Cycle A CPI Count n Instructio A Time CPU = × × = × = × × = × × = × = × × = × × =
A is faster... … by this much
CSEE 3827, Spring 2009 Martha Kim
Amdahl’s Law
47
Be aware when optimizing. . .
T =
improved
T improvement factor + T
unaffected
Example: On machine A, multiplication accounts for 80s out of 100s total CPU time. How much improvement in multiplication performance to get 5x speedup overall? Corollary: make the common case fast
affected
CSEE 3827, Spring 2009 Martha Kim
Performance Summary
48
CPU Time = Instructions Program Clock cycles Instruction Seconds Clock cycle x x
Performance depends on all of these things. Algorithm, programming language and compiler compiler affect these terms. ISA affects all three.
CSEE 3827, Spring 2009 Martha Kim
Single-Cycle CPU Performance Issues
- Longest delay determines clock period
- Critical path: load instruction
- instruction memory → register file → ALU → data memory → register file
- Not feasible to vary clock period for different instructions
- We will improve performance by pipelining
49
CSEE 3827, Spring 2009 Martha Kim
Pipelining Preview: Laundry Analogy
50
-
-
-
-