SLIDE 1 Lecture 2: Architectural Performance Laws and Rules of Thumb
- Prof. V. Catania
- Lab. Calcolatori
SLIDE 2 Measurement Tools
- Benchmarks, Traces, Mixes
- Cost, delay, area, power estimation
- Simulation (many levels)
– ISA, RT, Gate, Circuit
- Queuing Theory
- Rules of Thumb
- Fundamental Laws
SLIDE 3 The Bottom Line: Performance (and Cost)
"X is n times faster than Y" means ExTime(Y) Performance(X)
- -------- =
- ExTime(X) Performance(Y)
- Speed of Concorde vs. Boeing 747
- Throughput of Boeing 747 vs. Concorde
SLIDE 4 Performance Terminology
“X is n% faster than Y” means:
ExTime(Y) Performance(X) n
- -------- =
- = 1 +
- ExTime(X) Performance(Y)
100
n = 100(Performance(X) - Performance(Y))
Performance(Y) Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?
SLIDE 5
Example
15 10 = 1.5 1.0 = Performance (X) Performance (Y) ExTime(Y) ExTime(X) = n = 100 (1.5 - 1.0) 1.0 n = 50%
SLIDE 6
Legge di Amdahl
MAKE THE COMMON CASE FAST!
Il performance improvement che può essere guadagnato rendendo una qualche attività più veloce è limitato dalla frazione di tempo in cui tale attività ha luogo. SPEEDUP: misura di quanto più veloce un task gira sulla macchina ENHANCED
SLIDE 7 Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E Performance w/ E Speedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction F
- f the task by a factor S, and the remainder of the
task is unaffected, then: ExTime(E) = Speedup(E) =
SLIDE 8
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupoverall = ExTimeold ExTimenew Speedupenhanced = 1 (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced
SLIDE 9 Amdahl’s Law
- Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP Speedupoverall = ExTimenew =
SLIDE 10 Amdahl’s Law
- Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP Speedupoverall = 1 0.95 = 1.053 ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
SLIDE 11
Legge di Amdahl Improve x5 in CPU speed Increase x5 cost CPU use w/e: 50% of time (50% I/O) CPU cost = 1/3 Total Computer Cost
Evaluate the investment from cost/performance viewpoint
Speedup = 1 0,5 0,5 5 = 1,67
New cost = 2 3×11 3 ×5 = 2,33 times the original cost
Cost increase > performance improvement!
SLIDE 12 FPSQR ops. responsible of 20% of Execution time FP ops. responsible of 50% of Execution time Alternative enhancements:
- 1. To make a HW implementation of FPSQR ops.
with a speed up of 10
- 2. To increase ALL FP ops. to RUN 2x FASTER
with the same cost of 1 Comparison:
SpeedupFPSQR= 1 1-0,20,2 10 = 1,22 SpeedupFP= 1 1-0,5 0,5 2 = 1,33
Legge di Amdahl
SLIDE 13 Corollary: Make The Common Case Fast
- All instructions require an instruction fetch,
- nly a fraction require a data fetch/store.
– Optimize instruction access over data access
- Programs exhibit locality
Spatial Locality Temporal Locality
- Access to small memories is faster
– Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories. Reg's Cache Memory Disk / Tape
SLIDE 14 Legge di Amdahl
- Cache memory 5x FASTER of Main memory
- 90% CPU time is spent in a fraction of code which
could be put in cache What is the Speedup overall using cache?
Speedup = 1 1-% time cache can be used % time cache can be used Speedup using cache
Speedup = 1 1−0,9 0,9 5 =3,6
SLIDE 15 Occam's Toothbrush
- The simple case is usually the most frequent and
the easiest to optimize!
- Do simple, fast things in hardware and be sure
the rest can be handled correctly in software
SLIDE 16 Metrics of Performance
Compiler Programming Language Application Datapath Control Transistors Wires Pins
ISA
Function Units (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s Cycles per second (clock rate) Megabytes per second Answers per month Operations per second
SLIDE 17 Cycles Per Instruction
CPU time = CK cycles for a program × TCK
Average Cycles per Instruction Instruction Frequency CPI=CK cycles for a program Instruction Count = CPU time × CK rate Instruction Count
CPI = ∑
i=1 n
CPIi×F i
where
F i= I i Instruction Count
NB: CPIi should be measured and not just derived from CPU Ref. Manual (it must include cache misses, etc.) CPUtime= ICxCPIxTck =IC× Tck×∑
i=1 n
CPIi×F i
SLIDE 18 Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle
- Instr. Cnt CPI Clock Rate
Program Compiler
Organization Technology
SLIDE 19 Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle
Inst Count CPI Clock Rate Program X Compiler X (X)
X X Organization X X Technology X
SLIDE 20 CPI Example
Base Machine (A): COMPARE + BRANCH 2 separate instructions New Machine (B): COMPARE + BRANCH 1 integrated instruction (TCKB = 1,25 TCKA) Which machine is faster?
OP Freq. Cycles CPI(i) BRANCH 20% 2 0,4 COMP 20% 1 0,2 Others 60% 1 0,6 100% 1,2 OP Freq. Cycles BRANCH ? 2
? 1 100%
SLIDE 21 CPI Example
CPU time A=I C A×1,2×T CK A Machine B: I CB=I C A−20%I C A=0,8I CA
Branch freq .=20% IC A 80% I C A =25
OP Freq. Cycles CPI(i) BRANCH 25% 2 0,5
75% 1 0,75 100% 1,25
CPU time B =1,25×I CA×T CK A =I C B×CPIB×1,25×T CK A=0,8 I C A×1,25×1,25T CK A=
CPUA is FASTER than CPUB
SLIDE 22 Marketing Metrics
MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6
- Machines with different instruction sets ?
- Programs with different instruction mixes ?
– Dynamic frequency of instructions
- Uncorrelated with performance
MFLOP/s = FP Operations / Time * 10^6
- Machine dependent
- Often not where time is spent
Normalized: add,sub,compare,mult 1 divide, sqrt 4 exp, sin, . . . 8 Normalized: add,sub,compare,mult 1 divide, sqrt 4 exp, sin, . . . 8
SLIDE 23
Cycles Per Instruction
CPU time = CycleTime *
CPI * I i = 1 n i i CPI =
CPI * F where F = I i = 1 n i i i i Instruction Count
“Instruction Frequency” Invest Resources where time is Spent!
CPI = Instruction Count / (CPU Time * Clock Rate) = Instruction Count / Cycles
“Average Cycles per Instruction”
SLIDE 24 Organizational Trade-offs
Instruction Mix Cycle Time CPI
Compiler Programming Language Application Datapath Control Transistors Wires Pins
ISA
Function Units
SLIDE 25 Example: Calculating CPI
Typical Mix
Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch20% 2 .4 (27%) 1.5
SLIDE 26 Base Machine (Reg / Reg) Op Freq Cycles ALU 50% 1 Load 20% 2 Store 10% 2 Branch 20% 2
Typical Mix
Example
Add register / memory operations:
– One source operand in memory – One source operand in register – Cycle count of 2
Branch cycle count to increase to 3. What fraction of the loads (in the base machine) must be eliminated for this to pay off?
SLIDE 27
Example Solution
Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles ALU .50 1 .5 Load .20 2 .4 Store .10 2 .2 Branch .20 2 .3 Reg/Mem 1.00 1.5
SLIDE 28
Example Solution
Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X)
CPINew must be normalized to new instruction frequency
CyclesNew InstructionsNew
SLIDE 29
Example Solution
Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew 1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)
SLIDE 30 Example Solution
Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew 1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X) 1.5 = 1.7 – X 0.2 = X ALL loads must be eliminated for this to be a win!
SLIDE 31
CPU time including Cache
CPUTIME=CPUCK cyclesMemory Stall Cycles × TCK Memory Stall Cycles =N ° misses×miss penalty = =I C×miss per instruction ×miss penalty =
=I C× mem. ref . per instr .× miss rate × miss penalty
SLIDE 32
Example
Base Machine (A): Base Machine (B): CPI = 2 for cache hits Compare with a new machine B: 40% of Load/Store ops. 2% miss rate miss penalty: 25 cycles CPU TIME A
=CPU CK cyclesMemory Stall Cycles ×T CK=
= I C×CPI0×T CK
Memory Stall Cycles
=I C× mem. ref . per instr .× miss rate × miss penalty =I C×10,4× 0,02 × 25 =IC×0,7
CPU TIME B= I C×2I C×0,7×T CK=2,7×I C×T CK
CPU TIME B/CPU TIME A=1,35