Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. - - PowerPoint PPT Presentation

lecture 2 architectural performance laws and rules of
SMART_READER_LITE
LIVE PREVIEW

Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. - - PowerPoint PPT Presentation

Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. V. Catania Lab. Calcolatori Measurement Tools Benchmarks, Traces, Mixes Cost, delay, area, power estimation Simulation (many levels) ISA, RT, Gate, Circuit


slide-1
SLIDE 1

Lecture 2: Architectural Performance Laws and Rules of Thumb

  • Prof. V. Catania
  • Lab. Calcolatori
slide-2
SLIDE 2

Measurement Tools

  • Benchmarks, Traces, Mixes
  • Cost, delay, area, power estimation
  • Simulation (many levels)

– ISA, RT, Gate, Circuit

  • Queuing Theory
  • Rules of Thumb
  • Fundamental Laws
slide-3
SLIDE 3

The Bottom Line: Performance (and Cost)

"X is n times faster than Y" means ExTime(Y) Performance(X)

  • -------- =
  • ExTime(X) Performance(Y)
  • Speed of Concorde vs. Boeing 747
  • Throughput of Boeing 747 vs. Concorde
slide-4
SLIDE 4

Performance Terminology

“X is n% faster than Y” means:

ExTime(Y) Performance(X) n

  • -------- =
  • = 1 +
  • ExTime(X) Performance(Y)

100

n = 100(Performance(X) - Performance(Y))

Performance(Y) Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?

slide-5
SLIDE 5

Example

15 10 = 1.5 1.0 = Performance (X) Performance (Y) ExTime(Y) ExTime(X) = n = 100 (1.5 - 1.0) 1.0 n = 50%

slide-6
SLIDE 6

Legge di Amdahl

MAKE THE COMMON CASE FAST!

Il performance improvement che può essere guadagnato rendendo una qualche attività più veloce è limitato dalla frazione di tempo in cui tale attività ha luogo. SPEEDUP: misura di quanto più veloce un task gira sulla macchina ENHANCED

slide-7
SLIDE 7

Amdahl's Law

Speedup due to enhancement E:

ExTime w/o E Performance w/ E Speedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F

  • f the task by a factor S, and the remainder of the

task is unaffected, then: ExTime(E) = Speedup(E) =

slide-8
SLIDE 8

Amdahl’s Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupoverall = ExTimeold ExTimenew Speedupenhanced = 1 (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced

slide-9
SLIDE 9

Amdahl’s Law

  • Floating point instructions improved to run 2X;

but only 10% of actual instructions are FP Speedupoverall = ExTimenew =

slide-10
SLIDE 10

Amdahl’s Law

  • Floating point instructions improved to run 2X;

but only 10% of actual instructions are FP Speedupoverall = 1 0.95 = 1.053 ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold

slide-11
SLIDE 11

Legge di Amdahl Improve x5 in CPU speed Increase x5 cost CPU use w/e: 50% of time (50% I/O) CPU cost = 1/3 Total Computer Cost

Evaluate the investment from cost/performance viewpoint

Speedup = 1 0,5 0,5 5 = 1,67

New cost = 2 3×11 3 ×5 = 2,33 times the original cost

Cost increase > performance improvement!

slide-12
SLIDE 12

FPSQR ops. responsible of 20% of Execution time FP ops. responsible of 50% of Execution time Alternative enhancements:

  • 1. To make a HW implementation of FPSQR ops.

with a speed up of 10

  • 2. To increase ALL FP ops. to RUN 2x FASTER

with the same cost of 1 Comparison:

SpeedupFPSQR= 1 1-0,20,2 10 = 1,22 SpeedupFP= 1 1-0,5 0,5 2 = 1,33

Legge di Amdahl

slide-13
SLIDE 13

Corollary: Make The Common Case Fast

  • All instructions require an instruction fetch,
  • nly a fraction require a data fetch/store.

– Optimize instruction access over data access

  • Programs exhibit locality

Spatial Locality Temporal Locality

  • Access to small memories is faster

– Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories. Reg's Cache Memory Disk / Tape

slide-14
SLIDE 14

Legge di Amdahl

  • Cache memory 5x FASTER of Main memory
  • 90% CPU time is spent in a fraction of code which

could be put in cache What is the Speedup overall using cache?

Speedup = 1 1-% time cache can be used % time cache can be used Speedup using cache

Speedup = 1 1−0,9 0,9 5 =3,6

slide-15
SLIDE 15

Occam's Toothbrush

  • The simple case is usually the most frequent and

the easiest to optimize!

  • Do simple, fast things in hardware and be sure

the rest can be handled correctly in software

slide-16
SLIDE 16

Metrics of Performance

Compiler Programming Language Application Datapath Control Transistors Wires Pins

ISA

Function Units (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s Cycles per second (clock rate) Megabytes per second Answers per month Operations per second

slide-17
SLIDE 17

Cycles Per Instruction

CPU time = CK cycles for a program × TCK

Average Cycles per Instruction Instruction Frequency CPI=CK cycles for a program Instruction Count = CPU time × CK rate Instruction Count

CPI = ∑

i=1 n

CPIi×F i

where

F i= I i Instruction Count

NB: CPIi should be measured and not just derived from CPU Ref. Manual (it must include cache misses, etc.) CPUtime= ICxCPIxTck =IC× Tck×∑

i=1 n

CPIi×F i

slide-18
SLIDE 18

Aspects of CPU Performance

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle

  • Instr. Cnt CPI Clock Rate

Program Compiler

  • Instr. Set

Organization Technology

slide-19
SLIDE 19

Aspects of CPU Performance

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle

Inst Count CPI Clock Rate Program X Compiler X (X)

  • Inst. Set.

X X Organization X X Technology X

slide-20
SLIDE 20

CPI Example

Base Machine (A): COMPARE + BRANCH  2 separate instructions New Machine (B): COMPARE + BRANCH  1 integrated instruction (TCKB = 1,25 TCKA) Which machine is faster?

OP Freq. Cycles CPI(i) BRANCH 20% 2 0,4 COMP 20% 1 0,2 Others 60% 1 0,6 100% 1,2 OP Freq. Cycles BRANCH ? 2

  • thers

? 1 100%

slide-21
SLIDE 21

CPI Example

CPU time A=I C A×1,2×T CK A Machine B: I CB=I C A−20%I C A=0,8I CA

Branch freq .=20% IC A 80% I C A =25

OP Freq. Cycles CPI(i) BRANCH 25% 2 0,5

  • thers

75% 1 0,75 100% 1,25

CPU time B =1,25×I CA×T CK A =I C B×CPIB×1,25×T CK A=0,8 I C A×1,25×1,25T CK A=

CPUA is FASTER than CPUB

slide-22
SLIDE 22

Marketing Metrics

MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6

  • Machines with different instruction sets ?
  • Programs with different instruction mixes ?

– Dynamic frequency of instructions

  • Uncorrelated with performance

MFLOP/s = FP Operations / Time * 10^6

  • Machine dependent
  • Often not where time is spent

Normalized: add,sub,compare,mult 1 divide, sqrt 4 exp, sin, . . . 8 Normalized: add,sub,compare,mult 1 divide, sqrt 4 exp, sin, . . . 8

slide-23
SLIDE 23

Cycles Per Instruction

CPU time = CycleTime *

CPI * I i = 1 n i i CPI =

CPI * F where F = I i = 1 n i i i i Instruction Count

“Instruction Frequency” Invest Resources where time is Spent!

CPI = Instruction Count / (CPU Time * Clock Rate) = Instruction Count / Cycles

“Average Cycles per Instruction”

slide-24
SLIDE 24

Organizational Trade-offs

Instruction Mix Cycle Time CPI

Compiler Programming Language Application Datapath Control Transistors Wires Pins

ISA

Function Units

slide-25
SLIDE 25

Example: Calculating CPI

Typical Mix

Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch20% 2 .4 (27%) 1.5

slide-26
SLIDE 26

Base Machine (Reg / Reg) Op Freq Cycles ALU 50% 1 Load 20% 2 Store 10% 2 Branch 20% 2

Typical Mix

Example

Add register / memory operations:

– One source operand in memory – One source operand in register – Cycle count of 2

Branch cycle count to increase to 3. What fraction of the loads (in the base machine) must be eliminated for this to pay off?

slide-27
SLIDE 27

Example Solution

Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles ALU .50 1 .5 Load .20 2 .4 Store .10 2 .2 Branch .20 2 .3 Reg/Mem 1.00 1.5

slide-28
SLIDE 28

Example Solution

Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X)

CPINew must be normalized to new instruction frequency

CyclesNew InstructionsNew

slide-29
SLIDE 29

Example Solution

Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew 1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)

slide-30
SLIDE 30

Example Solution

Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew 1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X) 1.5 = 1.7 – X 0.2 = X ALL loads must be eliminated for this to be a win!

slide-31
SLIDE 31

CPU time including Cache

CPUTIME=CPUCK cyclesMemory Stall Cycles × TCK Memory Stall Cycles =N ° misses×miss penalty = =I C×miss per instruction ×miss penalty =

=I C× mem. ref . per instr .× miss rate × miss penalty

slide-32
SLIDE 32

Example

Base Machine (A): Base Machine (B): CPI = 2 for cache hits Compare with a new machine B: 40% of Load/Store ops. 2% miss rate miss penalty: 25 cycles CPU TIME A

=CPU CK cyclesMemory Stall Cycles ×T CK=

= I C×CPI0×T CK

Memory Stall Cycles

=I C× mem. ref . per instr .× miss rate × miss penalty =I C×10,4× 0,02 × 25 =IC×0,7

CPU TIME B= I C×2I C×0,7×T CK=2,7×I C×T CK

CPU TIME B/CPU TIME A=1,35