Measuring and Reasoning About Performance
Readings: 1.4-1.5
1
Measuring and Reasoning About Performance Readings: 1.4-1.5 1 - - PowerPoint PPT Presentation
Measuring and Reasoning About Performance Readings: 1.4-1.5 1 Goals for this Class Understand how CPUs run programs How do we express the computation the CPU? How does the CPU execute it? How does the CPU support other
Readings: 1.4-1.5
1
2
work?
performance?
running it?
3
s
Upgradability
consumption
Facebook likes
performance
Res?
4
efficiency
throughput
capacity
capacity
keyboard
capacity
interface
compatibility
5
6
better)
possible
clock cycles, etc.
better)
time as possible
instructions/s, instructions/cycle
money as possible
dissipating as few joules/ sec as possible
sec)
joules as possible
instruction, Joules/execution
probability of failure
failure” MTTF -- the average time until a failure occurs.
7
architecture
8
9
is-better metrics, “improved” means “decrease”. Likewise, for “worsened,” “was degraded,” etc.
10
comparison of two systems without reference to an absolute unit
without knowing anything about a concrete latency.
1,254 seconds, doubling the clock rate would reduce the latency to 627 seconds.”
11
smaller-is-better
bigger-is-better
12
13
designs are equally good
half
consumption by 33%
14
good
reducing ED2. By what factor can delay increase?
15
systems
bandwidth)
16
Fiber-optic cable State of the art networking medium (sent 585 GB) 1800 1.13 272,400
Latency (s) BW (GB/s) Tb-m/s
Fiber-optic cable State of the art networking medium (sent 585 GB) 1800 1.13 272,400
Latency (s) BW (GB/s) Tb-m/s
Fiber-optic cable State of the art networking medium (sent 585 GB) 1800 1.13 272,400
Latency (s) BW (GB/s) Tb-m/s
Fiber-optic cable State of the art networking medium (sent 585 GB) 1800 1.13 272,400
Latency (s) BW (GB/s) Tb-m/s Cargo Speed
Subaru Outback Sensible station wagon
183 kg 119 MPH
Fiber-optic cable State of the art networking medium (sent 585 GB) 1800 1.13 272,400
Latency (s) BW (GB/s) Tb-m/s Cargo Speed
563,984 0.0014 344,690 Subaru Outback Sensible station wagon
183 kg 119 MPH
Fiber-optic cable State of the art networking medium (sent 585 GB) 1800 1.13 272,400
Latency (s) BW (GB/s) Tb-m/s Cargo Speed
563,984 0.0014 344,690 Subaru Outback Sensible station wagon
183 kg 119 MPH
B1-B Supersonic bomber
25,515 kg 950 MPH
Fiber-optic cable State of the art networking medium (sent 585 GB) 1800 1.13 272,400
Latency (s) BW (GB/s) Tb-m/s Cargo Speed
563,984 0.0014 344,690 70,646 1.6 382,409,815 Subaru Outback Sensible station wagon
183 kg 119 MPH
B1-B Supersonic bomber
25,515 kg 950 MPH
Fiber-optic cable State of the art networking medium (sent 585 GB) 1800 1.13 272,400
Latency (s) BW (GB/s) Tb-m/s Cargo Speed
563,984 0.0014 344,690 70,646 1.6 382,409,815 Subaru Outback Sensible station wagon
183 kg 119 MPH
B1-B Supersonic bomber
25,515 kg 950 MPH
Hellespont Alhambra
World’s largest supertanker
400,975,655 kg
18.9 MPH
Fiber-optic cable State of the art networking medium (sent 585 GB) 1800 1.13 272,400
Latency (s) BW (GB/s) Tb-m/s Cargo Speed
563,984 0.0014 344,690 1,587,301 1114.5
267,000,000,000a
70,646 1.6 382,409,815 Subaru Outback Sensible station wagon
183 kg 119 MPH
B1-B Supersonic bomber
25,515 kg 950 MPH
Hellespont Alhambra
World’s largest supertanker
400,975,655 kg
18.9 MPH
18
representative of a class
available online)
(SPECFP)
benchmark suite.
in the applications in the suite, they are flawed
be selected for all kinds of reasons.
comparisons possible, benchmarks usually are;
conditions
these things.
19
applications.
20
21
Application Language Description 400.perlbench C PERL Programming Language 401.bzip2 C Compression 403.gcc C C Compiler 429.mcf C Combinatorial Optimization 445.gobmk C AI: go 456.hmmer C Search Gene Sequence 458.sjeng C AI: chess 462.libquantum C Quantum Computing 464.h264ref C Video Compression 471.omnetpp C++ Discrete Event Simulation 473.astar C++ Path-finding Algorithms 483.xalancbmk C++ XML Processing
22
1 10 100 1000 10000 100000 1990 1995 2000 2005 2010 2015 Relative Performance Year specINT95 Perf specINT2000 Perf specINT2006 Perf
23
24
25
26
27
28
29
work?
performance?
running it?
30
31
CPU executes
execution to the number of instructions executed.
32
Latency = Instruction Count * Cycles/Instruction * Seconds/Cycle L = IC * CPI * CT
model
performance
net win.
33
Latency = Instructions * Cycles/Instruction * Seconds/Cycle
time will be shorter.
technology, it’s clock speed will go up.
problems.
run.
34
35
Latency = Instructions * Cycles/Instruction * Seconds/Cycle Latency = (Instructions * Cycle/Insts)/(Clock speed in Hz)
counted at run time
dynamic instructions.
it was compiled
10 static instructions.
36
computation
sort)
instructions, the PE predicts it will be faster
same
1/(1-0.01*x) times
37
38
int i, sum = 0; for(i=0;i<10;i++) sum += i;
sw 0($sp), $zero#sum = 0 sw 4($sp), $zero#i = 0 loop: lw $s1, 4($sp) nop sub $s3, $s1, 10 beq $s3, $s0, end lw $s2, 0($sp) nop add $s2, $s2, $s1 st 0($sp), $s2 addi $s1, $s1, 1 b loop st 4($sp), $s1 #br delay end:
file: cpi-noopt.s
int i, sum = 0; for(i=0;i<10;i++) sum += i;
loop: sub $t3, $t1, 10 beq $t3, $t0, end nop add $t2, $t2, $t1 b loop addi $t1, $t1, 1 end: sw $t2, 0($sp)
file: cpi-opt.s
depending on its input
program
instructions on different ISAs
computer systems, they must be doing the same work.
because they use different ISAs or a different compilers)
40
many aspects of processor design impact it
instruction
program and the IC for that program. It is an average.
intuitive, because it emphasizes that it is an average.
41
Integer,( 19.90%( Floa2ng( Point,( 37.40%( Branch,( 4.40%( Memory ,(35.60%(
Spec%FP%2006%
Integer,( 49.10%( Branch,( 18.80%( Memory ,(31.90%(
Spec%INT%2006%
uses.
programs execute (their instruction mix) varies.
42
Spec INT and Spec FP are popular benchmark suites
impacts CPI because some instructions require extra cycles to execute
the ISA.
43
Instruction Type Cycles Integer +, -, |, &, branches 1 Integer multiply 3-5 integer divide 11-100
Floating point +, -, *, etc.
3-5
Floating point /, sqrt
7-27 Loads and Stores 1-100s
These values are for Intel’s Nehalem processor
44
int i, sum = 0; for(i=0;i<10;i++) sum += i;
sw 0($sp), $zero#sum = 0 sw 4($sp), $zero#i = 0 loop: lw $s1, 4($sp) nop sub $s3, $s1, 10 beq $s3, $s0, end lw $s2, 0($sp) nop add $s2, $s2, $s1 st 0($sp), $s2 addi $s1, $s1, 1 b loop st 4($sp), $s1 #br delay end:
file: cpi-noopt.s
Type CPI Static # Dyn#
mem 5 6 42 int 1 5 50 br 1 2 20 Total 2.5 13 112
Average CPI: (5*42 + 1*50 + 1*20)/112 = 2.5
int i, sum = 0; for(i=0;i<10;i++) sum += i;
loop: sub $t3, $t1, 10 beq $t3, $t0, end nop add $t2, $t2, $t1 b loop addi $t1, $t1, 1 end: sw $t2, 0($sp)
file: cpi-opt.s
Type CPI Static # Dyn#
mem 5 1 1 int 1 6 42 br 1 2 20 Total 1.06 9 63
Average CPI: (5*1 + 1*42 + 1*20)/63 = 1.06
46
loop: sub $t3, $t1, 10 beq $t3, $t0, end nop add $t2, $t2, $t1 b loop addi $t1, $t1, 1 end: sw $t2, 0($sp) sw 0($sp), $zero#sum = 0 sw 4($sp), $zero#i = 0 loop: lw $s1, 4($sp) nop sub $s3, $s1, 10 beq $s3, $s0, end lw $s2, 0($sp) nop add $s2, $s2, $s1 st 0($sp), $s2 addi $s1, $s1, 1 b loop st 4($sp), $s1 #br delay end:
46
Unoptimized Code (UC) IC: 112 CPI: 2.5 Optimized Code (OC) IC: 63 CPI: 1.06
loop: sub $t3, $t1, 10 beq $t3, $t0, end nop add $t2, $t2, $t1 b loop addi $t1, $t1, 1 end: sw $t2, 0($sp) sw 0($sp), $zero#sum = 0 sw 4($sp), $zero#i = 0 loop: lw $s1, 4($sp) nop sub $s3, $s1, 10 beq $s3, $s0, end lw $s2, 0($sp) nop add $s2, $s2, $s1 st 0($sp), $s2 addi $s1, $s1, 1 b loop st 4($sp), $s1 #br delay end:
46
Unoptimized Code (UC) IC: 112 CPI: 2.5 Optimized Code (OC) IC: 63 CPI: 1.06
LUC = ICUC * CPIUC * CTUC LUC = 112 * 2.5 * CTUC LOC = ICOC * CPIOC * CTOC LOC = 63 * 1.06 * CTOC
loop: sub $t3, $t1, 10 beq $t3, $t0, end nop add $t2, $t2, $t1 b loop addi $t1, $t1, 1 end: sw $t2, 0($sp) sw 0($sp), $zero#sum = 0 sw 4($sp), $zero#i = 0 loop: lw $s1, 4($sp) nop sub $s3, $s1, 10 beq $s3, $s0, end lw $s2, 0($sp) nop add $s2, $s2, $s1 st 0($sp), $s2 addi $s1, $s1, 1 b loop st 4($sp), $s1 #br delay end:
46
Unoptimized Code (UC) IC: 112 CPI: 2.5 Optimized Code (OC) IC: 63 CPI: 1.06
LUC = ICUC * CPIUC * CTUC LUC = 112 * 2.5 * CTUC LOC = ICOC * CPIOC * CTOC LOC = 63 * 1.06 * CTOC
63 * 1.06 * CTOC = 4.19x =
loop: sub $t3, $t1, 10 beq $t3, $t0, end nop add $t2, $t2, $t1 b loop addi $t1, $t1, 1 end: sw $t2, 0($sp) sw 0($sp), $zero#sum = 0 sw 4($sp), $zero#i = 0 loop: lw $s1, 4($sp) nop sub $s3, $s1, 10 beq $s3, $s0, end lw $s2, 0($sp) nop add $s2, $s2, $s1 st 0($sp), $s2 addi $s1, $s1, 1 b loop st 4($sp), $s1 #br delay end:
46
Unoptimized Code (UC) IC: 112 CPI: 2.5 Optimized Code (OC) IC: 63 CPI: 1.06
LUC = ICUC * CPIUC * CTUC LUC = 112 * 2.5 * CTUC LOC = ICOC * CPIOC * CTOC LOC = 63 * 1.06 * CTOC
63 * 1.06 * CTOC = 4.19x = 112 63 2.5 1.06
loop: sub $t3, $t1, 10 beq $t3, $t0, end nop add $t2, $t2, $t1 b loop addi $t1, $t1, 1 end: sw $t2, 0($sp) sw 0($sp), $zero#sum = 0 sw 4($sp), $zero#i = 0 loop: lw $s1, 4($sp) nop sub $s3, $s1, 10 beq $s3, $s0, end lw $s2, 0($sp) nop add $s2, $s2, $s1 st 0($sp), $s2 addi $s1, $s1, 1 b loop st 4($sp), $s1 #br delay end:
46
Unoptimized Code (UC) IC: 112 CPI: 2.5 Optimized Code (OC) IC: 63 CPI: 1.06
LUC = ICUC * CPIUC * CTUC LUC = 112 * 2.5 * CTUC LOC = ICOC * CPIOC * CTOC LOC = 63 * 1.06 * CTOC
63 * 1.06 * CTOC = 4.19x = 112 63 2.5 1.06
Since hardware is unchanged, CT is the same and cancels
loop: sub $t3, $t1, 10 beq $t3, $t0, end nop add $t2, $t2, $t1 b loop addi $t1, $t1, 1 end: sw $t2, 0($sp) sw 0($sp), $zero#sum = 0 sw 4($sp), $zero#i = 0 loop: lw $s1, 4($sp) nop sub $s3, $s1, 10 beq $s3, $s0, end lw $s2, 0($sp) nop add $s2, $s2, $s1 st 0($sp), $s2 addi $s1, $s1, 1 b loop st 4($sp), $s1 #br delay end:
count) of the program.
47
relevant
measured in cycles: Instructions * Cycles/Instruction == Cycles.
be measured in Instructions/Second: 1/(Cycles/Instruction * Seconds/Cycle)
to CPI (smaller-is-better). Alternately, performance is equivalent to Instructions per Cycle (IPC; bigger-is-better).
48
Latency = Instructions * Cycles/Instruction * Seconds/Cycle
You can only ignore terms in the PE, if they are identical across the two systems
instruction mix.
49
application, but CPIguess is an estimate.
resulting L is also an estimate. IC may not be an estimate.
50
insts/sec)”
capable of 10 GOPS under perfect conditions
standard benchmark
benchmark?
51
Science (AICS) (Japan)
workload.
52
53
valuable it is
the impact of an optimization.
`
–Speeds up JPEG decode by 10x!!! –Act now! While Supplies Last!
purpose whatsoever. It may not even build. It may cause fatigue, blindness, lethargy, malaise, and irritability. Debugging maybe hazardous. It will almost certainly cause ennui. Do not taunt SuperJPEG-O-Rama. Will not, on grounds of principle, decode images of Justin Beiber. Images of Lady Gaga maybe transposed, and meat dresses may be rendered as tofu. Not covered by US export control laws or the Geneva convention, although it probably should be. Beware of dog. Increases processor cost by 45%. Objects in the rear view mirror may appear closer than they are. Or is it farther? Either way, watch out! If you use SuperJPEG-O-Rama, the cake will not be a lie. All your base are belong to 141L. No whining or complaining. Wingeing is allowed, but only in countries where “wingeing” is a word.
`
–Speeds up JPEG decode by 10x!!! –Act now! While Supplies Last!
`
–Speeds up JPEG decode by 10x!!! –Act now! While Supplies Last!
56
JPEG Decode w/o JOR2k w/ JOR2k 30s 21s
56
JPEG Decode w/o JOR2k w/ JOR2k 30s 21s Performance: 30/21 = 1.42x Speedup != 10x
56
JPEG Decode w/o JOR2k w/ JOR2k 30s 21s Performance: 30/21 = 1.42x Speedup != 10x Amdahl ate our Speedup!
56
JPEG Decode w/o JOR2k w/ JOR2k 30s 21s Performance: 30/21 = 1.42x Speedup != 10x Amdahl ate our Speedup! Is this worth the 45% increase in cost?
56
JPEG Decode w/o JOR2k w/ JOR2k 30s 21s Performance: 30/21 = 1.42x Speedup != 10x Amdahl ate our Speedup! Is this worth the 45% increase in cost? Metric = Latency * Cost =>
56
JPEG Decode w/o JOR2k w/ JOR2k 30s 21s Performance: 30/21 = 1.42x Speedup != 10x Amdahl ate our Speedup! Is this worth the 45% increase in cost? Metric = Latency * Cost =>
56
JPEG Decode w/o JOR2k w/ JOR2k 30s 21s Performance: 30/21 = 1.42x Speedup != 10x Amdahl ate our Speedup! Is this worth the 45% increase in cost? Metric = Latency * Cost => Metric = Latency2 * Cost =>
56
JPEG Decode w/o JOR2k w/ JOR2k 30s 21s Performance: 30/21 = 1.42x Speedup != 10x Amdahl ate our Speedup! Is this worth the 45% increase in cost? Metric = Latency * Cost => Metric = Latency2 * Cost =>
57
spends 20% of time doing integer instructions
make the code run 10 hours faster?
make the code run 50 hours faster?
A)1.1 B)1.25 C)1.75 D)1.31 E) 10.0 F) 50.0 G) 1 million times H) Other
61
and spends 20% of time doing integer instructions
unit to make the code run 50 hours faster?
possible
62
and spends 20% of time doing integer instructions
unit to make the code run 50 hours faster?
0.2?
63
instructions by 25% (assume each integer instruction takes the same amount of time)
64
65
large)!
“most frequent”
compiler options, optimizations you’ve applied, etc.
common case.
66
functions.
Common case
functions.
Common case 7x => 1.4x
functions.
Common case 7x => 1.4x 4x => 1.3x
functions.
Common case 7x => 1.4x 4x => 1.3x 1.3x => 1.1x Total = 20/10 = 2x
for large p.
usefulness.
68
application to use 2 processors for 80% of execution.
69
1000*Oldlat
71
This one is tricky
72
73
L1 L1 sped up L 2 n a Not memory L1 sped up n a Not memory L 2 n a Not memory
Memory time 0.24 0.03 0.03 0.7 0.7 0.7 0.03 0.03 0.06 0.03 0.015 0.06 Total = 0.82 Total = 1 Total = 0.805 85% 4.2% 4.2% 8.6% 24% 3% 3% 70%
Amdahl’s law.
74
fraction of execution that the L2 effects actually grows
Amdahl’s law.
74
fraction of execution that the L2 effects actually grows
Amdahl’s law.
74
fraction of execution that the L2 effects actually grows
75
L1 L1 sped up L 2 n a Not memory L1 sped up n a Not memory L 2 n a Not memory
Memory time 0.24 0.03 0.03 0.7 0.7 0.7 0.03 0.03 0.06 0.03 0.015 0.06 Total = 0.82 Total = 1 Total = 0.805 85% 4.2% 4.2% 8.6% 24% 3% 3% 70%
execution and measure it’s speed up independently
x2only - x1&2))
be higher or lower.
76
= 1/(0.06+0.015+.73)) = 1.24 times
77
78
79
increased BW for many tasks.
utilization is better (there is always work available for tellers)
work onto resources
increases throughput but hurts latency.
80
81
computer must actually do
Big systems need 0.3-1 Watt of cooling for every watt of compute.
82
83
every cycles)
Less useless transistor switchings
84
metrics describe a hardware capability
(Guaranteed not to exceed) numbers.
85
86
87
Inst Count CPI Cycle time Program x Compiler x (x)
x x (x) Implementation x x Technology x
88
Latency = Instructions * Cycles/Instruction * Seconds/Cycle
89
some require “extra” cycles to execute
the ISA
many of each instruction type executes
90
Instruction Type Total Cycles “Extra” cycles Integer +, -, |, &, branches 1
instruction
integer add.
91
Type CPI Static # Dyn#
mem 5 6 42 int 1 3 30 br 1 2 20 Total 2.8 11 92
Type CPI Static # Dyn#
mem 5 1 1 int 1 5 32 br 1 2 20 Total 1.01 8 53
94