Designing for Performance
Raul Queiroz Feitosa
Designing for Performance Raul Queiroz Feitosa Objective In this - - PowerPoint PPT Presentation
Designing for Performance Raul Queiroz Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings 2 Designing for Performance Outline Performance
Designing for Performance
Raul Queiroz Feitosa
Objective
“In this section … we examine the most common approach to assessing processor and computer system performance”
Outline
Performance Assessment Amdahl’s Law
Performance Assessment
Designing for Performance 4EPYC 7601 Cache 64 MB Freq.: 2.2 GHz 32 Cores Intel Xeon Platinum 8280L Cache 38.5 MB Freq.: 2.7 GHz 28 Cores
Which one would you choose?
Performance Assessment
What matters?
Cost Size Reliability Security Power Consumption Performance
Designing for Performance 5Performance Assessment
Main CPU operations
Seek and decode instructions Load and Store data Logic and Arithmetic Operations
Fixed-Point Floating-Point
Designing for Performance 6Performance Assessment
Performance factors
Clock speed or clock rate ( f )
Expressed in multiples of Hz.
Clock cycle or clock tick
Clock time ( τ )
time between consecutive pulses.
Performance Assessment
Performance factors
Clock speed
Usually multiple clock cycles are required per
instruction.
The amount of work implied by one instruction varies
considerably.
Pipelining gives simultaneous execution of instructions. So, clock speed is not the whole story!
Performance Assessment
Performance factors
Instruction Execution Rate
Expressed in Millions of instructions (MIPS) or floating
point instructions (MFLOPS) per second.
Heavily dependent on instruction set, compiler design,
processor implementation, cache & memory hierarchy.
Performance Assessment
Performance factors
CPI - average number of cycles per instructions Ii - number of machine instructions of type i executed by a
program.
CPIi - number of cycles per instruction of type i. Ic - number of machine instructions executed by a program
n i i c
I I
1 c n i i i
I I CPI CPI
1
Performance Assessment
Performance factors
T – processor time needed to execute a program.
a refinement yields where
p is the number of processor cycles to decode + execute the instruction m is the number of memory references needed k is the ratio between memory cycle time and processor cycle time.
CPI I T
c
) ( k m p I T
c
Performance Assessment
Performance factors
System attributes affecting the performance factors
Ic p m k τ
Instruction set architecture
!
Compiler technology
Processor implementation
Cache and memory hierarchy
Performance Assessment
Exercise 1
A program involves the execution of 2 million instructions on a 400 MHz
Compute the average CPI: average CPI is CPI = 0.6+ (2 0.18) + (4 0.12) + (8 0.1) = 2.24 instruction type CPI instruction mix Arithmetic and logic 1 60% Load/store with cache hit 2 18% Branch 4 12% Load/store with cache miss 8 10%
Performance Assessment
Exercise 2
Consider two hardware implementations M1 and M2 of the same instruction set. There are three instruction classes: F, I and N. The M1 clock rate is 600 Mhz. The clock cycle of M2 is 2 ns. The average CPI for these three instruction classes are Class CPI of M1 CPI of M2 Comments F 5.0 4.0 floating-point I 2.0 3.8 integer N 2.4 2.0 non-arithmetic
a)
Compute the peak performance for M1 and M2 in MIPS.
b)
If 50% of the instruction executed in a given program belong to class N and the other are equally distributed between F and I, which is the fastest machine and by which factor?
Performance Assessment
Exercise 2
c) A designer of M1 plan to change the project to improve performance. Assuming the information in (b). Which of the options below should be more beneficial? 1. Use a FPU twice as fast (CPI=2,5 for class F). 2. Add a second ALU to reduce the CPI for integer operations to 1.20 3. Use a faster logic that allows a clock rate of 750 MHz keeping the same CPI values? d) The CPI given above include a cache miss that occurs 5 times per 100 executed instructions. Each cache miss imply in a 10 cycles penalty. The forth redesign option consists of using a larger instruction cache so as to reduce the miss ratio from 5% to 3%. Compare this alternative with the
e) Characterize application programs that can be executed faster in M1 than in M2, i. e., discuss the instruction composition of such applications. Hint: Let x, y and 1-x-y the fraction of instructions belonging to classes F, I and N respectively.
Performance Assessment
Exercise 3
Consider two codes produced by two compiler for the same source program. The instructions of the machine that will execute these codes can be divided in class A (CPI=1) and B (CPI=2). The number of executed instruction of each class is given below Class compiler 1 compiler 2 comments A 600M 400M CPI=1 B 400M 400M CPI=2
a)
Compute the execution time for both codes assuming a clock rate = 1 GHz.
b)
Which compiler produce the most efficient code and by which factor?
c)
Which code execute at the highest MIPS?
Performance Assessment
Benchmarks: motivation
A high level language statement A=B+C /* assume all quantities in main memory */ Compiled code on CISC add mem(B),mem(C),mem(A) Compiled code on RISC load mem(B),reg(1); load mem(C),reg(2); add reg(1),reg(2),reg(3); store reg(3),mem(A);
So, if MIPSCISC= 1, then MIPSRISC= 4 Both machines execute the same high level codes in the same time.
Performance Assessment
Benchmarks: definition
Programs designed to test performance Written in high level language → portable Represents a particular application or system programming
area (systems, numerical, commercial)
Easily measured and widely distributed The best known such collection of benchmark suites is the
System Performance Evaluation Corporation (SPEC)
The best known of the SPEC suites is the CPU2017:
contains 43 benchmarks organized into four suites includes an optional metric for measuring energy
consumption
Performance Assessment
SPECspeed metric
Spec benchmarks do not concern with instruction execution
rates
Base runtime defined for each benchmark using reference
machine
Speed metric is the ratio of reference time to system run time
Trefi execution time for benchmark i on reference machine Tsuti execution time of benchmark i on test systemPerformance Assessment
SPECrate Metric
Measures throughput or rate of a machine carrying out a number of tasks Multiple copies of benchmarks run simultaneously
Typically, same as number of processors Ratio is calculated as follows:
Trefi reference execution time for benchmark i N number of copies running simultaneously Tsuti elapsed time from start of execution of all N programs until completion of allcopies of program
Again, a geometric mean is calculatedPerformance Assessment
Averaging SPEC metrics
For both SPECspeed and SPECrate, the selected ratios are
averaged using the Geometric Mean, which is reported as the
Performance Assessment
Exercise 4
The table below shows the execution times, in seconds, for 3 different processors.
a)
Compute the arithmetic mean value for each system using X as the reference machine and then using Y as the reference machine.
b)
Compute the geometric mean value for each system using X as the reference machine and then using Y as the reference machine. Which is the most realistic result?
benchmark processor X Y Z 1 20 10 40 2 40 80 20
Outline
Performance Assessment Amdahl’s Law
Amdahl’s Law
Estimate the potential speed up of program using multiple processors
Fraction f of code parallelizable with no scheduling overhead Fraction (1-f) of code inherently serial T is total execution time for program on single processor N is number of processors that fully exploit parallel portions of code
Amdahl’s Law
Conclusions
Code needs to be parallelizable/parallelized! f small, parallel processors has little effect. N → ∞, speedup bound by 1/(1 – f). Speedup is bound, giving diminishing returns for
more processors .
Amdahl’s Law
Exercise 5
A program spends 60% of its execution time with floating point operations. 90%
coordination and synchronization between parts make the part not involving floating-point operations 10% longer.
a)
Find the improvement in terms of execution time achieved by doubling the speed of the floating-point unit.
b)
Find the improvement in terms of execution time achieved by using two processors having the same speed and structure as the original one
c)
What would be the improvement if both changes are implemented.
Amdahl’s Law
Generalization for any design improvement
Suppose that the enhancement affects the execution f of the total runtime before enhancement, and that the speed up brought by this enhancement is SUf . Thus . t enhancemen after time Execution t enhancemen before time Execution Speedup
f
SU f f Speedup 1 1
Amdahl’s Law
Generalized Amdahl’s Law example
Suppose that a task consumes 40% of the time with floating-point operations. A new FPU has speedup
So, the maximum speedup is 1.67.
K Speedup 4 . 4 . 1 1
Homeworks
Exercise 6
A processor is used for an application where 30 %, 25% and 10% of the processing time is spent with floating-point addition, multiplication and division,
should be selected?
a)
Redesign the adder making it twice as fast as the older one.
b)
Redesign the multiplier making it three times as fast as the older one
c)
Redesign the divider making it ten times as fast as the older one.
Homeworks
Exercise 7:
T is the average processing time of a computer operating at frequency f. Instructions are grouped in 3 types, as shown below. Typically a program executes the same proportion of instructions from all three groups/types. Compute the MIPS and the new execution time, if the FPU becomes twice as fast.
Designing for Performance 30Instruction type CPI Floating point arithmetic 10 Integer arithmetic 5 Non- arithmetic 2
Homeworks
Exercise 8:
Let f1 and f2 be the operation frequency of processors P1 and P2 respectively. Assume that two compilers generate different executable codes for the same source program which may be executed byP1 as well as byP2 . The codes have the characteristics given below: Compute the ratio f1/f2 for which the processing time in P1 executing code 1 equals the processing time of P2 executing code 2.
Designing for Performance 31Instruction type CPI Proportion compiler 1 Proportion compiler 2 Floating point arithmetic 10 20 % 30 % Integer arithmetic 5 30 % 10 % Non- arithmetic 2 50 % 60 %
Homeworks
Exercise 9:
The code of an application can be separated in a sequential part (S) and in a parallelizable part (P). The number of executed instructions of type P is twice as many as
multiple processors the number of instructions of type S increases in 10%. Consider the following two configurations:
a)Determine the limit ratio r between the CPI of instructions of type P and type S (r=CPIP /CPIS), for which the configuration A) is faster than configuration B).
b)Compute the upper limit for the speed up that can be achieved using multiple processors without changing the operation frequency.
Designing for Performance 32A) Single processor machine operating with frequency 2f. B) Four processors machine operating with frequency f.
Text Book References
The topics are covered in
Stallings
Tanenbaum - section 8.4
Designing for Performance