Designing for Performance Raul Queiroz Feitosa Objective In this - - PowerPoint PPT Presentation

designing for
SMART_READER_LITE
LIVE PREVIEW

Designing for Performance Raul Queiroz Feitosa Objective In this - - PowerPoint PPT Presentation

Designing for Performance Raul Queiroz Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings 2 Designing for Performance Outline Performance


slide-1
SLIDE 1

Designing for Performance

Raul Queiroz Feitosa

slide-2
SLIDE 2 Designing for Performance 2

Objective

“In this section … we examine the most common approach to assessing processor and computer system performance”

  • W. Stallings
slide-3
SLIDE 3 Designing for Performance 3

Outline

 Performance Assessment  Amdahl’s Law

slide-4
SLIDE 4

Performance Assessment

Designing for Performance 4

EPYC 7601 Cache 64 MB Freq.: 2.2 GHz 32 Cores Intel Xeon Platinum 8280L Cache 38.5 MB Freq.: 2.7 GHz 28 Cores

Which one would you choose?

slide-5
SLIDE 5

Performance Assessment

What matters?

Cost Size Reliability Security Power Consumption Performance

Designing for Performance 5
slide-6
SLIDE 6

Performance Assessment

Main CPU operations

Seek and decode instructions Load and Store data Logic and Arithmetic Operations

 Fixed-Point  Floating-Point

Designing for Performance 6
slide-7
SLIDE 7 Designing for Performance 7

Performance Assessment

Performance factors

Clock speed or clock rate ( f )

Expressed in multiples of Hz.

Clock cycle or clock tick

  • ne increment, or pulse, of the clock .

Clock time ( τ )

time between consecutive pulses.

slide-8
SLIDE 8 Designing for Performance 8

Performance Assessment

Performance factors

Clock speed

 Usually multiple clock cycles are required per

instruction.

 The amount of work implied by one instruction varies

considerably.

 Pipelining gives simultaneous execution of instructions.  So, clock speed is not the whole story!

slide-9
SLIDE 9 Designing for Performance 9

Performance Assessment

Performance factors

Instruction Execution Rate

 Expressed in Millions of instructions (MIPS) or floating

point instructions (MFLOPS) per second.

 Heavily dependent on instruction set, compiler design,

processor implementation, cache & memory hierarchy.

slide-10
SLIDE 10 Designing for Performance 10

Performance Assessment

Performance factors

 CPI - average number of cycles per instructions  Ii - number of machine instructions of type i executed by a

program.

 CPIi - number of cycles per instruction of type i.  Ic - number of machine instructions executed by a program

n i i c

I I

1 c n i i i

I I CPI CPI

 

1

slide-11
SLIDE 11 Designing for Performance 11

Performance Assessment

Performance factors

 T – processor time needed to execute a program.

a refinement yields where

p is the number of processor cycles to decode + execute the instruction m is the number of memory references needed k is the ratio between memory cycle time and processor cycle time.

    CPI I T

c

  

     ) ( k m p I T

c

slide-12
SLIDE 12 Designing for Performance 12

Performance Assessment

Performance factors

System attributes affecting the performance factors

Ic p m k τ

Instruction set architecture

 

!

Compiler technology

  

Processor implementation

 

Cache and memory hierarchy

 

slide-13
SLIDE 13 Designing for Performance 13

Performance Assessment

Exercise 1

A program involves the execution of 2 million instructions on a 400 MHz

  • processor. CPI and the proportion of four instruction types are given below.

Compute the average CPI: average CPI is CPI = 0.6+ (2  0.18) + (4  0.12) + (8  0.1) = 2.24 instruction type CPI instruction mix Arithmetic and logic 1 60% Load/store with cache hit 2 18% Branch 4 12% Load/store with cache miss 8 10%

slide-14
SLIDE 14 Designing for Performance 14

Performance Assessment

Exercise 2

Consider two hardware implementations M1 and M2 of the same instruction set. There are three instruction classes: F, I and N. The M1 clock rate is 600 Mhz. The clock cycle of M2 is 2 ns. The average CPI for these three instruction classes are Class CPI of M1 CPI of M2 Comments F 5.0 4.0 floating-point I 2.0 3.8 integer N 2.4 2.0 non-arithmetic

a)

Compute the peak performance for M1 and M2 in MIPS.

b)

If 50% of the instruction executed in a given program belong to class N and the other are equally distributed between F and I, which is the fastest machine and by which factor?

slide-15
SLIDE 15 Designing for Performance 15

Performance Assessment

Exercise 2

c) A designer of M1 plan to change the project to improve performance. Assuming the information in (b). Which of the options below should be more beneficial? 1. Use a FPU twice as fast (CPI=2,5 for class F). 2. Add a second ALU to reduce the CPI for integer operations to 1.20 3. Use a faster logic that allows a clock rate of 750 MHz keeping the same CPI values? d) The CPI given above include a cache miss that occurs 5 times per 100 executed instructions. Each cache miss imply in a 10 cycles penalty. The forth redesign option consists of using a larger instruction cache so as to reduce the miss ratio from 5% to 3%. Compare this alternative with the

  • ptions before.

e) Characterize application programs that can be executed faster in M1 than in M2, i. e., discuss the instruction composition of such applications. Hint: Let x, y and 1-x-y the fraction of instructions belonging to classes F, I and N respectively.

slide-16
SLIDE 16 Designing for Performance 16

Performance Assessment

Exercise 3

Consider two codes produced by two compiler for the same source program. The instructions of the machine that will execute these codes can be divided in class A (CPI=1) and B (CPI=2). The number of executed instruction of each class is given below Class compiler 1 compiler 2 comments A 600M 400M CPI=1 B 400M 400M CPI=2

a)

Compute the execution time for both codes assuming a clock rate = 1 GHz.

b)

Which compiler produce the most efficient code and by which factor?

c)

Which code execute at the highest MIPS?

slide-17
SLIDE 17 Designing for Performance 17

Performance Assessment

Benchmarks: motivation

A high level language statement A=B+C /* assume all quantities in main memory */ Compiled code on CISC add mem(B),mem(C),mem(A) Compiled code on RISC load mem(B),reg(1); load mem(C),reg(2); add reg(1),reg(2),reg(3); store reg(3),mem(A);

So, if MIPSCISC= 1, then MIPSRISC= 4 Both machines execute the same high level codes in the same time.

slide-18
SLIDE 18 Designing for Performance 18

Performance Assessment

Benchmarks: definition

 Programs designed to test performance  Written in high level language → portable  Represents a particular application or system programming

area (systems, numerical, commercial)

 Easily measured and widely distributed  The best known such collection of benchmark suites is the

System Performance Evaluation Corporation (SPEC)

 The best known of the SPEC suites is the CPU2017:

 contains 43 benchmarks organized into four suites

 includes an optional metric for measuring energy

consumption

slide-19
SLIDE 19 Designing for Performance 19

Performance Assessment

SPECspeed metric

 Spec benchmarks do not concern with instruction execution

rates

 Base runtime defined for each benchmark using reference

machine

 Speed metric is the ratio of reference time to system run time

 Trefi execution time for benchmark i on reference machine  Tsuti execution time of benchmark i on test system
slide-20
SLIDE 20 Designing for Performance 20

Performance Assessment

SPECrate Metric

 Measures throughput or rate of a machine carrying out a number of tasks  Multiple copies of benchmarks run simultaneously

 Typically, same as number of processors

 Ratio is calculated as follows:

 Trefi reference execution time for benchmark i  N number of copies running simultaneously  Tsuti elapsed time from start of execution of all N programs until completion of all

copies of program

 Again, a geometric mean is calculated
slide-21
SLIDE 21 Designing for Performance 21

Performance Assessment

Averaging SPEC metrics

 For both SPECspeed and SPECrate, the selected ratios are

averaged using the Geometric Mean, which is reported as the

  • verall metric.
slide-22
SLIDE 22 Designing for Performance 22

Performance Assessment

Exercise 4

The table below shows the execution times, in seconds, for 3 different processors.

a)

Compute the arithmetic mean value for each system using X as the reference machine and then using Y as the reference machine.

b)

Compute the geometric mean value for each system using X as the reference machine and then using Y as the reference machine. Which is the most realistic result?

benchmark processor X Y Z 1 20 10 40 2 40 80 20

slide-23
SLIDE 23 Designing for Performance 23

Outline

 Performance Assessment  Amdahl’s Law

slide-24
SLIDE 24 Designing for Performance 24

Amdahl’s Law

Estimate the potential speed up of program using multiple processors

 Fraction f of code parallelizable with no scheduling overhead  Fraction (1-f) of code inherently serial  T is total execution time for program on single processor  N is number of processors that fully exploit parallel portions of code

slide-25
SLIDE 25 Designing for Performance 25

Amdahl’s Law

Conclusions

Code needs to be parallelizable/parallelized! f small, parallel processors has little effect. N → ∞, speedup bound by 1/(1 – f). Speedup is bound, giving diminishing returns for

more processors .

slide-26
SLIDE 26 Designing for Performance 26

Amdahl’s Law

Exercise 5

A program spends 60% of its execution time with floating point operations. 90%

  • f them are executed in parallelizable loops. When the code is parallelized

coordination and synchronization between parts make the part not involving floating-point operations 10% longer.

a)

Find the improvement in terms of execution time achieved by doubling the speed of the floating-point unit.

b)

Find the improvement in terms of execution time achieved by using two processors having the same speed and structure as the original one

c)

What would be the improvement if both changes are implemented.

slide-27
SLIDE 27 Designing for Performance 27

Amdahl’s Law

Generalization for any design improvement

Suppose that the enhancement affects the execution f of the total runtime before enhancement, and that the speed up brought by this enhancement is SUf . Thus . t enhancemen after time Execution t enhancemen before time Execution  Speedup

 

f

SU f f Speedup    1 1

slide-28
SLIDE 28 Designing for Performance 28

Amdahl’s Law

Generalized Amdahl’s Law example

Suppose that a task consumes 40% of the time with floating-point operations. A new FPU has speedup

  • K. Then the overall speedup is

So, the maximum speedup is 1.67.

 

K Speedup 4 . 4 . 1 1   

slide-29
SLIDE 29 Designing for Performance 29

Homeworks

Exercise 6

A processor is used for an application where 30 %, 25% and 10% of the processing time is spent with floating-point addition, multiplication and division,

  • respectively. For a new processor version, 3 alternatives are being considered, all
  • f them involving nearly the same design and implementation cost. Which one

should be selected?

a)

Redesign the adder making it twice as fast as the older one.

b)

Redesign the multiplier making it three times as fast as the older one

c)

Redesign the divider making it ten times as fast as the older one.

slide-30
SLIDE 30

Homeworks

Exercise 7:

T is the average processing time of a computer operating at frequency f. Instructions are grouped in 3 types, as shown below. Typically a program executes the same proportion of instructions from all three groups/types. Compute the MIPS and the new execution time, if the FPU becomes twice as fast.

Designing for Performance 30

Instruction type CPI Floating point arithmetic 10 Integer arithmetic 5 Non- arithmetic 2

slide-31
SLIDE 31

Homeworks

Exercise 8:

Let f1 and f2 be the operation frequency of processors P1 and P2 respectively. Assume that two compilers generate different executable codes for the same source program which may be executed byP1 as well as byP2 . The codes have the characteristics given below: Compute the ratio f1/f2 for which the processing time in P1 executing code 1 equals the processing time of P2 executing code 2.

Designing for Performance 31

Instruction type CPI Proportion compiler 1 Proportion compiler 2 Floating point arithmetic 10 20 % 30 % Integer arithmetic 5 30 % 10 % Non- arithmetic 2 50 % 60 %

slide-32
SLIDE 32

Homeworks

Exercise 9:

The code of an application can be separated in a sequential part (S) and in a parallelizable part (P). The number of executed instructions of type P is twice as many as

  • f type S, when the application runs in a single processor. When the application runs in

multiple processors the number of instructions of type S increases in 10%. Consider the following two configurations:

a)

Determine the limit ratio r between the CPI of instructions of type P and type S (r=CPIP /CPIS), for which the configuration A) is faster than configuration B).

b)

Compute the upper limit for the speed up that can be achieved using multiple processors without changing the operation frequency.

Designing for Performance 32

A) Single processor machine operating with frequency 2f. B) Four processors machine operating with frequency f.

slide-33
SLIDE 33 Designing for Performance 33

Text Book References

The topics are covered in

Stallings

  • chapter 2

Tanenbaum - section 8.4

slide-34
SLIDE 34 Designing for Performance 34

Designing for Performance

END

15-17, 24,28,31-25