designing for
play

Designing for Performance Raul Queiroz Feitosa Objective In this - PowerPoint PPT Presentation

Designing for Performance Raul Queiroz Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings 2 Designing for Performance Outline Performance


  1. Designing for Performance Raul Queiroz Feitosa

  2. Objective “In this section … we examine the most common approach to assessing processor and computer system performance” W. Stallings 2 Designing for Performance

  3. Outline  Performance Assessment  Amdahl’s Law 3 Designing for Performance

  4. Performance Assessment Which one would you choose? Intel Xeon Platinum 8280L EPYC 7601 Cache 38.5 MB Cache 64 MB Freq.: 2.7 GHz Freq.: 2.2 GHz 28 Cores 32 Cores 4 Designing for Performance

  5. Performance Assessment What matters?  Cost  Size  Reliability  Security  Power Consumption  Performance 5 Designing for Performance

  6. Performance Assessment Main CPU operations  Seek and decode instructions  Load and Store data  Logic and Arithmetic Operations  Fixed-Point  Floating-Point 6 Designing for Performance

  7. Performance Assessment Performance factors  Clock speed or clock rate ( f ) Expressed in multiples of Hz.  Clock cycle or clock tick one increment, or pulse, of the clock .  Clock time ( τ ) time between consecutive pulses. 7 Designing for Performance

  8. Performance Assessment Performance factors  Clock speed  Usually multiple clock cycles are required per instruction.  The amount of work implied by one instruction varies considerably.  Pipelining gives simultaneous execution of instructions.  So, clock speed is not the whole story! 8 Designing for Performance

  9. Performance Assessment Performance factors  Instruction Execution Rate  Expressed in Millions of instructions (MIPS) or floating point instructions (MFLOPS) per second.  Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy. 9 Designing for Performance

  10. Performance Assessment Performance factors  CPI - average number of cycles per instructions  I i - number of machine instructions of type i executed by a program.  CPI i - number of cycles per instruction of type i.  I c - number of machine instructions executed by a program n   I I c i  i 1 n   CPI I i i   i 1 CPI I c 10 Designing for Performance

  11. Performance Assessment Performance factors  T – processor time needed to execute a program.     T I CPI c a refinement yields         T I p ( m k ) c where p is the number of processor cycles to decode + execute the instruction m is the number of memory references needed k is the ratio between memory cycle time and processor cycle time. 11 Designing for Performance

  12. Performance Assessment Performance factors System attributes affecting the performance factors τ I c p m k   Instruction set architecture !    Compiler technology   Processor implementation   Cache and memory hierarchy 12 Designing for Performance

  13. Performance Assessment Exercise 1 A program involves the execution of 2 million instructions on a 400 MHz processor. CPI and the proportion of four instruction types are given below. Compute the average CPI: instruction type CPI instruction mix Arithmetic and logic 1 60% Load/store with cache hit 2 18% Branch 4 12% Load/store with cache miss 8 10% average CPI is CPI = 0.6+ (2  0.18) + (4  0.12) + (8  0.1) = 2.24 13 Designing for Performance

  14. Performance Assessment Exercise 2 Consider two hardware implementations M 1 and M 2 of the same instruction set. There are three instruction classes: F, I and N. The M 1 clock rate is 600 Mhz. The clock cycle of M 2 is 2 ns. The average CPI for these three instruction classes are Class CPI of M 1 CPI of M 2 Comments F 5.0 4.0 floating-point I 2.0 3.8 integer N 2.4 2.0 non-arithmetic Compute the peak performance for M 1 and M 2 in MIPS. a) If 50% of the instruction executed in a given program belong to class N and b) the other are equally distributed between F and I, which is the fastest machine and by which factor? 14 Designing for Performance

  15. Performance Assessment Exercise 2 c) A designer of M 1 plan to change the project to improve performance. Assuming the information in (b). Which of the options below should be more beneficial? 1. Use a FPU twice as fast (CPI=2,5 for class F). 2. Add a second ALU to reduce the CPI for integer operations to 1.20 3. Use a faster logic that allows a clock rate of 750 MHz keeping the same CPI values? d) The CPI given above include a cache miss that occurs 5 times per 100 executed instructions. Each cache miss imply in a 10 cycles penalty. The forth redesign option consists of using a larger instruction cache so as to reduce the miss ratio from 5% to 3%. Compare this alternative with the options before. e) Characterize application programs that can be executed faster in M 1 than in M 2 , i. e., discuss the instruction composition of such applications. Hint : Let x, y and 1-x-y the fraction of instructions belonging to classes F, I and N respectively. 15 Designing for Performance

  16. Performance Assessment Exercise 3 Consider two codes produced by two compiler for the same source program. The instructions of the machine that will execute these codes can be divided in class A (CPI=1) and B (CPI=2). The number of executed instruction of each class is given below Class compiler 1 compiler 2 comments A 600M 400M CPI=1 B 400M 400M CPI=2 Compute the execution time for both codes assuming a clock rate = 1 GHz. a) Which compiler produce the most efficient code and by which factor? b) Which code execute at the highest MIPS? c) 16 Designing for Performance

  17. Performance Assessment Benchmarks: motivation A high level language statement A=B+C /* assume all quantities in main memory */ Compiled code on RISC load mem(B),reg(1); Compiled code on CISC load mem(C),reg(2); add mem(B),mem(C),mem(A) add reg(1),reg(2),reg(3); store reg(3),mem(A); Both machines execute the same high level codes in the same time. So, if MIPS CISC = 1, then MIPS RISC = 4 17 Designing for Performance

  18. Performance Assessment Benchmarks: definition  Programs designed to test performance  Written in high level language → portable  Represents a particular application or system programming area (systems, numerical, commercial)  Easily measured and widely distributed  The best known such collection of benchmark suites is the System Performance Evaluation Corporation (SPEC)  The best known of the SPEC suites is the CPU2017:  contains 43 benchmarks organized into four suites  includes an optional metric for measuring energy consumption 18 Designing for Performance

  19. Performance Assessment SPECspeed metric  Spec benchmarks do not concern with instruction execution rates  Base runtime defined for each benchmark using reference machine  Speed metric is the ratio of reference time to system run time  Tref i execution time for benchmark i on reference machine  Tsut i execution time of benchmark i on test system 19 Designing for Performance

  20. Performance Assessment SPECrate Metric  Measures throughput or rate of a machine carrying out a number of tasks  Multiple copies of benchmarks run simultaneously  Typically, same as number of processors  Ratio is calculated as follows:  Tref i reference execution time for benchmark i  N number of copies running simultaneously  Tsut i elapsed time from start of execution of all N programs until completion of all copies of program  Again, a geometric mean is calculated 20 Designing for Performance

  21. Performance Assessment Averaging SPEC metrics  For both SPECspeed and SPECrate, the selected ratios are averaged using the Geometric Mean, which is reported as the overall metric. 21 Designing for Performance

  22. Performance Assessment Exercise 4 The table below shows the execution times, in seconds, for 3 different processors. processor benchmark X Y Z 1 20 10 40 2 40 80 20 Compute the arithmetic mean value for each system using X as the reference a) machine and then using Y as the reference machine. Compute the geometric mean value for each system using X as the reference b) machine and then using Y as the reference machine. Which is the most realistic result? 22 Designing for Performance

  23. Outline  Performance Assessment  Amdahl’s Law 23 Designing for Performance

  24. Amdahl’s Law Estimate the potential speed up of program using multiple processors  Fraction f of code parallelizable with no scheduling overhead  Fraction (1- f ) of code inherently serial  T is total execution time for program on single processor  N is number of processors that fully exploit parallel portions of code 24 Designing for Performance

  25. Amdahl’s Law Conclusions  Code needs to be parallelizable/parallelized!  f small, parallel processors has little effect.  N → ∞, speedup bound by 1/(1 – f ).  Speedup is bound, giving diminishing returns for more processors . 25 Designing for Performance

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend