Designing for Performance Raul Queiroz Feitosa Objective In this - PowerPoint PPT Presentation

Designing for Performance Raul Queiroz Feitosa

Objective “In this section … we examine the most common approach to assessing processor and computer system performance” W. Stallings 2 Designing for Performance

Outline  Performance Assessment  Amdahl’s Law 3 Designing for Performance

Performance Assessment Which one would you choose? Intel Xeon Platinum 8280L EPYC 7601 Cache 38.5 MB Cache 64 MB Freq.: 2.7 GHz Freq.: 2.2 GHz 28 Cores 32 Cores 4 Designing for Performance

Performance Assessment What matters?  Cost  Size  Reliability  Security  Power Consumption  Performance 5 Designing for Performance

Performance Assessment Main CPU operations  Seek and decode instructions  Load and Store data  Logic and Arithmetic Operations  Fixed-Point  Floating-Point 6 Designing for Performance

Performance Assessment Performance factors  Clock speed or clock rate ( f ) Expressed in multiples of Hz.  Clock cycle or clock tick one increment, or pulse, of the clock .  Clock time ( τ ) time between consecutive pulses. 7 Designing for Performance

Performance Assessment Performance factors  Clock speed  Usually multiple clock cycles are required per instruction.  The amount of work implied by one instruction varies considerably.  Pipelining gives simultaneous execution of instructions.  So, clock speed is not the whole story! 8 Designing for Performance

Performance Assessment Performance factors  Instruction Execution Rate  Expressed in Millions of instructions (MIPS) or floating point instructions (MFLOPS) per second.  Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy. 9 Designing for Performance

Performance Assessment Performance factors  CPI - average number of cycles per instructions  I i - number of machine instructions of type i executed by a program.  CPI i - number of cycles per instruction of type i.  I c - number of machine instructions executed by a program n   I I c i  i 1 n   CPI I i i   i 1 CPI I c 10 Designing for Performance

Performance Assessment Performance factors  T – processor time needed to execute a program.     T I CPI c a refinement yields         T I p ( m k ) c where p is the number of processor cycles to decode + execute the instruction m is the number of memory references needed k is the ratio between memory cycle time and processor cycle time. 11 Designing for Performance

Performance Assessment Performance factors System attributes affecting the performance factors τ I c p m k   Instruction set architecture !    Compiler technology   Processor implementation   Cache and memory hierarchy 12 Designing for Performance

Performance Assessment Exercise 1 A program involves the execution of 2 million instructions on a 400 MHz processor. CPI and the proportion of four instruction types are given below. Compute the average CPI: instruction type CPI instruction mix Arithmetic and logic 1 60% Load/store with cache hit 2 18% Branch 4 12% Load/store with cache miss 8 10% average CPI is CPI = 0.6+ (2  0.18) + (4  0.12) + (8  0.1) = 2.24 13 Designing for Performance

Performance Assessment Exercise 2 Consider two hardware implementations M 1 and M 2 of the same instruction set. There are three instruction classes: F, I and N. The M 1 clock rate is 600 Mhz. The clock cycle of M 2 is 2 ns. The average CPI for these three instruction classes are Class CPI of M 1 CPI of M 2 Comments F 5.0 4.0 floating-point I 2.0 3.8 integer N 2.4 2.0 non-arithmetic Compute the peak performance for M 1 and M 2 in MIPS. a) If 50% of the instruction executed in a given program belong to class N and b) the other are equally distributed between F and I, which is the fastest machine and by which factor? 14 Designing for Performance

Performance Assessment Exercise 2 c) A designer of M 1 plan to change the project to improve performance. Assuming the information in (b). Which of the options below should be more beneficial? 1. Use a FPU twice as fast (CPI=2,5 for class F). 2. Add a second ALU to reduce the CPI for integer operations to 1.20 3. Use a faster logic that allows a clock rate of 750 MHz keeping the same CPI values? d) The CPI given above include a cache miss that occurs 5 times per 100 executed instructions. Each cache miss imply in a 10 cycles penalty. The forth redesign option consists of using a larger instruction cache so as to reduce the miss ratio from 5% to 3%. Compare this alternative with the options before. e) Characterize application programs that can be executed faster in M 1 than in M 2 , i. e., discuss the instruction composition of such applications. Hint : Let x, y and 1-x-y the fraction of instructions belonging to classes F, I and N respectively. 15 Designing for Performance

Performance Assessment Exercise 3 Consider two codes produced by two compiler for the same source program. The instructions of the machine that will execute these codes can be divided in class A (CPI=1) and B (CPI=2). The number of executed instruction of each class is given below Class compiler 1 compiler 2 comments A 600M 400M CPI=1 B 400M 400M CPI=2 Compute the execution time for both codes assuming a clock rate = 1 GHz. a) Which compiler produce the most efficient code and by which factor? b) Which code execute at the highest MIPS? c) 16 Designing for Performance

Performance Assessment Benchmarks: motivation A high level language statement A=B+C /* assume all quantities in main memory */ Compiled code on RISC load mem(B),reg(1); Compiled code on CISC load mem(C),reg(2); add mem(B),mem(C),mem(A) add reg(1),reg(2),reg(3); store reg(3),mem(A); Both machines execute the same high level codes in the same time. So, if MIPS CISC = 1, then MIPS RISC = 4 17 Designing for Performance

Performance Assessment Benchmarks: definition  Programs designed to test performance  Written in high level language → portable  Represents a particular application or system programming area (systems, numerical, commercial)  Easily measured and widely distributed  The best known such collection of benchmark suites is the System Performance Evaluation Corporation (SPEC)  The best known of the SPEC suites is the CPU2017:  contains 43 benchmarks organized into four suites  includes an optional metric for measuring energy consumption 18 Designing for Performance

Performance Assessment SPECspeed metric  Spec benchmarks do not concern with instruction execution rates  Base runtime defined for each benchmark using reference machine  Speed metric is the ratio of reference time to system run time  Tref i execution time for benchmark i on reference machine  Tsut i execution time of benchmark i on test system 19 Designing for Performance

Performance Assessment SPECrate Metric  Measures throughput or rate of a machine carrying out a number of tasks  Multiple copies of benchmarks run simultaneously  Typically, same as number of processors  Ratio is calculated as follows:  Tref i reference execution time for benchmark i  N number of copies running simultaneously  Tsut i elapsed time from start of execution of all N programs until completion of all copies of program  Again, a geometric mean is calculated 20 Designing for Performance

Performance Assessment Averaging SPEC metrics  For both SPECspeed and SPECrate, the selected ratios are averaged using the Geometric Mean, which is reported as the overall metric. 21 Designing for Performance

Performance Assessment Exercise 4 The table below shows the execution times, in seconds, for 3 different processors. processor benchmark X Y Z 1 20 10 40 2 40 80 20 Compute the arithmetic mean value for each system using X as the reference a) machine and then using Y as the reference machine. Compute the geometric mean value for each system using X as the reference b) machine and then using Y as the reference machine. Which is the most realistic result? 22 Designing for Performance

Outline  Performance Assessment  Amdahl’s Law 23 Designing for Performance

Amdahl’s Law Estimate the potential speed up of program using multiple processors  Fraction f of code parallelizable with no scheduling overhead  Fraction (1- f ) of code inherently serial  T is total execution time for program on single processor  N is number of processors that fully exploit parallel portions of code 24 Designing for Performance

Amdahl’s Law Conclusions  Code needs to be parallelizable/parallelized!  f small, parallel processors has little effect.  N → ∞, speedup bound by 1/(1 – f ).  Speedup is bound, giving diminishing returns for more processors . 25 Designing for Performance

Designing for Performance Raul Queiroz Feitosa Objective In this - PowerPoint PPT Presentation

Designing for Performance Raul Queiroz Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings 2 Designing for Performance Outline Performance

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

Class 14 Slides SLIDE what is the designing principle how does designing principle

Designing Your Fashion Portfolio From Concept To Presentation Designing Your Fashion Portfolio

Designing Better Places: Designing Better Places: Hands- H Hands H d d -On Design Training

Designing for Conversational UI Angie T errell Design Director, Big Nerd Ranch Designing for

Object Object- -oriented software oriented software engineering for designing an aerial

Designing for differences Dan Smith 2 Designing for Differences - Goals Be familiar with

Designing Applications that See Designing Applications that See Lecture 2: Human Vision and

Randomization methods Tamuno Alfred, PhD Biostatistician DataCamp Designing and Analyzing

Designing Networks on Chip: Designing Networks on Chip: Solutions and Challenges Solutions and

Designing Professional Presentation Slides Using Microsoft PowerPoint Designing Professional

Designing Applications that See Designing Applications that See Lecture 5: Motion and Tracking

Designing Applications that See Designing Applications that See Lecture 6: Processing Dan

Designing Applications that See Designing Applications that See Lecture 8: OpenCV Dan

Designing new agonists and antagonists of Designing new agonists and antagonists of glycoprotein

DESIGNING AND PREPARING PRESENTATION MATERIALS EDITION DATE: SEPTEMBER 1994 DESIGNING AND

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

On Clustering Histograms with k -Means by Using Mixed -Divergences Entropy 16(6): 3273-3301

Calorimeter respons Helga Holmestad 11. April 2013 Helga Holmestad DHCal 11. April 2013 1 /

Evaluation Albert Bifet April 2012 COMP423A/COMP523A Data Stream Mining Outline 1.

Information Retrieval Venkatesh Vinayakarao Term: Aug Dec, 2018 Indian Institute of

Accuracy Characterization for Metropolitan-scale Wi-Fi Localization Presented by Md TamzeedIslam

FileCheck: learning arithmetic Thomas Preud'homme Numeric constraints in toolchains Register

Lecture 5: SOS Proofs and the Motzkin Polynomial Lecture Outline Part I: SOS proofs and