Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. - PowerPoint PPT Presentation

Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. V. Catania Lab. Calcolatori

Measurement Tools • Benchmarks, Traces, Mixes • Cost, delay, area, power estimation • Simulation (many levels) – ISA, RT, Gate, Circuit • Queuing Theory • Rules of Thumb • Fundamental Laws

The Bottom Line: Performance (and Cost) "X is n times faster than Y" means ExTime(Y) Performance(X) --------- = --------------- ExTime(X) Performance(Y) • Speed of Concorde vs. Boeing 747 • Throughput of Boeing 747 vs. Concorde

Performance Terminology “X is n% faster than Y” means: ExTime(Y) Performance(X) n --------- = -------------- = 1 + ----- ExTime(X) Performance(Y) 100 n = 100(Performance(X) - Performance(Y)) Performance(Y) Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?

Example ExTime(Y) 15 1.5 Performance (X) = = = ExTime(X) 10 1.0 Performance (Y) 100 (1.5 - 1.0) n = 1.0 n = 50%

Legge di Amdahl MAKE THE COMMON CASE FAST! Il performance improvement che può essere guadagnato rendendo una qualche attività più veloce è limitato dalla frazione di tempo in cui tale attività ha luogo. SPEEDUP : misura di quanto più veloce un task gira sulla macchina ENHANCED

Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then: ExTime(E) = Speedup(E) =

Amdahl’s Law ExTime new = ExTime old x (1 - Fraction enhanced ) + Fraction enhanced Speedup enhanced 1 ExTime old Speedup overall = = (1 - Fraction enhanced ) + Fraction enhanced ExTime new Speedup enhanced

Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTime new = Speedup overall =

Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTime new = ExTime old x (0.9 + .1/2) = 0.95 x ExTime old 1 Speedup overall = = 1.053 0.95

Legge di Amdahl Improve x5 in CPU speed Increase x5 cost CPU use w/e: 50% of time (50% I/O) CPU cost = 1/3 Total Computer Cost Evaluate the investment from cost/performance viewpoint 1 Speedup = = 1,67 0,5  0,5 5 New cost = 2 3 × 1  1 3 × 5 = 2,33 times the original cost Cost increase > performance improvement!

Legge di Amdahl FPSQR ops. responsible of 20% of Execution time FP ops. responsible of 50% of Execution time Alternative enhancements: 1. To make a HW implementation of FPSQR ops. with a speed up of 10 2. To increase ALL FP ops. to RUN 2x FASTER with the same cost of 1 Comparison: 1 1 Speedup FPSQR = = 1,22 Speedup FP = = 1,33 1-0,2  0,2 1-0,5  0,5 10 2

Corollary: Make The Common Case Fast • All instructions require an instruction fetch, only a fraction require a data fetch/store. – Optimize instruction access over data access • Programs exhibit locality Spatial Locality Temporal Locality • Access to small memories is faster – Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories. Reg's Cache Disk / Tape Memory

Legge di Amdahl • Cache memory 5x FASTER of Main memory • 90% CPU time is spent in a fraction of code which could be put in cache What is the Speedup overall using cache? 1 Speedup =  1-% time cache can be used  % time cache can be used Speedup using cache 1 Speedup = = 3,6  1 − 0,9  0,9 5

Occam's Toothbrush • The simple case is usually the most frequent and the easiest to optimize! • Do simple, fast things in hardware and be sure the rest can be handled correctly in software

Metrics of Performance Application Answers per month Operations per second Programming Language Compiler (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: ISA MFLOP/s Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins

Cycles Per Instruction CPU time = CK cycles for a program × T CK Average Cycles per Instruction = CPU time × CK rate CPI = CK cycles for a program Instruction Count Instruction Count Instruction Frequency n I i CPI = ∑ CPI i × F i F i = where Instruction Count i = 1 n CPUtime= ICxCPIxTck = IC × Tck × ∑ CPI i × F i i = 1 NB: CPI i should be measured and not just derived from CPU Ref. Manual (it must include cache misses, etc.)

Aspects of CPU Performance CPU time = Seconds = Instructions x Cycles x Seconds CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Program Program Instruction Cycle Instr. Cnt CPI Clock Rate Program Compiler Instr. Set Organization Technology

Aspects of CPU Performance CPU time = Seconds = Instructions x Cycles x Seconds CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Program Program Instruction Cycle Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X X Technology X

CPI Example Base Machine (A): COMPARE + BRANCH  2 separate instructions OP Freq. Cycles CPI(i) BRANCH 20% 2 0,4 COMP 20% 1 0,2 Others 60% 1 0,6 100% 1,2 New Machine (B): COMPARE + BRANCH  1 integrated instruction OP Freq. Cycles (T CK B = 1,25 T CK A ) BRANCH ? 2 others ? 1 100% Which machine is faster?

CPI Example CPU time A = I C A × 1,2 × T CK A Machine B: Branch freq . = 20% I C A I C B = I C A − 20%I C A = 0,8I C A = 25 80% I C A OP Freq. Cycles CPI(i) BRANCH 25% 2 0,5 others 75% 1 0,75 100% 1,25 CPU time B = I C B × CPI B × 1,25 × T CK A = 0,8 I C A × 1,25 × 1,25 T CK A = = 1,25 × I C A × T CK A CPU A is FASTER than CPU B

Marketing Metrics MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6 • Machines with different instruction sets ? • Programs with different instruction mixes ? – Dynamic frequency of instructions • Uncorrelated with performance MFLOP/s = FP Operations / Time * 10^6 • Machine dependent Normalized: Normalized: • Often not where time is spent add,sub,compare,mult 1 add,sub,compare,mult 1 divide, sqrt 4 divide, sqrt 4 exp, sin, . . . 8 exp, sin, . . . 8

Cycles Per Instruction “Average Cycles per Instruction” CPI = Instruction Count / (CPU Time * Clock Rate) = Instruction Count / Cycles n CPU time = CycleTime *  CPI * I i i i = 1 “Instruction Frequency” n CPI =  CPI * F where F = I i i i i i = 1 Instruction Count Invest Resources where time is Spent!

Organizational Trade-offs Application Programming Language Compiler Instruction Mix ISA CPI Datapath Control Function Units Cycle Time Transistors Wires Pins

Example: Calculating CPI Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch20% 2 .4 (27%) 1.5 Typical Mix

Example Add register / memory operations: – One source operand in memory – One source operand in register – Cycle count of 2 Branch cycle count to increase to 3. What fraction of the loads (in the base machine) must be eliminated for this to pay off? Base Machine (Reg / Reg) Op Freq Cycles ALU 50% 1 Load 20% 2 Store 10% 2 Branch 20% 2 Typical Mix

Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles ALU .50 1 .5 Load .20 2 .4 Store .10 2 .2 Branch .20 2 .3 Reg/Mem 1.00 1.5

Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) Cycles New Instructions New CPI New must be normalized to new instruction frequency

Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) Instr Cnt Old x CPI Old x Clock Old = Instr Cnt New x CPI New x Clock New 1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)

Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) Instr Cnt Old x CPI Old x Clock Old = Instr Cnt New x CPI New x Clock New 1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X) 1.5 = 1.7 – X 0.2 = X ALL loads must be eliminated for this to be a win!

Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. - PowerPoint PPT Presentation

Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. V. Catania Lab. Calcolatori Measurement Tools Benchmarks, Traces, Mixes Cost, delay, area, power estimation Simulation (many levels) ISA, RT, Gate, Circuit

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

NES Architectural Ltd http://www.nes-solutions.co.uk/architectural Who Are we? NES Architectural

Basics Architectural Presentation Basics Architectural Presentation Filesize: 6.51 MB Reviews

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Banking software architecture 2 Architectural Styles 1 WebLogic Network Gatekeeper's software

Rules Engine Tool What is the Rules Engine? Alert Proactive Reaction Business Rules Actions

Extractive Laws in Africa: What is the state of these laws? Why are our laws a problem? Why and on

Optimizing Our Design! (II) Prof. Usagi Recap: Boolean Laws/Theorems OR AND Associative laws

The Strategic Abuse The Strategic Abuse of the of the Antitrust Laws Antitrust Laws Antitrust

By-Laws Click here for Index Click here to Exist Manual on How to draft Index of

State Environmental Laws Federal Environmental Laws Corinne Snow (212) 237-0157

LINKS AND RULES GENOME VISUALIZATION WITH CIRCOS LINKS AND RULES 1 Martin Krzywinski

Association Rules Data Mining and Exploration: Association Rules Itemsets, association rules

WRESTLING RULES CLINIC 2016-17 NFHS WRESTLING RULES The WIAA follows NFHS rules for Wrestling.

How to get peak FLOPS (CPU) What I wish I knew when I was twenty about CPU Kenjiro Taura

Computer Organization & Assembly Language Programming (CSE 2312) Lecture 3 Taylor Johnson

Chapter Chapter 1 Computer Abstractions and Technology 1.1 Introduction The Computer

CPU Performance Lecture 8 CAP 3103 06-11-2014 1.6 Performance Defining Performance Which

CS305 Computer Architecture Fall 2009 Lecture 04 Bhaskaran Raman Department of CSE, IIT Bombay

Cache 10/27/16 The Memory Hierarchy Smaller On 1 cycle to access Chip Faster Registers CPU

1 Memory Read Transaction (1) Memory Read Transaction (2) CPU places address A on the memory

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. - PowerPoint PPT Presentation

Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. V. Catania Lab. Calcolatori Measurement Tools Benchmarks, Traces, Mixes Cost, delay, area, power estimation Simulation (many levels) ISA, RT, Gate, Circuit

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

NES Architectural Ltd http://www.nes-solutions.co.uk/architectural Who Are we? NES Architectural

Basics Architectural Presentation Basics Architectural Presentation Filesize: 6.51 MB Reviews

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Banking software architecture 2 Architectural Styles 1 WebLogic Network Gatekeeper's software

Rules Engine Tool What is the Rules Engine? Alert Proactive Reaction Business Rules Actions

Extractive Laws in Africa: What is the state of these laws? Why are our laws a problem? Why and on

Optimizing Our Design! (II) Prof. Usagi Recap: Boolean Laws/Theorems OR AND Associative laws

The Strategic Abuse The Strategic Abuse of the of the Antitrust Laws Antitrust Laws Antitrust

By-Laws Click here for Index Click here to Exist Manual on How to draft Index of

State Environmental Laws Federal Environmental Laws Corinne Snow (212) 237-0157

LINKS AND RULES GENOME VISUALIZATION WITH CIRCOS LINKS AND RULES 1 Martin Krzywinski

Association Rules Data Mining and Exploration: Association Rules Itemsets, association rules

WRESTLING RULES CLINIC 2016-17 NFHS WRESTLING RULES The WIAA follows NFHS rules for Wrestling.

How to get peak FLOPS (CPU) What I wish I knew when I was twenty about CPU Kenjiro Taura

Computer Organization &amp; Assembly Language Programming (CSE 2312) Lecture 3 Taylor Johnson

Chapter Chapter 1 Computer Abstractions and Technology 1.1 Introduction The Computer

CPU Performance Lecture 8 CAP 3103 06-11-2014 1.6 Performance Defining Performance Which

CS305 Computer Architecture Fall 2009 Lecture 04 Bhaskaran Raman Department of CSE, IIT Bombay

Cache 10/27/16 The Memory Hierarchy Smaller On 1 cycle to access Chip Faster Registers CPU

1 Memory Read Transaction (1) Memory Read Transaction (2) CPU places address A on the memory

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Computer Organization & Assembly Language Programming (CSE 2312) Lecture 3 Taylor Johnson