Unit 4: Performance & Benchmarking CPU Performance Performance - PowerPoint PPT Presentation

This Unit • Metrics • Latency and throughput • Speedup CIS 501: Computer Architecture • Averaging Unit 4: Performance & Benchmarking • CPU Performance • Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania' ' • Benchmarking with'sources'that'included'University'of'Wisconsin'slides ' by'Mark'Hill,'Guri'Sohi,'Jim'Smith,'and'David'Wood ' CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 1 CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 2 Performance: Latency vs. Throughput • Latency (execution time) : time to finish a fixed task • Throughput (bandwidth) : number of tasks in fixed time • Different: exploit parallelism for throughput, not latency (e.g., bread) • Often contradictory (latency vs. throughput) • Will see many examples of this • Choose definition of performance that matches your goals • Scientific program? latency. web server? throughput. • Example: move people 10 miles • Car: capacity = 5, speed = 60 miles/hour Performance Metrics • Bus: capacity = 60, speed = 20 miles/hour • Latency: car = 10 min , bus = 30 min • Throughput: car = 15 PPH (count return trip), bus = 60 PPH • Fastest way to send 10TB of data? (1+ gbits/second) CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 3 CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 4

Amazon Does This… Comparing Performance - Speedup • A is X times faster than B if • X = Latency(B)/Latency(A) (divide by the faster) • X = Throughput(A)/Throughput(B) (divide by the slower) • A is X% faster than B if • Latency(A) = Latency(B) / (1+X/100) • Throughput(A) = Throughput(B) * (1+X/100) • Car/bus example • Latency? Car is 3 times (and 200%) faster than bus • Throughput? Bus is 4 times (and 300%) faster than car CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 5 CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 6 Speedup and % Increase and Decrease Mean (Average) Performance Numbers • Arithmetic : (1/N) * ∑ P=1..N Latency(P) • Program A runs for 200 cycles • For units that are proportional to time (e.g., latency) • Program B runs for 350 cycles • Percent increase and decrease are not the same. • Harmonic : N / ∑ P=1..N 1/Throughput(P) • % increase: ((350 – 200)/200) * 100 = 75% • For units that are inversely proportional to time (e.g., throughput) • % decrease: ((350 - 200)/350) * 100 = 42.3% • Speedup: • You can add latencies, but not throughputs • 350/200 = 1.75 – Program A is 1.75x faster than program B • Latency(P1+P2,A) = Latency(P1,A) + Latency(P2,A) • As a percentage: (1.75 – 1) * 100 = 75% • Throughput(P1+P2,A) != Throughput(P1,A) + Throughput(P2,A) • 1 mile @ 30 miles/hour + 1 mile @ 90 miles/hour • If program C is 1x faster than A, how many cycles does C • Average is not 60 miles/hour run for? – 200 (the same as A) • What if C is 1.5x faster? 133 cycles (50% faster than A) • Geometric : N √∏ P=1..N Speedup(P) • For unitless quantities (e.g., speedup ratios) CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 7 CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 8

For Example… Answer • You drive two miles • You drive two miles • 30 miles per hour for the first mile • 30 miles per hour for the first mile • 90 miles per hour for the second mile • 90 miles per hour for the second mile • Question: what was your average speed? • Question: what was your average speed? • Hint: the answer is not 60 miles per hour • Hint: the answer is not 60 miles per hour • Why? • 0.03333 hours per mile for 1 mile • 0.01111 hours per mile for 1 mile • Would the answer be different if each segment was equal • 0.02222 hours per mile on average time (versus equal distance)? • = 45 miles per hour CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 9 CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 10 Mean (Average) Performance Numbers • Arithmetic : (1/N) * ∑ P=1..N Latency(P) • For units that are proportional to time (e.g., latency) • Harmonic : N / ∑ P=1..N 1/Throughput(P) • For units that are inversely proportional to time (e.g., throughput) • You can add latencies, but not throughputs • Latency(P1+P2,A) = Latency(P1,A) + Latency(P2,A) • Throughput(P1+P2,A) != Throughput(P1,A) + Throughput(P2,A) CPU Performance • 1 mile @ 30 miles/hour + 1 mile @ 90 miles/hour • Average is not 60 miles/hour • Geometric : N √∏ P=1..N Speedup(P) • For unitless quantities (e.g., speedup ratios) CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 11 CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 12

Recall: CPU Performance Equation Cycles per Instruction (CPI) • Multiple aspects to performance: helps to isolate them • CPI : Cycle/instruction for on average • IPC = 1/CPI • Latency = seconds / program = • Used more frequently than CPI • (insns / program) * (cycles / insn) * (seconds / cycle) • Favored because “bigger is better”, but harder to compute with • Insns / program : dynamic insn count • Different instructions have different cycle costs • Impacted by program, compiler, ISA • E.g., “add” typically takes 1 cycle, “divide” takes >10 cycles • Cycles / insn : CPI • Depends on relative instruction frequencies • Impacted by program, compiler, ISA, micro-arch • Seconds / cycle : clock period (Hz) • CPI example • Impacted by micro-arch, technology • A program executes equal: integer, floating point (FP), memory ops • For low latency (better performance) minimize all three • Cycles per instruction type: integer = 1, memory = 2, FP = 3 • What is the CPI? (33% * 1) + (33% * 2) + (33% * 3) = 2 – Difficult: often pull against one another • Caveat : this sort of calculation ignores many effects • Example we have seen: RISC vs. CISC ISAs • Back-of-the-envelope arguments only ± RISC: low CPI/clock period, high insn count ± CISC: low insn count, high CPI/clock period CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 13 CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 14 CPI Example Measuring CPI • Assume a processor with instruction frequencies and costs • How are CPI and execution-time actually measured? • Integer ALU: 50%, 1 cycle • Execution time? stopwatch timer (Unix “time” command) • Load: 20%, 5 cycle • CPI = (CPU time * clock frequency) / dynamic insn count • Store: 10%, 1 cycle • How is dynamic instruction count measured? • Branch: 20%, 2 cycle • Which change would improve performance more? • More useful is CPI breakdown (CPI CPU , CPI MEM , etc.) • A. “Branch prediction” to reduce branch cost to 1 cycle? • So we know what performance problems are and what to fix • B. Faster data memory to reduce load cost to 3 cycles? • Hardware event counters • Compute CPI • Available in most processors today • One way to measure dynamic instruction count • Base = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*2 = 2 CPI • Calculate CPI using counter frequencies / known event costs • A = 0.5*1 + 0.2*5 + 0.1*1+ 0.2*1 = 1.8 CPI (1.11x or 11% faster) • Cycle-level micro-architecture simulation • B = 0.5*1 + 0.2*3 + 0.1*1 + 0.2*2 = 1.6 CPI (1.25x or 25% faster) + Measure exactly what you want … and impact of potential fixes! • B is the winner • Method of choice for many micro-architects CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 15 CIS 501: Comp. Arch. | Prof. Milo Martin | Performance 16

Unit 4: Performance & Benchmarking CPU Performance Performance - PowerPoint PPT Presentation

This Unit Metrics Latency and throughput Speedup CIS 501: Computer Architecture Averaging Unit 4: Performance & Benchmarking CPU Performance Performance Pitfalls

B3 Benchmarking B3 Building Benchmarking Program Overview www.CleanEnergyResourceTeams.org B3

Benchmarking Lunch-n-Learn March 18, 2019 Agenda 1. Why Benchmarking? 2. Introduction to

HOUSING PROJECT 1 UNIT 4 UNIT 1 UNIT 6 UNIT 5 UNIT 3 UNIT 2 Application of the Concept

Dockerization Impacts in Database Performance Benchmarking ..,

Unit Identifier Unit October 21, 2014 Unit Identifiers Unit Members Representing Name Email

Unit Title: Presentation Software Unit Level: 2 Unit Credit Value: 4 GLH: 30 LASER Unit

2015 Benchmarking & Data Management April 15, 2015 PSTA Runs on Data Highlights from 2015 1.

Autonomous Driving on Benchmarks Xiaodi Hou TWO DECADES OF BENCHMARKING Two decades of

PMPA/MPI Statistics and PMPA/MPI Statistics and Benchmarking Project Benchmarking Project Magda

MSA Benchmarking Daniel Yuan and Stanley Liu Intro Benchmarking 6 MSA software 3

President and CEO CFO Source: Benchmarking Alliance Source: Benchmarking Alliance

European Benchmarking Chinese Language European Benchmarking Chinese Language Opportunities

EWRB in practice How was EWRB in 2018? A brief history of benchmarking Outlook for EWRB in 2019

The Dangers and Complexities of SQLite Benchmarking Dhathri Purohith, Jayashree Mohan and Vijay

AHP Slides March 2018 NHS Benchmarking Network Raising Standards through Sharing Excellence

Towards Benchmarking AIOT Device based on MCU Dong Li Seaway Technology Inc. ICT, CAS

1. X-ray and gamma-ray Astronomy PhD Course, University of Padua Page 1 High Energy and Time

Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! Research and ANU

News from the Sudbury Neutrino Observatory (SNO) Christine Kraus TAUP conference, Sendai,

Small Angle Scattering (SAXS/SANS) Small Angle Scattering (SAXS/SANS) Small Angle Scattering

Fixed Income Investor Presentation FY 2016 Results 24 February 2017 Ewen Stevenson Chief

Dynamo: Amazons Highly Available Key-value Store Josh Blum | 6.S897 | 09/28/2015 Introduction

Material structure elucidation methods X-ray analysis dr. va Mak 1 Major branches of

Theoretical results Ignoring demand dynamics, nave old pricing model works well. Theorem : In

Unit 4: Performance & Benchmarking CPU Performance Performance - PowerPoint PPT Presentation

This Unit Metrics Latency and throughput Speedup CIS 501: Computer Architecture Averaging Unit 4: Performance & Benchmarking CPU Performance Performance Pitfalls

B3 Benchmarking B3 Building Benchmarking Program Overview www.CleanEnergyResourceTeams.org B3

Benchmarking Lunch-n-Learn March 18, 2019 Agenda 1. Why Benchmarking? 2. Introduction to

HOUSING PROJECT 1 UNIT 4 UNIT 1 UNIT 6 UNIT 5 UNIT 3 UNIT 2 Application of the Concept

Dockerization Impacts in Database Performance Benchmarking ..,

Unit Identifier Unit October 21, 2014 Unit Identifiers Unit Members Representing Name Email

Unit Title: Presentation Software Unit Level: 2 Unit Credit Value: 4 GLH: 30 LASER Unit

2015 Benchmarking &amp; Data Management April 15, 2015 PSTA Runs on Data Highlights from 2015 1.

Autonomous Driving on Benchmarks Xiaodi Hou TWO DECADES OF BENCHMARKING Two decades of

PMPA/MPI Statistics and PMPA/MPI Statistics and Benchmarking Project Benchmarking Project Magda

MSA Benchmarking Daniel Yuan and Stanley Liu Intro Benchmarking 6 MSA software 3

President and CEO CFO Source: Benchmarking Alliance Source: Benchmarking Alliance

European Benchmarking Chinese Language European Benchmarking Chinese Language Opportunities

EWRB in practice How was EWRB in 2018? A brief history of benchmarking Outlook for EWRB in 2019

The Dangers and Complexities of SQLite Benchmarking Dhathri Purohith, Jayashree Mohan and Vijay

AHP Slides March 2018 NHS Benchmarking Network Raising Standards through Sharing Excellence

Towards Benchmarking AIOT Device based on MCU Dong Li Seaway Technology Inc. ICT, CAS

1. X-ray and gamma-ray Astronomy PhD Course, University of Padua Page 1 High Energy and Time

Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! Research and ANU

News from the Sudbury Neutrino Observatory (SNO) Christine Kraus TAUP conference, Sendai,

Small Angle Scattering (SAXS/SANS) Small Angle Scattering (SAXS/SANS) Small Angle Scattering

Fixed Income Investor Presentation FY 2016 Results 24 February 2017 Ewen Stevenson Chief

Dynamo: Amazons Highly Available Key-value Store Josh Blum | 6.S897 | 09/28/2015 Introduction

Material structure elucidation methods X-ray analysis dr. va Mak 1 Major branches of

Theoretical results Ignoring demand dynamics, nave old pricing model works well. Theorem : In

2015 Benchmarking & Data Management April 15, 2015 PSTA Runs on Data Highlights from 2015 1.