this unit
play

This Unit CPU performance equation App App App Clock vs CPI - PowerPoint PPT Presentation

This Unit CPU performance equation App App App Clock vs CPI System software CIS 371 Performance metrics Mem CPU I/O Benchmarking Computer Organization and Design Unit 7: Performance Metrics Based on slides by Prof. Amir


  1. This Unit • CPU performance equation App App App • Clock vs CPI System software CIS 371 • Performance metrics Mem CPU I/O • Benchmarking Computer Organization and Design Unit 7: Performance Metrics Based on slides by Prof. Amir Roth & Prof. Milo Martin CIS 371 (Martin): Performance 1 CIS 371 (Martin): Performance 2 Readings As You Get Settled… • P&H • Revisit Chapter 1.4, 1.8, 1.9 • You drive two miles • 30 miles per hour for the first mile • 90 miles per hour for the second mile • Question: what was your average speed? • Hint: the answer is not 60 miles per hour • Why? • Would the answer be different if each segment was equal time (versus equal distance)? CIS 371 (Martin): Performance 3 CIS 371 (Martin): Performance 4

  2. Answer • You drive two miles • 30 miles per hour for the first mile • 90 miles per hour for the second mile • Question: what was your average speed? • Hint: the answer is not 60 miles per hour • 0.03333 hours per mile for 1 mile • 0.01111 hours per mile for 1 mile Reasoning About • 0.02222 hours per mile on average • = 45 miles per hour Performance CIS 371 (Martin): Performance 5 CIS 371 (Martin): Performance 6 Recall: Latency vs. Throughput Comparing Performance • Latency (execution time) : time to finish a fixed task • A is X times faster than B if • Latency(A) = Latency(B) / X • Throughput (bandwidth) : number of tasks in fixed time • Throughput(A) = Throughput(B) * X • Different: exploit parallelism for throughput, not latency (e.g., bread) • A is X% faster than B if • Often contradictory (latency vs. throughput) • Will see many examples of this • Latency(A) = Latency(B) / (1+X/100) • Choose definition of performance that matches your goals • Throughput(A) = Throughput(B) * (1+X/100) • Scientific program? Latency, web server: throughput? • Example: move people 10 miles • Car/bus example • Car: capacity = 5, speed = 60 miles/hour • Latency? Car is 3 times (and 200%) faster than bus • Bus: capacity = 60, speed = 20 miles/hour • Throughput? Bus is 4 times (and 300%) faster than car • Latency: car = 10 min , bus = 30 min • Throughput: car = 15 PPH (count return trip), bus = 60 PPH • Fastest way to send 1TB of data? (100+ mbits/second) CIS 371 (Martin): Performance 7 CIS 371 (Martin): Performance 8

  3. CPI Example Mean (Average) Performance Numbers • Arithmetic : (1/N) * ∑ P=1..N Latency(P) • Assume a processor with instruction frequencies and costs • For units that are proportional to time (e.g., latency) • Integer ALU: 50%, 1 cycle • Load: 20%, 5 cycle • You can add latencies, but not throughputs • Store: 10%, 1 cycle • Latency(P1+P2,A) = Latency(P1,A) + Latency(P2,A) • Branch: 20%, 2 cycle • Throughput(P1+P2,A) != Throughput(P1,A) + Throughput(P2,A) • Which change would improve performance more? • 1 mile @ 30 miles/hour + 1 mile @ 90 miles/hour • A. “Branch prediction” to reduce branch cost to 1 cycle? • Average is not 60 miles/hour • B. Faster data memory to reduce load cost to 3 cycles? • Compute CPI • Harmonic : N / ∑ P=1..N (1/Throughput(P)) • Base = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*2 = 2 CPI • For units that are inversely proportional to time (e.g., throughput) • A = 0.5*1 + 0.2*5 + 0.1*1+ 0.2*1 = 1.8 CPI (1.11x or 11% faster) • B = 0.5*1 + 0.2*3 + 0.1*1 + 0.2*2 = 1.6 CPI (1.25x or 25% faster) • Geometric : N √∏ P=1..N Speedup(P) • B is the winner • For unitless quantities (e.g., speedups) CIS 371 (Martin): Performance 9 CIS 371 (Martin): Performance 10 Processor Performance and Workloads • Q: what does performance of a chip mean? • A: nothing, there must be some associated workload • Workload : set of tasks someone (you) cares about • Benchmarks : standard workloads • Used to compare performance across machines • Either are or highly representative of actual programs people run • Micro-benchmarks : non-standard non-workloads Benchmarking • Tiny programs used to isolate certain aspects of performance • Not representative of complex behaviors of real applications • Examples: binary tree search, towers-of-hanoi, 8-queens, etc. CIS 371 (Martin): Performance 11 CIS 371 (Martin): Performance 12

  4. SPEC Benchmarks SPECmark 2006 • SPEC (Standard Performance Evaluation Corporation) • Reference machine: Sun UltraSPARC II (@ 296 MHz) • http://www.spec.org/ • Latency SPECmark • Consortium that collects, standardizes, and distributes benchmarks • For each benchmark • Post SPECmark results for different processors • Take odd number of samples • 1 number that represents performance for entire suite • Choose median • Benchmark suites for CPU, Java, I/O, Web, Mail, etc. • Take latency ratio (reference machine / your machine) • Updated every few years: so companies don’t target benchmarks • Take “average” (Geometric mean) of ratios over all benchmarks • Throughput SPECmark • SPEC CPU 2006 • Run multiple benchmarks in parallel on multiple-processor system • 12 “integer”: bzip2, gcc, perl, hmmer (genomics), h264, etc. • Leaders (a few years out of date, but Intel still at top) • 17 “floating point”: wrf (weather), povray, sphynx3 (speech), etc. • SPECint: Intel 3.3 GHz Xeon W5590 (34.2) • Written in C/C++ and Fortran • SPECfp: Intel 3.2 GHz Xeon W3570 (39.3) CIS 371 (Martin): Performance 13 CIS 371 (Martin): Performance 14 Other Benchmarks • Parallel benchmarks • SPLASH2: Stanford Parallel Applications for Shared Memory • NAS: another parallel benchmark suite • SPECopenMP: parallelized versions of SPECfp 2000) • SPECjbb: Java multithreaded database-like workload • Transaction Processing Council (TPC) • TPC-C: On-line transaction processing (OLTP) • TPC-H/R: Decision support systems (DSS) Pitfalls of Partial • TPC-W: E-commerce database backend workload • Have parallelism (intra-query and inter-query) Performance Metrics • Heavy I/O and memory components CIS 371 (Martin): Performance 15 CIS 371 (Martin): Performance 16

  5. Recall: CPU Performance Equation MIPS (performance metric, not the ISA) • (Micro) architects often ignore dynamic instruction count • Multiple aspects to performance: helps to isolate them • Typically work in one ISA/one compiler → treat it as fixed • Latency = seconds / program = • CPU performance equation becomes • (insns / program) * (cycles / insn) * (seconds / cycle) • Latency: seconds / insn = (cycles / insn) * (seconds / cycle) • Insns / program : dynamic insn count = f(program, compiler, ISA) • Throughput: insn / second = (insn / cycle) * (cycles / second) • Cycles / insn : CPI = f(program, compiler, ISA, micro-arch) • Seconds / cycle : clock period = f(micro-arch, technology) • MIPS (millions of instructions per second) • Cycles / second : clock frequency (in MHz) • For low latency (better performance) minimize all three • Example: CPI = 2, clock = 500 MHz → 0.5 * 500 MHz = 250 MIPS – Difficult: often pull against one another • Pitfall: may vary inversely with actual performance • Example we have seen: RISC vs. CISC ISAs – Compiler removes insns, program gets faster, MIPS goes down ± RISC: low CPI/clock period, high insn count – Work per instruction varies (e.g., multiply vs. add, FP vs. integer) ± CISC: low insn count, high CPI/clock period CIS 371 (Martin): Performance 17 CIS 371 (Martin): Performance 18 Mhz (MegaHertz) and Ghz (GigaHertz) CPI and Clock Frequency • 1 Hertz = 1 cycle per second • Clock frequency implies processor “core” clock frequency 1 Ghz is 1 cycle per nanosecond, 1 Ghz = 1000 Mhz • Other system components have their own clocks (or not) • (Micro-)architects often ignore dynamic instruction count… • E.g., increasing processor clock doesn’t accelerate memory latency • Example: a 1 Ghz processor with (1ns clock period) • … but general public (mostly) also ignores CPI • 80% non-memory instructions @ 1 cycle (1ns) • Equates clock frequency with performance! • 20% memory instructions @ 6 cycles (6ns) • Which processor would you buy? • (80%*1) + (20%*6) = 2ns per instruction (also 500 MIPS) • Processor A: CPI = 2, clock = 5 GHz • Impact of double the core clock frequency? • Processor B: CPI = 1, clock = 3 GHz • Without speeding up the memory • Probably A, but B is faster (assuming same ISA/compiler) • Non-memory instructions latency is now 0.5ns (but 1 cycle) • Classic example • Memory instructions keep 6ns latency (now 12 cycles) • 800 MHz PentiumIII faster than 1 GHz Pentium4! • (80% * 0.5) + (20% * 6) = 1.6ns per instruction (also 625 MIPS) • More recent example: Core i7 faster clock-per-clock than Core 2 • Speedup = 2/1.6 = 1.25, which is << 2 • Same ISA and compiler! • What about an infinite clock frequency? (non-memory free) • Meta-point: danger of partial performance metrics! • Only a factor of 1.66 speedup (example of Amdahl’s Law) CIS 371 (Martin): Performance 19 CIS 371 (Martin): Performance 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend