CS305 Computer Architecture Fall 2009 Lecture 04 Bhaskaran Raman - - PowerPoint PPT Presentation

cs305 computer architecture fall 2009 lecture 04
SMART_READER_LITE
LIVE PREVIEW

CS305 Computer Architecture Fall 2009 Lecture 04 Bhaskaran Raman - - PowerPoint PPT Presentation

CS305 Computer Architecture Fall 2009 Lecture 04 Bhaskaran Raman Department of CSE, IIT Bombay http://www.cse.iitb.ac.in/~br/ http://www.cse.iitb.ac.in/synerg/doku.php?id=public:courses:cs305-fall09:start Today's Topics Performance


slide-1
SLIDE 1

CS305 Computer Architecture Fall 2009 Lecture 04

Bhaskaran Raman Department of CSE, IIT Bombay

http://www.cse.iitb.ac.in/~br/ http://www.cse.iitb.ac.in/synerg/doku.php?id=public:courses:cs305-fall09:start

slide-2
SLIDE 2

Today's Topics

  • Performance metrics, CPI
  • Performance comparison
  • Benchmarks
slide-3
SLIDE 3

Performance Comparison

  • What performance metric to use?
  • User cares about response time
  • Performance is inversely proportional
  • What is execution time?
  • Response time
  • CPU time: User time + System time
  • System performance vs. CPU performance
  • Throughput vs. response-time
  • We will focus on CPU performance
slide-4
SLIDE 4

Which Program's Execution Time?

  • Real “workload” is ideal
  • Practical options:
  • Real programs: compilers, office-suite, scientific...
  • Kernels: key pieces of programs

– Example: Livermore loops

  • Toy benchmarks: small programs

– Examples: Quick-sort, tower of Hanoi...

  • Synthetic benchmarks: try to capture “average”

frequency of instructions in real programs

– Example: Whetstone, Dhrystone

slide-5
SLIDE 5

More on Performance Comparisons...

  • Caveat of benchmarks
  • They are needed
  • But manufacturers tend to optimize for benchmarks
  • Need to be updated periodically
  • Benchmark suite: collection of programs
  • E.g. SPEC2000
  • Reporting performance
  • Reproducibility: program version, compiler, flags
  • SPEC specifies compiler flags for baseline comparison
slide-6
SLIDE 6

Some Numerics...

  • Total (or average) execution time is a possible

metric

  • Weighted execution time is better

Computer A Computer B Computer C Program P1 (secs) 1 10 20 Program P2 (secs) 1000 100 20 Total (secs) 1001 110 40

W i×T i

slide-7
SLIDE 7

Normalizing the Performance

  • Normalize such that all programs take the same

time, on some machine

  • Arithmetic mean predicts performance
  • Geometric mean?

Norm(A)Norm(A)Norm(A)Norm(B)Norm(B)Norm(B)Norm(C)Norm(C)Norm(C) A B C A B C A B C P1 1 10 20 0.1 1 2 0.05 0.5 1 P2 1 0.1 0.02 10 1 0.2 50 5 1 AM 1 5.05 10.01 5.05 1 1.1 25.03 2.75 1 GM 1 1 0.63 1 1 0.63 1.58 1.58 1

slide-8
SLIDE 8

Summary

  • Performance inversely proportional to execution-

time

  • We are concerned with CPU time of unloaded

machine

  • Weighted execution time with weights from real

workload is ideal

  • Else, normalize w.r.t one machine
slide-9
SLIDE 9

Amdahl's Law

  • Amdahl's law:
  • Diminishing returns
  • Limit on overall speedup
  • Corollary: make the

common case fast

1-F F 1-F F/Speedup

slide-10
SLIDE 10

Amdahl's Law

  • Amdahl's law:
  • Diminishing returns
  • Limit on overall speedup

1-F F 1-F F/Speedup

  • Corollary: make the

common case fast Overall speedup= 1−FF 1−F F Speedup

slide-11
SLIDE 11

Illustrating Amdahl's Law

  • Example: implement faster memory, or faster ALU?
  • Proposed memory speedup: 10x
  • Proposed ALU speedup: 3x
  • Depends on fraction of instructions

– Suppose F mem=0.2,F alu=0.5,F other=0.3

Speedup with faster memory= 1 0.80.2/10=1.22 Speedup with faster ALU= 1 0.50.5/3=1.5

slide-12
SLIDE 12

Example continued...

  • Fixing for what value of is

going for a faster memory better?

F alu=0.5

F mem

1 1−F memF mem/101.5 ⇒F mem10 27=0.36

slide-13
SLIDE 13

The CPU Performance Equation

CPU time=Num.clock cycles×Clock cycletime CPU time=Num.of clock cycles÷Clock rate

OR

CPU time=IC×CPI×Cycletime

Putting these together Num.of clock cycles

=InstructionCount×Cycles Per Instruction

=IC×CPI

For a program,

slide-14
SLIDE 14

More on the Equation

  • This form is convenient
  • Involves many relevant parameters
  • Remembering is easy

CPU time= Seconds Program = Seconds Clock cycle× Clock cycles Instruction ×Instructions Program

  • With CPI as the independent variable

CPI= CPU time Clock cycletime×IC

slide-15
SLIDE 15

Other Convenient Forms of the Equation

  • Number of clock cycles can be counted as:

CPU clock cycles=∑

i=1 n

CPI i×ICi Hence ,CPU time=∑

i=1 n

CPI i×ICi×Clock cycletime

  • Calculating in terms of

CPI CPI i

CPI= CPU time Clock cycletime×IC=∑

i=1 n

CPI i× ICi IC 

slide-16
SLIDE 16

Usefulness of the Equation

  • easier to measure than
  • Equivalently, is measured through
  • Equation includes relevant parameters such as the

cycle time

IC i

F i F i

IC i

slide-17
SLIDE 17

Measuring the Parameters for the Equation

  • Clock cycle time:
  • Easy for existing architectures
  • Needs to be estimated in the design process
  • Instruction Count:
  • Requires a compiler
  • And, simulator/interpreter, or instrumentation code
  • CPI for each instruction type:
  • Easy for simple architectures
  • Pipelines, caches introduce complications
  • Need to simulate and measure average CPI
slide-18
SLIDE 18

A Design Example

  • A design choice for conditional branch

instructions:

  • Choice 1: condition code is set by a compare

instruction, checked by the next (branch) instruction

– 20% instructions are branches, and another 20% are

compares

– 2 cycles per branch, 1 cycle for all others – Clock-rate is 25% faster

  • Choice 2: single instruction for compare and branch
  • Which choice is better?
slide-19
SLIDE 19

Solution for Design Example

CPU time1= IC1×[0.8×10.2×2] 1.25×C = IC1 C × 1.2 1.25 CPU time2= IC1×[0.6×10.2×2] C = IC1 C