Previous Lecture Slides for Lecture 4 ENCM 501: Principles of - - PDF document

previous lecture slides for lecture 4
SMART_READER_LITE
LIVE PREVIEW

Previous Lecture Slides for Lecture 4 ENCM 501: Principles of - - PDF document

slide 2/30 ENCM 501 W14 Slides for Lecture 4 Previous Lecture Slides for Lecture 4 ENCM 501: Principles of Computer Architecture Winter 2014 Term completion of Wed Jan 15 tutorial Steve Norman, PhD, PEng energy and power use in


slide-1
SLIDE 1

Slides for Lecture 4

ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng

Electrical & Computer Engineering Schulich School of Engineering University of Calgary

21 January, 2014

ENCM 501 W14 Slides for Lecture 4

slide 2/30

Previous Lecture

◮ completion of Wed Jan 15 tutorial ◮ energy and power use in processors ◮ brief coverage of trends in cost

ENCM 501 W14 Slides for Lecture 4

slide 3/30

Today’s Lecture

◮ a little more about die yield ◮ measuring and reporting computer performance ◮ quantitative principles of computer design

Related reading in Hennessy & Patterson: Sections 1.8–1.9

ENCM 501 W14 Slides for Lecture 4

slide 4/30

More about die yields

Here is the formula presented last lecture: die yield = wafer yield (1 + defects per unit area × die area)N The formula is derived from many year of IC process data. N is called the process-complexity factor. 2010 numbers are 11.5 to 15.5 for N and 0.016 to 0.057 defects per cm2. Examples in the textbook with wafer yield = 100%, N = 13.5, and 0.031 defects per cm2 give yields of

◮ 66% for 1.0 cm × 1.0 cm dies; ◮ 40% for 1.5 cm × 1.5 cm dies.

ENCM 501 W14 Slides for Lecture 4

slide 5/30

Let’s think about that 66% yield for a minute. The defect density is about 3 per 100 cm2. With a 1 cm2 die size, that suggests about 3 defects spread over every 100 dies. So why is the yield not approximately 97%? With a couple of hours of Google search I found the yield formula (poorly explained) in multiple technical documents,

  • ften along with competing formulas.

Here is my best guess as to what is correct: N represents a number of process layers, and the defect density is specified per process layer. N is adjusted up or down from the real number of process layers to reflect the fact that some layers are more defect-prone than others. Regardless, it is true for a given IC fabrication process, die yield gets worse as die size increases.

ENCM 501 W14 Slides for Lecture 4

slide 6/30

Textbook Section 1.7: Dependability

We’re not going to cover this material in ENCM 501.

slide-2
SLIDE 2

ENCM 501 W14 Slides for Lecture 4

slide 7/30

How to evaluate performance (1)

Given two different computer designs, how do you decide which is “better”? Think about comparing other kinds of machines. For example, which is “better”, (a) a “3/4 ton” pickup truck,

  • r (b) a midsize luxury AWD sedan? Do you want to

◮ . . . move construction supplies? ◮ . . . pull a large trailer? ◮ . . . commute comfortably to an office job?

ENCM 501 W14 Slides for Lecture 4

slide 8/30

How to evaluate performance (2)

The analogy to vehicle selection can be used to make two key points . . .

◮ Obviously, making the best choice of machine, or at least

a reasonably good choice, depends on what the machine is going to be used for.

◮ No single narrow-scope measurement of performance is

very useful. It doesn’t make sense to use fastest acceleration from 0 to 60 mph, or fastest time to sort an array of 10 million doubles, as a sole criterion.

ENCM 501 W14 Slides for Lecture 4

slide 9/30

Often this makes sense: performance ∝ 1/time

Think about these examples:

◮ Software developer builds an executable from a large

body of C or C++ code.

◮ Digital designer runs a detailed simulation of a complex

circuit.

◮ Meteorologist runs 5-day weather forecast program using

current atmospheric data as input. These tasks can take minutes or hours to run. There are

  • bvious incentives to find hardware that will help minimize

running time.

ENCM 501 W14 Slides for Lecture 4

slide 10/30

Use ratios of running time to compare time-based performance

For a given task run on Systems A and B, performanceA performanceB = timeB timeA Example: For some task, timeA = 1000 s and timeB = 750 s. Then, for this task, System B is 1000/750 = 1.33 times as fast as System A. Equivalently, System B provides a speedup

  • f 1.33 relative to System A.

Ratios are easier to work with and harder to misinterpret than

  • ther ways to compare speed. For example, avoid saying

things like, “System B gives a 25% decrease in running time,”

  • r, “System B gives a 33% increase in speed.”

ENCM 501 W14 Slides for Lecture 4

slide 11/30

What might System A and System B be?

There are lots of different kinds of interesting practical

  • comparisons. Some of the many possibilities:

◮ same source code, different ISAs, different hardware,

different compilers

◮ same source code, same ISA, same compiler, different

hardware

◮ same source code, same ISA, same hardware, different

compiler

◮ same source code, same ISA, same hardware, same

compiler, different compiler options

◮ different source codes for the same task, same

everything else Don’t forget about the last one! Choice of data structures and algorithms can be a huge factor!

ENCM 501 W14 Slides for Lecture 4

slide 12/30

What programs should be used for performance evaluation?

This is a hard question, because every user is different. SPEC (Standard Performance Evaluation Corporation, www.spec.org) takes the position that complete runs of “suites” of carefully-chosen real-world programs are the best way to get general performance indexes for computer systems. Alternatives, such as runs of much smaller programs that are supposedly representative of practical code are problematic:

◮ the small programs will more likely fail to test some

important features that real-world programs depend on;

◮ hardware designers and compiler and library writers can

sometimes “game” synthetic benchmarks.

slide-3
SLIDE 3

ENCM 501 W14 Slides for Lecture 4

slide 13/30

SPEC CPU benchmark suites

Quote from www.spec.org/cpu2006/Docs/readme1st.html: “SPEC CPU2006 focuses on compute intensive performance, which means these benchmarks emphasize the performance of

◮ the computer processor (CPU), ◮ the memory architecture, and ◮ the compilers.

“It is important to remember the contribution of the latter two components. SPEC CPU performance intentionally depends on more than just the processor.”

ENCM 501 W14 Slides for Lecture 4

slide 14/30

More quotes from the same source . . . “SPEC CPU2006 contains two suites that focus on two different types of compute intensive performance:

◮ The CINT2006 suite measures compute-intensive integer

performance, and

◮ The CFP2006 suite measures compute-intensive floating

point performance.” “SPEC CPU2006 is not intended to stress other computer components such as networking, the operating system, graphics, or the I/O system. For single-CPU tests, the effects from such components on SPEC CPU2006 performance are usually minor.”

ENCM 501 W14 Slides for Lecture 4

slide 15/30

“compute-intensive integer performance”

Programs suitable for this suite would tend to

◮ have a lot of integer arithmetic instructions, especially

add, subtract, and compare, and logical operations such as shifts, bitwise AND, OR, NOR or XOR, etc.;

◮ do a lot of load and store operations between

general-purpose registers and the memory hierarchy;

◮ frequently encounter (conditional) branches and

(unconditional) jumps;

◮ have very few floating-point instructions or none at all.

ENCM 501 W14 Slides for Lecture 4

slide 16/30

“compute-intensive floating-point performance”

Programs suitable for this suite would tend to have some of the same properties as “compute-intensive integer” programs, but would also have

◮ relatively heavy concentrations of floating-point

instructions for operations such as + , - , * , / , sqrt , etc.

◮ a lot of load and store operations between floating-point

registers and the memory hierarchy. Why would a “compute-intensive floating-point” program have a lot of integer arithmetic instructions?

ENCM 501 W14 Slides for Lecture 4

slide 17/30

Arithmetic means and geometric means

Notation for a sum of N times: Time1 + Time2 + · · · + TimeN = N

k=1 Timek

Notation for a product of N times: Time1 × Time2 × · · · × TimeN = N

k=1 Timek

Arithmetic mean (average) of times:

1 N

N

k=1 Timek

Geometric mean of times: N

k=1 Timek

1

N

It turns out that the geometric mean is a better way to combine program run times than is the arithmetic mean . . .

ENCM 501 W14 Slides for Lecture 4

slide 18/30

An example, reflecting the structure of SPEC CPU benchmark reporting:

◮ Ref is an older, slower “reference” machine. ◮ Foo and Bar are newer, faster machines. ◮ All times, arithmetic means, and geometric means are in

seconds. program run time machine A B C AM GM Ref 1000 2000 10000 4333 2714 Foo 500 1000 8000 3166 1587 Bar 750 1600 6000 2873 1931 Let’s check the geometric mean calculation for Foo. Let’s make an argument that we should ignore arithmetic mean, and use geometric mean to conclude that Foo is faster

  • verall than Bar.
slide-4
SLIDE 4

ENCM 501 W14 Slides for Lecture 4

slide 19/30

Using the same run times as on the previous slide, we can make a table of speedups and geometric means of speedups. speedup wrt Ref machine A B C GM Ref 1.00 1.00 1.00 1.00 Foo 2.00 2.00 1.25 1.71 Bar 1.33 1.25 1.67 1.41 Figure 1.17 on page 43 of the textbook shows real SPECfp2000 data for three machines: a Sun Ultra 5 (the reference machine), one based on an AMD Opteron processor and one based. an Intel Itanium 2. The figure caption makes the useful point that reference machine performance doesn’t actually matter in comparing the newer machines. Also interesting: For some programs the Opteron is much faster than the Itanium, and for others, it’s the opposite.

ENCM 501 W14 Slides for Lecture 4

slide 20/30

More benchmarks

The screenshot is from the SPEC home page. Note that there are lots of benchmarks for loads that are not CPU-intensive. Some non-SPEC benchmarks are TPC benchmarks, oriented towards database systems, and CoreMark benchmarks,

  • riented toward embedded

processors.

ENCM 501 W14 Slides for Lecture 4

slide 21/30

Quantitative Principles of Computer Design

The title of this slide is the title of Section 1.9 in the

  • textbook. The first three subsection titles are:

◮ Take Advantage of Parallelism ◮ Principle of Locality ◮ Focis on the Common Case

These are important ideas, so please do the reading. There’s not much I can add by writing lecture slides on this material.

ENCM 501 W14 Slides for Lecture 4

slide 22/30

Amdahl’s law

(This law is named after Gene Amdahl, a pioneer in computing.) Let told be the running time of a program that performs some

  • task. Suppose the program—or the hardware it runs on—is

enhanced to perform the same task, but faster. Suppose FE is the fraction of run time affected by the enhancement, and SE is the speedup factor for the

  • enhancement. Then the new running time will be

tnew = (1 − FE)told + FE SE told And the overall speedup will be told tnew = 1 (1 − FE) + FE/SE

ENCM 501 W14 Slides for Lecture 4

slide 23/30

Amdahl’s law example

Let’s suppose that a task has three steps. Our current program uses just one processor core, and takes 50 s for Step 1, 900 s for Step 2, and 50 s for Step 3. Suppose that Step 1 and Step 3 are hard to enhance, but Step 2 is “embarrassingly parallel”—it’s possible to divide the work evenly over N cores, with very little overhead. Suppose the program is rewritten to parallelize Step 2, so that N can be chosen when the program is run. What is FE and what is SE?

ENCM 501 W14 Slides for Lecture 4

slide 24/30

Amdahl’s law example (continued)

Here is a table of results for some choices of N. (Times for Steps and total times are in seconds.) N Step 1 Step 2 Step 3 tnew

  • verall speedup

1 50 900 50 1000 1.00 2 50 450 50 550 1.82 4 50 275 50 375 3.07 16 50 56.3 50 156.3 6.40 64 50 14.1 50 114.1 8.77 This example illustrates a general point: As the speedup to the enhanced part of the program gets large, time spent in the non-enhanced part starts to dominate running time.

slide-5
SLIDE 5

ENCM 501 W14 Slides for Lecture 4

slide 25/30

Processor Performance Equation (1)

CPI: Clock cycles per instruction. IC: Instruction count. This is the number of instructions actually executed, not the program size. Instructions in loops count each time they are executed; instructions skipped by if statements don’t count. CPU time = IC × CPI × clock period CPI is processor-dependent and also program-dependent, so this equation by itself is not very powerful. However, it’s a great starting point for performance analysis—variations on the equation are quite useful.

ENCM 501 W14 Slides for Lecture 4

slide 26/30

Processor Performance Equation (2)

Here’s a useful variation: CPU time =

  • n
  • i=1

ICi × CPIi

  • × clock period

The summation is over the various kinds instructions used in a

  • program. For example, i = 1 could be LD, i = 2 could be SD,

i = 3 could be DADDU and all similar instructions, i = 4 could be conditional branches, and so on. For any modern desktop processor, the above equation is still an approximation. (Or, put another way, each CPIi could be somewhat program dependent.) Why?

ENCM 501 W14 Slides for Lecture 4

slide 27/30

Bonus slides about power dissipation in CMOS logic

The next two slides try to show why energy losses (heat generation) in 1 → 0 and 0 → 1 transitions of CMOS gate

  • utputs are both 1

2CV 2 DD, regardless of how well or poorly the

pull-down and pull-up networks conduct. My math depends on a crude resistor-and-switch model for pull-down and pull-up networks, but I’m pretty sure the same results can be derived without making such rough assumptions about NMOS and PMOS transistors. (This course is Computer Architecture, not Digital CMOS VLSI, so I’m not going to put any more time into this issue!)

ENCM 501 W14 Slides for Lecture 4

slide 28/30

1 → 0 transition: Let t = 0 be the instant when the input switches to cause a 1 → 0 transition on the output. (In reality, input changes are not instant.)

VDD C Vout RPU RPD

Vout(t) = VDD exp −t RPDC

  • So energy lost in RPD is

t=0

Vout(t)2 RPD dt = V 2

DD

RPD ∞

t=0

exp −2t RPDC

  • dt

= V 2

DD

RPD −RPDC 2 exp −2t RPDC ∞

t=0

= 1 2CV 2

DD

ENCM 501 W14 Slides for Lecture 4

slide 29/30

0 → 1 transition: Let t = 0 be the instant when the input switches to cause a 0 → 1 transition on the output.

VPU + − VDD C Vout RPU RPD

Vout(t) = VDD

  • 1 − exp

−t RPUC

  • VPU(t) = VDD exp

−t RPUC

  • Energy lost in RPU is

t=0

VPU(t)2 RPU dt = V 2

DD

RPU −RPUC 2 exp −2t RPUC ∞

t=0

= 1 2CV 2

DD

ENCM 501 W14 Slides for Lecture 4

slide 30/30

Upcoming Topics

◮ a survey of ISA design ideas

Related reading in Hennessy & Patterson: Sections A.1–A.7