CSE775: Computer Architecture Chapter 1: Fundamentals of Chapter 1: - - PowerPoint PPT Presentation

cse775 computer architecture
SMART_READER_LITE
LIVE PREVIEW

CSE775: Computer Architecture Chapter 1: Fundamentals of Chapter 1: - - PowerPoint PPT Presentation

CSE775: Computer Architecture Chapter 1: Fundamentals of Chapter 1: Fundamentals of Computer Design 1 Computer Architecture Topics Input/Output and Storage Disks, WORM, Tape p RAID Emerging Technologies DRAM Interleaving Memories


slide-1
SLIDE 1

CSE775: Computer Architecture

Chapter 1: Fundamentals of Chapter 1: Fundamentals of Computer Design

1

slide-2
SLIDE 2

Computer Architecture Topics

Disks, WORM, Tape RAID Input/Output and Storage DRAM p Emerging Technologies Interleaving Memories L2 Cache Coherence, Bandwidth, Latency Memory Hierarchy Addressing, L1 Cache Latency VLSI y

Instruction Set Architecture

Pipelining, Hazard Resolution, S l R d i Protection, Exception Handling VLSI Pipelining and Instruction

2

Superscalar, Reordering, Prediction, Speculation, Vector, DSP Level Parallelism

slide-3
SLIDE 3

Computer Architecture Topics

Shared Memory,

M P M P M P M P ° ° °

y, Message Passing, Data Parallelism

Interconnection Network S

Topologies Network Interfaces

Processor-Memory-Switch

Topologies, Routing, Bandwidth, Latency

Processor-Memory-Switch

Multiprocessors Networks and Interconnections Latency, Reliability Networks and Interconnections

3

slide-4
SLIDE 4

Measurement and Evaluation

Design

Architecture is an iterative process:

  • Searching the space of possible designs

At ll l l f t t

g Analysis

  • At all levels of computer systems

Creativity

Cost / Performance Analysis Good Ideas Good Ideas

4

Mediocre Ideas

Bad Ideas

slide-5
SLIDE 5

Issues for a Computer Designer

  • Functional Requirements Analysis (Target)

– Scientific Computing – High Performance Floating pt. p g g g p – Business – transactional support/decimal arithmetic – General Purpose –balanced performance for a range of tasks tasks

  • Level of software compatibility

– PL level

  • Flexible, Need new compiler, portability an issue

– Binary level (x86 architecture)

  • Little flexibility Portability requirements minimal
  • Little flexibility, Portability requirements minimal
  • OS requirements

– Address space issues, memory management, protection

5

p , y g , p

  • Conformance to Standards

– Languages, OS, Networks, I/O, IEEE floating pt.

slide-6
SLIDE 6

Computer Systems: Technology Computer Systems: Technology Trends

  • 1988

– Supercomputers

  • 2008

– Powerful PC’s and – Massively Parallel Processors – Mini-supercomputers laptops – Clusters delivering Petaflop performance – Minicomputers – Workstations PC’s Petaflop performance – Embedded Computers – PDAs, I-Phones, .. – PC s , ,

6

slide-7
SLIDE 7

Technology Trends

I t t d i it l i t h l a gro th in transistor

  • Integrated circuit logic technology – a growth in transistor

count on chip of about 40% to 55% per year.

  • Semiconductor RAM

capacity increases by 40% per

  • Semiconductor RAM – capacity increases by 40% per

year, while cycle time has improved very slowly, decreasing by about one-third in 10 years. Cost has decreased at rate b h hi h i i about the rate at which capacity increases.

  • Magnetic disc technology – in 1990’s disk density had been

improving 60% to100% per year while prior to 1990 about improving 60% to100% per year, while prior to 1990 about 30% per year. Since 2004, it dropped back to 30% per year.

  • Network technology – Latency and bandwidth are important.

gy y p Internet infrastructure in the U.S. has been doubling in bandwidth every year. High performance Systems Area Network (such as InfiniBand) delivering continuous reduced latency

7

InfiniBand) delivering continuous reduced latency.

slide-8
SLIDE 8

Why Such Change in 20 years?

  • Performance

– Technology Advances

CMOS ( l t t l id i d t ) VLSI

  • CMOS (complementary metal oxide semiconductor) VLSI

dominates older technologies like TTL (Transistor Transistor Logic) in cost AND performance

Computer architecture advances improves low end – Computer architecture advances improves low-end

  • RISC, pipelining, superscalar, RAID, …
  • Price: Lower costs due to …

– Simpler development

  • CMOS VLSI: smaller systems, fewer components

Higher volumes – Higher volumes – Lower margins by class of computer, due to fewer services

8

slide-9
SLIDE 9

Growth in Microprocessor Performance

Figure 1.1

I 90’ th i f i ti i t d i h f RISC st le

9

In 90’s, the main source of innovations in computer design has come from RISC-style pipelined processors. In the last several years, the annual growth rate is (only) 10-20%.

slide-10
SLIDE 10

Growth in Performance of RAM & CPU

Figure 5 2 Figure 5.2

  • Mismatch between CPU performance growth and memory performance growth!!
  • And, almost unchanged memory latency

10

  • Little instruction-level parallelism left to exploit efficiently
  • Maximum power dissipation of air-cooled chips reached
slide-11
SLIDE 11

Cost of Six Generations of Cost of Six Generations of DRAMs

11

slide-12
SLIDE 12

Cost of Microprocessors

12

slide-13
SLIDE 13

Components of Price for a $1000 Components of Price for a $1000 PC

13

slide-14
SLIDE 14

Integrated Circuits Costs

IC cost = Die cost + Testing cost + Packaging cost Final test yield Die cost = Wafer cost Dies per Wafer * Die yield Di f š * ( W f di / 2)2 š * W f di T di Dies per wafer = š * ( Wafer_diam / 2)2 – š * Wafer_diam – Test dies Die Area ¦ 2 * Die Area Die Yield = Wafer yield * 1 + Defects_per_unit_area * Die_Area

{

− α

}

14

DAP.S98 1

α

Die Cost goes roughly with die area4

{ }

slide-15
SLIDE 15

Failures and Dependability

  • Failures at any level costs money

I t t d i it ( ) – Integrated circuits (processor, memory) – Disks – Networks – Networks

  • Costs Millions of Dollars for 1hour downtime

(Amazon, Google, ..) (Amazon, Google, ..)

  • No concept of downtime at the middle of night
  • Systems need to be designed with fault-

Systems need to be designed with fault- tolerance

– Hardware

15

– Software

slide-16
SLIDE 16

Performance and Cost

Throughput Plane Boeing 747 Speed 610 mph DC to Paris 6 5 hours Passengers 470 Throughput (pmph) 286 700 Boeing 747 BAD/Sud 610 mph 1350 mph 6.5 hours 3 hours 470 132 286,700 178 200

  • Time to run the task (ExTime)

Concodre 1350 mph 3 hours 132 178,200

Time to run the task (ExTime)

– Execution time, response time, latency

  • Tasks per day, hour, week, sec, ns … (Performance)

16

– Throughput, bandwidth

slide-17
SLIDE 17

The Bottom Line: The Bottom Line: Performance (and Cost)

"X is n times faster than Y" means ExTime(Y) Performance(X) ExTime(Y) Performance(X)

  • =
  • ExTime(X)

Performance(Y) ExTime(X) Performance(Y)

  • Speed of Concorde vs. Boeing 747

Speed of Concorde vs. Boeing 747

  • Throughput of Boeing 747 vs. Concorde

17

slide-18
SLIDE 18

Metrics of Performance

Application Answers per month Programming Language Application Answers per month Operations per second Compiler

ISA

(millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s Datapath Control

ISA

Function Units (millions) of (FP) operations per second: MFLOP/s Megabytes per second Transistors Wires Pins Function Units Cycles per second (clock rate)

18

slide-19
SLIDE 19

Computer Engineering Methodology

Evaluate Existing Evaluate Existing Systems for Systems for Bottlenecks Bottlenecks Implementation Complexity Bottlenecks Bottlenecks

Technology

Benchmarks Simulate New Simulate New Implement Next Implement Next G ti S t G ti S t

Technology Trends

Designs and Designs and Organizations Organizations Generation System Generation System

19

Workloads

slide-20
SLIDE 20

Measurement Tools

  • Benchmarks, Traces, Mixes

d C d l i i

  • Hardware: Cost, delay, area, power estimation
  • Simulation (many levels)

– ISA, RT, Gate, Circuit

  • Queuing Theory
  • Rules of Thumb
  • Fundamental “Laws”/Principles
  • Understanding the limitations of any

measurement tool is crucial.

20

slide-21
SLIDE 21

Issues with Benchmark Issues with Benchmark Engineering

  • Motivated by the bottom dollar, good

performance on classic suites more p customers, better sales.

  • Benchmark Engineering Limits the

Benchmark Engineering Limits the longevity of benchmark suites

  • Technology and Applications Limits the
  • Technology and Applications Limits the

longevity of benchmark suites.

21

slide-22
SLIDE 22

SPEC: System Performance Evaluation Cooperative Evaluation Cooperative

  • First Round 1989

i ldi i l b ( k ) – 10 programs yielding a single number (“SPECmarks”)

  • Second Round 1992

SPECInt92 (6 integer programs) and SPECfp92 (14 – SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs) – “benchmarks useful for 3 years”

  • SPEC CPU2000 (11 integer benchmarks – CINT2000, and

14 floating-point benchmarks – CFP2000

  • SPEC 2006 (CINT2006, CFP2006)

SPEC 2006 (CINT2006, CFP2006)

  • Server Benchmarks

– SPECWeb SPECFS

22

– SPECFS

  • TPC (TPA-A, TPC-C, TPC-H, TPC-W, …)
slide-23
SLIDE 23

SPEC 2000 (CINT 2000)Results

23

slide-24
SLIDE 24

SPEC 2000 (CFP 2000)Results

24

slide-25
SLIDE 25

Reporting Performance Results

  • Reproducibility
  • Apply them on publicly available

Apply them on publicly available

  • benchmarks. Pecking/Picking order

Real Programs – Real Programs – Real Kernels Toy Benchmarks – Toy Benchmarks – Synthetic Benchmarks

25

slide-26
SLIDE 26

H t S i P f How to Summarize Performance

  • Arithmetic mean (weighted arithmetic mean)

tracks execution time: sum(Ti)/n or sum(Wi*Ti) H i ( i h d h i ) f

  • Harmonic mean (weighted harmonic mean) of

rates (e.g., MFLOPS) tracks execution time: n/sum(1/Ri) or 1/sum(Wi/Ri) n/sum(1/Ri) or 1/sum(Wi/Ri)

26

slide-27
SLIDE 27

How to Summarize Performance (Cont’d)

  • Normalized execution time is handy for scaling

performance (e g X times faster than performance (e.g., X times faster than SPARCstation 10)

  • But do not take the arithmetic mean of

But do not take the arithmetic mean of normalized execution time, use the Geometric Mean = (Product(Ri)^1/n)

27

slide-28
SLIDE 28

Performance Evaluation

  • “For better or worse, benchmarks shape a field”
  • Good products created when have:

– Good benchmarks – Good ways to summarize performance

  • Given sales is a function in part of performance
  • Given sales is a function in part of performance

relative to competition, investment in improving product as reported by performance summary p p y p y

  • If benchmarks/summary inadequate, then choose

between improving product for real programs vs. i i d t t t l improving product to get more sales; Sales almost always wins!

  • Execution time is the measure of computer

28

Execution time is the measure of computer performance!

slide-29
SLIDE 29

Simulations

  • When are simulations useful?
  • What are its limitations, I.e. what real world

phenomenon does it not account for? phenomenon does it not account for?

  • The larger the simulation trace, the less

tractable the post-processing analysis.

29

slide-30
SLIDE 30

Queuing Theory

  • What are the distributions of arrival rates

and values for other parameters? p

  • Are they realistic?
  • Are they realistic?
  • What happens when the parameters or

distributions are changed?

30

slide-31
SLIDE 31

Quantitative Principles of Computer Quantitative Principles of Computer Design

  • Make the Common Case Fast

– Amdahl’s Law

  • CPU Performance Equation

– Clock cycle time y – CPI – Instruction Count

  • Principles of Locality
  • Take advantage of Parallelism

31

g

slide-32
SLIDE 32

Amdahl's Law Amdahl s Law

Speedup due to enhancement E:

ExTime w/o E Performance w/ E / / Speedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F

  • f the task by a factor S, and the remainder of the

task is unaffected

32

DAP.S98 32

slide-33
SLIDE 33

Amdahl’s Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced Speedupoverall = ExTimeold = 1 (1 F ti ) + F ti p poverall ExTimenew (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced

33

slide-34
SLIDE 34

Amdahl’s Law (Cont’d)

  • Floating point instructions improved to run 2X;

but only 10% of actual instructions are FP

ExTimenew = Speedupoverall =

34

slide-35
SLIDE 35

CPU Performance Equation

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle

Inst Count CPI Clock Rate Program X Compiler X (X) Compiler X (X)

  • Inst. Set.

X X Organization X X

35

Technology X

slide-36
SLIDE 36

Cycles Per Instruction Cycles Per Instruction

CPI = (CPU Time * Clock Rate) / Instruction Count

“Average Cycles per Instruction”

n CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count CPU time = CycleTime * CPI * I i = 1 i i

“Instruction Frequency”

CPI =

CPI * F where F = I

n i i i i

q y

i = 1 i Instruction Count

Invest Resources where time is Spent!

36

vest esou ces w e e t e s Spe t!

slide-37
SLIDE 37

E ample: Calc lating CPI Example: Calculating CPI

Base Machine (Reg / Reg) Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) L d 20% 2 4 (27%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch 20% 2 .4 (27%)

Typical Mix

1.5

37