P age 1 A take on Moores Law Technology Trends Bit-level - - PDF document

p age 1
SMART_READER_LITE
LIVE PREVIEW

P age 1 A take on Moores Law Technology Trends Bit-level - - PDF document

Outline Why Take CS252? CS252 Fundament al Abst ract ions & Concept s Graduate Computer Architecture Lecture 1 I nst ruct ion Set Archit ect ure & Organizat ion Administrivia I ntroduction Pipelined I nst ruct ion


slide-1
SLIDE 1

P age 1

CS252/ Culler Lec 1. 1 1/ 22/ 02

January 22, 2002 Prof . David E Culler Comput er Science 252 Spring 2002 CS252 Graduate Computer Architecture Lecture 1 I ntroduction

CS252/ Culler Lec 1. 2 1/ 22/ 02

Outline

  • Why Take CS252?
  • Fundament al Abst ract ions & Concept s
  • I nst ruct ion Set Archit ect ure & Organizat ion
  • Administrivia
  • Pipelined I nst ruct ion Processing
  • Perf ormance
  • The Memory Abst ract ion
  • Summary

CS252/ Culler Lec 1. 3 1/ 22/ 02

Why take CS252?

  • To design the next great instruction set?...well...

– instruction set architecture has largely converged – especially in the desktop / server / laptop space – dictated by powerf ul market f orces

  • Tremendous organizational innovation relative to

established I SA abstractions

  • Many New instruction sets or equivalent

– embedded space, cont rollers, specialized devices, . . .

  • Design, analysis, implementation concepts vital to all

aspects of EE & CS

– syst ems, PL, t heory, circuit design, VLSI , comm.

  • Equip you with an intellectual toolbox f or dealing with

a host of systems design challenges

CS252/ Culler Lec 1. 4 1/ 22/ 02

Example Hot Developments ca. 2002

  • Manipulating the instruction set abstraction

– it anium: t ranslat e I SA64 - > micro- op sequences – t ransmet a : cont inuous dynamic t ranslat ion of I A32 – t insilica: synthesize the I SA f rom the application – reconf igurable HW

  • Virtualization

– vmware: emulate f ull virtual machine – JI T: compile to abstract virtual machine, dynamically compile to host

  • P

arallelism

– wide issue, dynamic instruction scheduling, EPI C – multithreading (SMT) – chip multiprocessors

  • Communication

– network processors, network interf aces

  • Exotic explorations

– nanot echnology, quantum computing

CS252/ Culler Lec 1. 5 1/ 22/ 02

Forces on Computer Architecture

Computer Architecture

Technology

Programming Languages Operating Systems

History Applications

(A = F / M)

CS252/ Culler Lec 1. 6 1/ 22/ 02

Amazing Underlying Technology Change

slide-2
SLIDE 2

P age 2

CS252/ Culler Lec 1. 7 1/ 22/ 02

A take on Moore’s Law

Transistors NN N N N N N N N N N N N N N N N N N NN N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1970 1975 1980 1985 1990 1995 2000 2005 Bit-level parallelism Instruction-level Thread-level (?) i4004 i8008 i8080 i8086 i80286 i80386 R2000 Pentium R10000 R3000 CS252/ Culler Lec 1. 8 1/ 22/ 02

Technology Trends

  • Clock Rate:

~30% per year

  • Transist or Densit y:

~35%

  • Chip Area:

~15%

  • Transist ors per chip: ~55%
  • Tot al Perf ormance Capabilit y: ~100%
  • by t he t ime you graduat e. . .

– 3x clock rat e (3- 4 GHz) – 10x transistor count (1 Billion transistors) – 30x raw capability

  • plus 16x dram densit y, 32x disk densit y

CS252/ Culler Lec 1. 9 1/ 22/ 02

Performance 0.1 1 10 100 1965 1970 1975 1980 1985 1990 1995 Supercomputers Minicomputers Mainframes Microprocessors

Perf ormance Trends

CS252/ Culler Lec 1. 10 1/ 22/ 02

Measurement and Evaluation

Architecture is an iterative process

  • - searching the space of possible designs
  • - at all levels of computer systems

Good Ideas Good Ideas

Mediocre Ideas

Bad Ideas

Cost / Performance Analysis Design Analysis

Creativity

CS252/ Culler Lec 1. 11 1/ 22/ 02

What is “Computer Architecture”?

I/O system

  • Instr. Set Proc.

Compiler Operating System Application Digital Design Circuit Design Instruction Set Architecture Firmware

  • Coordinat ion of many levels of abst ract ion
  • Under a rapidly changing set of f orces
  • Design, Measurement , and

Evaluat ion

Datapath & Control

Layout

CS252/ Culler Lec 1. 12 1/ 22/ 02

Coping with CS 252

  • Students with too varied background?

– I n past, CS grad students took written prelim exams on undergraduate material in hardware, sof tware, and theory – 1st 5 weeks reviewed background, helped 252, 262, 270 – Prelims were dropped => some unprepared f or CS 252?

  • I n class exam on Tues Jan. 29 (30 mins)

– Doesn’t af f ect grade, only admission into class – 2 grades: Admitted or audit/ take CS 152 1st – I mprove your experience if recapture common background

  • Review: Chapt ers 1, CS 152 home page, maybe

“Comput er Organizat ion and Design (COD)2/ e”

– Chapters 1 to 8 of COD if never took prerequisite – I f took a class, be sure COD Chapters 2, 6, 7 are f amiliar – Copies in Bechtel Library on 2- hour reserve

  • FAST review t his week of basic concept s
slide-3
SLIDE 3

P age 3

CS252/ Culler Lec 1. 13 1/ 22/ 02

Review of Fundamental Concepts

  • I nst ruct ion Set Archit ect ure
  • Machine Organizat ion
  • I nst ruct ion Execut ion Cycle
  • Pipelining
  • Memory
  • Bus (Peripheral Hierarchy)
  • Perf ormance I ron Triangle

CS252/ Culler Lec 1. 14 1/ 22/ 02

The I nstruction Set: a Critical I nterf ace

instruction set software hardware

CS252/ Culler Lec 1. 15 1/ 22/ 02

I nstruction Set Architecture

. . . the attributes of a [computing] system as seen by t he programmer, i. e. t he concept ual st ruct ure and f unct ional behavior, as dist inct f rom t he

  • rganizat ion of t he dat a f lows and cont rols t he logic

design, and t he physical implement at ion. – Amdahl, Blaaw, and Brooks, 1964 SOFTWARE SOFTWARE

  • - Organization of Programmable

Storage

  • - Data Types & Data Structures:

Encodings & Representations

  • - Instruction Formats
  • - Instruction (or Operation Code) Set
  • - Modes of Addressing and Accessing Data Items and Instructions
  • - Exceptional Conditions

CS252/ Culler Lec 1. 16 1/ 22/ 02

Organization

Logic Designer's View ISA Level FUs & Interconnect

  • Capabilit ies & Perf ormance

Charact erist ics of Principal Functional Units

– (e. g. , Registers, ALU, Shif ters, Logic Units, . . . )

  • Ways in which t hese component s

are int erconnect ed

  • I nf ormat ion f lows bet ween

component s

  • Logic and means by which such

inf ormat ion f low is cont rolled.

  • Choreography of FUs to

realize the I SA

  • Register Transf er Level (RTL)

Descript ion

CS252/ Culler Lec 1. 17 1/ 22/ 02

Review: MI PS R3000 (core)

r0 r1 ° ° ° r31 PC lo hi Programmable storage 2^32 x bytes 31 x 32-bit GPRs (R0=0) 32 x 32-bit FP regs (paired DP) HI, LO, PC Data types ? Format ? Addressing Modes? Arithmetic logical

Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU , SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV

Memory Access

LB, LBU, LH, LHU, LW, LWL,LWR SB, SH, SW, SWL, SWR

Control

J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL

32-bit instructions on word boundary

CS252/ Culler Lec 1. 18 1/ 22/ 02

Review: Basic I SA Classes

Accumulator: 1 address add A acc ← acc + mem[A] 1+x address addx A acc ← acc + mem[A + x] Stack: 0 address add tos ← tos + next General Purpose Register: 2 address add A B EA(A) ← EA(A) + EA(B) 3 address add A B C EA(A) ← EA(B) + EA(C) Load/ Store: 3 address add Ra Rb Rc Ra ← Rb + Rc load Ra Rb Ra ← mem[Rb] store Ra Rb mem[Rb] ← Ra

slide-4
SLIDE 4

P age 4

CS252/ Culler Lec 1. 19 1/ 22/ 02

I nstruction Formats

Variable: Fixed: Hybrid:

  • Addressing modes

–each operand requires addess specif ier => variable f ormat

  • code size => variable lengt h inst ruct ions
  • perf ormance => f ixed lengt h inst ruct ions

–simple decoding, predictable operations

  • Wit h load/ st ore inst ruct ion arch, only one memory

address and f ew addressing modes

  • => simple f ormat , address mode given by opcode

CS252/ Culler Lec 1. 20 1/ 22/ 02

MI PS Addressing Modes & Formats

  • Simple addressing modes
  • All instructions 32 bits wide
  • p

rs rt rd immed register Register (direct)

  • p

rs rt register Base+index + Memory immed

  • p

rs rt Immediate immed

  • p

rs rt PC PC-relative + Memory

  • Register Indirect?

CS252/ Culler Lec 1. 21 1/ 22/ 02

Cray- 1: the original RI SC

Op

15

Rd Rs1 R2

2 6 8 9

Load, Store and Branch

3 5

Op

15

Rd Rs1 Immediate

2 6 8 9 3 5 15 Register-Register

CS252/ Culler Lec 1. 22 1/ 22/ 02

VAX- 11: the canonical CI SC

  • Rich set of ort hogonal address modes

– immediate, of f set, indexed, aut oinc/ dec, indirect, indirect+of f set – applied t o any operand

  • Simple and complex inst ruct ions

– synchronization instructions – data structure operations (queues) – polynomial evaluation

OpCode A/M A/M A/M Byte 0 1 n m

Variable format, 2 and 3 address instruction

CS252/ Culler Lec 1. 23 1/ 22/ 02

Review: Load/ Store Architectures

MEM reg ° Substantial increase in instructions ° Decrease in data BW (due to many registers) ° Even more significant decrease in CPI (pipelining) ° Cycle time, Real estate, Design time, Design complexity ° 3 address GPR ° Register to register arithmetic ° Load and store with simple addressing modes (reg + immediate) ° Simple conditionals compare ops + branch z compare&branch condition code + branch on condition ° Simple f ixed- f ormat encoding

  • p
  • p
  • p

r r r r r immed

  • ffset

CS252/ Culler Lec 1. 24 1/ 22/ 02

MI PS R3000 I SA (Summary)

  • I nst ruct ion Cat egories

– Load/ St ore – Computational – Jump and Branch – Float ing Point » coprocessor – Memory Management – Special R0 - R31 PC HI LO OP OP OP rs rt rd sa funct rs rt immediate jump target 3 Instruction Formats: all 32 bits wide Registers

slide-5
SLIDE 5

P age 5

CS252/ Culler Lec 1. 25 1/ 22/ 02

CS 252 Administrivia

  • TA: Jason Hill, jhill@cs.berkeley.edu
  • All assignments, lectures via WWW page:

http:/ / www.cs.berkeley.edu/ ~culler/ 252S02/

  • 2 Quizzes: 3/ 21 and ~14th week (maybe take home)
  • Text:

– Pages of 3rd edit ion of Comput er Archit ect ure: A Quant it at ive Ap proach » available f rom Cindy Palwick (MWF) or Jeanet t e Cook ($ 30 1- 5 ) – “Readings in Computer Architecture” by Hill et al

  • I n class, prereq quiz 1/ 29 last 30 minutes

– I mprove 252 experience if recapture common background – Bring 1 sheet of paper with notes on both sides – Doesn’t af f ect grade, only admission into class – 2 grades: Admit t ed or audit / t ake CS 152 1st – Review: Chapters 1, CS 152 home page, maybe “Computer Organizat ion and Design (COD)2/ e” – I f did t ake a class, be sure COD Chapt ers 2, 5, 6, 7 are f amilia r – Copies in Bechtel Library on 2- hour reserve

CS252/ Culler Lec 1. 26 1/ 22/ 02

Research Paper Reading

  • As graduat e st udent s, you are now researchers.
  • Most inf ormat ion of import ance t o you will be in

research papers.

  • Abilit y t o rapidly scan and underst and research

papers is key to your success.

  • So: 1- 2 paper / week in t his course

– Quick 1 paragraph summaries will be due in class – I mportant supplement to book. – Will discuss papers in class

  • Papers “Readings in Comput er Archit ect ure” or online
  • Think about met hodology and approach

CS252/ Culler Lec 1. 27 1/ 22/ 02

First Assignment (due Tu 2/ 5)

  • Read

– Amdahl, Blaauw, and Brooks, Architecture of the I BM System/ 360 – Lonergan and King, B5000

  • Four each prepare f or in- class debat e 1/ 29
  • rest write analysis of the debate
  • Read “Programming t he EDSAC”, Cambell- Kelly

– write subroutine sum(A, n) to sum an array A of n numbers – write recursive f act(n) = if n==1 then 1 else n*f act(n- 1)

CS252/ Culler Lec 1. 28 1/ 22/ 02

Grading

  • 10% Homeworks (work in pairs)
  • 40% Examinations (2 Quizzes)
  • 40% Research Project (work in pairs)

– Draf t of Conf erence Quality Paper – Transition f rom undergrad to grad student – Berkeley wants you to succeed, but you need to show initiative – pick topic – meet 3 times with f aculty/ TA to see progress – give oral present at ion – give poster session – written report like conf erence paper – 3 weeks work f ull time f or 2 people (over more weeks) – Opportunity to do “research in the small” to help make transition f rom good student to research colleague

  • 10% Class Participation

CS252/ Culler Lec 1. 29 1/ 22/ 02

Course Prof ile

  • 3 weeks: basic concepts

– instruction processing, storage

  • 3 weeks: hot areas

– latency tolerance, low power, embedded design, network processors, NI s , virtualization

  • Proposals due
  • 2 weeks: advanced microprocessor design
  • Quiz & Spring Break
  • 3 weeks: Parallelism (MPs, CMPs, Networks)
  • 2 weeks: Met hodology / Analysis / Theory
  • 1 weeks: Topics: nano, quant um
  • 1 week: Project Present at ions

CS252/ Culler Lec 1. 30 1/ 22/ 02

Levels of Representation (61C Review)

High Level Language Program Assembly Language Program Machine Language Program Control Signal Specification Compiler Assembler Machine Interpretation temp = v[k]; v[k] = v[k+1]; v[k+1] = temp;

lw $ 15, 0($ 2) lw $ 16, 4($ 2) sw $ 16, 0($ 2) sw $ 15, 4($ 2)

0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111

° ° ALUOP[0:3] <= InstReg[9:11] & MASK

slide-6
SLIDE 6

P age 6

CS252/ Culler Lec 1. 31 1/ 22/ 02

Execution Cycle

Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction Obtain instruction from program storage Determine required actions and instruction size Locate and obtain operand data Compute result value or status Deposit results in storage for later use Determine successor instruction

CS252/ Culler Lec 1. 32 1/ 22/ 02

What’s a Clock Cycle?

  • Old days: 10 levels of gat es
  • Today: determined by numerous time- of -

f light issues + gat e delays

– clock propagation, wire lengths, drivers Lat ch

  • r

r egist er combinat ional logic

CS252/ Culler Lec 1. 33 1/ 22/ 02

Fast, Pipelined I nstruction I nterpretation

Instruction Register Operand Registers Instruction Address Result Registers Next Instruction Instruction Fetch Decode & Operand Fetch Execute Store Results NI IF D E W NI IF D E W NI IF D E W NI IF D E W NI IF D E W Time Registers or Mem

CS252/ Culler Lec 1. 34 1/ 22/ 02

Sequential Laundry

  • Sequent ial laundry t akes 6 hours f or 4 loads
  • I f t hey learned pipelining, how long would laundry t ake?

A B C D 30 40 20 30 40 20 30 40 20 30 40 20 6 PM 7 8 9 10 11 Midnight

T a s k O r d e r Time

CS252/ Culler Lec 1. 35 1/ 22/ 02

Pipelined Laundry Start work ASAP

  • Pipelined laundry t akes 3. 5 hours f or 4 loads

A B C D 6 PM 7 8 9 10 11 Midnight

T a s k O r d e r Time

30 40 40 40 40 20

CS252/ Culler Lec 1. 36 1/ 22/ 02

Pipelining Lessons

  • Pipelining doesn’t help

latency of single task, it helps throughput of entire workload

  • Pipeline rate limited by

slowest pipeline stage

  • Multiple tasks operating

simultaneously

  • Potential speedup =

Number pipe stages

  • Unbalanced lengths of

pipe stages reduces speedup

  • Time to “f ill” pipeline

and time to “drain” it reduces speedup

A B C D 6 PM 7 8 9

T a s k O r d e r Time

30 40 40 40 40 20

slide-7
SLIDE 7

P age 7

CS252/ Culler Lec 1. 37 1/ 22/ 02

I nstruction Pipelining

  • Execute billions of instructions, so throughput is

what matters

– except when?

  • What is desirable in instruction sets f or pipelining?

– Variable length instructions vs. all instructions same length? – Memory operands part of any operation vs. memory operands only in loads or stores? – Register operand many places in instruction f ormat vs. registers located in same place?

CS252/ Culler Lec 1. 38 1/ 22/ 02

Example: MI PS (Note register location)

Op

31 26 15 16 20 21 25

Rs1 Rd immediat e Op

31 26 25

Op

31 26 15 16 20 21 25

Rs1 Rs2 t arget Rd Opx Register- Register

5 6 10 11

Register- I mmediate Op

31 26 15 16 20 21 25

Rs1 R s2/ Opx immediat e Branch Jump / Call

CS252/ Culler Lec 1. 39 1/ 22/ 02

5 Steps of MI PS Datapath

Figure 3.1, P age 130, CA:AQA 2e Memory Access Write Back I nstruction Fetch I nstr. Decode

  • Reg. Fetch

Execute

  • Addr. Calc

L M D ALU

MUX

Memory Reg File

MUX MUX

Dat a Memory

MUX

Sign Extend

4

Adder

Zer o? Next SEQ PC

Address

Next PC WB Dat a I nst

RD RS1 RS2 I mm CS252/ Culler Lec 1. 40 1/ 22/ 02

5 Steps of MI PS Datapath

Figure 3.4, P age 134 , CA:AQA 2e Memory Access Write Back I nstruction Fetch I nstr. Decode

  • Reg. Fetch

Execute

  • Addr. Calc

ALU Memory Reg File

MUX MUX

Dat a Memory

MUX

Sign Extend

Zer o?

I F/ I D I D/ EX MEM/ WB EX/ MEM

4

Adder

Next SEQ PC Next SEQ PC R D R D R D WB Dat a

  • Dat a st at ionary cont rol

– local decode f or each instruction phase / pipeline stage

Next PC

Address

RS1 RS2 I mm

MUX

CS252/ Culler Lec 1. 41 1/ 22/ 02

Visualizing Pipelining

Figure 3.3, P age 133 , CA:AQA 2e I n s t r. O r d e r Time (clock cycles)

Reg ALU DMem I fetch Reg Reg ALU DMem I fetch Reg Reg ALU DMem I fetch Reg Reg ALU DMem I fetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

CS252/ Culler Lec 1. 42 1/ 22/ 02

I ts Not That Easy f or Computers

  • Limit s t o pipelining: Hazards prevent next

inst ruct ion f rom execut ing during it s designat ed clock cycle

– Structural hazards: HW cannot support this combination of instructions (single person to f old and put clothes away) – Data hazards: I nstruction depends on result of prior instruction still in the pipeline (missing sock) – Control hazards: Caused by delay between the f etching of instructions and decisions about changes in control f low (branches and jumps).

slide-8
SLIDE 8

P age 8

CS252/ Culler Lec 1. 43 1/ 22/ 02

Review of Perf ormance

CS252/ Culler Lec 1. 44 1/ 22/ 02

Which is f aster?

  • Time to run the task (ExTime)

– Execution time, response time, latency

  • Tasks per day, hour, week, sec, ns …

(Perf ormance)

– Throughput, bandwidth Plane Boeing 747 BAD/ Sud Concorde Speed 610 mph 1350 mph DC to Paris 6.5 hours 3 hours Passengers 470 132 Throughput (pmph) 286,700 178,200

CS252/ Culler Lec 1. 45 1/ 22/ 02

Performance(X) Execut ion_t ime(Y) n = = Performance(Y) Execut ion_t ime(Y)

Def initions

  • Perf ormance is in unit s of t hings per sec

– bigger is better

  • I f we are primarily concerned wit h response t ime

–perf ormance(x) = 1 execut ion_t ime(x) " X is n times faster than Y " means

CS252/ Culler Lec 1. 46 1/ 22/ 02

Computer Perf ormance

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle

Inst Count CPI Clock Rate Program X Compiler X (X)

  • Inst. Set.

X X Organization X X Technology X

inst count CPI Cycle t ime

CS252/ Culler Lec 1. 47 1/ 22/ 02

Cycles Per I nstruction (Throughput)

“I nstruction Frequency”

CPI = (CPU Time * Clock Rate) / I nstruction Count = Cycles / I nstruction Count

“Average Cycles per I nst ruct ion”

j n j j

I CPI Tim e Cycle tim e CP U × ∑ × =

=1

Count n I nstructio I F where F CPI CPI

j j n j j j

= ∑ × =

=1

CS252/ Culler Lec 1. 48 1/ 22/ 02

Example: Calculating CPI bottom up

Typical Mix of instruction types in program Base Machine (Reg / Reg) Op Freq Cycles CPI (i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) St ore 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5

slide-9
SLIDE 9

P age 9

CS252/ Culler Lec 1. 49 1/ 22/ 02

Example: Branch Stall I mpact

  • Assume CPI = 1. 0 ignoring branches (ideal)
  • Assume solut ion was st alling f or 3 cycles
  • I f 30% branch, St all 3 cycles on 30%
  • Op

Freq Cycles CP I (i) (% Time)

  • Ot her

70% 1 .7 (37%)

  • Branch

30% 4 1.2 (63%)

  • => new CPI = 1. 9
  • New machine is 1/ 1. 9 = 0. 52 t imes f ast er (i. e. slow!)

CS252/ Culler Lec 1. 50 1/ 22/ 02

Speed Up Equation f or Pipelining

pipelined d unpipeline

Time Cycle Time Cycle CPI stall Pipeline CPI I deal depth Pipeline CPI I deal Speedup × + × =

pipelined d unpipeline

Time Cycle Time Cycle CP I stall P ipeline 1 depth P ipeline Speedup × + = I nst per cycles Stall Average CPI I deal CPIpipelined + = For simple RI SC pipeline, CPI = 1:

CS252/ Culler Lec 1. 51 1/ 22/ 02

Now, Review of Memory Hierarchy

CS252/ Culler Lec 1. 52 1/ 22/ 02

The Memory Abstraction

  • Associat ion of <name, value> pairs

– typically named as byte addresses – of ten values aligned on multiples of size

  • Sequence of Reads and Writ es
  • Writ e binds a value t o an address
  • Read of addr returns most recently written

value bound t o t hat address

addr ess (name) command (R/ W) dat a (W) dat a (R) done

CS252/ Culler Lec 1. 53 1/ 22/ 02

Recap: Who Cares About the Memory Hierarchy?

µPr oc 60%/ yr . (2X/ 1.5yr ) DRAM 9%/ yr . (2X/ 10 yrs)

1 10 100 1000

1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

DRAM CPU

1982

Processor- Memory Perf ormance Gap: (grows 50% / year)

Perf ormance

Time

“J oy’s Law” Processor- DRAM Memory Gap (lat ency)

CS252/ Culler Lec 1. 54 1/ 22/ 02

Levels of the Memory Hierarchy

CPU Regist ers 100s Byt es <1s ns Cache 10s- 100s K Byt es 1- 10 ns $ 10/ MByte Main Memory M Byt es 100ns- 3 0 0 ns $1/ MByte Disk 10s G Byt es, 10 ms (10, 000, 000 ns) $ 0. 0031/ MByte Capacity Access Time Cost Tape inf inite sec- min $ 0. 0014/ MByte

Registers Cache Memory Disk Tape I nst r. Oper ands Blocks Pages Files

St aging Xf er Unit

  • prog. / compiler

1- 8 byt es cache cntl 8- 128 byt es OS 512- 4 K byt es user / oper at or Mbyt es

Upper Level Lower Level f ast er Lar ger

slide-10
SLIDE 10

P age 10

CS252/ Culler Lec 1. 55 1/ 22/ 02

The Principle of Locality

  • The Principle of Localit y:

– Program access a relatively small portion of the address space at any instant of time.

  • Two Dif f erent Types of Localit y:

– Temporal Locality (Locality in Time): I f an item is ref erenced, it will tend to be ref erenced again soon (e. g. , loops, reuse) – Spatial Locality (Locality in Space): I f an item is ref erenced, items whose addresses are close by tend to be ref erenced soon (e. g. , straightline code, array access)

  • Last 15 years, HW (hardware) relied on localit y

f or speed

CS252/ Culler Lec 1. 56 1/ 22/ 02

Memory Hierarchy: Terminology

  • Hit: data appears in some block in the upper level (example:

Block X)

– Hit Rat e: the f raction of memory access f ound in the upper level – Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/ miss

  • Miss: data needs to be retrieve f rom a block in the lower

level (Block Y)

– Miss Rate = 1 - (Hit Rate) – Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor

  • Hit Time << Miss Penalty (500 instructions on 21264!)

Lower Level Memory Upper Level Memory To Processor From Processor

Blk X Blk Y

CS252/ Culler Lec 1. 57 1/ 22/ 02

Cache Measures

  • Hit rate: f ract ion f ound in t hat level

– So high that usually talk about Miss rate – Miss rate f allacy: as MI PS to CPU perf ormance, miss rate to average memory access time in memory

  • Average memory- access time

= Hit time + Miss rate x Miss penalty (ns or clocks)

  • Miss penalty: time to replace a block f rom

lower level, including t ime t o replace in CPU

– access time: time to lower level

= f (latency to lower level)

– transf er time: time to transf er block

=f (BW between upper & lower levels)

CS252/ Culler Lec 1. 58 1/ 22/ 02

Simplest Cache: Direct Mapped

Memory 4 Byte Direct Mapped Cache Memory Address

1 2 3 4 5 6 7 8 9 A B C D E F Cache Index 1 2 3

  • Locat ion 0 can be occupied by

dat a f rom:

– Memory location 0, 4, 8, . . . etc. – I n general: any memory location whose 2 LSBs of the address are 0s – Address<1:0> => cache index

  • Which one should we place in

t he cache?

  • How can we t ell which one is in

t he cache?

CS252/ Culler Lec 1. 59 1/ 22/ 02

1 KB Direct Mapped Cache, 32B blocks

  • For a 2 ** N byte cache:

– The uppermost (32 - N) bits are always the Cache Tag – The lowest M bits are the Byte Select (Block Size = 2 ** M)

Cache Index 1 2 3

:

Cache Data Byte 0 4 31

:

Cache Tag Example: 0x50 Ex: 0x01 0x50

Stored as part

  • f the cache “state”

Valid Bit

:

31 Byte 1 Byte 31

:

Byte 32 Byte 33 Byte 63

:

Byte 992 Byte 1023

:

Cache Tag Byte Select Ex: 0x00 9

CS252/ Culler Lec 1. 60 1/ 22/ 02

The Cache Design Space

  • Several int eract ing dimensions

– cache size – block size – associativity – replacement policy – write- through vs writ e- back

  • The opt imal choice is a compromise

– depends on access characteristics » workload » use (I - cache, D- cache, TLB) – depends on t echnology / cost

  • Simplicit y of t en wins

Associativity Cache Size Block Size Bad Good Less More

Fact or A Fact or B

slide-11
SLIDE 11

P age 11

CS252/ Culler Lec 1. 61 1/ 22/ 02

Relationship of Caching and Pipelining

ALU Memory Reg File

MUX MUX

Dat a Memory

MUX

Sign Extend

Zer o?

I F/ I D I D/ EX MEM/ WB EX/ MEM

4

Adder

Next SEQ PC Next SEQ PC R D R D R D WB Dat a

  • Next PC

Address

RS1 RS2 I mm

MUX

I -Cache D-Cache

CS252/ Culler Lec 1. 62 1/ 22/ 02

Computer System Components

Proc Caches Busses Memory I/O Devices: Controllers adapters Disks Displays Keyboards Networks

  • All have int erf aces & organizat ions
  • Bus & Bus Prot ocol is key t o composit ion

=> perhipheral hierarchy

CS252/ Culler Lec 1. 63 1/ 22/ 02

A Modern Memory Hierarchy

  • By taking advantage of the principle of locality:

– Present the user with as much memory as is available in the chea pest technology. – Provide access at the speed of f ered by the f astest technology.

  • Requires servicing f aults on the processor

Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Second Level Cache (SRAM) On-Chip Cache 1s 10,000,000s (10s ms) Speed (ns): 10s 100s 100s Gs Size (bytes): Ks Ms Tertiary Storage (Disk/Tape) 10,000,000,000s (10s sec) Ts

CS252/ Culler Lec 1. 64 1/ 22/ 02

TLB, Virtual Memory

  • Caches, TLBs

, Virt ual Memory all underst ood by examining how t hey deal wit h 4 quest ions: 1) Where can block be placed? 2) How is block f ound? 3) What block is repalced on miss? 4) How are writes handled?

  • Page t ables map virt ual address t o physical address
  • TLBs make virt ual memory pract ical

– Locality in data => locality in addresses of data, temporal and spatial

  • TLB misses are signif icant in processor perf ormance

– f unny times, as most systems can’t access all of 2nd level cache without TLB misses!

  • Today VM allows many processes to share single

memory without having to swap all processes to disk; t oday VM prot ect ion is more import ant t han memory hierarchy

CS252/ Culler Lec 1. 65 1/ 22/ 02

Summary

  • Modern Computer Architecture is about managing and
  • ptimizing across several levels of abstraction wrt

dramatically changing technology and application load

  • Key Abstractions

– inst ruct ion set archit ect ure – memory – bus

  • Key concepts

– HW/ SW boundary – Compile Time / Run Time – Pipelining – Caching

  • Perf ormance I ron Triangle relates combined ef f ects

– Tot al Time = I nst . Count x CPI + Cycle Time