[PDF] - P age 1 A take on Moores Law Technology Trends Bit-level PDF Document

SLIDE 1

P age 1

CS252/ Culler Lec 1. 1 1/ 22/ 02

January 22, 2002 Prof . David E Culler Comput er Science 252 Spring 2002 CS252 Graduate Computer Architecture Lecture 1 I ntroduction

CS252/ Culler Lec 1. 2 1/ 22/ 02

Outline

Why Take CS252?
Fundament al Abst ract ions & Concept s
I nst ruct ion Set Archit ect ure & Organizat ion
Administrivia
Pipelined I nst ruct ion Processing
Perf ormance
The Memory Abst ract ion
Summary

CS252/ Culler Lec 1. 3 1/ 22/ 02

Why take CS252?

To design the next great instruction set?...well...

– instruction set architecture has largely converged – especially in the desktop / server / laptop space – dictated by powerf ul market f orces

Tremendous organizational innovation relative to

established I SA abstractions

Many New instruction sets or equivalent

– embedded space, cont rollers, specialized devices, . . .

Design, analysis, implementation concepts vital to all

aspects of EE & CS

– syst ems, PL, t heory, circuit design, VLSI , comm.

Equip you with an intellectual toolbox f or dealing with

a host of systems design challenges

CS252/ Culler Lec 1. 4 1/ 22/ 02

Example Hot Developments ca. 2002

Manipulating the instruction set abstraction

– it anium: t ranslat e I SA64 - > micro- op sequences – t ransmet a : cont inuous dynamic t ranslat ion of I A32 – t insilica: synthesize the I SA f rom the application – reconf igurable HW

Virtualization

– vmware: emulate f ull virtual machine – JI T: compile to abstract virtual machine, dynamically compile to host

P

arallelism

– wide issue, dynamic instruction scheduling, EPI C – multithreading (SMT) – chip multiprocessors

Communication

– network processors, network interf aces

Exotic explorations

– nanot echnology, quantum computing

CS252/ Culler Lec 1. 5 1/ 22/ 02

Forces on Computer Architecture

Computer Architecture

Technology

Programming Languages Operating Systems

History Applications

(A = F / M)

CS252/ Culler Lec 1. 6 1/ 22/ 02

Amazing Underlying Technology Change

SLIDE 2

P age 2

CS252/ Culler Lec 1. 7 1/ 22/ 02

A take on Moore’s Law

Transistors NN N N N N N N N N N N N N N N N N N NN N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1970 1975 1980 1985 1990 1995 2000 2005 Bit-level parallelism Instruction-level Thread-level (?) i4004 i8008 i8080 i8086 i80286 i80386 R2000 Pentium R10000 R3000 CS252/ Culler Lec 1. 8 1/ 22/ 02

Technology Trends

Clock Rate:

~30% per year

Transist or Densit y:

~35%

Chip Area:

~15%

Transist ors per chip: ~55%
Tot al Perf ormance Capabilit y: ~100%
by t he t ime you graduat e. . .

– 3x clock rat e (3- 4 GHz) – 10x transistor count (1 Billion transistors) – 30x raw capability

plus 16x dram densit y, 32x disk densit y

CS252/ Culler Lec 1. 9 1/ 22/ 02

Performance 0.1 1 10 100 1965 1970 1975 1980 1985 1990 1995 Supercomputers Minicomputers Mainframes Microprocessors

Perf ormance Trends

CS252/ Culler Lec 1. 10 1/ 22/ 02

Measurement and Evaluation

Architecture is an iterative process

- searching the space of possible designs
- at all levels of computer systems

Good Ideas Good Ideas

Mediocre Ideas

Bad Ideas

Cost / Performance Analysis Design Analysis

Creativity

CS252/ Culler Lec 1. 11 1/ 22/ 02

What is “Computer Architecture”?

I/O system

Instr. Set Proc.

Compiler Operating System Application Digital Design Circuit Design Instruction Set Architecture Firmware

Coordinat ion of many levels of abst ract ion
Under a rapidly changing set of f orces
Design, Measurement , and

Evaluat ion

Datapath & Control

Layout

CS252/ Culler Lec 1. 12 1/ 22/ 02

Coping with CS 252

Students with too varied background?

– I n past, CS grad students took written prelim exams on undergraduate material in hardware, sof tware, and theory – 1st 5 weeks reviewed background, helped 252, 262, 270 – Prelims were dropped => some unprepared f or CS 252?

I n class exam on Tues Jan. 29 (30 mins)

– Doesn’t af f ect grade, only admission into class – 2 grades: Admitted or audit/ take CS 152 1st – I mprove your experience if recapture common background

Review: Chapt ers 1, CS 152 home page, maybe

“Comput er Organizat ion and Design (COD)2/ e”

– Chapters 1 to 8 of COD if never took prerequisite – I f took a class, be sure COD Chapters 2, 6, 7 are f amiliar – Copies in Bechtel Library on 2- hour reserve

FAST review t his week of basic concept s

SLIDE 3

P age 3

CS252/ Culler Lec 1. 13 1/ 22/ 02

Review of Fundamental Concepts

I nst ruct ion Set Archit ect ure
Machine Organizat ion
I nst ruct ion Execut ion Cycle
Pipelining
Memory
Bus (Peripheral Hierarchy)
Perf ormance I ron Triangle

CS252/ Culler Lec 1. 14 1/ 22/ 02

The I nstruction Set: a Critical I nterf ace

instruction set software hardware

CS252/ Culler Lec 1. 15 1/ 22/ 02

I nstruction Set Architecture

. . . the attributes of a [computing] system as seen by t he programmer, i. e. t he concept ual st ruct ure and f unct ional behavior, as dist inct f rom t he

rganizat ion of t he dat a f lows and cont rols t he logic

design, and t he physical implement at ion. – Amdahl, Blaaw, and Brooks, 1964 SOFTWARE SOFTWARE

- Organization of Programmable

Storage

- Data Types & Data Structures:

Encodings & Representations

- Instruction Formats
- Instruction (or Operation Code) Set
- Modes of Addressing and Accessing Data Items and Instructions
- Exceptional Conditions

CS252/ Culler Lec 1. 16 1/ 22/ 02

Organization

Logic Designer's View ISA Level FUs & Interconnect

Capabilit ies & Perf ormance

Charact erist ics of Principal Functional Units

– (e. g. , Registers, ALU, Shif ters, Logic Units, . . . )

Ways in which t hese component s

are int erconnect ed

I nf ormat ion f lows bet ween

component s

Logic and means by which such

inf ormat ion f low is cont rolled.

Choreography of FUs to

realize the I SA

Register Transf er Level (RTL)

Descript ion

CS252/ Culler Lec 1. 17 1/ 22/ 02

Review: MI PS R3000 (core)

r0 r1 ° ° ° r31 PC lo hi Programmable storage 2^32 x bytes 31 x 32-bit GPRs (R0=0) 32 x 32-bit FP regs (paired DP) HI, LO, PC Data types ? Format ? Addressing Modes? Arithmetic logical

Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU , SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV

Memory Access

LB, LBU, LH, LHU, LW, LWL,LWR SB, SH, SW, SWL, SWR

Control

J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL

32-bit instructions on word boundary

CS252/ Culler Lec 1. 18 1/ 22/ 02

Review: Basic I SA Classes

Accumulator: 1 address add A acc ← acc + mem[A] 1+x address addx A acc ← acc + mem[A + x] Stack: 0 address add tos ← tos + next General Purpose Register: 2 address add A B EA(A) ← EA(A) + EA(B) 3 address add A B C EA(A) ← EA(B) + EA(C) Load/ Store: 3 address add Ra Rb Rc Ra ← Rb + Rc load Ra Rb Ra ← mem[Rb] store Ra Rb mem[Rb] ← Ra

SLIDE 4

P age 4

CS252/ Culler Lec 1. 19 1/ 22/ 02

I nstruction Formats

Variable: Fixed: Hybrid:

…

Addressing modes

–each operand requires addess specif ier => variable f ormat

code size => variable lengt h inst ruct ions
perf ormance => f ixed lengt h inst ruct ions

–simple decoding, predictable operations

Wit h load/ st ore inst ruct ion arch, only one memory

address and f ew addressing modes

=> simple f ormat , address mode given by opcode

CS252/ Culler Lec 1. 20 1/ 22/ 02

MI PS Addressing Modes & Formats

Simple addressing modes
All instructions 32 bits wide
p

rs rt rd immed register Register (direct)

p

rs rt register Base+index + Memory immed

p

rs rt Immediate immed

p

rs rt PC PC-relative + Memory

Register Indirect?

CS252/ Culler Lec 1. 21 1/ 22/ 02

Cray- 1: the original RI SC

Op

15

Rd Rs1 R2

2 6 8 9

Load, Store and Branch

3 5

Op

15

Rd Rs1 Immediate

2 6 8 9 3 5 15 Register-Register

CS252/ Culler Lec 1. 22 1/ 22/ 02

VAX- 11: the canonical CI SC

Rich set of ort hogonal address modes

– immediate, of f set, indexed, aut oinc/ dec, indirect, indirect+of f set – applied t o any operand

Simple and complex inst ruct ions

– synchronization instructions – data structure operations (queues) – polynomial evaluation

OpCode A/M A/M A/M Byte 0 1 n m

Variable format, 2 and 3 address instruction

CS252/ Culler Lec 1. 23 1/ 22/ 02

Review: Load/ Store Architectures

MEM reg ° Substantial increase in instructions ° Decrease in data BW (due to many registers) ° Even more significant decrease in CPI (pipelining) ° Cycle time, Real estate, Design time, Design complexity ° 3 address GPR ° Register to register arithmetic ° Load and store with simple addressing modes (reg + immediate) ° Simple conditionals compare ops + branch z compare&branch condition code + branch on condition ° Simple f ixed- f ormat encoding

p
p
p

r r r r r immed

ffset

CS252/ Culler Lec 1. 24 1/ 22/ 02

MI PS R3000 I SA (Summary)

I nst ruct ion Cat egories

– Load/ St ore – Computational – Jump and Branch – Float ing Point » coprocessor – Memory Management – Special R0 - R31 PC HI LO OP OP OP rs rt rd sa funct rs rt immediate jump target 3 Instruction Formats: all 32 bits wide Registers

SLIDE 5

P age 5

CS252/ Culler Lec 1. 25 1/ 22/ 02

CS 252 Administrivia

TA: Jason Hill, jhill@cs.berkeley.edu
All assignments, lectures via WWW page:

http:/ / www.cs.berkeley.edu/ ~culler/ 252S02/

2 Quizzes: 3/ 21 and ~14th week (maybe take home)
Text:

– Pages of 3rd edit ion of Comput er Archit ect ure: A Quant it at ive Ap proach » available f rom Cindy Palwick (MWF) or Jeanet t e Cook ($ 30 1- 5 ) – “Readings in Computer Architecture” by Hill et al

I n class, prereq quiz 1/ 29 last 30 minutes

– I mprove 252 experience if recapture common background – Bring 1 sheet of paper with notes on both sides – Doesn’t af f ect grade, only admission into class – 2 grades: Admit t ed or audit / t ake CS 152 1st – Review: Chapters 1, CS 152 home page, maybe “Computer Organizat ion and Design (COD)2/ e” – I f did t ake a class, be sure COD Chapt ers 2, 5, 6, 7 are f amilia r – Copies in Bechtel Library on 2- hour reserve

CS252/ Culler Lec 1. 26 1/ 22/ 02

Research Paper Reading

As graduat e st udent s, you are now researchers.
Most inf ormat ion of import ance t o you will be in

research papers.

Abilit y t o rapidly scan and underst and research

papers is key to your success.

So: 1- 2 paper / week in t his course

– Quick 1 paragraph summaries will be due in class – I mportant supplement to book. – Will discuss papers in class

Papers “Readings in Comput er Archit ect ure” or online
Think about met hodology and approach

CS252/ Culler Lec 1. 27 1/ 22/ 02

First Assignment (due Tu 2/ 5)

Read

– Amdahl, Blaauw, and Brooks, Architecture of the I BM System/ 360 – Lonergan and King, B5000

Four each prepare f or in- class debat e 1/ 29
rest write analysis of the debate
Read “Programming t he EDSAC”, Cambell- Kelly

– write subroutine sum(A, n) to sum an array A of n numbers – write recursive f act(n) = if n==1 then 1 else n*f act(n- 1)

CS252/ Culler Lec 1. 28 1/ 22/ 02

Grading

10% Homeworks (work in pairs)
40% Examinations (2 Quizzes)
40% Research Project (work in pairs)

– Draf t of Conf erence Quality Paper – Transition f rom undergrad to grad student – Berkeley wants you to succeed, but you need to show initiative – pick topic – meet 3 times with f aculty/ TA to see progress – give oral present at ion – give poster session – written report like conf erence paper – 3 weeks work f ull time f or 2 people (over more weeks) – Opportunity to do “research in the small” to help make transition f rom good student to research colleague

10% Class Participation

CS252/ Culler Lec 1. 29 1/ 22/ 02

Course Prof ile

3 weeks: basic concepts

– instruction processing, storage

3 weeks: hot areas

– latency tolerance, low power, embedded design, network processors, NI s , virtualization

Proposals due
2 weeks: advanced microprocessor design
Quiz & Spring Break
3 weeks: Parallelism (MPs, CMPs, Networks)
2 weeks: Met hodology / Analysis / Theory
1 weeks: Topics: nano, quant um
1 week: Project Present at ions

CS252/ Culler Lec 1. 30 1/ 22/ 02

Levels of Representation (61C Review)

High Level Language Program Assembly Language Program Machine Language Program Control Signal Specification Compiler Assembler Machine Interpretation temp = v[k]; v[k] = v[k+1]; v[k+1] = temp;

lw $ 15, 0($ 2) lw $ 16, 4($ 2) sw $ 16, 0($ 2) sw $ 15, 4($ 2)

0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111

° ° ALUOP[0:3] <= InstReg[9:11] & MASK

SLIDE 6

P age 6

CS252/ Culler Lec 1. 31 1/ 22/ 02

Execution Cycle

Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction Obtain instruction from program storage Determine required actions and instruction size Locate and obtain operand data Compute result value or status Deposit results in storage for later use Determine successor instruction

CS252/ Culler Lec 1. 32 1/ 22/ 02

What’s a Clock Cycle?

Old days: 10 levels of gat es
Today: determined by numerous time- of -

f light issues + gat e delays

– clock propagation, wire lengths, drivers Lat ch

r

r egist er combinat ional logic

CS252/ Culler Lec 1. 33 1/ 22/ 02

Fast, Pipelined I nstruction I nterpretation

Instruction Register Operand Registers Instruction Address Result Registers Next Instruction Instruction Fetch Decode & Operand Fetch Execute Store Results NI IF D E W NI IF D E W NI IF D E W NI IF D E W NI IF D E W Time Registers or Mem

CS252/ Culler Lec 1. 34 1/ 22/ 02

Sequential Laundry

Sequent ial laundry t akes 6 hours f or 4 loads
I f t hey learned pipelining, how long would laundry t ake?

A B C D 30 40 20 30 40 20 30 40 20 30 40 20 6 PM 7 8 9 10 11 Midnight

T a s k O r d e r Time

CS252/ Culler Lec 1. 35 1/ 22/ 02

Pipelined Laundry Start work ASAP

Pipelined laundry t akes 3. 5 hours f or 4 loads

A B C D 6 PM 7 8 9 10 11 Midnight

T a s k O r d e r Time

30 40 40 40 40 20

CS252/ Culler Lec 1. 36 1/ 22/ 02

Pipelining Lessons

Pipelining doesn’t help

latency of single task, it helps throughput of entire workload

Pipeline rate limited by

slowest pipeline stage

Multiple tasks operating

simultaneously

Potential speedup =

Number pipe stages

Unbalanced lengths of

pipe stages reduces speedup

Time to “f ill” pipeline

and time to “drain” it reduces speedup

A B C D 6 PM 7 8 9

T a s k O r d e r Time

30 40 40 40 40 20

SLIDE 7

P age 7

CS252/ Culler Lec 1. 37 1/ 22/ 02

I nstruction Pipelining

Execute billions of instructions, so throughput is

what matters

– except when?

What is desirable in instruction sets f or pipelining?

– Variable length instructions vs. all instructions same length? – Memory operands part of any operation vs. memory operands only in loads or stores? – Register operand many places in instruction f ormat vs. registers located in same place?

CS252/ Culler Lec 1. 38 1/ 22/ 02

Example: MI PS (Note register location)

Op

31 26 15 16 20 21 25

Rs1 Rd immediat e Op

31 26 25

Op

31 26 15 16 20 21 25

Rs1 Rs2 t arget Rd Opx Register- Register

5 6 10 11

Register- I mmediate Op

31 26 15 16 20 21 25

Rs1 R s2/ Opx immediat e Branch Jump / Call

CS252/ Culler Lec 1. 39 1/ 22/ 02

5 Steps of MI PS Datapath

Figure 3.1, P age 130, CA:AQA 2e Memory Access Write Back I nstruction Fetch I nstr. Decode

Reg. Fetch

Execute

Addr. Calc

L M D ALU

MUX

Memory Reg File

MUX MUX

Dat a Memory

MUX

Sign Extend

4

Adder

Zer o? Next SEQ PC

Address

Next PC WB Dat a I nst

RD RS1 RS2 I mm CS252/ Culler Lec 1. 40 1/ 22/ 02

5 Steps of MI PS Datapath

Figure 3.4, P age 134 , CA:AQA 2e Memory Access Write Back I nstruction Fetch I nstr. Decode

Reg. Fetch

Execute

Addr. Calc

ALU Memory Reg File

MUX MUX

Dat a Memory

MUX

Sign Extend

Zer o?

I F/ I D I D/ EX MEM/ WB EX/ MEM

4

Adder

Next SEQ PC Next SEQ PC R D R D R D WB Dat a

Dat a st at ionary cont rol

– local decode f or each instruction phase / pipeline stage

Next PC

Address

RS1 RS2 I mm

MUX

CS252/ Culler Lec 1. 41 1/ 22/ 02

Visualizing Pipelining

Figure 3.3, P age 133 , CA:AQA 2e I n s t r. O r d e r Time (clock cycles)

Reg ALU DMem I fetch Reg Reg ALU DMem I fetch Reg Reg ALU DMem I fetch Reg Reg ALU DMem I fetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

CS252/ Culler Lec 1. 42 1/ 22/ 02

I ts Not That Easy f or Computers

Limit s t o pipelining: Hazards prevent next

inst ruct ion f rom execut ing during it s designat ed clock cycle

– Structural hazards: HW cannot support this combination of instructions (single person to f old and put clothes away) – Data hazards: I nstruction depends on result of prior instruction still in the pipeline (missing sock) – Control hazards: Caused by delay between the f etching of instructions and decisions about changes in control f low (branches and jumps).

SLIDE 8

P age 8

CS252/ Culler Lec 1. 43 1/ 22/ 02

Review of Perf ormance

CS252/ Culler Lec 1. 44 1/ 22/ 02

Which is f aster?

Time to run the task (ExTime)

– Execution time, response time, latency

Tasks per day, hour, week, sec, ns …

(Perf ormance)

– Throughput, bandwidth Plane Boeing 747 BAD/ Sud Concorde Speed 610 mph 1350 mph DC to Paris 6.5 hours 3 hours Passengers 470 132 Throughput (pmph) 286,700 178,200

CS252/ Culler Lec 1. 45 1/ 22/ 02

Performance(X) Execut ion_t ime(Y) n = = Performance(Y) Execut ion_t ime(Y)

Def initions

Perf ormance is in unit s of t hings per sec

– bigger is better

I f we are primarily concerned wit h response t ime

–perf ormance(x) = 1 execut ion_t ime(x) " X is n times faster than Y " means

CS252/ Culler Lec 1. 46 1/ 22/ 02

Computer Perf ormance

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle

Inst Count CPI Clock Rate Program X Compiler X (X)

Inst. Set.

X X Organization X X Technology X

inst count CPI Cycle t ime

CS252/ Culler Lec 1. 47 1/ 22/ 02

Cycles Per I nstruction (Throughput)

“I nstruction Frequency”

CPI = (CPU Time * Clock Rate) / I nstruction Count = Cycles / I nstruction Count

“Average Cycles per I nst ruct ion”

j n j j

I CPI Tim e Cycle tim e CP U × ∑ × =

=1

Count n I nstructio I F where F CPI CPI

j j n j j j

= ∑ × =

=1

CS252/ Culler Lec 1. 48 1/ 22/ 02

Example: Calculating CPI bottom up

Typical Mix of instruction types in program Base Machine (Reg / Reg) Op Freq Cycles CPI (i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) St ore 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5

SLIDE 9

P age 9

CS252/ Culler Lec 1. 49 1/ 22/ 02

Example: Branch Stall I mpact

Assume CPI = 1. 0 ignoring branches (ideal)
Assume solut ion was st alling f or 3 cycles
I f 30% branch, St all 3 cycles on 30%
Op

Freq Cycles CP I (i) (% Time)

Ot her

70% 1 .7 (37%)

Branch

30% 4 1.2 (63%)

=> new CPI = 1. 9
New machine is 1/ 1. 9 = 0. 52 t imes f ast er (i. e. slow!)

CS252/ Culler Lec 1. 50 1/ 22/ 02

Speed Up Equation f or Pipelining

pipelined d unpipeline

Time Cycle Time Cycle CPI stall Pipeline CPI I deal depth Pipeline CPI I deal Speedup × + × =

pipelined d unpipeline

Time Cycle Time Cycle CP I stall P ipeline 1 depth P ipeline Speedup × + = I nst per cycles Stall Average CPI I deal CPIpipelined + = For simple RI SC pipeline, CPI = 1:

CS252/ Culler Lec 1. 51 1/ 22/ 02

Now, Review of Memory Hierarchy

CS252/ Culler Lec 1. 52 1/ 22/ 02

The Memory Abstraction

Associat ion of <name, value> pairs

– typically named as byte addresses – of ten values aligned on multiples of size

Sequence of Reads and Writ es
Writ e binds a value t o an address
Read of addr returns most recently written

value bound t o t hat address

addr ess (name) command (R/ W) dat a (W) dat a (R) done

CS252/ Culler Lec 1. 53 1/ 22/ 02

Recap: Who Cares About the Memory Hierarchy?

µPr oc 60%/ yr . (2X/ 1.5yr ) DRAM 9%/ yr . (2X/ 10 yrs)

1 10 100 1000

1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

DRAM CPU

1982

Processor- Memory Perf ormance Gap: (grows 50% / year)

Perf ormance

Time

“J oy’s Law” Processor- DRAM Memory Gap (lat ency)

CS252/ Culler Lec 1. 54 1/ 22/ 02

Levels of the Memory Hierarchy

CPU Regist ers 100s Byt es <1s ns Cache 10s- 100s K Byt es 1- 10 ns $ 10/ MByte Main Memory M Byt es 100ns- 3 0 0 ns $1/ MByte Disk 10s G Byt es, 10 ms (10, 000, 000 ns) $ 0. 0031/ MByte Capacity Access Time Cost Tape inf inite sec- min $ 0. 0014/ MByte

Registers Cache Memory Disk Tape I nst r. Oper ands Blocks Pages Files

St aging Xf er Unit

prog. / compiler

1- 8 byt es cache cntl 8- 128 byt es OS 512- 4 K byt es user / oper at or Mbyt es

Upper Level Lower Level f ast er Lar ger

SLIDE 10

P age 10

CS252/ Culler Lec 1. 55 1/ 22/ 02

The Principle of Locality

The Principle of Localit y:

– Program access a relatively small portion of the address space at any instant of time.

Two Dif f erent Types of Localit y:

– Temporal Locality (Locality in Time): I f an item is ref erenced, it will tend to be ref erenced again soon (e. g. , loops, reuse) – Spatial Locality (Locality in Space): I f an item is ref erenced, items whose addresses are close by tend to be ref erenced soon (e. g. , straightline code, array access)

Last 15 years, HW (hardware) relied on localit y

f or speed

CS252/ Culler Lec 1. 56 1/ 22/ 02

Memory Hierarchy: Terminology

Hit: data appears in some block in the upper level (example:

Block X)

– Hit Rat e: the f raction of memory access f ound in the upper level – Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/ miss

Miss: data needs to be retrieve f rom a block in the lower

level (Block Y)

– Miss Rate = 1 - (Hit Rate) – Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor

Hit Time << Miss Penalty (500 instructions on 21264!)

Lower Level Memory Upper Level Memory To Processor From Processor

Blk X Blk Y

CS252/ Culler Lec 1. 57 1/ 22/ 02

Cache Measures

Hit rate: f ract ion f ound in t hat level

– So high that usually talk about Miss rate – Miss rate f allacy: as MI PS to CPU perf ormance, miss rate to average memory access time in memory

Average memory- access time

= Hit time + Miss rate x Miss penalty (ns or clocks)

Miss penalty: time to replace a block f rom

lower level, including t ime t o replace in CPU

– access time: time to lower level

= f (latency to lower level)

– transf er time: time to transf er block

=f (BW between upper & lower levels)

CS252/ Culler Lec 1. 58 1/ 22/ 02

Simplest Cache: Direct Mapped

Memory 4 Byte Direct Mapped Cache Memory Address

1 2 3 4 5 6 7 8 9 A B C D E F Cache Index 1 2 3

Locat ion 0 can be occupied by

dat a f rom:

– Memory location 0, 4, 8, . . . etc. – I n general: any memory location whose 2 LSBs of the address are 0s – Address<1:0> => cache index

Which one should we place in

t he cache?

How can we t ell which one is in

t he cache?

CS252/ Culler Lec 1. 59 1/ 22/ 02

1 KB Direct Mapped Cache, 32B blocks

For a 2 ** N byte cache:

– The uppermost (32 - N) bits are always the Cache Tag – The lowest M bits are the Byte Select (Block Size = 2 ** M)

Cache Index 1 2 3

:

Cache Data Byte 0 4 31

:

Cache Tag Example: 0x50 Ex: 0x01 0x50

Stored as part

f the cache “state”

Valid Bit

:

31 Byte 1 Byte 31

:

Byte 32 Byte 33 Byte 63

:

Byte 992 Byte 1023

:

Cache Tag Byte Select Ex: 0x00 9

CS252/ Culler Lec 1. 60 1/ 22/ 02

The Cache Design Space

Several int eract ing dimensions

– cache size – block size – associativity – replacement policy – write- through vs writ e- back

The opt imal choice is a compromise

– depends on access characteristics » workload » use (I - cache, D- cache, TLB) – depends on t echnology / cost

Simplicit y of t en wins

Associativity Cache Size Block Size Bad Good Less More

Fact or A Fact or B

SLIDE 11

P age 11

CS252/ Culler Lec 1. 61 1/ 22/ 02

Relationship of Caching and Pipelining

ALU Memory Reg File

MUX MUX

Dat a Memory

MUX

Sign Extend

Zer o?

I F/ I D I D/ EX MEM/ WB EX/ MEM

4

Adder

Next SEQ PC Next SEQ PC R D R D R D WB Dat a

Next PC

Address

RS1 RS2 I mm

MUX

I -Cache D-Cache

CS252/ Culler Lec 1. 62 1/ 22/ 02

Computer System Components

Proc Caches Busses Memory I/O Devices: Controllers adapters Disks Displays Keyboards Networks

All have int erf aces & organizat ions
Bus & Bus Prot ocol is key t o composit ion

=> perhipheral hierarchy

CS252/ Culler Lec 1. 63 1/ 22/ 02

A Modern Memory Hierarchy

By taking advantage of the principle of locality:

– Present the user with as much memory as is available in the chea pest technology. – Provide access at the speed of f ered by the f astest technology.

Requires servicing f aults on the processor

Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Second Level Cache (SRAM) On-Chip Cache 1s 10,000,000s (10s ms) Speed (ns): 10s 100s 100s Gs Size (bytes): Ks Ms Tertiary Storage (Disk/Tape) 10,000,000,000s (10s sec) Ts

CS252/ Culler Lec 1. 64 1/ 22/ 02

TLB, Virtual Memory

Caches, TLBs

, Virt ual Memory all underst ood by examining how t hey deal wit h 4 quest ions: 1) Where can block be placed? 2) How is block f ound? 3) What block is repalced on miss? 4) How are writes handled?

Page t ables map virt ual address t o physical address
TLBs make virt ual memory pract ical

– Locality in data => locality in addresses of data, temporal and spatial

TLB misses are signif icant in processor perf ormance

– f unny times, as most systems can’t access all of 2nd level cache without TLB misses!

Today VM allows many processes to share single

memory without having to swap all processes to disk; t oday VM prot ect ion is more import ant t han memory hierarchy

CS252/ Culler Lec 1. 65 1/ 22/ 02

Summary

Modern Computer Architecture is about managing and
ptimizing across several levels of abstraction wrt

dramatically changing technology and application load

Key Abstractions

– inst ruct ion set archit ect ure – memory – bus

Key concepts

– HW/ SW boundary – Compile Time / Run Time – Pipelining – Caching

Perf ormance I ron Triangle relates combined ef f ects

– Tot al Time = I nst . Count x CPI + Cycle Time