P age 1 Review: Cache perf ormance What are all the aspects of - - PDF document

p age 1
SMART_READER_LITE
LIVE PREVIEW

P age 1 Review: Cache perf ormance What are all the aspects of - - PDF document

Who Cares About the Memory Hierarchy? CS252 Graduate Computer Architecture Proc 1000 Lecture 4 CPU 60%/yr. Moores Law Performance CPU- DRAM Gap Cache Design 100 Processor-Memory Performance Gap: (grows 50% / year) 10


slide-1
SLIDE 1

P age 1

CS252/ Culler Lec 4. 1 1/ 31/ 02

CS252 Graduate Computer Architecture Lecture 4 Cache Design

January 31, 2002 Prof . David Culler

CS252/ Culler Lec 4. 2 1/ 31/ 02

CPU- DRAM Gap

  • 1980: no cache in µproc; 1995 2- level cache on chip

(1989 f irst I nt el µproc wit h a cache on chip)

Who Cares About the Memory Hierarchy?

µProc 60%/yr. DRAM 7%/yr.

1 10 100 1000

1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

DRAM CPU

1982

Processor-Memory Performance Gap: (grows 50% / year)

Performance

“Moore’s Law” “Less’ Law?”

CS252/ Culler Lec 4. 3 1/ 31/ 02

Generations of Microprocessors

  • Time of a f ull cache miss in inst ruct ions execut ed:

1st Alpha: 340 ns/ 5. 0 ns = 68 clks x 2 or 136 2nd Alpha: 266 ns/ 3. 3 ns = 80 clks x 4 or 320 3rd Alpha: 180 ns/ 1. 7 ns =108 clks x 6 or 648

  • 1/ 2X lat ency x 3X clock rat e x 3X I nstr/ clock ⇒ - 5X

CS252/ Culler Lec 4. 4 1/ 31/ 02

Processor - Memory Perf ormance Gap “Tax”

Processor % Area %Transist ors (- cost) (- power)

  • Alpha 21164

37% 77%

  • St rongArm SA110

61% 94%

  • Pentium Pro

64% 88%

– 2 dies per package: Proc/ I $/ D$ + L2$

  • Caches have no “inherent value”,
  • nly t ry t o close perf ormance gap

CS252/ Culler Lec 4. 5 1/ 31/ 02

What is a cache?

  • Small, f ast st orage used t o improve average access

time to slow memory.

  • Exploits spacial and temporal locality
  • I n comput er archit ect ure, almost everyt hing is a cache!

– Registers “a cache” on variables – sof tware managed – First - level cache a cache on second- level cache – Second- level cache a cache on memory – Memory a cache on disk (virtual memory) – TLB a cache on page table – Branch- prediction a cache on prediction inf ormation? Proc/ Regs L1- Cache L2- Cache Memory Disk, Tape, etc. Bigger Faster

CS252/ Culler Lec 4. 6 1/ 31/ 02

Traditional Four Questions f or Memory Hierarchy Designers

  • Q1: Where can a block be placed in t he upper level?

(Block placement )

– Fully Associative, Set Associative, Direct Mapped

  • Q2: How is a block f ound if it is in t he upper level?

(Block ident if icat ion)

– Tag/ Block

  • Q3: Which block should be replaced on a miss?

(Block replacement )

– Random, LRU

  • Q4: What happens on a writ e?

(Write strategy)

– Write Back or Write Through (with Write Buf f er)

slide-2
SLIDE 2

P age 2

CS252/ Culler Lec 4. 7 1/ 31/ 02

What are all the aspects of cache organization that impact perf ormance?

CS252/ Culler Lec 4. 8 1/ 31/ 02

  • Miss- orient ed Approach t o Memory Access:

– CPI Execution includes ALU and Memory instructions CycleTime y MissPenalt MissRate Inst MemAccess Execution CPI IC CPUtime ×       × × + × = CycleTime y MissPenalt Inst MemMisses Execution CPI IC CPUtime ×       × + × =

Review: Cache perf ormance

  • Separat ing out Memory component ent irely

– AMAT = Average Memory Access Time – CPI ALUOps does not include memory instructions CycleTime AMAT Inst MemAccess CPI Inst AluOps IC CPUtime

A l u O p s

×       × + × × = y MissPenalt MissRate HitTime AMAT × + =

( ) ( )

Data Data Data Inst Inst Inst

y MissPenalt MissRate HitTime y MissPenalt MissRate HitTime × + + × + =

CS252/ Culler Lec 4. 9 1/ 31/ 02

I mpact on Perf ormance

  • Suppose a processor execut es at

– Clock Rate = 200 MHz (5 ns per cycle), I deal (no misses) CPI = 1.1 – 50% arith/ logic, 30% ld/ st, 20% control

  • Suppose t hat 10% of memory operat ions get 50 cycle

miss penalty

  • Suppose t hat 1% of inst ruct ions get same miss penalt y
  • CPI = ideal CPI + average st alls per inst ruct ion
  • 1. 1(cycles/ ins) +

[ 0. 30 (Dat aMops / ins) x 0. 10 (miss/ Dat aMop) x 50 (cycle/ miss)] + [ 1 (I nst Mop/ ins) x 0. 01 (miss/ I nst Mop) x 50 (cycle/ miss)] = (1. 1 + 1. 5 + . 5) cycle/ ins = 3. 1

  • 58% of t he t ime t he proc is st alled wait ing f or memory!
  • AMAT=(1/ 1. 3)x[1+0. 01x50]+(0. 3/ 1. 3)x[1+0. 1x50]=2. 54

CS252/ Culler Lec 4. 10 1/ 31/ 02

Unif ied vs Split Caches

  • Unif ied vs Separat e I &D
  • Example:

– 16KB I &D: I nst miss rate=0. 64%, Data miss rate=6. 47% – 32KB unif ied: Aggregate miss rate=1. 99%

  • Which is better (ignore L2 cache)?

– Assume 33% data ops ⇒ 75% accesses f rom instructions (1. 0/ 1. 33) – hit time=1, miss time=50 – Note that data hit has 1 stall f or unif ied cache (only one port)

AMAT Harvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMATUnif ied=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

Proc I - Cache- 1 Proc Unif ied Cache- 1 Unif ied Cache- 2 D- Cache- 1 Proc Unif ied Cache- 2

CS252/ Culler Lec 4. 11 1/ 31/ 02

How to I mprove Cache Perf ormance?

  • 1. Reduce the miss rate,
  • 2. Reduce t he miss penalt y, or
  • 3. Reduce the time to hit in the cache.

y MissPenalt MissRate HitTime AMAT × + =

CS252/ Culler Lec 4. 12 1/ 31/ 02

Where to misses come f rom?

  • Classif ying Misses: 3 Cs

– Compulsory—The f irst access to a block is not in the cache,

so the block must be brought into the cache. Also called cold start misses or f irst ref erence misses. (Misses in even an I nf inite Cache)

– Capacit y—I f the cache cannot contain all the blocks needed

during execution of a program, capacity misses will occur due t o blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache)

– Conf lict—I f block- placement strategy is set associative or

direct mapped, conf lict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interf erence misses. (Misses in N- way Associative, Size X Cache)

  • 4t h “C”:

– Coherence - Misses caused by cache coherence.

slide-3
SLIDE 3

P age 3

CS252/ Culler Lec 4. 13 1/ 31/ 02

Cache Size (KB) 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 4 8 16 32 64 128 1-way 2-way 4-way 8-way Capacity Compulsory

3Cs Absolute Miss Rate (SPEC92)

Conflict

CS252/ Culler Lec 4. 14 1/ 31/ 02

Cache Size

  • Old rule of thumb: 2x size => 25% cut in miss rate
  • What does it reduce?

Cache Size (KB) 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 4 8 16 32 64 128 1-way 2-way 4-way 8-way Capacity Compulsory

CS252/ Culler Lec 4. 15 1/ 31/ 02

Huge Caches => Working Sets

2 4 6 8 10 12 14 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Per Processor Cache Size (KB) Miss Rate (%) 4-node 8-node 16-node 32-node

First working set Capacity -generated traf fic (including conflicts) Second working set

Data traf fic

Other capacity -independent communication Cold-start (compulsory) traf fic

Replication capacity (cache size)

Inher ent communication

Example LU Decomposition f rom NAS Parallel Benchmarks

CS252/ Culler Lec 4. 16 1/ 31/ 02

Cache Organization?

  • Assume total cache size not changed:
  • What happens if :

1) Change Block Size: 2) Change Associat ivit y: 3) Change Compiler: Which of 3Cs is obviously af f ect ed?

CS252/ Culler Lec 4. 17 1/ 31/ 02

Block Size (bytes) Miss Rate 0% 5% 10% 15% 20% 25% 16 32 64 128 256 1K 4K 16K 64K 256K

Larger Block Size (f ixed size&assoc)

Reduced compulsory misses I ncreased Conf lict Misses

What else drives up block size?

CS252/ Culler Lec 4. 18 1/ 31/ 02

Cache Size (KB) 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 4 8 16 32 64 128 1-way 2-way 4-way 8-way Capacity Compulsory

Associativity

Conflict

slide-4
SLIDE 4

P age 4

CS252/ Culler Lec 4. 19 1/ 31/ 02

3Cs Relative Miss Rate

Cache Size (KB) 0% 20% 40% 60% 80% 100% 1 2 4 8 16 32 64 128 1-way 2-way 4-way 8-way Capacity Compulsory

Conflict

Flaws: for fixed block size Good: insight => invention

CS252/ Culler Lec 4. 20 1/ 31/ 02

Associativity vs Cycle Time

  • Beware: Execut ion t ime is only f inal measure!
  • Why is cycle t ime t ied t o hit t ime?
  • Will Clock Cycle t ime increase?

– Hill [1988] suggested hit time f or 2- way vs. 1- way external cache +10%, internal + 2% – suggested big and dumb caches

Ef f ect ive cycle t ime of assoc pzrbski I SCA

CS252/ Culler Lec 4. 21 1/ 31/ 02

Example: Avg. Memory Access Time vs. Miss Rate

  • Example: assume CCT = 1. 10 f or 2- way, 1. 12 f or

4- way, 1. 14 f or 8- way vs. CCT direct mapped

Cache Size Associativity (KB) 1- way 2- way 4- way 8- way 1 2.33 2.15 2.07 2.01 2 1.98 1.86 1.76 1.68 4 1.72 1.67 1.61 1.53 8 1.46 1.48 1.47 1.43 16 1.29 1.32 1.32 1.32 32 1.20 1.24 1.25 1.27 64 1.14 1.20 1.21 1.23 128 1.10 1.17 1.18 1.20 (Red means A.M.A.T. not improved by more associativity)

CS252/ Culler Lec 4. 22 1/ 31/ 02

Fast Hit Time + Low Conf lict => Victim Cache

  • How to combine f ast hit time
  • f direct mapped

yet still avoid conf lict misses?

  • Add buf f er to place data

discarded f rom cache

  • Jouppi [1990]: 4 - entry victim

cache removed 20% to 95% of conf licts f or a 4 KB direct mapped data cache

  • Used in Alpha, HP machines

To Next Lower Level In Hierarchy

DATA

TAGS One Cache line of Data

Tag and Comparator

One Cache line of Data

Tag and Comparator

One Cache line of Data

Tag and Comparator

One Cache line of Data

Tag and Comparator CS252/ Culler Lec 4. 23 1/ 31/ 02

Reducing Misses via “Pseudo- Associativity”

  • How to combine f ast hit time of Direct Mapped and have the

lower conf lict misses of 2 - way SA cache?

  • Divide cache: on a miss, check other half of cache to see if

there, if so have a pseudo

  • hit

(slow hit)

  • Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles

– Bet t er f or caches not t ied direct ly t o processor (L2) – Used in MI PS R1000 L2 cache, similar in UltraSPARC

Hit Time Pseudo Hit Time Miss Penalty Time

CS252/ Culler Lec 4. 24 1/ 31/ 02

Reducing Misses by Hardware Pref etching of I nstructions & Data

  • E. g. , I nst ruct ion Pref et ching

– Alpha 21064 f etches 2 blocks on a miss – Extra block placed in “stream buf f er” – On miss check stream buf f er

  • Works wit h dat a blocks t oo:

– Jouppi [1990] 1 data stream buf f er got 25% misses f rom 4KB cache; 4 streams got 43% – Palacharla & Kessler [1994] f or scientif ic programs f or 8 streams got 50% to 70% of misses f rom 2 64KB, 4- way set associative caches

  • Pref et ching relies on having ext ra memory

bandwidt h t hat can be used wit hout penalt y

slide-5
SLIDE 5

P age 5

CS252/ Culler Lec 4. 25 1/ 31/ 02

Reducing Misses by Sof t ware Pref etching Dat a

  • Data Pref et ch

– Load data into register (HP PA- RI SC loads) – Cache Pref etch: load into cache (MI PS I V, PowerPC, SPARC v. 9) – Special pref etching instructions cannot cause f aults; a f orm of speculat ive execut ion

  • Pref et ching comes in t wo f lavors:

– Binding pref etch: Requests load directly into register. » Must be correct address and register! – Non- Binding pref etch: Load into cache. » Can be incorrect. Faults?

  • I ssuing Pref et ch I nst ruct ions t akes t ime

– I s cost of pref etch issues < savings in reduced misses? – Higher superscalar reduces dif f iculty of issue bandwidth

CS252/ Culler Lec 4. 26 1/ 31/ 02

Reducing Misses by Compiler Optimizations

  • McFarling [1989] reduced caches misses by 75%
  • n 8KB direct mapped cache, 4 byte blocks in sof tware
  • I nstructions

– Reorder procedures in memory so as to reduce conf lict misses – Prof iling t o look at conf lict s(using t ools t hey developed)

  • Dat a

– Merging Arrays: improve spat ial localit y by single array of compound element s

  • vs. 2 arrays

– Loop I nterchange: change nest ing of loops t o access dat a in order st ored in memory – Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap – Blocking: I mprove temporal locality by accessing “blocks” of data repeat edly

  • vs. going down whole columns or rows

CS252/ Culler Lec 4. 27 1/ 31/ 02

Merging Arrays Example

/* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE];

Reducing conf lict s bet ween val & key; improve spatial locality

CS252/ Culler Lec 4. 28 1/ 31/ 02

Loop I nterchange Example

/* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j];

Sequent ial accesses inst ead of st riding t hrough memory every 100 words; improved spat ial localit y

CS252/ Culler Lec 4. 29 1/ 31/ 02

Loop Fusion Example

/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; improve spat ial localit y

CS252/ Culler Lec 4. 30 1/ 31/ 02

Blocking Example

/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; };

  • Two I nner Loops:

– Read all NxN elements of z[] – Read N elements of 1 row of y[] repeatedly – Write N elements of 1 row of x[]

  • Capacit y Misses a f unct ion of N & Cache Size:

– 2N 3 + N2 => (assuming no conf lict; otherwise … )

  • I dea: comput e on BxB submat rix t hat f it s
slide-6
SLIDE 6

P age 6

CS252/ Culler Lec 4. 31 1/ 31/ 02

Blocking Example

/* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; };

  • B called Blocking Fact or
  • Capacit y Misses f rom 2N

3 + N 2 to N3/ B+2N2

  • Conf lict Misses Too?

CS252/ Culler Lec 4. 32 1/ 31/ 02

Reducing Conf lict Misses by Blocking

  • Conf lict misses in caches not FA vs. Blocking size

– Lam et al [1991] a blocking f actor of 24 had a f if th the misses

  • vs. 48 despite both f it in cache

Blocking Factor 0.05 0.1 50 100 150 Fully Associative Cache Direct Mapped Cache

CS252/ Culler Lec 4. 33 1/ 31/ 02

Performance Improvement 1 1.5 2 2.5 3 compress cholesky (nasa7) spice mxm (nasa7) btrix (nasa7) tomcatv gmty (nasa7) vpenta (nasa7) merged arrays loop interchange loop fusion blocking

Summary of Compiler Optimizations to Reduce Cache Misses (by hand)

CS252/ Culler Lec 4. 34 1/ 31/ 02

Summary: Miss Rate Reduction

  • 3 Cs: Compulsory, Capacity, Conf lict
  • 0. Larger cache
  • 1. Reduce Misses via Larger Block Size
  • 2. Reduce Misses via Higher Associativity
  • 3. Reducing Misses via Victim Cache
  • 4. Reducing Misses via Pseudo
  • Associativity
  • 5. Reducing Misses by HW Pref etching I nstr, Dat a
  • 6. Reducing Misses by SW Pref et ching Dat a
  • 7. Reducing Misses by Compiler Optimizations
  • Pref etching comes in two f lavors:

– Binding pref et ch : Request s load direct ly int o regist er. » Must be correct address and regist er! – Non- Binding pref etch: Load int o cache. » Can be incorrect. Frees HW/ SW to guess!

CPUtime = IC × CPIExecution+ Memory accesses Instruction × Miss rate× Miss penalty     × Clock cycle time

CS252/ Culler Lec 4. 35 1/ 31/ 02

Review: I mproving Cache Perf ormance

  • 1. Reduce the miss rate,
  • 2. Reduce t he miss penalt y, or
  • 3. Reduce t he t ime t o hit in t he cache.

CS252/ Culler Lec 4. 36 1/ 31/ 02

Write Policy: Writ e- Through vs Write- Back

  • Write- through: all writes update cache and underlying

memory/ cache

– Can always discard cached data - most up- to- date data is in memory – Cache control bit: only a valid bit

  • Write- back: all writes simply update cache

– Can’t just discard cached data - may have to write it back to memory – Cache cont rol bit s: bot h valid and dirty bit s

  • Other Advantages:

– Writ e- through: » memory (or ot her processors) always have lat est dat a » Simpler management of cache – Writ e- back: » much lower bandwidt h, since dat a of t en overwrit t en mult iple t imes » Better tolerance to long- lat ency memory?

slide-7
SLIDE 7

P age 7

CS252/ Culler Lec 4. 37 1/ 31/ 02

Writ e Policy 2: Write Allocate vs Non- Allocate (What happens on write- miss)

  • Write allocate: allocate new cache line in cache

– Usually means that you have to do a “read miss” to f ill in rest of the cache- line! – Alternative: per/ word valid bits

  • Write non- allocat e (or “writ e- around”):

– Simply send write data through to underlying memory/ cache - don’t allocate new cache line!

CS252/ Culler Lec 4. 38 1/ 31/ 02

  • 1. Reducing Miss Penalty:

Read Priority over Write on Miss

write buf f er CPU

in out

DRAM (or lower mem ) Write Buf f er

CS252/ Culler Lec 4. 39 1/ 31/ 02

  • 1. Reducing Miss Penalty:

Read Priority over Write on Miss

  • Write- t hrough w/ writ e buf f ers => RAW conf lict s

wit h main memory reads on cache misses

– I f simply wait f or write buf f er to empty, might increase read miss penalty (old MI PS 1000 by 50% ) – Check write buf f er contents bef ore read; if no conf licts, let the memory access continue

  • Write- back want buf f er t o hold displaced blocks

– Read miss replacing dirty block – Normal: Write dirty block to memory, and then do the read – I nstead copy the dirty block to a write buf f er, then do the read, and then do the write – CPU stall less since restarts as soon as do read

CS252/ Culler Lec 4. 40 1/ 31/ 02

  • 2. Reduce Miss Penalty:

Early Restart and Critical Word First

  • Don’t wait f or f ull block t o be loaded bef ore

rest art ing CPU

– Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue executio – Critical Word First—Request the missed word f irst f rom memory and send it to the CPU as soon as it arrives; let the CPU continue execution while f illing the rest of the words in the block. Also called wrapped f etch and requested word f irst

  • Generally usef ul only in large blocks,
  • Spatial locality => tend to want next sequential

word, so not clear if benef it by early rest art

block

CS252/ Culler Lec 4. 41 1/ 31/ 02

  • 3. Reduce Miss Penalty: Non-

blocking Caches to reduce stalls on misses

  • Non- blocking cache or lockup- f ree cache allow data

cache t o cont inue t o supply cache hit s during a miss

– requires F/ E bits on registers or out- of - order execution – requires multi- bank memories

  • “hit under miss” reduces t he ef f ect ive miss penalt y

by working during miss vs. ignoring CPU request s

  • “hit under mult iple miss

” or “miss under miss” may f urt her lower t he ef f ect ive miss penalt y by

  • verlapping mult iple misses

– Signif icantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses – Requires muliple memory banks (otherwise cannot support) – Penium Pro allows 4 outstanding memory misses

CS252/ Culler Lec 4. 42 1/ 31/ 02

Value of Hit Under Miss f or SPEC

  • FP programs on average: AMAT= 0. 68 - > 0. 52 - > 0. 34 - > 0. 26
  • I nt programs on average: AMAT= 0.24 - > 0. 20 - > 0. 19 - > 0. 19
  • 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss

Hit Under i Misses

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 eqntott espresso xlisp compress mdljsp2 ear fpppp tomcatv swm256 doduc su2cor wave5 mdljdp2 hydro2d alvinn nasa7 spice2g6

  • ra

0->1 1->2 2->64 Base

Integer Floating Point “Hit under n Misses”

0->1 1->2 2->64 Base

slide-8
SLIDE 8

P age 8

CS252/ Culler Lec 4. 43 1/ 31/ 02

4: Add a second- level cache

  • L2 Equat ions

AMAT = Hit Time L1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit Time L2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit Time L1 + Miss RateL1 x (Hit Time L2 + Miss RateL2 + Miss PenaltyL2)

  • Def init ions:

– Local miss rate— misses in this cache divided by the total number of memory accesses t o t his cache (Miss rateL2) – Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU – Global Miss Rate is what matters

CS252/ Culler Lec 4. 44 1/ 31/ 02

Comparing Local and Global Miss Rates

  • 32 KByt e 1st level cache;

I ncreasing 2nd level cache

  • Global miss rate close to

single level cache rate provided L2 >> L1

  • Don’t use local miss rat e
  • L2 not t ied t o CPU clock

cycle!

  • Cost & A. M. A. T.
  • Generally Fast Hit Times

and f ewer misses

  • Since hits are f ew, target

miss reduct ion

Linear Log Cache Size Cache Size

CS252/ Culler Lec 4. 45 1/ 31/ 02

Reducing Misses: Which apply to L2 Cache?

  • Reducing Miss Rat e
  • 1. Reduce Misses via Larger Block Size
  • 2. Reduce Conf lict Misses via Higher Associativity
  • 3. Reducing Conf lict Misses via Victim Cache
  • 4. Reducing Conf lict Misses via Pseudo- Associativity
  • 5. Reducing Misses by HW Pref et ching I nst r, Data
  • 6. Reducing Misses by SW Pref et ching Data
  • 7. Reducing Capacity/ Conf . Misses by Compiler Optimizations

CS252/ Culler Lec 4. 46 1/ 31/ 02

Relative CPU Time Block Size 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 16 32 64 128 256 512 1.36 1.28 1.27 1.34 1.54 1.95

L2 cache block size & A.M.A.T.

  • 32KB L1, 8 byte path t o memory

CS252/ Culler Lec 4. 47 1/ 31/ 02

Reducing Miss Penalty Summary

  • Four t echniques

– Read priority over write on miss – Early Restart and Critical Word First on miss – Non- blocking Caches (Hit under Miss, Miss under Miss) – Second Level Cache

  • Can be applied recursively t o Mult ilevel Caches

– Danger is that time to DRAM will grow with multiple levels in between – First attempts at L2 caches can make things worse, since increased worst case is worse CPUtime = IC × CPI Execution+ Memory accesses Instruction × Miss rate × Miss penalty     × Clock cycle time

CS252/ Culler Lec 4. 48 1/ 31/ 02

What is the I mpact of What You’ve Learned About Caches?

  • 1960- 1985: Speed

= ƒ(no. operations)

  • 1990

– Pipelined Execut ion & Fast Clock Rate – Out- of - Order execut ion – Superscalar I nstruction I ssue

  • 1998: Speed =

ƒ(non- cached memory accesses)

  • Superscalar, Out- of - Order machines hide L1 data cache miss

(- 5 clocks) but not L2 cache miss (- 50 clocks)?

1 10 100 1000 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 DRAM CPU

slide-9
SLIDE 9

P age 9

CS252/ Culler Lec 4. 49 1/ 31/ 02

Cache Optimization Summary

Technique MR MP HT Complexity Larger Block Size + – Higher Associativity + – 1 Victim Caches + 2 Pseudo- Associative Caches + 2 HW Pref et ching of I nstr/ Data + 2 Compiler Controlled Pref et ching + 3 Compiler Reduce Misses + Priority to Read Misses + 1 Early Restart & Critical Word 1st + 2 Non- Blocking Caches + 3 Second Level Caches + 2

miss rate miss penalty