P age 1 Reducing Misses by Compiler Merging Arrays Example - - PDF document

p age 1
SMART_READER_LITE
LIVE PREVIEW

P age 1 Reducing Misses by Compiler Merging Arrays Example - - PDF document

How to I mprove Cache CS252 Perf ormance? Graduate Computer Architecture Lecture 7 = + AMAT HitTime MissRate MissPenalt y Cache Design (continued) Feb 12, 2002 1. Reduce the miss rate, Prof . David Culler 2. Reduce t he miss


slide-1
SLIDE 1

P age 1

CS252/ Culler Lec 4. 1 1/ 31/ 02

CS252 Graduate Computer Architecture Lecture 7 Cache Design (continued)

Feb 12, 2002 Prof . David Culler

CS252/ Culler Lec 4. 2 1/ 31/ 02

How to I mprove Cache Perf ormance?

  • 1. Reduce the miss rate,
  • 2. Reduce t he miss penalt y, or
  • 3. Reduce the time to hit in the cache.

y MissPenalt MissRate HitTime AMAT × + =

CS252/ Culler Lec 4. 3 1/ 31/ 02

Where to misses come f rom?

  • Classif ying Misses: 3 Cs

– Compulsory—The f irst access to a block is not in the cache,

so the block must be brought into the cache. Also called cold start misses or f irst ref erence misses. (Misses in even an I nf inite Cache)

– Capacit y—I f the cache cannot contain all the blocks needed

during execution of a program, capacity misses will occur due t o blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache)

– Conf lict—I f block- placement strategy is set associative or

direct mapped, conf lict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interf erence misses. (Misses in N- way Associative, Size X Cache)

  • 4t h “C”:

– Coherence - Misses caused by cache coherence.

CS252/ Culler Lec 4. 4 1/ 31/ 02

Cache Size (KB) 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 4 8 16 32 64 128 1-way 2-way 4-way 8-way Capacity Compulsory

3Cs Absolute Miss Rate (SPEC92)

Conflict

CS252/ Culler Lec 4. 5 1/ 31/ 02

Reducing Misses by Hardware Pref etching of I nstructions & Data

  • E. g. , I nst ruct ion Pref et ching

– Alpha 21064 f etches 2 blocks on a miss – Extra block placed in “stream buf f er” – On miss check stream buf f er

  • Works wit h dat a blocks t oo:

– Jouppi [1990] 1 data stream buf f er got 25% misses f rom 4KB cache; 4 streams got 43% – Palacharla & Kessler [1994] f or scientif ic programs f or 8 streams got 50% to 70% of misses f rom 2 64KB, 4- way set associative caches

  • Pref et ching relies on having ext ra memory

bandwidt h t hat can be used wit hout penalt y

CS252/ Culler Lec 4. 6 1/ 31/ 02

Reducing Misses by Sof t ware Pref etching Dat a

  • Data Pref et ch

– Load data into register (HP PA- RI SC loads) – Cache Pref etch: load into cache (MI PS I V, PowerPC, SPARC v. 9) – Special pref etching instructions cannot cause f aults; a f orm of speculat ive execut ion

  • Pref et ching comes in t wo f lavors:

– Binding pref etch: Requests load directly into register. » Must be correct address and register! – Non- Binding pref etch: Load into cache. » Can be incorrect. Faults?

  • I ssuing Pref et ch I nst ruct ions t akes t ime

– I s cost of pref etch issues < savings in reduced misses? – Higher superscalar reduces dif f iculty of issue bandwidth

slide-2
SLIDE 2

P age 2

CS252/ Culler Lec 4. 7 1/ 31/ 02

Reducing Misses by Compiler Optimizations

  • McFarling [1989] reduced caches misses by 75%
  • n 8KB direct mapped cache, 4 byte blocks in sof tware
  • I nstructions

– Reorder procedures in memory so as to reduce conf lict misses – Prof iling t o look at conf lict s(using t ools t hey developed)

  • Dat a

– Merging Arrays: improve spat ial localit y by single array of compound element s

  • vs. 2 arrays

– Loop I nterchange: change nest ing of loops t o access dat a in order st ored in memory – Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap – Blocking: I mprove temporal locality by accessing “blocks” of data repeat edly

  • vs. going down whole columns or rows

CS252/ Culler Lec 4. 8 1/ 31/ 02

Merging Arrays Example

/* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE];

Reducing conf lict s bet ween val & key; improve spatial locality

CS252/ Culler Lec 4. 9 1/ 31/ 02

Loop I nterchange Example

/* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j];

Sequent ial accesses inst ead of st riding t hrough memory every 100 words; improved spat ial localit y

CS252/ Culler Lec 4.10 1/ 31/ 02

Loop Fusion Example

/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; improve spat ial localit y

CS252/ Culler Lec 4.11 1/ 31/ 02

Blocking Example

/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; };

  • Two I nner Loops:

– Read all NxN elements of z[] – Read N elements of 1 row of y[] repeatedly – Write N elements of 1 row of x[]

  • Capacit y Misses a f unct ion of N & Cache Size:

– 2N 3 + N2 => (assuming no conf lict; otherwise … )

  • I dea: comput e on BxB submat rix t hat f it s

CS252/ Culler Lec 4.12 1/ 31/ 02

Blocking Example

/* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[ i][j] + r; };

  • B called Blocking Fact or
  • Capacit y Misses f rom 2N

3 + N 2 to N3/ B+2N2

  • Conf lict Misses Too?
slide-3
SLIDE 3

P age 3

CS252/ Culler Lec 4.13 1/ 31/ 02

Reducing Conf lict Misses by Blocking

  • Conf lict misses in caches not FA vs. Blocking size

– Lam et al [1991] a blocking f actor of 24 had a f if th the misses

  • vs. 48 despite both f it in cache

Blocking Factor 0.05 0.1 50 100 150 Fully Associative Cache Direct Mapped Cache

CS252/ Culler Lec 4.14 1/ 31/ 02

Performance Improvement 1 1.5 2 2.5 3 compress cholesky (nasa7) spice mxm (nasa7) btrix (nasa7) tomcatv gmty (nasa7) vpenta (nasa7) merged arrays loop interchange loop fusion blocking

Summary of Compiler Optimizations to Reduce Cache Misses (by hand)

CS252/ Culler Lec 4.15 1/ 31/ 02

Summary: Miss Rate Reduction

  • 3 Cs: Compulsory, Capacity, Conf lict
  • 0. Larger cache
  • 1. Reduce Misses via Larger Block Size
  • 2. Reduce Misses via Higher Associativity
  • 3. Reducing Misses via Victim Cache
  • 4. Reducing Misses via Pseudo
  • Associativity
  • 5. Reducing Misses by HW Pref etching I nstr, Dat a
  • 6. Reducing Misses by SW Pref et ching Dat a
  • 7. Reducing Misses by Compiler Optimizations
  • Pref etching comes in two f lavors:

– Binding pref et ch : Request s load direct ly int o regist er. » Must be correct address and regist er! – Non- Binding pref etch: Load int o cache. » Can be incorrect. Frees HW/ SW to guess!

CPUtime = IC × CPIExecution+ Memory accesses Instruction × Miss rate× Miss penalty     × Clock cycle time

CS252/ Culler Lec 4.16 1/ 31/ 02

Review: I mproving Cache Perf ormance

  • 1. Reduce the miss rate,
  • 2. Reduce t he miss penalt y, or
  • 3. Reduce t he t ime t o hit in t he cache.

CS252/ Culler Lec 4.17 1/ 31/ 02

Write Policy: Writ e- Through vs Write- Back

  • Write- through: all writes update cache and underlying

memory/ cache

– Can always discard cached data - most up- to- date data is in memory – Cache control bit: only a valid bit

  • Write- back: all writes simply update cache

– Can’t just discard cached data - may have to write it back to memory – Cache cont rol bit s: bot h valid and dirty bit s

  • Other Advantages:

– Writ e- through: » memory (or ot her processors) always have lat est dat a » Simpler management of cache – Writ e- back: » much lower bandwidt h, since dat a of t en overwrit t en mult iple t imes » Better tolerance to long- lat ency memory?

CS252/ Culler Lec 4.18 1/ 31/ 02

Writ e Policy 2: Write Allocate vs Non- Allocate (What happens on write- miss)

  • Write allocate: allocate new cache line in cache

– Usually means that you have to do a “read miss” to f ill in rest of the cache- line! – Alternative: per/ word valid bits

  • Write non- allocat e (or “writ e- around”):

– Simply send write data through to underlying memory/ cache - don’t allocate new cache line!

slide-4
SLIDE 4

P age 4

CS252/ Culler Lec 4.19 1/ 31/ 02

  • 1. Reducing Miss Penalty:

Read Priority over Write on Miss

write buf f er CPU

in out

DRAM (or lower mem ) Write Buf f er

CS252/ Culler Lec 4.20 1/ 31/ 02

  • 1. Reducing Miss Penalty:

Read Priority over Write on Miss

  • Write- t hrough w/ writ e buf f ers => RAW conf lict s

wit h main memory reads on cache misses

– I f simply wait f or write buf f er to empty, might increase read miss penalty (old MI PS 1000 by 50% ) – Check write buf f er contents bef ore read; if no conf licts, let the memory access continue

  • Write- back want buf f er t o hold displaced blocks

– Read miss replacing dirty block – Normal: Write dirty block to memory, and then do the read – I nstead copy the dirty block to a write buf f er, then do the read, and then do the write – CPU stall less since restarts as soon as do read

CS252/ Culler Lec 4.21 1/ 31/ 02

  • 2. Reduce Miss Penalty:

Early Restart and Critical Word First

  • Don’t wait f or f ull block t o be loaded bef ore

rest art ing CPU

– Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue executio – Critical Word First—Request the missed word f irst f rom memory and send it to the CPU as soon as it arrives; let the CPU continue execution while f illing the rest of the words in the block. Also called wrapped f etch and requested word f irst

  • Generally usef ul only in large blocks,
  • Spatial locality => tend to want next sequential

word, so not clear if benef it by early rest art

block

CS252/ Culler Lec 4.22 1/ 31/ 02

  • 3. Reduce Miss Penalty: Non-

blocking Caches to reduce stalls on misses

  • Non- blocking cache or lockup- f ree cache allow data

cache t o cont inue t o supply cache hit s during a miss

– requires F/ E bits on registers or out- of - order execution – requires multi- bank memories

  • “hit under miss” reduces t he ef f ect ive miss penalt y

by working during miss vs. ignoring CPU request s

  • “hit under mult iple miss

” or “miss under miss” may f urt her lower t he ef f ect ive miss penalt y by

  • verlapping mult iple misses

– Signif icantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses – Requires muliple memory banks (otherwise cannot support) – Penium Pro allows 4 outstanding memory misses

CS252/ Culler Lec 4.23 1/ 31/ 02

Value of Hit Under Miss f or SPEC

  • FP programs on average: AMAT= 0. 68 - > 0. 52 - > 0. 34 - > 0. 26
  • I nt programs on average: AMAT= 0.24 - > 0. 20 - > 0. 19 - > 0. 19
  • 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss

Hit Under i Misses

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 eqntott espresso xlisp compress mdljsp2 ear fpppp tomcatv swm256 doduc su2cor wave5 mdljdp2 hydro2d alvinn nasa7 spice2g6

  • ra

0->1 1->2 2->64 Base

Integer Floating Point “Hit under n Misses”

0->1 1->2 2->64 Base

CS252/ Culler Lec 4.24 1/ 31/ 02

4: Add a second- level cache

  • L2 Equat ions

AMAT = Hit Time L1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit Time L2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit Time L1 + Miss RateL1 x (Hit Time L2 + Miss RateL2 + Miss PenaltyL2)

  • Def init ions:

– Local miss rate— misses in this cache divided by the total number of memory accesses t o t his cache (Miss rateL2) – Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU

slide-5
SLIDE 5

P age 5

CS252/ Culler Lec 4.25 1/ 31/ 02

Partner Discussion

What’s dif f erent in L2 vs L1 Caches?

CS252/ Culler Lec 4.26 1/ 31/ 02

Comparing Local and Global Miss Rates

  • 32 KByt e 1st level cache;

I ncreasing 2nd level cache

  • Global miss rate close to

single level cache rate provided L2 >> L1

  • Don’t use local miss rat e
  • L2 not t ied t o CPU clock

cycle!

  • Cost & A. M. A. T.
  • Generally Fast Hit Times

and f ewer misses

  • Since hits are f ew, target

miss reduct ion

Linear Log Cache Size Cache Size

CS252/ Culler Lec 4.27 1/ 31/ 02

Reducing Misses: Which apply to L2 Cache?

  • Reducing Miss Rat e
  • 1. Reduce Misses via Larger Block Size
  • 2. Reduce Conf lict Misses via Higher Associativity
  • 3. Reducing Conf lict Misses via Victim Cache
  • 4. Reducing Conf lict Misses via Pseudo- Associativity
  • 5. Reducing Misses by HW Pref et ching I nst r, Data
  • 6. Reducing Misses by SW Pref et ching Data
  • 7. Reducing Capacity/ Conf . Misses by Compiler Optimizations

CS252/ Culler Lec 4.28 1/ 31/ 02

Relative CPU Time Block Size 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 16 32 64 128 256 512 1.36 1.28 1.27 1.34 1.54 1.95

L2 cache block size & A.M.A.T.

  • 32KB L1, 8 byte path t o memory

CS252/ Culler Lec 4.29 1/ 31/ 02

Reducing Miss Penalty Summary

  • Four t echniques

– Read priority over write on miss – Early Restart and Critical Word First on miss – Non- blocking Caches (Hit under Miss, Miss under Miss) – Second Level Cache

  • Can be applied recursively t o Mult ilevel Caches

– Danger is that time to DRAM will grow with multiple levels in between – First attempts at L2 caches can make things worse, since increased worst case is worse CPUtime = IC × CPI Execution+ Memory accesses Instruction × Miss rate × Miss penalty     × Clock cycle time

CS252/ Culler Lec 4.30 1/ 31/ 02

What is the I mpact of What You’ve Learned About Caches?

  • 1960- 1985: Speed

= ƒ(no. operations)

  • 1990

– Pipelined Execut ion & Fast Clock Rate – Out- of - Order execut ion – Superscalar I nstruction I ssue

  • 1998: Speed =

ƒ(non- cached memory accesses)

  • Superscalar, Out- of - Order machines hide L1 data cache miss

(- 5 clocks) but not L2 cache miss (- 50 clocks)?

1 10 100 1000 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 DRAM CPU

slide-6
SLIDE 6

P age 6

CS252/ Culler Lec 4.31 1/ 31/ 02

  • 1. Fast Hit t imes

via Small and Simple Caches

  • Why Alpha 21164 has 8KB I nst ruct ion and

8KB dat a cache + 96KB second level cache?

– Small data cache and clock rate

  • Direct Mapped, on chip

CS252/ Culler Lec 4.32 1/ 31/ 02

Address Translation

  • Page table is a large data structure in memory
  • Two memory accesses f or every load, store, or instruction

f etch!!!

  • Virtually addressed cache?

– synonym problem

  • Cache the address translations?

CPU Trans

  • lat ion

Cache Main Memory VA PA m iss hit data

CS252/ Culler Lec 4.33 1/ 31/ 02

TLBs

A way to speed up translation is to use a special cache of recently used page table entries - - this has many names, but the most f requently used is Translation Lookaside Buf f er or TLB Virtual Address Physical Address Dirty Ref Valid Access Really just a cache on the page table mappings TLB access time comparable to cache access time (much less than main memory access time)

CS252/ Culler Lec 4.34 1/ 31/ 02

Translation Look- Aside Buf f ers

Just like any other cache, the TLB can be organized as f ully associative, set associative, or direct mapped TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits f ully associative lookup on these machines. Most mid- range machines use small n- way set associative organizations. CPU TLB Lookup Cache Main Memory VA PA m iss hit data Trans

  • lat ion

hit m iss 20 t t 1/ 2 t Translation wit h a TLB

CS252/ Culler Lec 4.35 1/ 31/ 02

  • 2. Fast hits by Avoiding

Address Translation

CPU TB $ MEM VA PA PA Conventional Organization CPU $ TB MEM VA VA PA Virtually Addressed Cache Translate only on miss Synonym Problem CPU $ TB MEM VA PA Tags PA Overlap $ access with VA translation: requires $ index to remain invariant across translation VA Tags L2 $

CS252/ Culler Lec 4.36 1/ 31/ 02

  • 2. Fast Cache Hits by Avoiding

Translation: I ndex with Physical Portion of Address

  • I f index is physical part of address, can

st art t ag access in parallel wit h t ranslat ion so t hat can compare t o physical t ag

  • Limit s cache t o page size: what if want

bigger caches and uses same t rick?

– Higher associativity moves barrier to right – Page coloring Page Address Page Offset Address Tag Index Block Offset

slide-7
SLIDE 7

P age 7

CS252/ Culler Lec 4.37 1/ 31/ 02

  • 2. Fast hits by Avoiding Address

Translation

  • Send virtual address to cache? Called Virtually Addressed Cache or

just Virtual Cache vs. Physical Cache

– Every t ime process is swit ched logically must f lush t he cache; o therwise get f alse hits » Cost is time to f lush + “compulsory” misses f rom empty cache » Add process identif ier tag t hat ident if ies process as well as address wit hin process: can’t get a hit if wrong process

  • Dealing with aliases (sometimes called synonyms);

Two dif f erent virtual addresses map to same physical address

– solve by f iat: no aliasing! What are the implications?

– HW antialiasing: guarant ees every cache block has unique address » verif y on miss (rather than on every hit) » cache set size <= page size ? » what if it get s larger? – How can SW simplif y the problem? (called page coloring) – I / O must int eract wit h cache, so need virt ual address

CS252/ Culler Lec 4.38 1/ 31/ 02

3: Fast Hits by pipelining Cache Case Study: MI PS R4000

  • 8 Stage Pipeline:

– I F–f irst half of f etching of instruction; PC selection happens here as well as initiation of instruction cache access. – I S–second half of access to instruction cache. – RF–instruction decode and register f etch, hazard checking and also instruction cache hit detection. – EX–execution, which includes ef f ective address calculation, ALU

  • peration, and branch target computation and condition

evaluation. – DF–data f etch, f irst half of access to data cache. – DS–second half of access to data cache. – TC–tag check, determine whether the data cache access hit. – WB–write back f or loads and register- register operations.

  • What is impact on Load delay?

– Need 2 instructions between a load and its use!

CS252/ Culler Lec 4.39 1/ 31/ 02

Case Study: MI PS R4000

I F I S I F RF I S I F EX RF I S I F DF EX RF I S I F DS DF EX RF I S I F TC DS DF EX RF I S I F WB TC DS DF EX RF I S I F TWO Cycle Load Lat ency I F I S I F RF I S I F EX RF I S I F DF EX RF I S I F DS DF EX RF I S I F TC DS DF EX RF I S I F WB TC DS DF EX RF I S I F THREE Cycle Branch Latency (condit ions evaluat ed dur ing EX phase) Delay slot plus two stalls Branch likely cancels delay slot if not taken

CS252/ Culler Lec 4.40 1/ 31/ 02

R4000 Perf ormance

  • Not ideal CPI of 1:

– Load stalls (1 or 2 clock cycles) – Branch stalls (2 cycles + unf illed slots) – FP result stalls: RAW data hazard (latency) – FP structural stalls: Not enough FP hardware (parallelism) 0.5 1 1.5 2 2.5 3 3.5 4 4.5

eqntott espresso gcc li doduc nasa7

  • ra

spice2g6 su2cor tomcatv Base Load stalls Branch stalls FP result stalls FP structural stalls CS252/ Culler Lec 4.41 1/ 31/ 02

What is the I mpact of What You’ve Learned About Caches?

  • 1960- 1985: Speed

= ƒ(no. operations)

  • 1990

– Pipelined Execut ion & Fast Clock Rate – Out- of - Order execut ion – Superscalar I nstruction I ssue

  • 1998: Speed =

ƒ(non- cached memory accesses)

  • What does this mean f or

– Compilers?, Operating Systems?, Algorithms? Data Structures?

1 10 100 1000 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 DRAM CPU CS252/ Culler Lec 4.42 1/ 31/ 02

Alpha 21064

  • Separate I nstr & Data

TLB & Caches

  • TLBs f ully associative
  • TLB updates in SW

(“Priv Arch Libr”)

  • Caches 8KB direct

mapped, write thru

  • Critical 8 bytes f irst
  • Pref etch instr. stream

buf f er

  • 2 MB L2 cache, direct

mapped, WB (of f - chip)

  • 256 bit path to main

memory, 4 x 64- bit modules

  • Victim Buf f er: to give

read priority over write

  • 4 entry write buf f er

between D$ & L2$ Stream Buffer Write Buffer Victim Buffer Instr Data

slide-8
SLIDE 8

P age 8

CS252/ Culler Lec 4.43 1/ 31/ 02

0.01% 0.10% 1.00% 10.00% 100.00%

AlphaSort TPC-B (db1) Li Sc Compress Ora Ear Doduc Tomcatv Mdljp2 Spice Su2cor Miss Rate I $ D $ L2

Alpha Memory Perf ormance: Miss Rates of SPEC92

8K 8K 2M I$ miss = 2% D$ miss = 13% L2 miss = 0.6% I$ miss = 1% D$ miss = 21% L2 miss = 0.3% I$ miss = 6% D$ miss = 32% L2 miss = 10%

CS252/ Culler Lec 4.44 1/ 31/ 02

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00

AlphaSort TPC-B (db1) Li Sc Compress Ora Ear Doduc Tomcatv Mdljp2 CPI L2 I$ D$ I Stall Other

Alpha CPI Components

  • I nstruction stall: branch mispredict (green);
  • Data cache (blue); I nstruction cache (yellow); L2$ (pink)

Other: compute + reg conf licts, structural conf licts

CS252/ Culler Lec 4.45 1/ 31/ 02

Pitf all: Predicting Cache Perf ormance f rom Dif f erent P

  • rog. (I SA, compiler,

. . . )

  • 4KB Data cache miss

rate 8%,12%, or 28%?

  • 1K

B I nst r cache miss rate 0%,3%,or 10%?

  • Alpha vs. MI PS

f or 8KB Data $: 17% vs. 10%

  • Why 2X Alpha v.

MI PS? 0 % 5 % 10% 15% 20% 25% 30% 35% 1 2 4 8 1 6 32 64 128

Cache Size (KB) Miss Rate D: tomcatv D: gcc D: espresso I: gcc I: espresso I: tomcatv

D$, Tom D$, gcc D$, esp I$, gcc I$, esp I$, Tom

CS252/ Culler Lec 4.46 1/ 31/ 02

Cache Optimization Summary

Technique MR MP HT Complexity Larger Block Size + – Higher Associativity + – 1 Victim Caches + 2 Pseudo- Associative Caches + 2 HW Pref et ching of I nstr/ Data + 2 Compiler Controlled Pref et ching + 3 Compiler Reduce Misses + Priority to Read Misses + 1 Early Restart & Critical Word 1st + 2 Non- Blocking Caches + 3 Second Level Caches + 2 Better memory system + 3 Small & Simple Caches – + Avoiding Address Translation + 2 Pipelining Caches + 2 miss rate hit time miss penalty