p age 1
play

P age 1 Review: Cache perf ormance What are all the aspects of - PDF document

Who Cares About the Memory Hierarchy? CS252 Graduate Computer Architecture Proc 1000 Lecture 4 CPU 60%/yr. Moores Law Performance CPU- DRAM Gap Cache Design 100 Processor-Memory Performance Gap: (grows 50% / year) 10


  1. Who Cares About the Memory Hierarchy? CS252 Graduate Computer Architecture µProc 1000 Lecture 4 CPU 60%/yr. “Moore’s Law” Performance CPU- DRAM Gap Cache Design 100 Processor-Memory Performance Gap: (grows 50% / year) 10 “Less’ Law?” January 31, 2002 DRAM 7%/yr. Prof . David Culler DRAM 1 1980 1981 1985 1988 1989 1992 1995 1996 1997 1999 2000 1982 1983 1984 1986 1987 1990 1991 1993 1994 1998 • 1980: no cache in µproc; 1995 2- level cache on chip (1989 f irst I nt el µproc wit h a cache on chip) CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 1 Lec 4. 2 Processor - Memory Generations of Microprocessors Perf ormance Gap “Tax” • Time of a f ull cache miss in inst ruct ions execut ed: 1st Alpha: 340 ns/ 5. 0 ns = 68 clks x 2 or 136 Processor % Area %Transist ors 2nd Alpha: 266 ns/ 3. 3 ns = 80 clks x 4 or 320 (- cost) (- power) 3rd Alpha: 180 ns/ 1. 7 ns =108 clks x 6 or 648 • Alpha 21164 37% 77% • 1/ 2X lat ency x 3X clock rat e x 3X I nstr/ clock ⇒ - 5X St rongArm SA110 61% 94% • • Pentium Pro 64% 88% – 2 dies per package: Proc/ I $/ D$ + L2$ • Caches have no “inherent value”, only t ry t o close perf ormance gap CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 3 Lec 4. 4 What is a cache? Traditional Four Questions f or • Small, f ast st orage used t o improve average access time to slow memory. Memory Hierarchy Designers • Exploits spacial and temporal locality • Q1: Where can a block be placed in t he upper level? • I n comput er archit ect ure, almost everyt hing is a cache! (Block placement ) – Registers “a cache” on variables – sof tware managed – Fully Associative, Set Associative, Direct Mapped – First - level cache a cache on second- level cache • Q2: How is a block f ound if it is in t he upper level? – Second- level cache a cache on memory (Block ident if icat ion) – Memory a cache on disk (virtual memory) – Tag/ Block – TLB a cache on page table • Q3: Which block should be replaced on a miss? – Branch- prediction a cache on prediction inf ormation? (Block replacement ) Proc/ Regs – Random, LRU L1- Cache • Q4: What happens on a writ e? Bigger Faster (Write strategy) L2- Cache – Write Back or Write Through (with Write Buf f er) Memory Disk, Tape, etc. CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 5 Lec 4. 6 P age 1

  2. Review: Cache perf ormance What are all the aspects of cache organization that impact • Miss- orient ed Approach t o Memory Access: perf ormance?   MemAccess = ×  + × ×  × CPUtime IC CPI MissRate MissPenalt y CycleTime   Execution Inst   MemMisses = × + × ×   CPUtime IC CPI MissPenalt y CycleTime   Execution Inst – CPI Execution includes ALU and Memory instructions • Separat ing out Memory component ent irely – AMAT = Average Memory Access Time – CPI ALUOps does not include memory instructions   AluOps MemAccess = × × + × ×   CPUtime IC CPI AMAT CycleTime   A l u O p s Inst Inst = + × AMAT HitTime MissRate MissPenalt y ( ) = + × + HitTime MissRate MissPenalt y Inst Inst Inst ( ) + × HitTime MissRate MissPenalt y Data Data Data CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 7 Lec 4. 8 Unif ied vs Split Caches I mpact on Perf ormance • Suppose a processor execut es at • Unif ied vs Separat e I &D – Clock Rate = 200 MHz (5 ns per cycle), I deal (no misses) CPI = 1.1 – 50% arith/ logic, 30% ld/ st, 20% control Proc Proc • Suppose t hat 10% of memory operat ions get 50 cycle I - Cache- 1 Proc D- Cache- 1 Unif ied miss penalty Cache- 1 Unif ied Cache- 2 • Suppose t hat 1% of inst ruct ions get same miss penalt y Unif ied Cache- 2 • CPI = ideal CPI + average st alls per inst ruct ion • Example: 1. 1(cycles/ ins) + – 16KB I &D: I nst miss rate=0. 64%, Data miss rate=6. 47% [ 0. 30 (Dat aMops / ins) – 32KB unif ied: Aggregate miss rate=1. 99% x 0. 10 (miss/ Dat aMop) x 50 (cycle/ miss)] + • Which is better (ignore L2 cache)? [ 1 (I nst Mop/ ins) – Assume 33% data ops ⇒ 75% accesses f rom instructions (1. 0/ 1. 33) x 0. 01 (miss/ I nst Mop) x 50 (cycle/ miss)] – hit time=1, miss time=50 = (1. 1 + 1. 5 + . 5) cycle/ ins = 3. 1 – Note that data hit has 1 stall f or unif ied cache (only one port) • 58% of t he t ime t he proc is st alled wait ing f or memory! • AMAT=(1/ 1. 3)x[1+0. 01x50]+(0. 3/ 1. 3)x[1+0. 1x50]=2. 54 AMAT Harvard =75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMAT Unif ied =75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24 CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 9 Lec 4. 10 Where to misses come f rom? How to I mprove Cache • Classif ying Misses: 3 Cs Perf ormance? – Compulsory —The f irst access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or f irst ref erence misses . (Misses in even an I nf inite Cache) = + × AMAT HitTime MissRate MissPenalt y – Capacit y —I f the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due t o blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) – Conf lict —I f block- placement strategy is set associative or 1. Reduce the miss rate, direct mapped, conf lict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and 2. Reduce t he miss penalt y, or later retrieved if too many blocks map to its set. Also called collision misses or interf erence misses . 3. Reduce the time to hit in the cache. (Misses in N- way Associative, Size X Cache) • 4t h “C”: – Coherence - Misses caused by cache coherence. CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 11 Lec 4. 12 P age 2

  3. 3Cs Absolute Miss Rate Cache Size (SPEC92) 0.14 1-way 0.12 2-way 0.14 0.1 1-way 4-way Conflict 0.12 0.08 2-way 8-way 0.1 0.06 4-way Capacity 0.08 0.04 8-way 0.06 0.02 Capacity 0.04 0 0.02 1 2 4 8 16 32 64 128 0 Compulsory Cache Size (KB) 1 2 4 8 16 32 64 128 • Old rule of thumb: 2x size => 25% cut in miss rate Compulsory • What does it reduce? Cache Size (KB) CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 13 Lec 4. 14 Cache Organization? Huge Caches => Working Sets fic • Assume total cache size not changed: Data traf First working set • What happens if : Capacity -generated traf fic (including conflicts) 1) Change Block Size: Second working set 14 4-node Other capacity -independent communication 12 8-node 2) Change Associat ivit y: Inher ent communication 16-node 10 Cold-start (compulsory) traf fic 32-node Replication capacity (cache size) Miss Rate (%) 8 3) Change Compiler: 6 Example LU Decomposition 4 f rom NAS Parallel Benchmarks Which of 3Cs is obviously af f ect ed? 2 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Per Processor Cache Size (KB) CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 15 Lec 4. 16 Larger Block Size Associativity (f ixed size&assoc) 25% 0.14 1-way Conflict 1K 0.12 20% 2-way 0.1 4K 15% 4-way Miss 0.08 16K Rate 8-way 10% 0.06 64K Capacity 0.04 5% 256K 0.02 Reduced 0% compulsory 0 misses 16 32 64 128 256 1 2 4 8 I ncreased 16 32 64 128 Conf lict Block Size (bytes) Compulsory Misses Cache Size (KB) What else drives up block size? CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 17 Lec 4. 18 P age 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend