now handout page 1
play

NOW Handout Page 1 CS258 S99 1 Physi sical al Mem is 2 41 41 or - PDF document

11 Advanced Cache Optimizations Memory Technology and DRAM optimizations Virtual Machines Xen VM: Design and Performance AMD Opteron Memory Hierarchy COSC 5351 Advanced Computer Architecture Opteron Memory Performance vs.


  1.  11 Advanced Cache Optimizations  Memory Technology and DRAM optimizations  Virtual Machines  Xen VM: Design and Performance  AMD Opteron Memory Hierarchy COSC 5351 Advanced Computer Architecture  Opteron Memory Performance vs. Pentium 4 Slides modified from Hennessy CS252 course slides  Fallacies and Pitfalls  Conclusion COSC5351 Advanced Computer Architecture 10/26/2011 2  How does a memory hierarchy improve 100,000 performance?  What costs are associated with a memory 10,000 Performance access? 1,000 Processor Processor-Memory Performance Gap 100 Growing 10 Memory 1 1980 1985 1990 1995 2000 2005 2010 Year COSC5351 Advanced Computer 10/26/2011 3 Architecture 10/26/2011 4 VM is 2 64 64 or 16Eb COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/26/2011 5 Architecture 10/26/2011 6 NOW Handout Page 1 CS258 S99 1

  2. Physi sical al Mem is 2 41 41 or Page size is 2 13 13 or 8Kb 2Tb 2Tb COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/26/2011 7 Architecture 10/26/2011 8 2 8 TLB entries s 2 13 13 (8Kb) direct ct mapped direct ct mapped d in this L1 lines s with 64b blocks case (ofte ten fully assoc) Compare re 43-bit t tag with the tag in the appropri priat ate TLB slot COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/26/2011 9 Architecture 10/26/2011 10 If in TLB you check the L1 cache tag in the appropri priat ate line to se if in L1 If not in L1, build PA with 28bit t TLB data + page offset. t. Use this to access s L2 cache COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/26/2011 11 Architecture 10/26/2011 12 NOW Handout Page 2 CS258 S99 2

  3. 2 22 22 (4Mb) b) direct t mapped Compare re the L2 tag to L2 lines s with 64b blocks see if actuall ally in L2 cache COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/26/2011 13 Architecture 10/26/2011 14 Reducing hit time  Reducing Miss Penalty  Reducing hit time  1. Giving Reads Priority over Writes 7. Critical word first 1. Small and simple • E.g., Read completes before earlier writes in 8. Merging write buffers write buffer caches 2. Avoiding Address Translation during 2. Way prediction Cache Indexing (use page offset) Reducing Miss Rate  3. Trace caches 9. Compiler optimizations Reducing Miss Penalty  3. Multilevel Caches (avoid larger vs faster)  Increasing cache Reducing miss penalty  bandwidth Reducing Miss Rate or miss rate via  4. Pipelined caches parallelism 4. Larger Block size (Compulsory misses) 5. Larger Cache size (Capacity misses) 5. Multibanked caches 10. Hardware prefetching 6. Higher Associativity (Conflict misses) 11. Compiler prefetching 6. Nonblocking caches Do these always impr prove ove perfor formance? COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/26/2011 15 Architecture 10/26/2011 16  Index tag memory and then compare takes time  Assume 2-way hit time is 1.1x faster than 4-   Small cache can help hit time since smaller memory way takes less time to index  Miss rate will be .049 and .044 (from C.8) ◦ E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron  Hit is 1 clock cycle, miss penalty is 10 clocks ◦ Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip (to go to L2 and it hits)  Simple  direct mapping Avg Mem Acces = Hit time + Miss Rate X Miss pen ◦ Can overlap tag check with data transmission since no choice  Access time estimate for 90 nm using CACTI model 4.0  2-way ◦ Median ratios of access time relative to the direct-mapped caches are Avg Mem Acces = 1 + .049*10 = 1.49 Elapsed ed time e should 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches be about same 2.50  4-way 9*1.1 = 9.9 ~ 10 Access time (ns) 1-way 2-way 4-way 8-way 2.00 1.50 Avg Mem Acces = 1.1 + .044*9 = 1.50 1.00 0.50 This means the clock k would - be slower er though so 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB ever erything else e slower er. Cache size COSC5351 Advanced Computer 10/26/2011 17 Architecture 10/26/2011 18 NOW Handout Page 3 CS258 S99 3

  4. Find more instruction level parallelism?  How to combine fast hit time of Direct Mapped and have  How avoid translation from x86 to microops? the lower conflict misses of 2-way SA cache? Trace cache in Pentium 4  Way prediction: keep extra bits in cache to predict the  Dynamic traces of the executed instructions vs. static sequences of “way,” or block within the set, of next cache access. 1. instructions as determined by layout in memory ◦ Multiplexor is set early to select desired block, only 1 tag Built-in branch predictor ◦ comparison performed that clock cycle in parallel with reading Cache the micro-ops vs. x86 instructions 2. ◦ Decode/translate from x86 to micro-ops on trace cache miss the cache data + 1.  better utilize long blocks (don’t exit in middle of ◦ Miss  1 st check other blocks for matches in next clock cycle block, don’t enter at label in middle of block) Hit Time 1.  complicated address mapping since addresses no - longer aligned to power-of-2 multiples of word size Miss Penalty Way-Miss Hit Time - 1.  instructions may appear in multiple dynamic traces  Accuracy  85% (seen 97.9%) due to different branch outcomes decreasing cache  Drawback: CPU pipeline is harder if variable hit times space usage efficiency ◦ Used for instruction caches (speculative) vs. data caches COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/26/2011 19 Architecture 10/26/2011 20  Pipeline cache access  Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss ◦ Allows higher clock ◦ requires F/E bits on registers or out-of-order execution ◦ Gives higher bandwidth ◦ requires multi-bank memories ◦ But multiple clocks for a hit => higher latency  “ hit under miss ” reduces the effective miss penalty by  Cycles to access instruction cache Example: working during miss vs. ignoring CPU requests 1: Pentium  “ hit under multiple miss ” or “ miss under miss ” may 2: Pentium Pro through Pentium III further lower the effective miss penalty by overlapping 4: Pentium 4 multiple misses => greater penalty on mispredicted branches ◦ Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses => more cycles between load issue & data use ◦ Requires muliple memory banks (otherwise cannot support) + Easier to have higher associativity ◦ Penium Pro allows 4 outstanding memory misses COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/26/2011 21 Architecture 10/26/2011 22 Hit Under i Misses 2 1.8 1.6 1.4 0->1 0->1 1.2 1->2 1->2 1 2->64 2->64 0.8 Base Base 0.6 0.4 “Hit under n Misses” 0.2 0 espresso doduc eqntott xlisp compress ear fpppp tomcatv su2cor wave5 hydro2d nasa7 spice2g6 ora mdljsp2 swm256 mdljdp2 alvinn Integer Floating Point  FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26  FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26  Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19  Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19  8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92  8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92 COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10/26/2011 23 Architecture 10/26/2011 24 NOW Handout Page 4 CS258 S99 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend