cosc 5351 advanced computer architecture
play

COSC 5351 Advanced Computer Architecture Slides modified from - PowerPoint PPT Presentation

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides 11 Advanced Cache Optimizations Memory Technology and DRAM optimizations Virtual Machines Xen VM: Design and Performance AMD Opteron


  1. COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides

  2.  11 Advanced Cache Optimizations  Memory Technology and DRAM optimizations  Virtual Machines  Xen VM: Design and Performance  AMD Opteron Memory Hierarchy  Opteron Memory Performance vs. Pentium 4  Fallacies and Pitfalls  Conclusion COSC5351 Advanced Computer Architecture 10/26/2011 2

  3. 100,000 10,000 Performance 1,000 Processor Processor-Memory Performance Gap 100 Growing 10 Memory 1 1980 1985 1990 1995 2000 2005 2010 Year 10/26/2011 3

  4.  How does a memory hierarchy improve performance?  What costs are associated with a memory access? COSC5351 Advanced Computer Architecture 10/26/2011 4

  5. COSC5351 Advanced Computer Architecture 10/26/2011 5

  6. VM is 2 64 64 or 16E 6Eb COSC5351 Advanced Computer Architecture 10/26/2011 6

  7. Physic ical al Mem em is 2 41 41 or 2Tb Tb COSC5351 Advanced Computer Architecture 10/26/2011 7

  8. Page e size e is 2 13 13 or 8Kb COSC5351 Advanced Computer Architecture 10/26/2011 8

  9. 2 13 13 (8Kb) b) direct ect mapped pped L1 lines nes with th 64b 4b blocks ocks COSC5351 Advanced Computer Architecture 10/26/2011 9

  10. 2 8 TLB B entries tries direct ect map apped ped in th this s case se (often ten fully ly assoc soc) Compare ompare 43-bit bit tag with th the tag in the e appropria ropriate te TLB B slot ot COSC5351 Advanced Computer Architecture 10/26/2011 10

  11. If in TLB LB you u chec heck k the e L1 cache ache tag in the e appropr ropria iate e line ne to se if in L1 COSC5351 Advanced Computer Architecture 10/26/2011 11

  12. If f not in L1, , build ld PA with th 28bit 8bit TLB B data a + page e offset set. Use e this is to acc cces ess s L2 cache che COSC5351 Advanced Computer Architecture 10/26/2011 12

  13. 2 22 22 (4Mb) Mb) direct ect mapped pped L2 lines nes with th 64b 4b blocks ocks COSC5351 Advanced Computer Architecture 10/26/2011 13

  14. Compare ompare the L2 tag to see if actual tually ly in L2 cach ache COSC5351 Advanced Computer Architecture 10/26/2011 14

  15. Reducing hit time  1. Giving Reads Priority over Writes • E.g., Read completes before earlier writes in write buffer 2. Avoiding Address Translation during Cache Indexing (use page offset) Reducing Miss Penalty  3. Multilevel Caches (avoid larger vs faster) Reducing Miss Rate  4. Larger Block size (Compulsory misses) 5. Larger Cache size (Capacity misses) 6. Higher Associativity (Conflict misses) Do these e always ys improve rove perform ormanc ance? COSC5351 Advanced Computer Architecture 10/26/2011 15

  16. Reducing Miss Penalty  Reducing hit time  7. Critical word first 1. Small and simple 8. Merging write buffers caches 2. Way prediction Reducing Miss Rate  3. Trace caches 9. Compiler optimizations  Increasing cache Reducing miss penalty  bandwidth or miss rate via 4. Pipelined caches parallelism 10. Hardware prefetching 5. Multibanked caches 11. Compiler prefetching 6. Nonblocking caches COSC5351 Advanced Computer Architecture 10/26/2011 16

  17.  Index tag memory and then compare takes time   Small cache can help hit time since smaller memory takes less time to index ◦ E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron ◦ Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip  Simple  direct mapping ◦ Can overlap tag check with data transmission since no choice  Access time estimate for 90 nm using CACTI model 4.0 ◦ Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches 2.50 Access time (ns) 1-way 2-way 4-way 8-way 2.00 1.50 1.00 0.50 - 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB Cache size 10/26/2011 17

  18.  Assume 2-way hit time is 1.1x faster than 4- way  Miss rate will be .049 and .044 (from C.8)  Hit is 1 clock cycle, miss penalty is 10 clocks (to go to L2 and it hits) Avg Mem Acces = Hit time + Miss Rate X Miss pen  2-way Avg Mem Acces = 1 + .049*10 = 1.49 Elapse sed time me shou ould ld be abou out t same me  4-way 9*1.1 .1 = 9.9 ~ 10 Avg Mem Acces = 1.1 + .044*9 = 1.50 This is means ns the cloc ock woul uld be slow ower er thoug ugh h so ever eryth thing ng else se slowe lower. r. COSC5351 Advanced Computer Architecture 10/26/2011 18

  19.  How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache?  Way prediction: keep extra bits in cache to predict the “way,” or block within the set, of next cache access. ◦ Multiplexor is set early to select desired block, only 1 tag comparison performed that clock cycle in parallel with reading the cache data ◦ Miss  1 st check other blocks for matches in next clock cycle Hit Time Miss Penalty Way-Miss Hit Time  Accuracy  85% (seen 97.9%)  Drawback: CPU pipeline is harder if variable hit times ◦ Used for instruction caches (speculative) vs. data caches COSC5351 Advanced Computer Architecture 10/26/2011 19

  20. Find more instruction level parallelism?  How avoid translation from x86 to microops? Trace cache in Pentium 4  Dynamic traces of the executed instructions vs. static sequences of 1. instructions as determined by layout in memory Built-in branch predictor ◦ Cache the micro-ops vs. x86 instructions 2. Decode/translate from x86 to micro-ops on trace cache miss ◦ + 1.  better utilize long blocks (don’t exit in middle of block, don’t enter at label in middle of block) 1.  complicated address mapping since addresses no - longer aligned to power-of-2 multiples of word size - 1.  instructions may appear in multiple dynamic traces due to different branch outcomes decreasing cache space usage efficiency COSC5351 Advanced Computer Architecture 10/26/2011 20

  21.  Pipeline cache access ◦ Allows higher clock ◦ Gives higher bandwidth ◦ But multiple clocks for a hit => higher latency  Cycles to access instruction cache Example: 1: Pentium 2: Pentium Pro through Pentium III 4: Pentium 4 => greater penalty on mispredicted branches => more cycles between load issue & data use + Easier to have higher associativity COSC5351 Advanced Computer Architecture 10/26/2011 21

  22.  Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss ◦ requires F/E bits on registers or out-of-order execution ◦ requires multi-bank memories  “ hit under miss ” reduces the effective miss penalty by working during miss vs. ignoring CPU requests  “ hit under multiple miss ” or “ miss under miss ” may further lower the effective miss penalty by overlapping multiple misses ◦ Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses ◦ Requires muliple memory banks (otherwise cannot support) ◦ Penium Pro allows 4 outstanding memory misses COSC5351 Advanced Computer Architecture 10/26/2011 22

  23. Hit Under i Misses 2 1.8 1.6 1.4 0->1 0->1 1.2 1->2 1->2 1 2->64 2->64 0.8 Base Base 0.6 0.4 “Hit under n Misses” 0.2 0 espresso doduc ear nasa7 ora compress wave5 eqntott xlisp fpppp su2cor hydro2d spice2g6 tomcatv alvinn mdljsp2 swm256 mdljdp2 Integer Floating Point  FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26  Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19  8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92 COSC5351 Advanced Computer Architecture 10/26/2011 23

  24.  FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26  Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19  8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92 COSC5351 Advanced Computer Architecture 10/26/2011 24

  25.  Previous study old with smaller cache  New study: 32KL1 4cal, 256KBL2 10CAL, L3 2M 36CAL Li, Chen, Brockman, Jouppi (2011) ◦ COSC5351 Advanced Computer Architecture 10/26/2011 25

  26.  Rather than treat the cache as a single monolithic block, divide into independent banks that can support simultaneous access ◦ E.g.,T1 (“Niagara”) and Arm Cortex-A8 have 4 L2 banks ◦ Intel Core i7 has four L1 banks, L2 has 8.  Banking works best when accesses spread across banks  mapping of addresses to banks affects behavior of memory system  Simple mapping that works well is sequential interleaving ◦ Spread block addresses sequentially across banks ◦ E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1; … COSC5351 Advanced Computer Architecture 10/26/2011 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend