cache systems
play

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz - PDF document

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy Design Bus 66MHz Bus 66MHz Chapter 5 and Appendix C Data object Block transfer transfer Main CPU Cache Memory 1 4 Example: Two-level


  1. Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy Design Bus 66MHz Bus 66MHz Chapter 5 and Appendix C Data object Block transfer transfer Main CPU Cache Memory 1 4 Example: Two-level Hierarchy Overview Access Time • Problem T 1 +T 2 – CPU vs Memory performance imbalance • Solution – Driven by temporal and Driven by temporal and spatial locality – Memory hierarchies • Fast L1, L2, L3 caches • Larger but slower memories • Even larger but even T 1 slower secondary storage • Keep most of the action in the higher levels 1 0 Hit ratio 2 5 Locality of Reference Basic Cache Read Operation • Temporal and Spatial • CPU requests contents of memory location • Sequential access to memory • Check cache for this data • Unit-stride loop (cache lines = 256 bits) • If present, get from cache (fast) for (i = 1; i < 100000; i++) • If not present, read required block from If d i d bl k f sum = sum + a[i]; main memory to cache • Non-unit stride loop (cache lines = 256 bits) • Then deliver from cache to CPU for (i = 0; i <= 100000; i = i+8) • Cache includes tags to identify which block sum = sum + a[i]; of main memory is in each cache slot 3 6 1

  2. Number of Caches Elements of Cache Design • Increased logic density => on-chip cache • Cache size – Internal cache: level 1 (L1) • Line (block) size – External cache: level 2 (L2) • Number of caches • Unified cache • Mapping function Mapping function – Balances the load between instruction and data fetches – Block placement – Only one cache needs to be designed / implemented – Block identification • Split caches (data and instruction) • Replacement Algorithm – Pipelined, parallel architectures • Write Policy 7 10 Cache Size Mapping Function • Cache lines << main memory blocks • Cache size << main memory size • Small enough • Direct mapping – Minimize cost – Maps each block into only one possible line – Speed up access (less gates to address the cache) – (block address) MOD (number of lines) – Keep cache on chip Keep cache on chip • Fully associative • Large enough – Block can be placed anywhere in the cache – Minimize average access time • Set associative • Optimum size depends on the workload – Block can be placed in a restricted set of lines • Practical size? – (block address) MOD (number of sets in cache) 8 11 Line Size Cache Addressing • Optimum size depends on workload • Small blocks do not use locality of reference Block address Block offset principle Tag Index • Larger blocks reduce the number of blocks – Replacement overhead Block offset – selects data object from the block Main Memory • Practical sizes? Tag Cache Index – selects the block set Tag – used to detect a hit 9 12 2

  3. Direct Mapping Replacement Algorithm • Simple for direct-mapped: no choice • Random – Simple to build in hardware • LRU Associativity Two-way Four-way Eight-way Size LRU Random LRU Random LRU Random 16KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% 13 16 Associative Mapping Write Policy • Write is more complex than read – Write and tag comparison can not proceed simultaneously – Only a portion of the line has to be updated • Write policies Write policies – Write through – write to the cache and memory – Write back – write only to the cache (dirty bit) • Write miss: – Write allocate – load block on a write miss – No-write allocate – update directly in memory 14 17 K-Way Set Associative Mapping Alpha AXP 21064 Cache CPU 21 8 5 Address Tag Index offset Data data In out Valid Tag Data (256) Write buffer =? Lower level memory 15 18 3

  4. Write Merging Cache Performance Improvements Write address V V V V • Average memory-access time 1 100 0 0 0 = Hit time + Miss rate x Miss penalty 104 1 0 0 0 • Cache optimizations 108 1 0 0 0 – Reducing the miss rate 1 1 112 112 0 0 0 0 0 0 – Reducing the miss penalty – Reducing the hit time Write address V V V V 100 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 19 22 DECstation 5000 Miss Rates Example 30 Which has the lower average memory access time: 25 A 16-KB instruction cache with a 16-KB data cache or A 32-KB unified cache 20 Instr. Cache Hit time = 1 cycle % 15 Data Cache Unified Miss penalty = 50 cycles p y y 10 10 Load/store hit = 2 cycles on a unified cache 5 Given: 75% of memory accesses are instruction references. 0 Overall miss rate for split caches = 0.75*0.64% + 0.25*6.47% = 2.10% 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB Cache size Miss rate for unified cache = 1.99% Average memory access times: Direct-mapped cache with 32-byte blocks Split = 0.75 * (1 + 0.0064 * 50) + 0.25 * (1 + 0.0647 * 50) = 2.05 Percentage of instruction references is 75% Unified = 0.75 * (1 + 0.0199 * 50) + 0.25 * (2 + 0.0199 * 50) = 2.24 20 23 Cache Performance Measures Cache Performance Equations • Hit rate : fraction found in that level CPU time = (CPU execution cycles + Mem stall cycles) * Cycle time – So high that usually talk about Miss rate Mem stall cycles = Mem accesses * Miss rate * Miss penalty – Miss rate fallacy: as MIPS to CPU performance, CPU time = IC * (CPI execution + Mem accesses per instr * Miss rate * • Average memory-access time Miss penalty) * Cycle time = Hit time + Miss rate x Miss penalty (ns) Hit time + Miss rate x Miss penalty (ns) Misses per instr = Mem accesses per instr * Miss rate • Miss penalty : time to replace a block from lower CPU time = IC * (CPI execution + Misses per instr * Miss penalty) * level, including time to replace in CPU Cycle time – access time to lower level = f(latency to lower level) – transfer time : time to transfer block =f(bandwidth) 21 24 4

  5. Reducing Miss Penalty Critical Word First and Early Restart • Multi-level Caches • Critical Word First: Request the missed word first from memory • Critical Word First and Early Restart • Early Restart: Fetch in normal order, but as • Priority to Read Misses over Writes soon as the requested word arrives, send it • Merging Write Buffers g g to CPU • Victim Caches 25 28 Multi-Level Caches Giving Priority to Read Misses over Writes • Avg mem access time = Hit time(L1) + Miss Rate SW R3, 512(R0) (L1) X Miss Penalty(L1) LW R1, 1024 (R0) • Miss Penalty (L1) = Hit Time (L2) + Miss Rate (L2) X Miss Penalty (L2) LW R2, 512 (R0) • Avg mem access time = Hit Time (L1) + Miss • Direct-mapped, write-through cache pp , g Rate (L1) X (Hit Time (L2) + Miss Rate (L2) X ( ) ( ( ) ( ) Miss Penalty (L2) mapping 512 and 1024 to the same block • Local Miss Rate: number of misses in a cache and a four word write buffer divided by the total number of accesses to the cache • Will R2=R3? • Global Miss Rate: number of misses in a cache • Priority for Read Miss? divided by the total number of memory accesses generated by the cache 26 29 Performance of Multi-Level Caches Victim Caches 27 30 5

  6. Reducing Miss Rates: 1. Larger Block Size Types of Cache Misses • Compulsory • Effects of larger block sizes – First reference or cold start misses – Reduction of compulsory misses • Capacity • Spatial locality – Working set is too big for the cache – Increase of miss penalty (transfer time) – Fully associative caches Fully associative caches – Reduction of number of blocks R d i f b f bl k • Conflict (collision) • Potential increase of conflict misses – Many blocks map to the same block frame (line) • Latency and bandwidth of lower-level memory – Affects – High latency and bandwidth => large block size • Set associative caches • Small increase in miss penalty • Direct mapped caches 31 34 Miss Rates: Absolute and Example Distribution 32 35 Reducing the Miss Rates 2. Larger Caches • More blocks 1. Larger block size • Higher probability of getting the data 2. Larger Caches • Longer hit time and higher cost 3. Higher associativity • Primarily used in 2 nd level caches y 4. Pseudo associative caches 4 Pseudo-associative caches 5. Compiler optimizations 33 36 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend