chapter seven
play

Chapter Seven 1 2004 Morgan Kaufmann Publishers Memories: Review - PowerPoint PPT Presentation

Chapter Seven 1 2004 Morgan Kaufmann Publishers Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on


  1. Chapter Seven 1  2004 Morgan Kaufmann Publishers

  2. Memories: Review • SRAM: – value is stored on a pair of inverting gates – very fast but takes up more space than DRAM (4 to 6 transistors) • DRAM: – value is stored as a charge on capacitor (must be refreshed) – very small but slower than SRAM (factor of 5 to 10) – very small but slower than SRAM (factor of 5 to 10) W ord line P ass transistor A A C apacitor B B B it line 2  2004 Morgan Kaufmann Publishers

  3. Exploiting Memory Hierarchy • Users want large and fast memories! SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB. DRAM access times are 50-70ns at cost of $100 to $200 per GB. Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB. • Try and give it to them anyway – build a memory hierarchy 3  2004 Morgan Kaufmann Publishers

  4. Locality • A principle that makes having a memory hierarchy a good idea • If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon. Why does code have locality? • Our initial focus: two levels (upper, lower) – block: minimum unit of data transferred between adjacent levels – hit: data requested is in the upper level – miss: data requested is not in the upper level – miss penalty: the requirement to fetch a block into a level of the memory hierarchy from the lower level 4  2004 Morgan Kaufmann Publishers

  5. Cache • Two issues: – How do we know if a data item is in the cache? – If it is, how do we find it? • Our first example: – block size is one word of data – "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level 5  2004 Morgan Kaufmann Publishers

  6. Direct Mapped Cache • Mapping: address is modulo the number of blocks in the cache Cache 000 001 010 011 100 101 110 111 01001 00001 00101 01101 10001 10101 11001 11101 Memory 6  2004 Morgan Kaufmann Publishers

  7. Direct Mapped Cache For MIPS: • Tag – contains the address information to identify whether the associated block corresponds to a requested word • • Valid bit – a bit to indicate Valid bit – a bit to indicate that the associated block in the hierarchy contains valid data Why “valid” bit ? 7  2004 Morgan Kaufmann Publishers

  8. Direct Mapped Cache - Example a ( Miss ) Index V Tag Data Memory Decimal Binary Hit or Assigned cache request address of address of miss in block 000 N reference reference cache 001 N a 22 10110 Miss 10110 mod 8 = 110 010 N b 26 11010 Miss 11010 mod 8 = 010 011 N c 18 10010 Miss 10110 mod 8 = 110 100 N b 26 11010 Miss 11010 mod 8 = 010 101 N a 22 10000 Hit 10000 mod 8 = 000 110 Y 10 a 111 111 N N a (Hit) b ( Miss ) c ( Miss ) b ( Miss ) Inde V Tag Data Ind V Tag Data Ind V Tag Data Ind V Tag Data x ex ex ex 000 N 000 N 000 N 000 N 001 N 001 N 001 N 001 N 010 Y 11 b 010 Y 10 c 010 Y 11 b 010 Y 11 b 011 N 011 N 011 N 011 N 100 N 100 N 100 N 100 N 101 N 101 N 101 N 101 N 110 Y 10 a 110 Y 10 a 110 Y 10 a 110 Y 10 a 111 N 111 N 111 N 111 N 8  2004 Morgan Kaufmann Publishers

  9. Direct Mapped Cache • Increase Block Size : – E.g., A 16KB cache contains 256 blocks with 16 words per block – What kind of locality are we taking advantage of? (spatial locality) 9  2004 Morgan Kaufmann Publishers

  10. Analysis of Tag Bits and Index Bits • Assume the 32-bit byte address, a directed-mapped cache of size 2 n blocks with 2 m -word ( 2 m+2 -byte) blocks – Tag field: 32 – (n+m+2 ) bits Cache size: 2 n × (2 m × 32 + (32 – n – m – 2) + 1) – Ex1. Bits in a cache • How many total bits are required for a directed-mapped cache with 16KB of data and 4-word blocks, assuming a 32-bit address? 16KB = 2 14 Bytes = 2 12 Words 16KB = 2 14 Bytes = 2 12 Words – – Number of blocks = 2 12 /4 = 2 10 blocks – – Tag field = 32 – (2 + 2 + 10) = 18 tag index word byte Total size = 2 10 × – × (4 × × 32 + 18 + 1) = 147 Kbits × × × × Ex2. Mapping an address to a multiword cache block • Consider a cache with 64 blocks and a block size of 16 bytes. What block number does byte address 1200 map to? Block address = � � � 1200/16 � � � � � = 75 – – Block number = 75 modulo 64 = 11 – Block number 11 ranges from 1200 to 1215 10  2004 Morgan Kaufmann Publishers

  11. Hits vs. Misses • Read hits – this is what we want! • Read misses – stall the CPU, fetch block from memory, deliver to cache, restart • Write hits: – can replace data in cache and memory (write-through) • The writes always update both the cache and the memory, ensuring that data is always consistent between the two. – write the data only into the cache (write-back the cache later) • The modified blocks (dirty blocks) are written to the lower level of the hierarchy when the block is replaced. – Write the data into the cache and a buffer (write buffer) • After writing into buffer, CPU continue execution, writing to memory is controlled by memory controller • If buffer is full, CPU must wait for a free buffer entry • Write misses: – read the entire block into the cache, then write the word 11  2004 Morgan Kaufmann Publishers

  12. Hardware Issues • Assume a cache block of 4 words (transfer 4 words for one miss) • Make reading multiple words easier by using banks of memory CPU CPU CPU The width of bus and cache need Multiplexor not change Cache Cache Cache Bus Bus Bus Memory Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3 b. Wide memory organization c. Interleaved memory organization c. Interleaved memory b. Wide memory 1+ 2 × × 15 + 2 = 33 (2-word wide) Memory × × 1+ 1 × × 15 + 4 × × 1 = 20 × × × × 1+ 1 × × 15 + 1 = 17 (4-word wide) × × Ex. 3 Assuming following memory access times - 1 clock cycle to send the address a. One-word-wide - 15 clock cycles for each DRAM access initiated a. One-word-wide memory memory organization - 1 clock cycle to send a word of data 1+ 4 × × 15 + 4 × × 1 = 65 × × × × Assume the cache block is of 4 words. What’s the cache miss penalty for different memory organizations? 12  2004 Morgan Kaufmann Publishers

  13. Performance • Use split caches because there is more spatial locality in code: Block size in Instruction Data miss Effective combined Program words miss rate rate miss rate gcc 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% spice 1 1.2% 1.3% 1.2% 4 0.3% 0.6% 0.4% • Increasing the block size tends to decrease miss rate: • However, for a fixed cache size, as block size increases over a threshold value, miss rate will increase (Why ?) 13  2004 Morgan Kaufmann Publishers

  14. Performance • Simplified model: execution time = (execution cycles + stall cycles) ´ ´ ´ cycle time ´ stall cycles = # of instructions ´ ´ miss ratio ´ ´ miss penalty ´ ´ ´ ´ • Two ways of improving performance: – decreasing the miss ratio – decreasing the miss penalty What happens if we increase block size? 14  2004 Morgan Kaufmann Publishers

  15. Cache Performance Examples • Ex. 4(a) Instruction cache miss rate = 2%, data cache miss rate = 4%, CPI = 2 without any memory stall, miss penalty = 100 clock cycles, how much faster a processor would run with a perfect cache that never missed (Assume the percentage of instructions lw and sw is 36%) – Instruction miss cycle = I × × 2% × × 100 = 2.0 × × I × × × × × × – data miss cycle = I × × 36%(lw and sw percentage) × × 4% × × 100 = 1.44 × × I × × × × × × × × – CPI with memory stall = 2 + 3.44 = 5.44 – CPU_time stall /CPU_time nostall = I × × CPI stall × × cycle_time/I × × CPI nostall × × cycle_time = × × × × × × × × CPI stall /CPI nostall = 5.44/2 = 2.72 • Ex. 4(b) What happens if the processor is made twofold faster by reducing CPI from 2 to 1, but the memory system is not? – (1+3.44)/1 = 4.44 • Ex. 4(c) Double clock rate, the time to handle a cache miss does not change. How much faster will the computer be with the same miss rate? – Miss cycle/inst = (2% × × 200)+36% × × (4% × × 200)=6.88 × × × × × × – Performance fast /performance slow = execution_time slow /execution_tim fast = IC × × CPI slow × × cycle_time/IC × × CPI fast × × (cycle_time/2) = 5.44/(8.88 × × 0.5) = 1.23 × × × × × × × × × × 15  2004 Morgan Kaufmann Publishers

  16. Decreasing miss ratio with associativity • Directed-mapping placement • Full associative placement – a block can be placed in any location in the cache • Set associative cache – each block can be placed in a fixed number of locations (at least two). Mapping = (block number) modulo (number of sets in the cache) Block # Block # 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Set # Set # 0 0 1 1 2 2 3 3 Data Data Data 1 � 1 � 1 � Tag Tag Tag 2 2 2 Search Search Search 16  2004 Morgan Kaufmann Publishers

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend