caches memory
play

Caches & Memory Hakim Weatherspoon CS 3410 Computer Science - PowerPoint PPT Presentation

Caches & Memory Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer] Programs 101 C Code RISC-V Assembly main: addi sp,sp,-48 int main (int argc, char* argv[ ]) { sw x1,44(sp)


  1. Caches & Memory Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer]

  2. Programs 101 C Code RISC-V Assembly main: addi sp,sp,-48 int main (int argc, char* argv[ ]) { sw x1,44(sp) int i; sw fp,40(sp) int m = n; move fp,sp int sum = 0; sw x10,-36(fp) for (i = 1; i <= m; i++) { sw x11,-40(fp) sum += i; la x15,n } lw x15,0(x15) printf (“...”, n, sum); sw x15,-28(fp) sw x0,-24(fp) } li x15,1 sw x15,-20(fp) Load/Store Architectures: L2: lw x14,-20(fp) lw x15,-28(fp) • Read data from memory blt x15,x14,L3 (put in registers) . . . • Manipulate it  Instructions that read from • Store it back to memory or write to memory… 2

  3. Programs 101 C Code RISC-V Assembly main: addi sp,sp,-48 int main (int argc, char* argv[ ]) { sw ra,44(sp) int i; sw fp,40(sp) int m = n; move fp,sp int sum = 0; sw a0,-36(fp) for (i = 1; i <= m; i++) { sw a1,-40(fp) sum += i; la a5,n } lw a5,0(x15) printf (“...”, n, sum); sw a5,-28(fp) sw x0,-24(fp) } li a5,1 sw a5,-20(fp) Load/Store Architectures: L2: lw a4,-20(fp) lw a5,-28(fp) • Read data from memory blt a5,a4,L3 (put in registers) . . . • Manipulate it  Instructions that read from • Store it back to memory or write to memory… 3

  4. 1 Cycle Per Stage: the Biggest Lie (So Far) Code Stored in Memory (also, data and stack) compute jump/branch targets A memory register D D ALU file B +4 addr PC inst d in d out M B control memory imm extend new forward pc Stack, Data, Code detect unit Stored in Memory hazard Write- Instruction Instruction ctrl ctrl ctrl Back Memory Execute Fetch Decode IF/ID ID/EX EX/MEM MEM/WB 4

  5. What’s the problem? CPU Main Memory + big – slow – far away SandyBridge Motherboard, 2011 5 http://news.softpedia.com

  6. The Need for Speed CPU Pipeline 6

  7. The Need for Speed CPU Pipeline Instruction speeds: • add,sub,shift : 1 cycle • mult : 3 cycles • load/store : 100 cycles off-chip 50(-70)ns 2(-3) GHz processor  0.5 ns clock 7

  8. The Need for Speed CPU Pipeline 8

  9. What’s the solution? Caches ! Level 1 Data $ Level 2 $ Level 1 Insn $ 9 Intel Pentium 3, 1999

  10. Aside • Go back to 04-state and 05-memory and look at how registers, SRAM and DRAM are built. 10

  11. What’s the solution? Caches ! Level 1 Data $ Level 2 $ Level 1 Insn $ What lucky data gets to go here? 11 Intel Pentium 3, 1999

  12. Locality Locality Locality If you ask for something, you’re likely to ask for: • the same thing again soon  Temporal Locality • something near that thing, soon  Spatial Locality total = 0; for (i = 0; i < n; i++) total += a[i]; return total; 12

  13. Your life is full of Locality Last Called Speed Dial Favorites Contacts Google/Facebook/email 13

  14. Your life is full of Locality 14

  15. The Memory Hierarchy Small, Fast 1 cycle, 128 bytes Registers 4 cycles, L1 Caches 64 KB 12 cycles, L2 Cache 256 KB 36 cycles, L3 Cache 2-20 MB 50-70 ns, Big, Slow Main Memory 512 MB – 4 GB 5-20 ms Disk 16GB – 4 TB, 15 Intel Haswell Processor, 2013

  16. Some Terminology Cache hit • data is in the Cache • t hit : time it takes to access the cache • Hit rate (%hit): # cache hits / # cache accesses Cache miss • data is not in the Cache • t miss : time it takes to get the data from below the $ • Miss rate (%miss): # cache misses / # cache accesses Cacheline or cacheblock or simply line or block • Minimum unit of info that is present/or not in the cache 16

  17. The Memory Hierarchy 1 cycle, average access time 128 bytes t avg = t hit + % miss * t miss Registers = 4 + 5% x 100 4 cycles, L1 Caches 64 KB = 9 cycles 12 cycles, L2 Cache 256 KB 36 cycles, L3 Cache 2-20 MB 50-70 ns, Main Memory 512 MB – 4 GB 5-20 ms Disk 16GB – 4 TB, 17 Intel Haswell Processor, 2013

  18. Single Core Memory Hierarchy ON CHIP Processor Regs Registers I$ D$ L1 Caches L2 L2 Cache L3 Cache Main Memory Main Memory Disk Disk 18

  19. Multi-Core Memory Hierarchy ON CHIP Processor Processor Processor Processor Regs Regs Regs Regs I$ D$ I$ D$ I$ D$ I$ D$ L2 L2 L2 L2 L3 Main Memory Disk 19

  20. Memory Hierarchy by the Numbers CPU clock rates ~0.33ns – 2ns (3GHz-500MHz) Memory Transistor Access time Access time in $ per GIB Capacity technology count* cycles in 2012 6-8 transistors 0.5-2.5 ns 1-3 cycles $4k 256 KB SRAM (on chip) 1.5-30 ns 5-15 cycles $4k 32 MB SRAM (off chip) 1 transistor 50-70 ns 150-200 cycles $10-$20 8 GB DRAM (needs refresh) 5k-50k ns Tens of $0.75-$1 512 GB SSD (Flash) thousands 5M-20M ns Millions $0.05- 4 TB Disk $0.1 *Registers,D-Flip Flops: 10-100’s of registers 20

  21. Basic Cache Design Direct Mapped Caches 21

  22. MEMORY 16 Byte Memory addr data 0000 A 0001 B 0010 C 0011 D 1100  r1 load 0100 E 0101 F 0110 G 0111 H 1000 J • Byte-addressable memory 1001 K • 4 address bits  16 bytes total 1010 L • b addr bits  2 b bytes in memory 1011 M 1100 N 1101 O 1110 P 1111 Q 22

  23. MEMORY 4-Byte, Direct Mapped Cache addr data 0000 A CACHE 0001 B index 0010 C data index XXXX 0011 D  Cache entry A 00 0100 E B 01 = row 0101 F C 10 = ( cache) line 0110 G D 11 = ( cache) block 0111 H Block Size: 1 byte 1000 J 1001 K Direct mapped: 1010 L • Each address maps to 1 cache block 1011 M • 4 entries  2 index bits (2 n  n bits) 1100 N 1101 O Index with LSB: 1110 P 1111 Q • Supports spatial locality 23

  24. Analogy to a Spice Rack Spice Wall Spice Rack (Memory) (Cache) index spice A B C D E F … Z • Compared to your spice wall • Smaller • Faster • More costly (per oz.) 24 http:// www.bedbathandbeyond.com

  25. Analogy to a Spice Rack Spice Wall Spice Rack (Memory) (Cache) index tag spice A B innamon Cinnamon C D E F … Z • How do you know what’s in the jar? • Need labels Tag = Ultra-minimalist label 25

  26. MEMORY 4-Byte, Direct Mapped addr data Cache 0000 A 0001 B tag|index 0010 C CACHE XXXX 0011 D tag data 0100 E index 00 A 0101 F 00 00 01 B 0110 G 00 C 0111 H 10 00 D 1000 J 11 1001 K Tag: minimalist label/address 1010 L address = tag + index 1011 M 1100 N 1101 O 1110 P 1111 Q 26

  27. MEMORY 4-Byte, Direct Mapped addr data Cache 0000 A 0001 B 0010 C CACHE 0011 D tag V data 0100 E index 0 00 X 0101 F 00 0 00 X 01 0110 G 0 00 X 0111 H 10 0 00 X 1000 J 11 1001 K One last tweak: valid bit 1010 L 1011 M 1100 N 1101 O 1110 P 1111 Q 27

  28. MEMORY Simulation #1 addr data of a 4-byte, DM Cache 0000 A 0001 B tag|index 0010 C CACHE XXXX 0011 D tag V data 0100 E index 0 11 X 0101 F 00 0 11 X 01 0110 G 0 11 X 0111 H 10 0 11 X 1000 J 11 1001 K load 1100 Lookup: 1010 L • Index into $ 1011 M • Check tag 1100 N 1101 O • Check valid bit 1110 P 1111 Q 28

  29. Block Diagram 4-entry, direct mapped Cache tag|index CACHE 1101 tag V data 2 2 1 00 1111 0000 1 11 1010 0101 0 01 1010 1010 Great! 1 11 0000 0000 Are we done? 2 8 = 1010 0101 data Hit! 29

  30. MEMORY Simulation #2: addr data 4-byte, DM Cache 0000 A 0001 B 0010 C CACHE 0011 D tag V data 0100 E index 1 11 N 0101 F 00 0 11 X 01 0110 G 0 11 X 0111 H 10 0 11 X 1000 J 11 1001 K load 1100 Lookup: Miss 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tag 1100 N load 1100 • Check valid bit 1101 O 1110 P 1111 Q 30

  31. Reducing Cold Misses by Increasing Block Size • Leveraging Spatial Locality 31

  32. MEMORY Increasing Block Size addr data 0000 A 0001 B CACHE offset 0010 C V tag index data XXXX 0011 D 00 0 x A | B 0100 E 01 0 x C | D 0101 F 10 0 x E | F 0110 G 11 0 x G | H 0111 H 1000 J • Block Size: 2 bytes 1001 K • Block Offset: least significant bits 1010 L 1011 M indicate where you live in the block 1100 N • Which bits are the index? tag? 1101 O 1110 P 1111 Q 32

  33. MEMORY Simulation #3: addr data 8-byte, DM Cache 0000 A index 0001 B CACHE tag| |offset 0010 C index V tag data XXXX 0011 D 00 0 x X | X 0100 E 01 0 x X | X 0101 F 10 0 x X | X 0110 G 11 0 x X | X 0111 H 1000 J 1001 K load 1100 Lookup: 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tag 1100 N load 1100 • Check valid bit 1101 O 1110 P 1111 Q 33

  34. Removing Conflict Misses with Fully-Associative Caches 34

  35. MEMORY Simulation #4: addr data 8-byte, FA Cache 0000 A 0001 B XXXX 0010 C tag|offset 0011 D CACHE 0100 E 0101 F V tag data V tag data V tag data V tag data 0110 G 0 xxx X | X 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H 1000 J 1001 K load 1100 Lookup: Miss 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tags 1100 N load 1100 • Check valid bits 1101 O 1110 P 1111 Q LRU Pointer 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend