Caches and Memory Anne Bracy CS 3410 Computer Science Cornell - PowerPoint PPT Presentation

MEMORY Simulation #2: addr data 4-byte, DM Cache 0000 A 0001 B tag|index 0010 C CACHE XXXX 0011 D V data 0100 E index tag 1 11 N 0101 F 00 1 11 O 01 0110 G 0 xx X 0111 H 10 0 xx X 1000 J 11 1001 K Lookup: load 0x1100 1010 L Miss • Index into $ load 0x1101 Miss 1011 M load 0x0100 • Check tag Miss 1100 N load 0x1100 • Check valid bit 1101 O 1110 P 1111 Q 34

MEMORY Simulation #2: addr data 4-byte, DM Cache 0000 A 0001 B tag|index 0010 C CACHE XXXX 0011 D V data 0100 E index tag 1 01 E 0101 F 00 1 11 O 01 0110 G 0 11 X 0111 H 10 0 11 X 1000 J 11 1001 K Lookup: load 0x1100 1010 L Miss • Index into $ load 0x1101 Miss 1011 M load 0x0100 • Check tag Miss 1100 N load 0x1100 • Check valid bit 1101 O 1110 P 1111 Q 35

MEMORY Simulation #2: addr data 4-byte, DM Cache 0000 A 0001 B tag|index 0010 C CACHE XXXX 0011 D V data 0100 E index tag 1 01 E 0101 F 00 1 11 O 01 0110 G 0 11 X 0111 H 10 0 11 X 1000 J 11 1001 K Lookup: load 0x1100 1010 L Miss • Index into $ load 0x1101 Miss 1011 M load 0x0100 • Check tag Miss 1100 N load 0x1100 Miss • Check valid bit 1101 O 1110 P 1111 Q 36

MEMORY Simulation #2: addr data 4-byte, DM Cache 0000 A 0001 B tag|index 0010 C CACHE XXXX 0011 D V data 0100 E index tag 1 11 N 0101 F 00 1 11 O 01 0110 G 0 11 X 0111 H 10 0 11 X 1000 J 11 1001 K load 0x1100 1010 L Miss cold Disappointed! load 0x1101 Miss cold 1011 M L load 0x0100 cold Miss 1100 N load 0x1100 Miss 1101 O 1110 P 1111 Q 37

Reducing Cold Misses by Increasing Block Size Leveraging Spatial Locality 38

MEMORY Increasing Block Size addr data 0000 A 0001 B CACHE 0010 C offset index V data tag 0011 D XXXX 00 0 x A | B 0100 E 01 0 x C | D 0101 F 10 0 x E | F 0110 G 11 0 x G | H 0111 H 1000 J • Block Size: 2 bytes 1001 K 1010 L • Block Offset: least significant bits 1011 M indicate where you live in the block 1100 N • Which bits are the index? tag? 1101 O 1110 P 1111 Q 39

MEMORY Simulation #3: addr data 8-byte, DM Cache 0000 A index 0001 B CACHE 0010 C tag| |offset index V data tag 0011 D XXXX 00 0 x X | X 0100 E 01 0 x X | X 0101 F 10 0 x X | X 0110 G 11 0 x X | X 0111 H 1000 J 1001 K Lookup: load 0x1100 1010 L Miss • Index into $ load 0x1101 1011 M load 0x0100 • Check tag 1100 N load 0x1100 • Check valid bit 1101 O 1110 P 1111 Q 40

MEMORY Simulation #3: addr data 8-byte, DM Cache 0000 A index 0001 B CACHE 0010 C tag| |offset index V data tag 0011 D XXXX 00 0 x X | X 0100 E 01 0 x X | X 0101 F 10 1 1 N | O 0110 G 11 0 x X | X 0111 H 1000 J 1001 K Lookup: load 0x1100 1010 L Miss • Index into $ load 0x1101 1011 M load 0x0100 • Check tag 1100 N load 0x1100 • Check valid bit 1101 O 1110 P 1111 Q 41

MEMORY Simulation #3: addr data 8-byte, DM Cache 0000 A index 0001 B CACHE 0010 C tag| |offset index V data tag 0011 D XXXX 00 0 x X | X 0100 E 01 0 x X | X 0101 F 10 1 1 N | O 0110 G 11 0 x X | X 0111 H 1000 J 1001 K Lookup: load 0x1100 1010 L Miss • Index into $ load 0x1101 Hit! 1011 M load 0x0100 • Check tag 1100 N load 0x1100 • Check valid bit 1101 O 1110 P 1111 Q 42

MEMORY Simulation #3: addr data 8-byte, DM Cache 0000 A index 0001 B CACHE 0010 C tag| |offset index V data tag 0011 D XXXX 00 0 x X | X 0100 E 01 0 x X | X 0101 F 10 1 1 N | O 0110 G 11 0 x X | X 0111 H 1000 J 1001 K Lookup: load 0x1100 Miss 1010 L • Index into $ load 0x1101 Hit! 1011 M load 0x0100 • Check tag Miss 1100 N load 0x1100 • Check valid bit 1101 O 1110 P 1111 Q 43

MEMORY Simulation #3: addr data 8-byte, DM Cache 0000 A index 0001 B CACHE 0010 C tag| |offset index V data tag 0011 D XXXX 00 0 x X | X 0100 E 01 0 x X | X 0101 F 10 1 0 E | F 0110 G 11 0 x X | X 0111 H 1000 J 1001 K Lookup: load 0x1100 Miss 1010 L • Index into $ load 0x1101 Hit! 1011 M load 0x0100 • Check tag Miss 1100 N load 0x1100 • Check valid bit 1101 O 1110 P 1111 Q 44

MEMORY Simulation #3: addr data 8-byte, DM Cache 0000 A index 0001 B CACHE 0010 C tag| |offset index V data tag 0011 D XXXX 00 0 x X | X 0100 E 01 0 x X | X 0101 F 10 1 0 E | F 0110 G 11 0 x X | X 0111 H 1000 J 1001 K Lookup: load 0x1100 Miss 1010 L • Index into $ load 0x1101 Hit! 1011 M load 0x0100 • Check tag Miss 1100 N load 0x1100 Miss • Check valid bit 1101 O 1110 P 1111 Q 45

MEMORY Simulation #3: addr data 8-byte, DM Cache 0000 A 0001 B CACHE 0010 C index V data tag 0011 D 00 0 x X | X 0100 E 01 0 x X | X 0101 F 10 1 0 E | F 0110 G 11 0 x X | X 0111 H 1000 J 1001 K cold load 0x1100 Miss 1 hit, 3 misses 1010 L load 0x1101 Hit! 3 bytes don’t fit in 1011 M load 0x0100 an 8 byte cache? Miss cold 1100 N load 0x1100 conflict Miss 1101 O 1110 P 1111 Q 46

Removing Conflict Misses with Fully-Associative Caches 47

MEMORY 8 byte, fully-associative addr data Cache 0000 A 0001 B XXXX XXXX XXXX 0010 C tag|offset offset 0011 D CACHE 0100 E 0101 F V data V data V data V data tag tag tag tag 0110 G 0 X | X 0 X | X 0 X | X 0 X | X xxx xxx xxx xxx 0111 H 1000 J What should the offset be? 1001 K What should the index be? 1010 L 1011 M What should the tag be? 1100 N 1101 O 1110 P 1111 Q 48

MEMORY Simulation #4: addr data 8-byte, FA Cache 0000 A 0001 B XXXX 0010 C tag|offset 0011 D CACHE 0100 E 0101 F V data V data V data V data tag tag tag tag 0110 G 0 X | X 0 X | X 0 X | X 0 X | X xxx xxx xxx xxx 0111 H 1000 J 1001 K Lookup: load 0x1100 1010 L Miss • Index into $ load 0x1101 1011 M load 0x0100 • Check tags 1100 N load 0x1100 • Check valid bits 1101 O 1110 P 1111 Q 49 LRU Pointer

MEMORY Simulation #4: addr data 8-byte, FA Cache 0000 A 0001 B XXXX 0010 C tag|offset 0011 D CACHE 0100 E 0101 F V data V data V data V data tag tag tag tag 0110 G 1 110 N | O 0 xxx X | X 0 X | X 0 X | X xxx xxx 0111 H 1000 J 1001 K Lookup: load 0x1100 1010 L Miss • Index into $ load 0x1101 Hit! 1011 M load 0x0100 • Check tags 1100 N load 0x1100 • Check valid bits 1101 O 1110 P 1111 Q 50

MEMORY Simulation #4: addr data 8-byte, FA Cache 0000 A 0001 B XXXX 0010 C tag|offset 0011 D CACHE 0100 E 0101 F V data V data V data V data tag tag tag tag 0110 G 1 110 N | O 0 xxx X | X 0 X | X 0 X | X xxx xxx 0111 H 1000 J 1001 K Lookup: load 0x1100 1010 L Miss • Index into $ load 0x1101 Hit! 1011 M load 0x0100 • Check tags Miss 1100 N load 0x1100 • Check valid bits 1101 O 1110 P 1111 Q 51 LRU Pointer

MEMORY Simulation #4: addr data 8-byte, FA Cache 0000 A 0001 B XXXX 0010 C tag|offset 0011 D CACHE 0100 E 0101 F V data V data V data V data tag tag tag tag 0110 G 1 110 N | O 1 010 E | F 0 X | X 0 X | X xxx xxx 0111 H 1000 J 1001 K Lookup: load 0x1100 1010 L Miss • Index into $ load 0x1101 Hit! 1011 M load 0x0100 • Check tags Miss 1100 N load 0x1100 Hit! • Check valid bits 1101 O 1110 P 1111 Q 52 LRU Pointer

Pros and Cons of Full Associativity + No more conflicts! + Excellent utilization! But either: Parallel Reads – lots of reading! Serial Reads – lots of waiting t avg = t hit + % miss * t miss = 4 + 5% x 100 = 6 + 3% x 100 = 9 cycles = 9 cycles 53

Pros & Cons Direct Mapped Fully Associative Tag Size Smaller Larger SRAM Overhead Less More Controller Logic Less More Speed Faster Slower Price Less More Scalability Very Not Very # of conflict misses Lots Zero Hit Rate Low High Pathological Cases Common ?

Reducing Conflict Misses with Set-Associative Caches Not too conflict-y. Not too slow. … Just Right! 55

MEMORY 8 byte, 2-way addr data set associative Cache 0000 A 0001 B XXXX XXXX XXXX 0010 C tag||offset offset index 0011 D CACHE 0100 E 0101 F index V data V tag data tag 0110 G 0 0 xx E | F 0 xx N | O 0111 H 1 0 xx C | D 0 xx P | Q 1000 J What should the offset be? 1001 K 1010 L What should the index be? 1011 M 1100 N What should the tag be? 1101 O 1110 P 1111 Q 56

MEMORY 8 byte, 2-way addr data set associative Cache 0000 A 0001 B XXXX 0010 C tag||offset index 0011 D CACHE 0100 E 0101 F index V data V tag data tag 0110 G 0 0 xx X | X 0 xx X | X 0111 H 1 0 xx X | X 0 xx X | X 1000 J 1001 K Lookup: load 0x1100 1010 L Miss • Index into $ load 0x1101 1011 M load 0x0100 • Check tag 1100 N load 0x1100 • Check valid bit 1101 O 1110 P 1111 Q 58 LRU Pointer

MEMORY 8 byte, 2-way addr data set associative Cache 0000 A 0001 B XXXX 0010 C tag||offset index 0011 D CACHE 0100 E 0101 F index V data V tag data tag 0110 G 0 1 11 N | O 0 xx X | X 0111 H 1 0 xx X | X 0 xx X | X 1000 J 1001 K Lookup: load 0x1100 Miss 1010 L • Index into $ load 0x1101 Hit! 1011 M load 0x0100 • Check tag 1100 N load 0x1100 • Check valid bit 1101 O 1110 P 1111 Q 59 LRU Pointer

MEMORY 8 byte, 2-way addr data set associative Cache 0000 A 0001 B XXXX 0010 C tag||offset index 0011 D CACHE 0100 E 0101 F index V data V tag data tag 0110 G 0 1 11 N | O 0 xx X | X 0111 H 1 0 xx X | X 0 xx X | X 1000 J 1001 K Lookup: load 0x1100 Miss 1010 L • Index into $ load 0x1101 Hit! 1011 M load 0x0100 • Check tag Miss 1100 N load 0x1100 • Check valid bit 1101 O 1110 P 1111 Q 60 LRU Pointer

MEMORY 8 byte, 2-way addr data set associative Cache 0000 A 0001 B XXXX 0010 C tag||offset index 0011 D CACHE 0100 E 0101 F index V data V tag data tag 0110 G 0 1 11 N | O 1 01 E | F 0111 H 1 0 xx X | X 0 xx X | X 1000 J 1001 K Lookup: load 0x1100 Miss 1010 L • Index into $ load 0x1101 Hit! 1011 M load 0x0100 • Check tag Miss 1100 N load 0x1100 Hit! • Check valid bit 1101 O 1110 P 1111 Q 61 LRU Pointer

Eviction Policies Which cache line should be evicted from the cache to make room for a new line? • Direct-mapped: no choice, must evict line selected by index • Associative caches • Random: select one of the lines at random • Round-Robin: similar to random • FIFO: replace oldest line • LRU: replace line that has not been used in the longest time 62

Misses: the Three C’s • Cold (compulsory) Miss: never seen this address before • Conflict Miss: cache associativity is too low • Capacity Miss: cache is too small 63

Miss Rate vs. Block Size 64

Block Size Tradeoffs • For a given total cache size, Larger block sizes mean…. – fewer lines – so fewer tags, less overhead – and fewer cold misses (within-block “prefetching”) • But also… – fewer blocks available (for scattered accesses!) – so more conflicts – can decrease performance if working set can’t fit in $ – and larger miss penalty (time to fetch block)

Miss Rate vs. Associativity 66

ABCs of Caches t avg = t hit + % miss * t miss + Associativity: ⬇ conflict misses J ⬆ hit time L + Block Size: ⬇ cold misses J ⬆ conflict misses L + Capacity: ⬇ capacity misses J ⬆ hit time L 67

Which caches get what properties? t avg = t hit + % miss * t miss Design with Fast speed in mind L1 Caches More Associative L2 Cache Bigger Block Sizes Larger Capacity L3 Cache Design with miss Big rate in mind 68

Roadmap • Things we have covered: – The Need for Speed – Locality to the Rescue! – Calculating average memory access time – $ Misses: Cold, Conflict, Capacity – $ Characteristics: Associativity, Block Size, Capacity • Things we will now cover: – Cache Figures – Cache Performance Examples – Writes 69

Caches and Memory Anne Bracy CS 3410 Computer Science Cornell - PowerPoint PPT Presentation

Caches and Memory Anne Bracy CS 3410 Computer Science Cornell University Slides by Anne Bracy with 3410 slides by Professors Weatherspoon, Bala, McKee, and Sirer. See P&H Chapter: 5.1-5.4, 5.8, 5.10, 5.13, 5.15, 5.17 1 Programs 101 C

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 105 Contents 1

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

1 Implementation Snoop Caches Implementing Snooping Caches Write Races: Multiple processors

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa

PIONEERIN ONEERING G LNG AS FUEL FOR R SHIPPING ING : OPPOR ORTUNITIES UNITIES AND ND CONS

using deep learning to identify languages in short text Jeanne Elizabeth Daniel October 5, 2018

Paths in Graphs and Continua Paul Gartside May 2018 University of Pittsburgh Joint work with:

Dr John R Elliott Reader in Intelligence Engineering International Astronautics Association SETI

A Learned Index for Log-Structured Merge Trees Yifan Dai, Yien Xu, Aishwarya Ganesan, Ramnatthan

Writing maintainable and extensible CSS Mato gajner, 2014 Complex projects and puny

04-1: Market defini1on U.C. Berkeley, Boalt Hall School of

CSE 255 Lecture 6 Data Mining and Predictive Analytics Combining models of ratings and

Caches and Memory Anne Bracy CS 3410 Computer Science Cornell - PowerPoint PPT Presentation

Caches and Memory Anne Bracy CS 3410 Computer Science Cornell University Slides by Anne Bracy with 3410 slides by Professors Weatherspoon, Bala, McKee, and Sirer. See P&H Chapter: 5.1-5.4, 5.8, 5.10, 5.13, 5.15, 5.17 1 Programs 101 C

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 105 Contents 1

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

1 Implementation Snoop Caches Implementing Snooping Caches Write Races: Multiple processors

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Caches &amp; Memcache Example Client N. America Client System Asia + Caches Client Africa

PIONEERIN ONEERING G LNG AS FUEL FOR R SHIPPING ING : OPPOR ORTUNITIES UNITIES AND ND CONS

using deep learning to identify languages in short text Jeanne Elizabeth Daniel October 5, 2018

Paths in Graphs and Continua Paul Gartside May 2018 University of Pittsburgh Joint work with:

Dr John R Elliott Reader in Intelligence Engineering International Astronautics Association SETI

A Learned Index for Log-Structured Merge Trees Yifan Dai, Yien Xu, Aishwarya Ganesan, Ramnatthan

Writing maintainable and extensible CSS Mato gajner, 2014 Complex projects and puny

04-1: Market defini1on U.C. Berkeley, Boalt Hall School of

CSE 255 Lecture 6 Data Mining and Predictive Analytics Combining models of ratings and

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa