General Cache Mechanics CPU Block: unit of data in cache and - PowerPoint PPT Presentation

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory Hierarchy: Cache Smaller, faster, more expensive. Cache 8 9 14 3 Stores subset of memory blocks . (lines) Data is moved in block units Memory hierarchy Cache basics Memory Larger, slower, cheaper. 0 1 2 3 Locality Partitioned into blocks (lines) . 4 5 6 7 Cache organization 8 9 10 11 Cache-aware programming 12 13 14 15 8 Cache Hit Cache Miss CPU CPU Request: 12 1. Request data in block b. 1. Request data in block b. Request: 14 2. Cache miss: Cache 8 9 14 3 12 9 block is not in cache 2. Cache hit: Cache 8 9 14 14 3 Block b is in cache. 3. Cache eviction: 12 9 Request: 12 Evict a block to make room, maybe store to memory. Memory 0 1 2 3 4. Cache fill: Memory 0 1 2 3 4 5 6 7 Fetch block from memory, store in cache. 4 5 6 7 9 8 9 10 11 8 9 10 11 12 12 13 14 15 12 13 14 15 Placement Policy: Re placement Policy: where to put block in cache which block to evict 9 10

Locality #1 Locality #2 row-major M x N 2D array in C What is stored in memory? sum = 0; int sum_array_rows(int a[M][N]) { for (i = 0; i < n; i++) { int sum = 0; sum += a[i]; a[0][0] a[0][1] a[0][2] a[0][3] } for (int i = 0; i < M; i++) { a[1][0] a[1][1] a[1][2] a[1][3] for (int j = 0; j < N; j++) { return sum; a[2][0] a[2][1] a[2][2] a[2][3] sum += a[i][j]; } Data: 1: a[0][0] 2: a[0][1] } Temporal: sum referenced in each iteration 3: a[0][2] return sum; 4: a[0][3] } Spatial: array a[] accessed in stride-1 pattern 5: a[1][0] 6: a[1][1] Instructions: 7: a[1][2] 8: a[1][3] Temporal: execute loop repeatedly 9: a[2][0] Spatial: execute instructions in sequence 10: a[2][1] 11: a[2][2] 12: a[2][3] Assessing locality in code is an important programming skill. stride 1 12 13 Locality #4 Locality #3 row-major M x N 2D array in C int sum_array_3d(int a[M][N][N]) { int sum_array_cols(int a[M][N]) { int sum = 0; int sum = 0; a[0][0] a[0][1] a[0][2] a[0][3] for (int i = 0; i < N; i++) { … for (int j = 0; j < N; j++) { a[1][0] a[1][1] a[1][2] a[1][3] for (int j = 0; j < N; j++) { for (int i = 0; i < M; i++) { a[2][0] a[2][1] a[2][2] a[2][3] for (int k = 0; k < M; k++) { sum += a[i][j]; … sum += a[k][i][j]; } 1: a[0][0] } 2: a[1][0] } } 3: a[2][0] return sum; } 4: a[0][1] } 5: a[1][1] return sum; 6: a[2][1] } 7: a[0][2] 8: a[1][2] 9: a[2][2] What is "wrong" with this code? 10: a[0][3] 11: a[1][3] How can it be fixed? 12: a[2][3] stride N 14 15

memory hierarchy Cache Performance Metrics explicitly why does it work? program- controlled small, fast, Miss Rate power-hungry, registers expensive Fraction of memory accesses to data not in cache (misses / accesses) Typically: 3% - 10% for L1; maybe < 1% for L2, depending on size, etc. L1 cache (SRAM, on-chip) Hit Time L2 cache Time to find and deliver a block in the cache to the processor. (SRAM, on-chip) Typically: 1 - 2 clock cycles for L1 ; 5 - 20 clock cycles for L2 Memory L3 cache (SRAM, off-chip) Miss Penalty Additional time required on cache miss = main memory access time main memory Typically 50 - 200 cycles for L2 (trend: increasing!) (DRAM) large, slow, persistent storage power-efficient, (hard disk, flash, over network, cloud, etc.) cheap 17 (byte) Memory Cache Organization: Key Points Blocks address 00000 000 Note: drawing address order differently from here on! Divide address space into fixed-size aligned blocks. Block block power of 2 0 Fixed-size unit of data in memory/cache Example: block size = 8 00001 000 Placement Policy full byte address block Where in the cache should a given block be stored? 1 00010 010 § direct-mapped, set associative 00010 000 00010001 00010010 Block ID offset within block Replacement Policy 00010011 block 00010100 address bits - offset bits log 2 (block size) 2 What if there is no room in the cache for requested data? 00010101 00010110 00010111 § least recently used, most recently used 00011 000 Write Policy block 3 When should writes update lower levels of memory hierarchy? § write back, write through, write allocate, no write allocate ... remember withinSameBlock? (Pointers Lab)

Placement: Direct-Mapped Placement: Tags resolve ambiguity Memory Mapping: Memory Mapping: Block ID Block ID index(Block ID) = Block ID mod S index(Block ID) = Block ID mod S 00 00 00 00 (easy for power-of-2 block sizes...) 00 01 00 01 00 10 00 10 00 11 00 11 Cache Cache 01 00 01 00 Tag Data Index Index 01 01 01 01 01 10 00 01 10 00 00 01 01 11 01 11 01 11 S = # slots = 4 S 10 00 10 10 00 10 01 10 01 10 01 11 11 01 10 10 10 10 10 11 10 11 11 00 11 00 11 01 11 01 Block ID bits not used for index. 11 10 11 10 11 11 11 11 22 24 A puzzle. Address = Tag, Index, Offset What slot in the cache? Disambiguates slot contents. Where within a block? a-bit Address Tag Index Offset Cache starts empty. b bits (a-s-b) bits s bits Access (address, hit/miss) stream: Block ID bits - Index bits log 2 (# cache slots) Tag Index (10, miss), (11, hit), (12, miss) 00010 010 full byte address block size >= 2 bytes block size < 8 bytes Block ID Offset within block What could the block size be? Address bits - Offset bits log 2 (block size) = b # address bits 27

sets Placement: direct mapping conflicts Placement: Set Associative S = # slots in cache One index per set of block slots. Mapping: Store block in any slot within set. index(Block ID) = Block ID mod S Block ID 1-way 2-way 4-way 8-way 8 sets, 4 sets, 2 sets, 1 set, What happens when accessing 0000 0001 1 block each 2 blocks each 4 blocks each 8 blocks in repeated pattern: 0010 Set Set Set Set 0011 0010, 0110, 0010, 0110, 0010...? 0 0100 Index 0 1 0101 0 00 2 0110 1 01 0111 3 0 10 1000 4 cache conflict 2 11 1001 5 1 1010 Every access suffers a miss, 6 3 1011 evicts cache line needed 7 1100 1101 by next access. direct mapped fully associative 1110 1111 Replacement policy: if set is full, what block should be replaced? Common: least recently used (LRU) but hardware usually implements “not most recently used” 28 29 Example: Tag, Index, Offset? Example: Tag, Index, Offset? E -way set-associative 16 -bit Address 4 -bit Address S slots Tag Index Offset Tag Index Offset 16-byte blocks E = 1-way E = 2-way E = 4-way Direct-mapped tag bits ____ S = 8 sets S = 4 sets S = 2 sets 4 slots set index bits ____ Set Set Set 0 2-byte blocks block offset bits____ 0 1 0 2 1 3 4 2 5 1 6 3 7 tag bits ____ tag bits ____ tag bits ____ set index bits ____ set index bits ____ set index bits ____ index(1101) = ____ block offset bits ____ block offset bits ____ block offset bits ____ index(0x1833) ____ index(0x1833) ____ index(0x1833) ____

Replacement Policy General Cache Organization (S, E, B) If set is full, what block should be replaced? E lines per set (“ E -way”) Powers of 2 Common: least recently used (LRU) set (but hardware usually implements “not most recently used” block/line Another puzzle: Cache starts empty , uses LRU. Access (address, hit/miss) stream S sets (10, miss); (12, miss); (10, miss) cache capacity : S x E x B data bytes 12’s block replaced 10’s block 12 is not in the same block as 10 address size: t + s + b address bits v tag 0 1 2 B-1 associativity of cache? direct-mapped cache valid bit B = 2 b bytes of data per cache line (the data block) 32 33 Direct-Mapped Cache Practice Cache Read Locate set by index Hit if any block in set: 0x354 is valid; and 12-bit address E = 2 e lines per set has matching tag 16 lines, 4-byte block size Get data at offset in block 0xA20 Direct mapped Offset bits? Index bits? Tag bits? Address of byte in memory: 11 10 9 8 7 6 5 4 3 2 1 0 t bits s bits b bits S = 2 s sets tag set block index offset Index Tag Valid B0 B1 B2 B3 Index Tag Valid B0 B1 B2 B3 0 19 1 99 11 23 11 8 24 1 3A 00 51 89 1 15 0 – – – – 9 2D 0 – – – – 2 1B 1 00 02 04 08 A 2D 1 93 15 DA 3B data begins at this offset 3 36 0 – – – – B 0B 0 – – – – 4 32 1 43 6D 8F 09 C 12 0 – – – – 1 tag 0 1 2 B-1 5 0D 1 36 72 F0 1D D 16 1 04 96 34 15 6 31 0 – – – – E 13 1 83 77 1B D3 valid bit B = 2 b bytes of data per cache line (the data block) 7 16 1 11 C2 DF 03 F 14 0 – – – – 34 37

General Cache Mechanics CPU Block: unit of data in cache and - PowerPoint PPT Presentation

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory Hierarchy: Cache Smaller, faster, more expensive. Cache 8 9 14 3 Stores subset of memory blocks . (lines) Data is moved in block units Memory

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

FE Review-Mechanics of Materials 1 FE Review-Mechanics of Materials 2 FE Review-Mechanics of

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Cache Creek Placer Area Fee Proposal History of Placer Mining at Cache Creek Prospecting in

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

lecture 18 cache 2 - TLB (hit and miss) - instruction or data cache - cache (hit and

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer

S POILER : Speculative Load Hazards Boost Rowhammer and Cache Attacks Saad Islam, Daniel

Cache Memories 15-213: Introduc0on to Computer Systems 10 th

Lecture 2: Single processor architecture and memory David Bindel 30 Aug 2011 Teaser What will

Modeling Hardware Timing 1 Caches and Pipelines Peter Puschner slides: P. Puschner, R. Kirner,

Direct-Mapped Cache: Write Allocate with Write-Through Protocol Block size in bytes: B = 2 b WRITE

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

Caching 1 Caches break down an address into which parts? Letter Answer A Tag, delay, length

Caches & Memory Hakim Weatherspoon CS 3410 Computer Science Cornell University

General Cache Mechanics CPU Block: unit of data in cache and - PowerPoint PPT Presentation

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory Hierarchy: Cache Smaller, faster, more expensive. Cache 8 9 14 3 Stores subset of memory blocks . (lines) Data is moved in block units Memory

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

FE Review-Mechanics of Materials 1 FE Review-Mechanics of Materials 2 FE Review-Mechanics of

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Cache Creek Placer Area Fee Proposal History of Placer Mining at Cache Creek Prospecting in

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

lecture 18 cache 2 - TLB (hit and miss) - instruction or data cache - cache (hit and

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer

S POILER : Speculative Load Hazards Boost Rowhammer and Cache Attacks Saad Islam, Daniel

Cache Memories 15-213: Introduc0on to Computer Systems 10 th

Lecture 2: Single processor architecture and memory David Bindel 30 Aug 2011 Teaser What will

Modeling Hardware Timing 1 Caches and Pipelines Peter Puschner slides: P. Puschner, R. Kirner,

Direct-Mapped Cache: Write Allocate with Write-Through Protocol Block size in bytes: B = 2 b WRITE

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

Caching 1 Caches break down an address into which parts? Letter Answer A Tag, delay, length

Caches &amp; Memory Hakim Weatherspoon CS 3410 Computer Science Cornell University

Caches & Memory Hakim Weatherspoon CS 3410 Computer Science Cornell University