Advanced cache memory optimizations Computer Architecture J. Daniel - PowerPoint PPT Presentation

Advanced cache memory optimizations Advanced cache memory optimizations Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 1/44

Advanced cache memory optimizations Introduction Introduction 1 2 Advanced optimizations 3 Conclusion cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 2/44

Advanced cache memory optimizations Introduction Why do we use caching? To overcome the memory wall . 1980 – 2010: Improvement in processors performance better (orders of magnitude) than memory. 2005 – . . . : Situation becomes worse with emerging multi-core architectures. To reduce both data and instructions access times. Make memory access time nearer to cache access time. Offer the illusion of a cache size approaching to main memory size. Based on the principle of locality . cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 3/44

Advanced cache memory optimizations Introduction Memory average access time 1 level. t = t h ( L 1 ) + m L 1 × t p ( L 1 ) 2 levels. t = t h ( L 1 ) + m L 1 × ( t h ( L 2 ) + m L 2 × t p ( L 2 )) 3 levels. t = t h ( L 1 )+ m L 1 × ( t h ( L 2 ) + m L 2 × ( t h ( L 3 ) + m L 3 × t p ( L 3 ))) . . . cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 4/44

Advanced cache memory optimizations Introduction Basic optimizations 1. Increase block size. 2. Increase cache size. 3. Increase associativity. 4. Introduce multi-level caches. 5. Give priority to read misses. 6. Avoid address translation during indexing. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 5/44

Advanced cache memory optimizations Introduction Advanced optimizations Metrics to be decreased : Hit time. Miss rate. Miss penalty. Metrics to be increased : Cache bandwidth. Observation : All advanced optimizations aim to improve some of those metrics. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 6/44

Advanced cache memory optimizations Advanced optimizations Introduction 1 2 Advanced optimizations 3 Conclusion cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 7/44

Advanced cache memory optimizations Advanced optimizations Small and simple caches Advanced optimizations 2 Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 8/44

Advanced cache memory optimizations Advanced optimizations Small and simple caches Small caches Lookup procedures: Select a line using the index . Read line tag . Compare to address tag . Lookup time is increased as cache size grows. A smaller cache allows: Simpler lookup hardware. Cache can better fit into processor chip. A small cache improves lookup time . cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 9/44

Advanced cache memory optimizations Advanced optimizations Small and simple caches Simple caches Cache simplification. Use mapping mechanisms as simple as possible. Direct mapping : Allows to parallelize tag comparison and data transfers. Observation : Most modern processors focus more on using small caches than on simplifying them. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 10/44

Advanced cache memory optimizations Advanced optimizations Small and simple caches Intel Core i7 L1 cache (1 per core) 32 KB instructions. 32 KB data. Latency: 3 cycles. Associative 4(i), 8(d) ways. L2 cache (1 per core) 256 KB Latency: 9 cycles. Associative 8 ways. L3 cache (shared) 8 MB Latency: 39 cycles. Associative 16 ways. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 11/44

Advanced cache memory optimizations Advanced optimizations Way prediction Advanced optimizations 2 Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 12/44

Advanced cache memory optimizations Advanced optimizations Way prediction Way prediction Problem : Direct mapping → fast but many misses. Set associative mapping → less misses but more sets (slower). Way prediction Additional bits stored for predicting the way to be selected in the next access. Block prefetching and compare to single tag. If there is a miss, it is compared with other tags. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 13/44

Advanced cache memory optimizations Advanced optimizations Pipelined access to cache Advanced optimizations 2 Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 14/44

Advanced cache memory optimizations Advanced optimizations Pipelined access to cache Pipelined access to cache Goal : Improve cache bandwidth. Solution : Pipelined access to the cache in multiple clock cycles. Effects : Clock cycle can be shortened. A new access can be initiated every clock cycle. Cache bandwidth is increased. Latency is increased. Positive effect in superscalar processors . cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 15/44

Advanced cache memory optimizations Advanced optimizations Non-blocking caches Advanced optimizations 2 Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 16/44

Advanced cache memory optimizations Advanced optimizations Non-blocking caches Non-blocking caches Problem : Cache miss leads to a stall until a block is obtained. Solution : Out-of-order execution. But : How is memory accessed while a miss is resolved? Hit during miss Allow accesses with hit while waiting. Reduces miss penalty. Hit during several misses / Miss during miss : Allow overlapped misses. Needs multi-channel memory. Highly complex. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 17/44

Advanced cache memory optimizations Advanced optimizations Multi-bank caches Advanced optimizations 2 Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 18/44

Advanced cache memory optimizations Advanced optimizations Multi-bank caches Multi-bank caches Goal : Allow simultaneous accesses to different cache locations. Solution : Divide memory into independent banks. Effect : Bandwidth is increased. Example : Sun Niagara L2: 4 banks. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 19/44

Advanced cache memory optimizations Advanced optimizations Multi-bank caches Bandwidth For increasing the bandwidth, it is necessary to distribute accesses across banks. Simple approach : Sequential interleaving Round-robin of blocks across banks. Block addr. Bank 0 Block addr. Bank 1 Block addr. Bank 2 Block addr. Bank 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 20/44

Advanced cache memory optimizations Advanced optimizations Critical word first and early restart Advanced optimizations 2 Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 21/44

Advanced cache memory optimizations Advanced optimizations Critical word first and early restart Critical word first and early restart Observation : Usually processors need a single word to proceed. Solution : Do not wait until the whole block from memory has been transferred. Alternatives : Critical word first : Reorder blocks so that first word is the word needed by the processor. Early restart : Block received without reordering. As soon as the selected word is received, the processor proceeds. Effects : Depends on block size → the larger the better. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 22/44

Advanced cache memory optimizations Computer Architecture J. Daniel - PowerPoint PPT Presentation

Advanced cache memory optimizations Advanced cache memory optimizations Computer Architecture J. Daniel Garca Snchez (coordinator) David Expsito Singh Francisco Javier Garca Blas ARCOS Group Computer Science and Engineering Department

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Chapter 4 Cache Memory Contents Computer memory system overview Characteristics of

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Lecture 23: Cache, Memory, Virtual Memory Todays topics: Cache examples, caching

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Cache Example Main memory: Byte addressable memory of size 4GB = 2 32 bytes Cache size: 64KB = 2 16

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Scope-based Method Cache Analysis Benedikt Huber 1 , Stefan Hepp 1 , Martin Schoeberl 2 1 Vienna

Computation structures Tutorial 4: : -code for ULg03 ULg02 - constant ROM and XP register

Computer Systems Lecture 17 Caching Continued CS 230 - Spring 2020 3-1 Cache Writing

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

Addressing Shared Resource Contention in Multicore Processors via Scheduling ASPLOS 10

TESLA V100 GPU Xudong Shao Houxiang Ji Hao Gao The history of GPU architecture 2017 Volta

A Fast Analytical Model of Fully Associative Caches spcl.inf.ethz.ch @spcl_eth The Cost of Data

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group