Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE - PDF document

Caches Electronic Computers M Caches 1

Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL) WORKING SET CPU Cache Registers Cache I lev. Cache II lev. Cache III lev. Memory Disk Tape • The cache is a memory with an access time some order of magnitudes shorter than that of the main memory BUT with a size much smaller. It contains a small (see later) replicated portion of the main memory. • The CPU, when accessing a data (code or data), tries FIRST to find it in a cache (hit) and then, when the data is not found, in the main memory (miss) • In cache there are no single bytes BUT groups of bytes with contiguous addresses (normally 32 or 64 or 128 or more and in any case “aligned” – that is starting at an address multiple of the group size): each group is called a «line» Caches 2

Cache Number of line: The line number Memory Cache (tag) is the complete address of the first line byte minus the LSBits 0 0 (the bits which define the line size in 2 bytes) which are zeros (alignment!) 2 5 m m+1 5 n n+2 Memory access time >100 clock cycles Cache access time : 1 to 4 clock cycles m m+1 Processor generated address Line number In line offset n n+2 Data line In cache position address detection Data line Cache line 32-256 bytes Data per line Accessed data range: single byte to the entire line Caches 3

Cache Let’ consider a cache line of 32 bytes In line offset (0,1,2…31) 5 LSBits of the data/instruction address The cache line consists of bytes 31 30 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 In line offset Please notice that cache offset has nothing to do with page offset • In this cache access for example the processor can read/write  a byte at any offset  a half word starting at even offset  a word (4 bytes) starting at multiple of 4 offset  a double word (8 bytes) starting at multiple of 8 offset  a quad word (16 bytes) starting at multiple of 16 offset The size of the readble data depends on the processor parallelism The cache read/write data MUST be aligned (address multiple of the read data size) • In Risc computers this is mandatory for the memory too which is not the case for many Cisc computers (this implies that two consecutive accesses are sometimes mandatory – and therefore two cache accesses). Why? Because this implies that the most significant part of the address must be incremented Caches 4

Memorie associative (Content Addressable Memories) • Associative memories : they include BOTH data lines and the most significant bits of the lower byte address of each line (line number - TAG) • A data is found not through the decoding of the CPU address BUT by mean of a parallel comparison between all cache lines numbers (TAGs) and the CPU MSB address. The comparison can be either successfull (hit) or not (miss) Line number Data Data Line number Data Line number Line number Data Data Line number Data Line number Data Line number Data Line number Caches 5

Full-associative cache Cache Memory Line 0 Slot TAG Validity Line 1 0 315 1 Line 2 1 7225 0 Line 2 7226 1 Line Line k m 57 1 Line Line Number Line w Line w+1 n 88 1 Line Line z • In each slot any memory line can be stored. The TAG 256 bytes/line is the line number Cache size is always a • For instance: 64GB memory (36 bit address) and 256 power of 2 as the line size byte lines. Offset in line: 8 bit. Tag=36-8= 28 bit Processor generated address Line number In line (28 bit) offset (8 bit) The line number is compared with all cache TAGs . In case of HIT (and if the validity bit is 1) the requested data is present. The address offset is the position of the first byte in the line (requested data can be a byte, a word, a double word and so on provided it is within the line boundary ). This cache organization makes the best use of the cache but it is terribly complex since it requires many comparators (if the cache has 1024 slots - in this case the cache size is 256 Kbytes - 1024 28 bit comparators are required!) and normally caches have 64K slots and more. Each cache line has status bits (2 or more). In this case the cache memory is (in bits) 1024 x (28 + 256 x 8 + 2) bits = 2.127.872 bits Caches 6 Tag Data Status

Directly mapped cache Memory Cache Line 0 Slot TAG Validity Line 1 0 1 Line Line 2 1 0 Line 2 1 Lina Line k m 1 Line Line w Line w+1 n 1 Line Line z In each cache slot only a subset of all memory lines can be stored. For instance in slot 0 only those whose line number divided by the slots number has a remainder 0 in slot 1 those with remainder 1 and so on. Obviously the initial memory address of data in each slot is the line number joined with zeroes (LSBits). For instance: 1 MB main memory, 64 bytes lines => 16K different lines. If the cache has 128 slots (the cache size is therefore 128 x 64Bytes = 8KBytes) in slot 0 lines number 0, 128, 256, etc., in slot 1 lines number 1, 129, 257 etc. Caches 7

Cache directly mapped An example (line 4 bytes) Memory Line 0 Line 1 Line 2 Cache Line 3 Line 4 Slot TAG Validity Line 5 0 1 Line Line 6 1 0 Line Line 7 2 1 Line Line 8 3 1 Line Line 9 Line 10 Line 11 Line 12 Line 13 Line 14 Line 15 Caches 8

Directly mapped cache The LSBs of the line number indicate the only cache slot where the line can be stored. Consider a processor with 36 bit address (64 GB), 256 byte line (8 bit): the line number is 28 bit (how many lines ? 2 28 -> 2 10 x 2 10 x 2 8 ). If the cache has 1024 slots (256KB) the 10 LSBs (2 10 = 1024) of the line number (index) indicate the slot where a line must be stored Processor generated address Line number In line (28 bit) offset (8 bit) Slot In line TAG (18 bit) (10bit) offset (8 bit) Index Only one 10 bit decoder (to detect the involved slot) and only one 18 bit comparator are needed Very little flexibility Caches 9

Directly mapped cache Processor generated address Line TAG Slot Offset Index Cache TAG DATA In each slot only one line for each index can be stored Caches 10

A compromise n-way set-associative cache Processor generated address Line TAG Slot Offset Cache TAG DATA • N-way set-associative : many lines for each index • N comparators for n-way. Parallelism of the comparators identical to that of directly mapped cache • In the directly mapped caches data can be provided before validity and TAG check . In the set-associative caches only after the check • Sometimes speculative mechanisms (way 0 data is provided then check) Caches 11

Set associative cache ADDRESS CACHE LINE Tag Index Offset Status Tag Data Way 0 Way 1 Way n D Way 0 Way 1 Way n E C O Way n Way 0 Way 1 D E R Way 1 Way n Way 0 TAG COMPARATOR Data word Hit/miss Data selection Requested data Caches 12

Therefore... • In a fully associative cache a line can be stored in any slot • In a directly mapped cache in only one slot, that corresponding to the INDEX • In a set-associative cache in any way of the slot corresponding to the INDEX • Caches are normally multiport that means that 2 or3 addresses can be presented to the cache which answer simultaneously. This solves the Harvard problem discussed for the DLX • http://www.ecs.umass.edu/ece/koren/architecture/Cache/defa ult.htm • http://www.ecs.umass.edu/ece/koren/architecture/Cache/page 3.htm • http://www.ecs.umass.edu/ece/koren/architecture/Cache/fram e2.htm Caches 13

Replacement algorithms Caches are of limited size and therefore it is necessary (i.e. in case of a read miss) to select a line which must be discarded (overwritten if not modified, written back in memory and then overwritten if modified) There are basically three possible policies: RAND (Random), LRU (Least Recently Used), and FIFO (First In First Out) with different efficiency and complexity RAND: in this case the logical network must first detect whether invalid lines are present (and therefore overwrite one on them): if not according to a random number generator (i.e. a shift register feedbacked by an EX-OR gate) must select a line to be replaced. The algorithms can be refined selecting first the non-modified lines. Although non-optimal this algorithm is very cost-effective Caches 14

Replacement algorithms NB: the same network for each set . When a“hit”occurs the hit way must become the most recent way and all others become of a lower rank with no rank change among them. Let’s suppose there are 4 ways and that all lines of the set are valid. The way (its number) in position Ra is the most recently hit. The other lines were hit in the past according to their positions. Right shift register Rx Ra Rb Rc Rd X Na Nb Nc Nd ExOR zero if the inputs are identical Ex-OR Ex-OR Ex-OR AND AND AND CLK Na, Nb, Nc, Nd the hit ways numbers (0,1,2,3 in any order according to the set history! ) Rx,Ra, Rb,Rc,Rd: 2 bit registers Rx stores the present hit way number (if any – no miss) Rd stores the way number least recently hit (it stores the oldest line). Its line number is the candidate for replacement in case of miss for the set. Ra stores the way number most recently hit. For each hit the contents of the 2-bits registers are richt shifted one position Caches 15

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE - PDF document

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL) WORKING SET CPU Cache Registers Cache I lev. Cache II lev. Cache III lev. Memory Disk Tape The cache is a memory with an access time some

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa

SPLIT ARRAY CACHES FOR EMBEDDED APPLICATIONS Euromicro DSD 2010 Alice M. Tokarnia, Marina

Techniques for Caches in GPUs Gnther Schindler Seminar Talk 2015/16 Chair ASC Outline 1.

Caches Out-of-order execution Data flow model Samira Khan Superscalar processor March

Nexus: A New Approach to Replication in Distributed Shared Caches Po-An Tsai , Nathan Beckmann,

Today Memory hierarchy, caches, locality Cache organiza:on

Caching 1 Caches break down an address into which parts? Letter Answer A Tag, delay, length

Multicore DSP Architecture and Programming O. Dahl 1 1 Electrical Engineering, Linkping

PXD DAQ S. Lange (Giessen) for PXD DAQ team Hardware and firmware by IHEP Beijing, Bonn,

Fa Fast st Has ash h Tab able e Loo ookup p Usi sing ng Exten ended ded Bloom oom Fi

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Programming of hierarchic array processors: The physical layer February 18, 2013 Many core

processors for new highly informative experiments in space O. Serdin, . n tonov, A. Dubrovsky,

Linear Logic, Types and Implicit Computational Complexity Patrick Baillot LIPN, CNRS Universit

The Impact of Domain Knowledge on the Effectiveness of Requirements Idea Generation during

Sambuz

Useful Links

Newsletter

Mail Us

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE - PDF document

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL) WORKING SET CPU Cache Registers Cache I lev. Cache II lev. Cache III lev. Memory Disk Tape The cache is a memory with an access time some

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

Caches &amp; Memcache Example Client N. America Client System Asia + Caches Client Africa

SPLIT ARRAY CACHES FOR EMBEDDED APPLICATIONS Euromicro DSD 2010 Alice M. Tokarnia, Marina

Techniques for Caches in GPUs Gnther Schindler Seminar Talk 2015/16 Chair ASC Outline 1.

Caches Out-of-order execution Data flow model Samira Khan Superscalar processor March

Nexus: A New Approach to Replication in Distributed Shared Caches Po-An Tsai , Nathan Beckmann,

Today Memory hierarchy, caches, locality Cache organiza:on

Caching 1 Caches break down an address into which parts? Letter Answer A Tag, delay, length

Multicore DSP Architecture and Programming O. Dahl 1 1 Electrical Engineering, Linkping

PXD DAQ S. Lange (Giessen) for PXD DAQ team Hardware and firmware by IHEP Beijing, Bonn,

Fa Fast st Has ash h Tab able e Loo ookup p Usi sing ng Exten ended ded Bloom oom Fi

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Programming of hierarchic array processors: The physical layer February 18, 2013 Many core

processors for new highly informative experiments in space O. Serdin, . n tonov, A. Dubrovsky,

Linear Logic, Types and Implicit Computational Complexity Patrick Baillot LIPN, CNRS Universit

The Impact of Domain Knowledge on the Effectiveness of Requirements Idea Generation during

Sambuz

Useful Links

Newsletter

Mail Us

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa