SLIDE 1
Chapter 4 Cache Memory
SLIDE 2 Contents
- Computer memory system overview
—Characteristics of memory systems —Memory hierarchy
- Cache memory principles
- Elements of cache design
—Cache size —Mapping function —Replacement algorithms —Write policy —Line size —Number of caches
- Pentium 4 and PowerPC cache organizations
SLIDE 3 Key Points
—processor registers —cache —main memory —fixed hard disk —ZIP cartridges, optical disks, and tape
—decreasing cost, increasing capacity, and slower access time
—during the execution of a program, memory references tend to cluster
SLIDE 4 4.1 Computer Memory System Overview
- Characteristics of memory systems
—Location —Capacity —Unit of transfer —Access method —Performance —Physical type —Physical characteristics
– volatile/nonvolatile – erasable/nonerasable
—Organization
SLIDE 5 Location
—main memory —cache
—peripheral storage devices —disk, tape
SLIDE 6 Capacity
—natural unit of organization —8, 16, 32, and 64 bits
—memory capacity
SLIDE 7 Unit of Transfer
—Usually governed by data bus width
—Usually a block which is much larger than a word
SLIDE 8 Access Methods (1)
—Start at the beginning and read through in order —Access time depends on location of data and previous location —e.g. tape
—Individual blocks have unique address —Access is by jumping to vicinity plus sequential search —Access time depends on location of data and previous location —e.g. disk
SLIDE 9 Access Methods (2)
—Each location has a unique address —Access time is independent of location or previous access —e.g. RAM
—Data is retrieved based on a portion of its contents rather than its address —Access time is independent of location or previous access —e.g. cache
SLIDE 10 Performance
—For random-access memory
– time between presenting the address and getting the valid data
—For non-random-access memory
– time to position the read-write head at the location
- Memory Cycle time (primarily applied to random-access memory)
—Time may be required for the memory to “recover” before next access
– die out on signal lines – regenerate data if they are read destructively
—access time + recover time
—For random-access memory, equal to 1/(cycle time)
SLIDE 11 Performance
- For non-random-access memory, the following
relationship holds: TN = TA + N/R where TN = Average time to read or write N bits TA = Average access time N = Number of bits R = Transfer rate, in bits per second(bps)
SLIDE 12 Physical Types
—RAM, ROM
—Disk, Tape
—CD, CD-R, CD-RW, DVD
SLIDE 13 Physical Characteristics
- Volatile/Nonvolatile
- Erasable/Nonerasable
SLIDE 14 Questions on Memory Design
—Capacity
—Time is money
SLIDE 15 Hierarchy List
- Registers
- L1 Cache
- L2 Cache
- Main memory
- Disk cache
- Disk
- Optical
- Tape
SLIDE 16
Memory Hierarchy - Diagram
SLIDE 17 As Going Dow n The Hierarchy
- Decreasing cost per bit
- Increasing capacity
- Increasing access time
- Decreasing frequency of access of memory by
the processor
SLIDE 18 An Example
- Suppose we have two levels of memory
—L1 : 1000 words, 0.01 us access time —L2 : 100,000 words, 0.1 us access time —H = fraction of all memory accesses found in L1 —T1 = access time to L1 —T2 = access time to L2
—(0.95)(0.01 us) + (0.05)(0.01 us + 0.1 us) = 0.095 + 0.0055 = 0.015 us —average access time is much closer to 0.01 us
SLIDE 19
SLIDE 20 Principle of Locality
- As going down the hierarchy, we had the
decreasing frequency of access by the processor
—this is possible due to the principle of locality
- During the course of the execution of a
program, memory references tend to cluster
—programs contain loops and procedures
– there are repeated references to a small set of instructions
—operations on arrays involve access to a clustered set
– there are repeated references to a small set of data
SLIDE 21 4.2 Cache Memory Principles
—Small amount of fast memory local to processor —Sits between main memory and CPU
SLIDE 22
Cache/Main Memory Structure
SLIDE 23 Cache Read Operation
- CPU requests contents of memory location
- Check cache for this data
- If present, get from cache (fast)
- If not present, read required block from main
memory to cache
- Then deliver from cache to CPU
- Cache includes tags to identify which block of
main memory is in each cache slot
SLIDE 24
Cache Read Operation
SLIDE 25 4.3 Elements of Cache Design
—Size —Mapping Function
– direct, associative, set associative
—Replacement Algorithm
– LRU, FIFO, LFU, Random
—Write Policy
– Write through, write back
—Line Size —Number of Caches
– single or two level – unified or split
SLIDE 26 Size Does Matter
- Small enough to make it cost effective
- Large enough for performance reasons
—but larger caches tend to be slightly slower than small ones
SLIDE 27 Mapping Function
- Fewer cache lines than main memory blocks
—mapping is needed —also need to know which memory block is in cache
—Direct —Associative —Set associative
—Cache size : 64 KByte —Line size : 4 Bytes
– cache is organized as 16 K lines
—Main memory size : 16 Mbytes
– each byte is directly addressable by a 24-bit address
SLIDE 28 Direct Mapping
- Maps each block into a possible cache line
- Mapping function
i = j modulo m where i = cache line number j = main memory block number m = number of lines in the cache
- Address is in three parts
—Least Significant w bits identify unique word —Most Significant s bits specify one memory block
– these are split into a cache line field r and a tag s-r(most significant)
SLIDE 29 Direct Mapping - Address Structure
- Address length = (s + w) bits
- Number of addressable units = 2s+ w words or bytes
- Block size = line size = 2w words or bytes
- Number of blocks in main memory = 2s+ w/2w = 2s
- Number of lines in cache = m = 2r
- Size of tag = (s – r) bits
SLIDE 30 Direct Mapping - Address Structure
Tag s-r Line or Slot r w 8 14 2
- 24 bit address(22 + 2)
- 2 bit word identifier (4 bytes in a block)
- 22 bit block identifier
— 8 bit tag (= 22-14) — 14 bit slot or line
- No two blocks mapping into the same line have the same
tag field
SLIDE 31
Direct Mapping - Cache Line Mapping
Cache line Main Memory blocks assigned 0, m, 2m, 3m…2s-m 1 1,m+ 1, 2m+ 1…2s-m+ 1 m-1 m-1, 2m-1,3m-1…2s-1
SLIDE 32
Direct Mapping - Cache Line Mapping
Cache line Starting memory address of block 000000, 010000,…, FF0000 1 000004, 010004,…, FF0004 m-1 00FFFC, 01FFFC,…, FFFFFC
SLIDE 33
Direct Mapping - Cache Organization
SLIDE 34
Direct Mapping Example
SLIDE 35 Direct Mapping Pros & Cons
- Simple and inexpensive to implement
- Fixed cache location for any given block
—If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high
SLIDE 36 Associative Mapping
- A main memory block can be loaded into any
line of cache
- Memory address is interpreted as a tag and a
word field
—Tag field uniquely identifies a block of memory
- Every line’s tag is simultaneously examined for a
match
—Cache searching gets complex and expensive
SLIDE 37 Associative Mapping - Address Structure
- Address length = (s + w) bits
- Number of addressable units = 2s+ w words or bytes
- Block size = line size = 2w words or bytes
- Number of blocks in main memory = 2s+ w/2w = 2s
- Number of lines in cache = cannot specify using s or w
- Size of tag = s bits
SLIDE 38 Tag 22 bit Word 2 bit
Associative Mapping - Address Structure
- 22 bit tag stored with each 32 bit block of data
- Compare tag field with tag entry in cache to
check for hit
- Least significant 2 bits of address identify which
byte is required from 32 bit data block
SLIDE 39
Fully Associative Cache Organization
SLIDE 40
Associative Mapping - Example
SLIDE 41 Associative Mapping Pros & Cons
- Flexible as to which block to replace when a
new block is read into the cache
—need to select one which is not going to be used in the near future
- Complex circuitry is required to examine the
tags of all cache lines
SLIDE 42 Set Associative Mapping
- A compromise of direct and associative methods
- Cache is divided into a number of sets(v)
- Each set contains a number of lines(k)
- The relationships are
m = v x k i = j modulo v where i = cache set number j = main memory block number m = number of lines in the cache
SLIDE 43 Set Associative Mapping - Address
- Address length = (s + w) bits
- Number of addressable units = 2s+ w words or bytes
- Block size = line size = 2w words or bytes
- Number of blocks in main memory = 2s+ w/2w = 2s
- Number of lines in set = k
- Number of sets = v = 2d
- Number of lines in cache = kv = k * 2d
- Size of tag = (s – d) bits
- Address is interpreted as 3 fields
— tag, set, and word
SLIDE 44 Set Associative Mapping - Address
- Use set field to determine cache set to look in
—this determines the mapping of blocks into lines
- Compare tag field to see if we have a hit
—two lines are examined simultaneously
- If v = m, k = 1, same as direct mapping
- If v = 1, k = m, same as associative mapping
- two- or four-way set associative mappings are
common
Tag 9 bit Set 13 bit Word 2 bit
SLIDE 45
Tw o Way Set Associative Cache Organization
SLIDE 46
Tw o Way Set Associative Mapping - Example
SLIDE 47 Replacement Algorithms
- When a new block is brought into cache, one of
the existing blocks must be replaced
—Only one possible line for any particular block
– No choice – Replace that line
SLIDE 48 Replacement Algorithms
- Associative & Set Associative mapping
—To achieve high speed, need to be implemented in hardware —5 common algorithms
– Least Recently Used (LRU)
+ the most effective one + e.g. in 2 way set associative
- can be implemented using a Use bit for each line
- Which of the 2 block is LRU?
– First In First Out (FIFO)
+ replace block that has been in cache longest + can be implemented using a circular buffer
– Least Frequently Used
+ replace block which has fewest references + can be implemented using a counter for each line
– Random
SLIDE 49
Replacement Algorithms
– Clock algorithm
+ upgraded version of FIFO + Additional bit called a use bit + When a block is first loaded in cache, use bit is set to 1 + When the block is referenced, use bit is set to 1 + When it is time to replace a block, the first block encountered with the use bit set to 0 is replaced + During the search for replacement, each use bit with 1 is changed to 0
SLIDE 50
Example of clock policy operation
SLIDE 51
Example of clock policy operation
SLIDE 52 Write Policy
- When a block in the cache is to be replaced,
need to consider whether it has been altered
—If not, that block may be overwritten —If so, main memory need to be updated
—More than one device may have access to main memory —Multiple processors with their own caches
—Write through —Write back
SLIDE 53 Write through
- All writes go to main memory as well as cache
- Other CPUs can monitor main memory traffic to
keep their cache up to date
—generates substantial memory traffic
SLIDE 54 Write back
- Updates are made only in cache
—Update bit for cache line is set —If a block is to be replaced, write to main memory
- nly if update bit is set
- Portions of main memory may be invalid
—accesses by other devices are allowed only through the cache
—15% of memory references are writes —For HPC, 33~ 50% are writes
SLIDE 55 Cache Coherency
- Cache coherency in shared memory multiprocessors
—Bus watching with write through
– each cache controller monitors the address lines – if another master writes to a location that also resides in its cache, cache entry is invalidated
—Hardware transparency
– all updates to main memory are reflected in all caches
—Noncacheable memory
– only a portion of memory is shared by processors – this is designated as noncacheable – this portion is never copied into the cache
SLIDE 56 Line Size
—Larger blocks means a small number of blocks
– there will be more replacements – data will be overwritten shortly after they are fetched
—As a block becomes larger, each additional word is farther from the requested word, and less likely to be needed in the near future
– principle of locality does not apply well
- 8 to 32 bytes sizes are close to optimum
- For HPC, 64 to 128 bytes sizes are common
SLIDE 57 Number of Caches
—On-chip cache
– speeds up execution and increases performance
—Is off-chip cache still desirable?
– the speed gap between CPU and main memory is too big – most designs include both caches – usually called L1 and L2 caches
—trend is toward split caches
– one cache each for instruction and data – eliminate the contention for the cache between instruction fetch/decode unit and execution unit
SLIDE 58 Pentium 4 Cache
- 80386 – no on-chip cache
- 80486 – 8 KB on-chip using 16 byte lines and four-way
set associative organization
- Pentium (all versions) – two on-chip L1 caches
— Data & instructions
— L1 caches
– 8 KB – 64 byte lines – four-way set associative
— L2 cache
– Feeding both L1 caches – 256 KB – 128 byte lines – 8-way set associative
SLIDE 59
Pentium 4 Diagram
SLIDE 60 Pentium 4 Processor Core
— Fetches instructions from L2 cache — Decode into micro-ops — Store micro-ops in L1 cache
- Out of order execution logic
— Schedules micro-ops — Based on data dependencies and resources — May speculatively execute
— Execute micro-ops — Data from L1 cache — Results in registers
— L2 cache and systems bus
SLIDE 61 Pentium 4 Design Reasoning
- Decodes instructions into RISC like micro-ops before L1
cache
— Superscalar pipelining and scheduling
- Pentium instructions long & complex
- Performance improved by separating decoding from
scheduling & pipelining
— (More later – ch14)
— Can be configured to write through
SLIDE 62 Pow er PC Cache Organization
- 601 – single 32 KB 8-way set associative
- 603 – 16 KB (2 x 8 KB) 2-way set associative
- 604 – 32 KB (2 x 8 KB) 4-way set associative
- 620 – 64 KB (2 x 32 KB) 8-way set associative
- G3 & G4
—64 KB L1 cache
– 8-way set associative
—256 KB, 512 KB or 1MB L2 cache
– 2-way set associative
SLIDE 63
Pow erPC G4
SLIDE 64
Comparison of Cache Sizes