Chapter 4 Cache Memory Contents Computer memory system overview - - PowerPoint PPT Presentation

chapter 4 cache memory contents
SMART_READER_LITE
LIVE PREVIEW

Chapter 4 Cache Memory Contents Computer memory system overview - - PowerPoint PPT Presentation

Chapter 4 Cache Memory Contents Computer memory system overview Characteristics of memory systems Memory hierarchy Cache memory principles Elements of cache design Cache size Mapping function Replacement


slide-1
SLIDE 1

Chapter 4 Cache Memory

slide-2
SLIDE 2

Contents

  • Computer memory system overview

—Characteristics of memory systems —Memory hierarchy

  • Cache memory principles
  • Elements of cache design

—Cache size —Mapping function —Replacement algorithms —Write policy —Line size —Number of caches

  • Pentium 4 and PowerPC cache organizations
slide-3
SLIDE 3

Key Points

  • Memory hierarchy

—processor registers —cache —main memory —fixed hard disk —ZIP cartridges, optical disks, and tape

  • Going down the hierarchy

—decreasing cost, increasing capacity, and slower access time

  • Principles of locality

—during the execution of a program, memory references tend to cluster

slide-4
SLIDE 4

4.1 Computer Memory System Overview

  • Characteristics of memory systems

—Location —Capacity —Unit of transfer —Access method —Performance —Physical type —Physical characteristics

– volatile/nonvolatile – erasable/nonerasable

—Organization

slide-5
SLIDE 5

Location

  • CPU
  • Internal

—main memory —cache

  • External(secondary)

—peripheral storage devices —disk, tape

slide-6
SLIDE 6

Capacity

  • Word size

—natural unit of organization —8, 16, 32, and 64 bits

  • Number of words

—memory capacity

slide-7
SLIDE 7

Unit of Transfer

  • Internal memory

—Usually governed by data bus width

  • External memory

—Usually a block which is much larger than a word

slide-8
SLIDE 8

Access Methods (1)

  • Sequential

—Start at the beginning and read through in order —Access time depends on location of data and previous location —e.g. tape

  • Direct

—Individual blocks have unique address —Access is by jumping to vicinity plus sequential search —Access time depends on location of data and previous location —e.g. disk

slide-9
SLIDE 9

Access Methods (2)

  • Random

—Each location has a unique address —Access time is independent of location or previous access —e.g. RAM

  • Associative

—Data is retrieved based on a portion of its contents rather than its address —Access time is independent of location or previous access —e.g. cache

slide-10
SLIDE 10

Performance

  • Access time (latency)

—For random-access memory

– time between presenting the address and getting the valid data

—For non-random-access memory

– time to position the read-write head at the location

  • Memory Cycle time (primarily applied to random-access memory)

—Time may be required for the memory to “recover” before next access

– die out on signal lines – regenerate data if they are read destructively

—access time + recover time

  • Transfer Rate

—For random-access memory, equal to 1/(cycle time)

slide-11
SLIDE 11

Performance

  • For non-random-access memory, the following

relationship holds: TN = TA + N/R where TN = Average time to read or write N bits TA = Average access time N = Number of bits R = Transfer rate, in bits per second(bps)

slide-12
SLIDE 12

Physical Types

  • Semiconductor

—RAM, ROM

  • Magnetic

—Disk, Tape

  • Optical

—CD, CD-R, CD-RW, DVD

slide-13
SLIDE 13

Physical Characteristics

  • Volatile/Nonvolatile
  • Erasable/Nonerasable
slide-14
SLIDE 14

Questions on Memory Design

  • How much?

—Capacity

  • How fast?

—Time is money

  • How expensive?
slide-15
SLIDE 15

Hierarchy List

  • Registers
  • L1 Cache
  • L2 Cache
  • Main memory
  • Disk cache
  • Disk
  • Optical
  • Tape
slide-16
SLIDE 16

Memory Hierarchy - Diagram

slide-17
SLIDE 17

As Going Dow n The Hierarchy

  • Decreasing cost per bit
  • Increasing capacity
  • Increasing access time
  • Decreasing frequency of access of memory by

the processor

slide-18
SLIDE 18

An Example

  • Suppose we have two levels of memory

—L1 : 1000 words, 0.01 us access time —L2 : 100,000 words, 0.1 us access time —H = fraction of all memory accesses found in L1 —T1 = access time to L1 —T2 = access time to L2

  • Suppose H = 0.95

—(0.95)(0.01 us) + (0.05)(0.01 us + 0.1 us) = 0.095 + 0.0055 = 0.015 us —average access time is much closer to 0.01 us

slide-19
SLIDE 19
slide-20
SLIDE 20

Principle of Locality

  • As going down the hierarchy, we had the

decreasing frequency of access by the processor

—this is possible due to the principle of locality

  • During the course of the execution of a

program, memory references tend to cluster

—programs contain loops and procedures

– there are repeated references to a small set of instructions

—operations on arrays involve access to a clustered set

  • f data

– there are repeated references to a small set of data

slide-21
SLIDE 21

4.2 Cache Memory Principles

  • Cache

—Small amount of fast memory local to processor —Sits between main memory and CPU

slide-22
SLIDE 22

Cache/Main Memory Structure

slide-23
SLIDE 23

Cache Read Operation

  • CPU requests contents of memory location
  • Check cache for this data
  • If present, get from cache (fast)
  • If not present, read required block from main

memory to cache

  • Then deliver from cache to CPU
  • Cache includes tags to identify which block of

main memory is in each cache slot

slide-24
SLIDE 24

Cache Read Operation

slide-25
SLIDE 25

4.3 Elements of Cache Design

  • Design issues

—Size —Mapping Function

– direct, associative, set associative

—Replacement Algorithm

– LRU, FIFO, LFU, Random

—Write Policy

– Write through, write back

—Line Size —Number of Caches

– single or two level – unified or split

slide-26
SLIDE 26

Size Does Matter

  • Small enough to make it cost effective
  • Large enough for performance reasons

—but larger caches tend to be slightly slower than small ones

slide-27
SLIDE 27

Mapping Function

  • Fewer cache lines than main memory blocks

—mapping is needed —also need to know which memory block is in cache

  • Techniques

—Direct —Associative —Set associative

  • Example case

—Cache size : 64 KByte —Line size : 4 Bytes

– cache is organized as 16 K lines

—Main memory size : 16 Mbytes

– each byte is directly addressable by a 24-bit address

slide-28
SLIDE 28

Direct Mapping

  • Maps each block into a possible cache line
  • Mapping function

i = j modulo m where i = cache line number j = main memory block number m = number of lines in the cache

  • Address is in three parts

—Least Significant w bits identify unique word —Most Significant s bits specify one memory block

– these are split into a cache line field r and a tag s-r(most significant)

slide-29
SLIDE 29

Direct Mapping - Address Structure

  • Address length = (s + w) bits
  • Number of addressable units = 2s+ w words or bytes
  • Block size = line size = 2w words or bytes
  • Number of blocks in main memory = 2s+ w/2w = 2s
  • Number of lines in cache = m = 2r
  • Size of tag = (s – r) bits
slide-30
SLIDE 30

Direct Mapping - Address Structure

Tag s-r Line or Slot r w 8 14 2

  • 24 bit address(22 + 2)
  • 2 bit word identifier (4 bytes in a block)
  • 22 bit block identifier

— 8 bit tag (= 22-14) — 14 bit slot or line

  • No two blocks mapping into the same line have the same

tag field

slide-31
SLIDE 31

Direct Mapping - Cache Line Mapping

Cache line Main Memory blocks assigned 0, m, 2m, 3m…2s-m 1 1,m+ 1, 2m+ 1…2s-m+ 1 m-1 m-1, 2m-1,3m-1…2s-1

slide-32
SLIDE 32

Direct Mapping - Cache Line Mapping

Cache line Starting memory address of block 000000, 010000,…, FF0000 1 000004, 010004,…, FF0004 m-1 00FFFC, 01FFFC,…, FFFFFC

slide-33
SLIDE 33

Direct Mapping - Cache Organization

slide-34
SLIDE 34

Direct Mapping Example

slide-35
SLIDE 35

Direct Mapping Pros & Cons

  • Simple and inexpensive to implement
  • Fixed cache location for any given block

—If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high

slide-36
SLIDE 36

Associative Mapping

  • A main memory block can be loaded into any

line of cache

  • Memory address is interpreted as a tag and a

word field

—Tag field uniquely identifies a block of memory

  • Every line’s tag is simultaneously examined for a

match

—Cache searching gets complex and expensive

slide-37
SLIDE 37

Associative Mapping - Address Structure

  • Address length = (s + w) bits
  • Number of addressable units = 2s+ w words or bytes
  • Block size = line size = 2w words or bytes
  • Number of blocks in main memory = 2s+ w/2w = 2s
  • Number of lines in cache = cannot specify using s or w
  • Size of tag = s bits
slide-38
SLIDE 38

Tag 22 bit Word 2 bit

Associative Mapping - Address Structure

  • 22 bit tag stored with each 32 bit block of data
  • Compare tag field with tag entry in cache to

check for hit

  • Least significant 2 bits of address identify which

byte is required from 32 bit data block

slide-39
SLIDE 39

Fully Associative Cache Organization

slide-40
SLIDE 40

Associative Mapping - Example

slide-41
SLIDE 41

Associative Mapping Pros & Cons

  • Flexible as to which block to replace when a

new block is read into the cache

—need to select one which is not going to be used in the near future

  • Complex circuitry is required to examine the

tags of all cache lines

slide-42
SLIDE 42

Set Associative Mapping

  • A compromise of direct and associative methods
  • Cache is divided into a number of sets(v)
  • Each set contains a number of lines(k)
  • The relationships are

m = v x k i = j modulo v where i = cache set number j = main memory block number m = number of lines in the cache

slide-43
SLIDE 43

Set Associative Mapping - Address

  • Address length = (s + w) bits
  • Number of addressable units = 2s+ w words or bytes
  • Block size = line size = 2w words or bytes
  • Number of blocks in main memory = 2s+ w/2w = 2s
  • Number of lines in set = k
  • Number of sets = v = 2d
  • Number of lines in cache = kv = k * 2d
  • Size of tag = (s – d) bits
  • Address is interpreted as 3 fields

— tag, set, and word

slide-44
SLIDE 44

Set Associative Mapping - Address

  • Use set field to determine cache set to look in

—this determines the mapping of blocks into lines

  • Compare tag field to see if we have a hit

—two lines are examined simultaneously

  • If v = m, k = 1, same as direct mapping
  • If v = 1, k = m, same as associative mapping
  • two- or four-way set associative mappings are

common

Tag 9 bit Set 13 bit Word 2 bit

slide-45
SLIDE 45

Tw o Way Set Associative Cache Organization

slide-46
SLIDE 46

Tw o Way Set Associative Mapping - Example

slide-47
SLIDE 47

Replacement Algorithms

  • When a new block is brought into cache, one of

the existing blocks must be replaced

  • Direct mapping

—Only one possible line for any particular block

– No choice – Replace that line

slide-48
SLIDE 48

Replacement Algorithms

  • Associative & Set Associative mapping

—To achieve high speed, need to be implemented in hardware —5 common algorithms

– Least Recently Used (LRU)

+ the most effective one + e.g. in 2 way set associative

  • can be implemented using a Use bit for each line
  • Which of the 2 block is LRU?

– First In First Out (FIFO)

+ replace block that has been in cache longest + can be implemented using a circular buffer

– Least Frequently Used

+ replace block which has fewest references + can be implemented using a counter for each line

– Random

slide-49
SLIDE 49

Replacement Algorithms

– Clock algorithm

+ upgraded version of FIFO + Additional bit called a use bit + When a block is first loaded in cache, use bit is set to 1 + When the block is referenced, use bit is set to 1 + When it is time to replace a block, the first block encountered with the use bit set to 0 is replaced + During the search for replacement, each use bit with 1 is changed to 0

slide-50
SLIDE 50

Example of clock policy operation

slide-51
SLIDE 51

Example of clock policy operation

slide-52
SLIDE 52

Write Policy

  • When a block in the cache is to be replaced,

need to consider whether it has been altered

—If not, that block may be overwritten —If so, main memory need to be updated

  • Two problems to contend

—More than one device may have access to main memory —Multiple processors with their own caches

  • Two techniques

—Write through —Write back

slide-53
SLIDE 53

Write through

  • All writes go to main memory as well as cache
  • Other CPUs can monitor main memory traffic to

keep their cache up to date

  • Disadvantage

—generates substantial memory traffic

slide-54
SLIDE 54

Write back

  • Updates are made only in cache

—Update bit for cache line is set —If a block is to be replaced, write to main memory

  • nly if update bit is set
  • Portions of main memory may be invalid

—accesses by other devices are allowed only through the cache

  • Experiences

—15% of memory references are writes —For HPC, 33~ 50% are writes

slide-55
SLIDE 55

Cache Coherency

  • Cache coherency in shared memory multiprocessors

—Bus watching with write through

– each cache controller monitors the address lines – if another master writes to a location that also resides in its cache, cache entry is invalidated

—Hardware transparency

– all updates to main memory are reflected in all caches

—Noncacheable memory

– only a portion of memory is shared by processors – this is designated as noncacheable – this portion is never copied into the cache

slide-56
SLIDE 56

Line Size

  • Effects of line size

—Larger blocks means a small number of blocks

– there will be more replacements – data will be overwritten shortly after they are fetched

—As a block becomes larger, each additional word is farther from the requested word, and less likely to be needed in the near future

– principle of locality does not apply well

  • 8 to 32 bytes sizes are close to optimum
  • For HPC, 64 to 128 bytes sizes are common
slide-57
SLIDE 57

Number of Caches

  • Multilevel caches

—On-chip cache

– speeds up execution and increases performance

—Is off-chip cache still desirable?

– the speed gap between CPU and main memory is too big – most designs include both caches – usually called L1 and L2 caches

  • Unified vs. Split caches

—trend is toward split caches

– one cache each for instruction and data – eliminate the contention for the cache between instruction fetch/decode unit and execution unit

slide-58
SLIDE 58

Pentium 4 Cache

  • 80386 – no on-chip cache
  • 80486 – 8 KB on-chip using 16 byte lines and four-way

set associative organization

  • Pentium (all versions) – two on-chip L1 caches

— Data & instructions

  • Pentium 4

— L1 caches

– 8 KB – 64 byte lines – four-way set associative

— L2 cache

– Feeding both L1 caches – 256 KB – 128 byte lines – 8-way set associative

slide-59
SLIDE 59

Pentium 4 Diagram

slide-60
SLIDE 60

Pentium 4 Processor Core

  • Fetch/Decode Unit

— Fetches instructions from L2 cache — Decode into micro-ops — Store micro-ops in L1 cache

  • Out of order execution logic

— Schedules micro-ops — Based on data dependencies and resources — May speculatively execute

  • Execution units

— Execute micro-ops — Data from L1 cache — Results in registers

  • Memory subsystem

— L2 cache and systems bus

slide-61
SLIDE 61

Pentium 4 Design Reasoning

  • Decodes instructions into RISC like micro-ops before L1

cache

  • Micro-ops fixed length

— Superscalar pipelining and scheduling

  • Pentium instructions long & complex
  • Performance improved by separating decoding from

scheduling & pipelining

— (More later – ch14)

  • Data cache is write back

— Can be configured to write through

slide-62
SLIDE 62

Pow er PC Cache Organization

  • 601 – single 32 KB 8-way set associative
  • 603 – 16 KB (2 x 8 KB) 2-way set associative
  • 604 – 32 KB (2 x 8 KB) 4-way set associative
  • 620 – 64 KB (2 x 32 KB) 8-way set associative
  • G3 & G4

—64 KB L1 cache

– 8-way set associative

—256 KB, 512 KB or 1MB L2 cache

– 2-way set associative

slide-63
SLIDE 63

Pow erPC G4

slide-64
SLIDE 64

Comparison of Cache Sizes