CS422 Computer Architecture Spring 2004 Lecture 18, 26 Feb 2004 - - PowerPoint PPT Presentation

cs422 computer architecture
SMART_READER_LITE
LIVE PREVIEW

CS422 Computer Architecture Spring 2004 Lecture 18, 26 Feb 2004 - - PowerPoint PPT Presentation

CS422 Computer Architecture Spring 2004 Lecture 18, 26 Feb 2004 Bhaskaran Raman Department of CSE IIT Kanpur http://web.cse.iitk.ac.in/~cs422/index.html Memory Hierarchy Two principles: Smaller is faster Principle of locality


slide-1
SLIDE 1

CS422 Computer Architecture

Spring 2004 Lecture 18, 26 Feb 2004 Bhaskaran Raman Department of CSE IIT Kanpur

http://web.cse.iitk.ac.in/~cs422/index.html

slide-2
SLIDE 2

Memory Hierarchy

  • Two principles:

– Smaller is faster – Principle of locality

  • Processor speed grows much faster than

memory speed

  • Registers – Cache – Memory – Disk

– Upper level vs. lower level

  • Cache design
slide-3
SLIDE 3

Cache Design Questions

  • Cache is arranged in terms of blocks

– To take advantage of spatial locality

  • Design choices:

– Q1: block placement – where to place a block in

upper level?

– Q2: block identification – how to find a block in

upper level?

– Q3: block replacement – which block to replace

  • n a miss?

– Q4: write strategy – what happens on a write?

slide-4
SLIDE 4

Block Placement: Fully Associative

8 16 24 11 Block 11 can go anywhere Memory Cache

slide-5
SLIDE 5

Block Placement: Direct

8 16 24 11 Block 11 can go only in block number 11 mod 8 Memory Cache

slide-6
SLIDE 6

Block Placement: Set Associative

8 16 24 11 Block 11 can go in set number 11 mod 4 Memory Cache

slide-7
SLIDE 7

Continuum of Choices

  • Memory has n blocks, cache has m blocks
  • Fully associative is the same as set

associative with one set (m-way set associative)

  • Direct placement is the same as 1-way set

associative (with m sets)

  • Most processors use direct, 2-way/4-way set

associative

slide-8
SLIDE 8

Block Identification

  • How many different blocks of memory can be

mapped (at different times) to a cache block?

  • Fully associative: n
  • Direct: n/m
  • k-way set associative: k*n/m
  • Each cache block has a tag saying which

block of memory is currently present in it

– A valid bit is set to 0 if no memory block is in the

cache block currently

slide-9
SLIDE 9

Block Identification (continued)

  • How many bits for the tag?

log2k∗n/m

  • How many sets in cache?

m/k

  • How many bits to identify the correct set?

log2m/k

slide-10
SLIDE 10

Block Identification (continued)

  • How many blocks in memory?

n ,log2nto represent block number in memory

Tag Index Block offset

  • Given a memory address:

log2klog2n−log2m log2m−log2k log2block-size

– Select set using index, block from set using tag – Select location from block using block offset – tag + index = block address

slide-11
SLIDE 11

Block Replacement Policy

  • Cache miss ==> bring block onto cache

– What if no free block in set? – Need to replace a block

  • Possible policies:

– Random – Least-Recently Used (LRU)

  • Lesser miss-rate, but harder to implement
slide-12
SLIDE 12

Replacement Policy Performance

256KB 64KB 16KB 0.00% 1.00% 2.00% 3.00% 4.00% 5.00% 6.00%

2-way LRU 2-way Random 4-way LRU 4-way Random 8-way LRU 8-way Random

Cache size Cache miss rate

slide-13
SLIDE 13

Write Strategy

  • Reads are dominant

– All instructions are read – Even for data, loads dominate over stores

  • Reads can be fast

– Can read from multiple blocks while performing

tag comparison

– Cannot do the same with writes

  • Should pay attention to write performance

too!

slide-14
SLIDE 14

When do Writes go to Memory?

  • Write through: each write is mirrored to

memory also

– Easier to implement

  • Write back: write to memory only when

block is replaced

– Faster writes – Some writes do not go to memory at all! – But, read miss may cause more delay

  • Block being replaced has to be written back
  • Optimize using dirty bit

– Also, bad for multiprocessors and I/O

slide-15
SLIDE 15

Write Stalls

  • In write through, may have to stall waiting for

write to complete

– Called a write stall – Can employ a write buffer to enable the

processor to proceed during the write-through

slide-16
SLIDE 16

What to do on a Write Miss?

  • Write-allocate (or, fetch on write): load block
  • n a cache miss during a write
  • No-write allocate (or, write around): just

write directly to main memory

  • Write-allocate usually goes with write-back,

and no-write allocate goes with write-through

slide-17
SLIDE 17

The Alpha AXP 21064 Cache

  • 34-bit physical address

– 29 bits for block address – 5 bits for block offset

  • 8 KB cache, direct-mapped

– 8 bits for index – 29 – 8 = 21 bits for tag

slide-18
SLIDE 18

Steps in Memory Read

  • Four steps:

– Step-1: CPU puts out the address – Step-2: Index selection – Step-3: Tag comparison, read from data – Step-4: Data returned to CPU (assuming hit)

  • This takes two cycles
slide-19
SLIDE 19

Steps in Memory Write

  • Write-through policy is used
  • Write buffer with four entries

– Each entry can have up to 4 words from the same

block

– Write merging: successive writes to the same

block use the same write-buffer entry

slide-20
SLIDE 20

Some More Details

  • What happens on a miss?

– Cache sends signal to CPU asking it to wait – No replacement policy required (direct mapped) – Write miss ==> write-around

  • 8KB separate instruction cache
slide-21
SLIDE 21

Separate versus Unified Cache

  • Direct-mapped

cache, 32-byte blocks, SPEC92,

  • n DECstation

5000

  • Unified cache has

twice the size of I- cache or D-cache

  • 75% instruction

references

I-Cache D-Cache U-Cache 1KB 3.06% 24.61% 13.34% 2KB 2.26% 20.57% 9.78% 4KB 1.78% 15.94% 7.24% 8KB 1.10% 10.19% 4.57% 16KB 0.64% 6.47% 2.87% 32KB 0.39% 4.82% 1.99% 64KB 0.15% 3.77% 1.35% 128KB 0.02% 2.88% 0.95%

Miss-rates