1 Direct-mapped cache Tags and validation stickers Address - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Direct-mapped cache Tags and validation stickers Address - - PDF document

Basic structure of memory hierarchy Exploiting memory hierarchy Speed Size Cost ($/bits) Technology Cache basics CPU < ns 32 high CPU registers 0.5-5 ns KB-MB $4K-$10K SRAM Cache per GB 50-70 ns MB-GB $100-$200 DRAM Main


slide-1
SLIDE 1

1

CS240 Computer Organization

Department of Computer Science Wellesley College

Exploiting memory hierarchy

Cache basics

Cache basics 22-2

< ns 32 high CPU

Basic structure of memory hierarchy

CPU registers Cache Main memory Magnetic disk

Speed Size Cost ($/bits) Technology 0.5-5 ns KB-MB $4K-$10K SRAM per GB 50-70 ns MB-GB $100-$200 DRAM per GB 5M-10M ns GB-TB $.50-$2 Magnetic per GB disk

Cache basics 22-3

Copying blocks of information

  • A memory hierarchy can

consist of many levels, but data is copied between

  • nly two adjacent levels at

a time.

  • We focus on just two

levels: the upper -- the

  • ne closer to the

processor -- and the lower.

Processor Data block is transferred

Cache basics 24

Batting average

  • If the data requested by the

processor appears in the some block in the upper level, it’s a hit. Otherwise, it’s a miss.

  • The hit ratio is the fraction

hits to total requests.

  • Hit time is the time to access

the upper level of memory.

  • Miss penalty is the time to

replace a block in the upper level with the corresponding block from the lower level, plus the time to deliver this block

Processor Data block is transferred

slide-2
SLIDE 2

2

Cache basics 22-5

Direct-mapped cache

00001 00101 01001 01101 10001 10101 11001 11101 Cache Memory 1 1 1 1 1 1 1 1 1 1 1 1

Cache basics 22-6

Tags and validation stickers

Address (showing bit positions) Data Hit Data Tag Valid Tag 32 20 Index 1 2 1023 1022 1021 = Index 20 10 Byte

  • ffset

31 30 13 12 11 2 1 0

*MIPS words are aligned to multiples of 4 bytes, so the least significant 2 bits of every address are ignored.

Cache basics 22-7

Miss rate versus block size

  • Data are move between

memory units in blocks of words.

  • Larger blocks exploit

spatial locality to lower the miss rates.

  • However, the miss rate

may increase if the block size becomes significant fractional of the cache

  • size. Why?
  • Increased block size also

increases the miss penalty.

1 KB 8 KB 16 KB 64 KB 256 KB 256 40% 35% 30% 25% 20% 15% 10% 5% 0% M i s s r a t e 64 16 4 Block size (bytes)

Cache basics 22-8

Handling cache misses

1.

Send the original PC value (current PC - 4) to the memory.

2.

Instruct main memory to perform a read and wait for the memory to complete its access.

3.

Write the cache, putting the data from memory in the data portion, writing the upper bits

  • f the address into the tag

field, and turning the valid bit

  • n.

4.

Restart the instruction execution at the first step, which will refetch the instruction.

slide-3
SLIDE 3

3

Cache basics 22-9

What’s wrong with this picture?

Processor A store instruction writes data into the cache

Cache basics 22-10

Write-through

Processor A store instruction writes data into the cache and memory at the same time

Cache basics 22-11

Write buffer

Processor A store instruction writes data into the cache . . . and into a write buffer . . . where it can cool its heels until written to main memory

Cache basics 22-12

Write-back

Processor A store instruction writes data ONLY into the cache The modified block is written into main memory

  • nly when it is

replaced

slide-4
SLIDE 4

4

Cache basics 22-13

An example: Intrinsity FastMATH

  • MIPS architecture with a

12-stage pipeline and a simple cache implementation.

  • To pipeline without

stalling, separate instruction and data caches are used.

  • Each cache is 16 KB, or 4K

words, with 16-word blocks.

Cache basics 22-14

256 block cache with 16 words per block

Address (showing bit positions) Data Hit Data Tag V Tag 32 16 = Index 18 8 Byte

  • ffset

31 14 13 2 1 0 6 5 4 Block offset 256 entries 512 bits 18 bits Mux 32 32 32

Cache basics 22-15

Approximate instruction and data misses for SPEC2000 benchmarks

Instruction Data Effective miss rate miss rate combined miss rate 0.4% 11.4% 3.2%

*This isn’t the whole story, the ultimate measure is the effect of memory on program execution time. More soon . . .

Cache basics 22-16

Increasing physical or logical memory width

CPU Cache Memory Bus One-word-wide memory organization a.

  • b. Wide memory organization

CPU Cache Memory Bus Multiplexor CPU Cache Bus Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3

  • c. Interleaved memory organization

*Assume four word cache block; 1 bus cycle to send address; 15 bus cycles for each DRAM access initiate; 1 bus cycle to send word.

slide-5
SLIDE 5

5

Improving cache 22-17

Memory references 0 1 2 3 4 3 4 15

1 2 3 4 3 4 15

00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(2)

miss miss miss miss miss miss hit hit

00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3)

01 4 11 15

00 Mem(1) 00 Mem(2) 00 Mem(3) *Start with an empty cache - all blocks initially marked as not valid.

Improving cache 22-18

Memory references 0 4 0 4 0 4 0 4

4 4 4 4 miss miss miss miss miss miss miss miss 00 Mem(0) 00 Mem(0) 01 4 01 Mem(4) 00 00 Mem(0) 01 4 00 Mem(0) 01 4 00 Mem(0) 01 4 01 Mem(4) 00 01 Mem(4) 00 *Start with an empty cache - all blocks initially marked as not valid.

Improving cache 22-19

We seek to decrease miss rate, while not increasing hit time

  • We may be able to reduce

cache misses by more flexible placement of blocks.

  • In direct mapped cache, a

block can go in exactly one place.

  • That makes it easy to

find, but . . .

Improving cache 22-20

Fully associative

  • At the other extreme is a

scheme where a block can be placed anywhere in cache.

  • But then how do we find

it?

slide-6
SLIDE 6

6

Improving cache 22-21

Middle ground: n-way set-associative cache

Eight-way set associative (fully associative) Tag Tag Data Data Tag Tag Data Data Tag Tag Data Data Tag Tag Data Data Tag Tag Data Data Tag Tag Data Data Set 1 Four-way set associative Tag Tag Data Data Set 1 2 3 Two-way set associative Tag Data Block 1 2 3 4 5 6 7 One-way set associative (direct mapped)

*The set containing a memory block is given by (Block number) modulo (Number of sets in the cache).

Improving cache 22-22

Memory references 0 4 0 4 0 4 0 4

4 4 4 4 miss 00 Mem(0) miss 01 Mem(4) *Start with an empty cache - all blocks initially marked as not valid. 00 Mem(0) hit 01 Mem(4) hit 01 Mem(4) hit 00 Mem(0) hit 01 Mem(4) hit 00 Mem(0) hit

Improving cache 22-23

Performance of set-associative caches

Associativity One-way Two-way 3% 6% 9% 12% 15% Four-way Eight-way 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB Improving cache 22-24

Implementing a 4-way set-associative cache

Address 22 8 V Tag Index 1 2 253 254 255 Data V Tag Data V Tag Data V Tag Data 32 22 4-to-1 multiplexor Hit Data 1 2 3 8 9 10 11 12 30 31

slide-7
SLIDE 7

7

Improving cache 22-25

Choosing which block to replace

  • When a miss occurs in a

direct-mapped cache, the requested block can only go in one place.

  • In an associative cache we

have a choice.

  • The most commonly

scheme least recently used (LRU).

Improving cache 22-26

Reducing miss penalty using multiple caches

  • A two-level cache allows

the primary cache to focus on minimizing the hit time to yield a shorter clock cycle,

  • . . . while the secondary

cache focuses on miss rate to reduce the penalty

  • f long memory access

times.

Balancing cache accounts

  • Suppose we have a 5 GHz

processor with a base CPI of 1.0, assuming all references hit in primary cache.

  • Assume main memory access

time of 100 ns, including all miss handling.

  • Suppose miss rate per

instruction at primary cache is 2%.

  • How much faster would the

processor be if we add a secondary cache with a 5 ns access time that is large enough to reduce miss rate to 0.5% ?

Improving cache 22-27 Improving cache 22-28

If only things were that simple

Radix sort Quicksort Size (K items to sort) 4 8 16 32 200 400 600 800 1000 1200 64 128 256 512 1024 2048 4096 Radix sort Quicksort Size (K items to sort) 4 8 16 32 400 800 1200 1600 2000 64 128 256 512 1024 2048 4096

Theoretical behavior of Radix sort vs. Quicksort Observed behavior of Radix sort vs. Quicksort

slide-8
SLIDE 8

8

Improving cache 22-29

The real scoop

  • Memory system performance is often critical factor
  • Multilevel caches, pipelined processors, make it harder to

predict outcomes

Radix sort Quicksort Size (K items to sort) 4 8 16 32 1 2 3 4 5 64 128 256 512 1024 2048 4096