Memory Hierarchy Instructor: Jun Yang 1 11/19/2009 Motivation - - PowerPoint PPT Presentation

memory hierarchy
SMART_READER_LITE
LIVE PREVIEW

Memory Hierarchy Instructor: Jun Yang 1 11/19/2009 Motivation - - PowerPoint PPT Presentation

Memory Hierarchy Instructor: Jun Yang 1 11/19/2009 Motivation Processor-DRAM Memory Gap (latency) Proc 1000 CPU 60%/yr. Moores Law (2X/1.5yr) Performance 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM


slide-1
SLIDE 1

Memory Hierarchy

Instructor: Jun Yang

11/19/2009

1

slide-2
SLIDE 2

Motivation

2

11/19/2009

µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs)

1 10 100 1000

1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

DRAM CPU

1982

Processor-Memory Performance Gap: (grows 50% / year)

Performance

Time

“Moore’s Law” Processor-DRAM Memory Gap (latency)

slide-3
SLIDE 3

The Goal: Illusion of Large, Fast, Cheap Memory

  • Goal: a large and fast memory
  • Fact:

Large memories are slow Fast memories are small

  • How do we create a memory that is large,

cheap and fast (most of the time)?

– Hierarchy

3

11/19/2009

slide-4
SLIDE 4

Memory Hierarchy of a Modern Computer System

  • By taking advantage of the principle of locality:

– Present the user with as much memory as is available in the cheapest technology. – Provide access at the speed offered by the fastest technology.

4

11/19/2009

Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Off-Chip Caches (SRAM) On-Chip Caches

1s 10,000,000s (10s ms) Speed (ns): 10s 100s 100s Gs Size (bytes): Ks Ms

Tertiary Storage (Tape)

10,000,000,000s (10s sec) Ts

slide-5
SLIDE 5

Memory Hierarchy: Why Does it Work? Locality!

  • Temporal Locality (Locality in Time):

=> Keep most recently accessed data items closer to the processor

  • Spatial Locality (Locality in Space):

=> Move blocks consists of contiguous words to the upper levels

5

11/19/2009

Lower Level Memory Upper Level Memory To Processor From Processor

Blk X Blk Y

Address Space 2^n - 1 Probability

  • f reference
slide-6
SLIDE 6

Memory Hierarchy Technology

  • Rando

dom m Access: – “Random” is good: access time is the same for all locations – DRAM: Dynamic Random Access Memory

  • High density, low power, cheap, slow
  • Dynamic: need to be “refreshed” regularly

– SRAM: Static Random Access Memory

  • Low density, high power, expensive, fast
  • Static: content will last “forever”(until lose power)
  • “Not
  • t-so

so-random” Acces ess Tec Technology: – Access time varies from location to location and from time to time – Examples: Disk, CDROM

  • Seq

Sequential Acces ess Tec Technology: access time linear in location (e.g.,Tape)

  • We will concentrate on random access technology

– The Main Memory: DRAMs + Caches: SRAMs

6

11/19/2009

slide-7
SLIDE 7

Introduction to Caches

  • Cache

– is a small very fast memory (SRAM, expensive) – contains copies of the most recently accessed memory locations (data and instructions): temporal locality – is fully managed by hardware (unlike virtual memory) – storage is organized in blocks of contiguous memory locations: spatial locality – unit of transfer to/from main memory (or L2) is the cache block

  • General structure

– n blocks per cache organized in s sets – b bytes per block – total cache size n*b bytes

7

11/19/2009

slide-8
SLIDE 8

Caches

  • For each block:

– an address tag: unique identifier – state bits:

  • (in)valid
  • modified

– the data: b bytes

  • Basic cache operation

– every memory access is first presented to the cache – hit: the word being accessed is in the cache, it is returned to the cpu – miss: the word is not in the cache,

  • a whole block is fetched from memory (L2)
  • an “old” block is evicted from the cache (kicked out), which one?
  • the new block is stored in the cache
  • the requested word is sent to the cpu

8

11/19/2009

slide-9
SLIDE 9

00001 00101 01001 01101 10001 10101 11001 11101 000 Cache Memory 001 010 011 100 101 110 111

Direct Mapped Cache

  • Cache stores a subset of memory blocks
  • Mapping: address is modulo the number of blocks in the

cache

9

slide-10
SLIDE 10

Block Size > 4 bytes (1 word)

10

00 01 10 11 00000 00001 00010 00011 00100 00101 01010 01011 10000 10001 10100 10101 11110 11111

slide-11
SLIDE 11

Two way set-associative mapping

11

Way 0 Way 1 00 01 10 11 00000 00001 00010 00011 00 01 10 11

4 bytes per block

slide-12
SLIDE 12

Addressing the Cache

– Direct mapped cache: one block per set. – Set-associative mapping: n/s blocks per set. – Fully associative mapping: one set per cache (s = n).

12

tag

  • ffset

index Direct mapping log n log b tag

  • ffset

index Set-associative mapping log n/s log b tag

  • ffset

Fully associative mapping log b

slide-13
SLIDE 13

Example: 1 KB Direct Mapped Cache with 32 B Blocks

  • For a 2N byte cache:

– The uppermost (32 - N) bits are always the Cache Tag – The lowest M bits are the Byte Select (Block Size = 2M) – One cache miss, pull in complete “Cache Block” (or “Cache Line”)

13 Cache Index 1 2 3

:

Cache Data Byte 0 4 31

:

Cache Tag Example: 0x50 Ex: 0x01 0x50 Stored as part

  • f the cache “state”

Valid Bit

:

31 Byte 1 Byte 31

:

Byte 32 Byte 33 Byte 63

:

Byte 992 Byte 1023

:

Cache Tag Byte Select Ex: 0x00 9

Block address

slide-14
SLIDE 14

Set Associative Cache Architecture

  • N-way set associative: N entries for each Cache Index

– N direct mapped caches operates in parallel

  • Example: Two-way set associative cache

– Cache Index selects a “set” from the cache – The two tags in the set are compared to the input in parallel – Data is selected based on the tag result

14

Cache Data Cache Block 0 Cache Tag Valid

: : :

Cache Data Cache Block 0 Cache Tag Valid

: : :

Cache Index Mux

1 Sel1 Sel0

Cache Block Compare Adr Tag Compare OR Hit

slide-15
SLIDE 15

Example: Fully Associative Architecture

  • Fully Associative Cache

– Forget about the Cache Index – Compare the Cache Tags of all cache entries in parallel – Example: Block Size = 32 B blocks, we need N 27-bit comparators

  • By definition: Conflict Miss = 0 for a fully associative cache

15

:

Cache Data Byte 0 4 31

:

Cache Tag (27 bits long) Valid Bit

:

Byte 1 Byte 31

:

Byte 32 Byte 33 Byte 63

:

Cache Tag Byte Select Ex: 0x01 = = = = =

slide-16
SLIDE 16

Cache Organization – Example

  • Instruction & Data caches:

–Because instructions have much lower miss rates: split the cache (at L1 level) into instruction and data. Otherwise, unified. –Main advantages: no interference, data misses do not stall instruction fetch, optimize for different access patterns etc.

  • AMD Opteron Cache

–64K I & D caches (L1), 2-way set associative; –b = 64 bytes; –write-back; write allocate on write miss. –Organization shown in Fig. 5.19 –Read & write hits: 2 cycles, pipelined. –Victim buffer: 8 entries.

16

slide-17
SLIDE 17

Alpha 21264 Cache Organization

17

slide-18
SLIDE 18

Block Replacement

  • Block replacement

– not an issue with direct mapped – replacement strategy is more important in small caches than in large

  • nes

– replacement policies:

  • LRU: has been unused for the longest time. good temporal locality,

complex state

  • pseudo-random: randomly selected
  • FIFO: oldest block. Difference with LRU?

18

slide-19
SLIDE 19
  • Assume a fully-associative cache with two blocks,

which of the following memory references miss in the cache.

– assume distinct addresses go to distinct blocks

LRU example

19

LRU Tags A B A C B A B addresses

  • 1
slide-20
SLIDE 20

LRU example

  • Assume a fully-associative cache with two blocks,

which of the following memory references miss in the cache.

20

LRU Tags A B A C B A B addresses

  • 1

A

  • 1

A B A B 1 A C B C 1 B A B A 1 miss miss miss miss miss On a miss, we replace the LRU. On a hit, we j ust update the LRU.

slide-21
SLIDE 21

Exercise

  • Assume you have a fully associative cache with 4
  • entries. For the following memory block address

sequence, which entry becomes the LRU at the end? 8 9 5 2 6 5 9 10 3

21

slide-22
SLIDE 22

Write Policies

  • Writes are hard

– read: concurrently check tag and read data – write is destructive, so it must be slower

  • Write strategies

– when to update memory?

  • on every write (write-through)
  • when a modified block is

replaced (write-back)

– what to do on a write miss?

  • fetch the block to cache (write

allocate), used with write-back

  • do not fetch (write around),

used with write-through

  • Trade-offs

– Write-back:

  • uses less bus bandwidth,

– Write-through:

  • keeps MM consistent with

CM,

  • good for DMA.

23

slide-23
SLIDE 23

Write Buffers

  • Write-through Buffers

– buffers words to be written in L2 cache/memory along with their addresses. – 2 to 4 entries deep – all read misses are checked against pending writes for dependencies (associatively) – can coalesce writes to same address

  • Write-back Buffers

– between a write-back cache and L2 or MM – algorithm

  • move dirty block to write-back

buffer

  • read new block
  • write dirty block in L2 or MM

– can be associated with victim cache (later)

24

L1 L2

Write buffer

to CPU

slide-24
SLIDE 24

29

Improving CPU Performance

  • In the past 10 years, there are over 5000 research

papers on reducing the gap between the CPU and memory speeds.

  • We will address some them in four categories:

– Reducing the miss rate – Reducing the cache miss penalty – Reducing the cache miss penalty or miss rate via parallelism – Reducing the hit time

slide-25
SLIDE 25

30

Cache Misses

  • Types of cache misses

– the three Cs:

  • Compulsory: first access to a block is a miss.
  • Conflict: collision misses, blocks map to the same set.
  • Capacity: replaced blocks that are later referenced, cache too small.

– the fourth C:

  • Coherence: shared memory, invalid copies
  • Relative effects

– Fully associative placement: no conflict misses. – Larger block size might reduce compulsory misses. – Larger caches have lower capacity misses. Most of these have negative impacts on hit time and therefore cycle time. – a direct mapped cache can be faster than a set associative one

slide-26
SLIDE 26

31

Reducing Miss Rate (1–3)

  • 1. Larger block size
  • Takes advantage of locality
  • But, longer miss penalty and maybe more conflict misses (w/ same

cache size)

  • Block size increase should not reach the point where miss rate

increases.

  • 2. Larger cache size
  • Longer hit time – suitable for lower level caches.
  • 3. Higher associativity
  • 2:1 cache rule of thumb: a direct-mapped cache of size N has about

the same miss rate as a 2-way set-associative cache of size N/2 (for cache size < 128KB)

  • Longer hit time

Improving an aspect of AMAT comes at the expense of another!

slide-27
SLIDE 27

32

Reducing Miss Rate (4)

  • 4. Pseudoassociative cache – a direct-mapped cache having same

hit rate as a 2-way set-associative cache

  • If the first access is a miss, try an alternative block (by

modifying an address portion)

  • A normal hit time and a pseudohit time – in addition to

the miss penalty

slide-28
SLIDE 28

33

Reducing Cache Miss Penalty (1)

  • 1. Multilevel Caches – “the more the merrier”

– Add another level behind L1 cache to speed up access from memory (why not combine the two levels into one? Because larger cache will incur longer access time and could stretch cc to hurt all instructions!)

AMAT = Hit timeL1 + Miss rateL1 x (Hit timeL2 + Local miss rateL2 x Miss penaltyL2) Average memory stall time = Miss rateL1 x Hit timeL2 + Miss rateL1 x Local miss rateL2 x Miss penaltyL2 Average memory stalls per instruction = Miss per instructionL1 x Hit timeL2 + Miss per instrucitonL2 x Miss penaltyL2

slide-29
SLIDE 29

34

Example

  • For every 1000 instructions, 40 misses in L1 and 20 misses

in L2; Hit cycle in L1 is 1, L2 is 10; Miss penalty from L2 to memory is 100 cycles; there are 1.5 memory references per

  • instruction. What is AMAT and average stall cycles per

instruction?

– AMAT = [1 + 40/1000 * (10 + 20/40 * 100) ] *cc = 3.4cc – Average stall cycles per instruction = 1.5 * 40/1000 * 10 + 1.5 * 20/1000 * 100 = 3.6 cycles

  • Note: We have not distinguished reads and writes. Access L2
  • nly on L1 miss, i.e. write back cache
slide-30
SLIDE 30

35

Reducing Cache Miss Penalty (2,3)

  • 2. Critical word first – “impatience”

L2 cache block: Requested word: 5 Critical word first: (wrapped fetch)

  • 3. Serves reads before writes have been completed –

“preference”

– Recall: In ooo processor, the reorder buffer contains loads/stores (waiting for address computation or memory) in program order.

1 2 3 4 5 6 7 8 5 6 7 8 1 2 3 4

Ld r4, 1000 St 35, 1000

to memory ld/st enqueue

slide-31
SLIDE 31

36

Reducing Cache Miss Penalty (3,4)

– Complications with write buffer: it stores updated blocks but the L2 hasn’t seen them yet! What will happen on a L1 read miss?

  • Conventionally, read miss stalls and waits until write buffer

flushes its content to L2 and then access L2 (slow).

  • Alternatively, L1 read miss should check write buffer before going

to L2 (faster).

  • 4. Improving write buffer (in write through) efficiency –

“companionship”

– Two writes with the same address are coalesced. – Writes stall if no empty entries are present -> utilize the entries efficiently using write merge.

slide-32
SLIDE 32

37

Write Merge

slide-33
SLIDE 33

38

Reducing Cache Miss Penalty (5)

  • Victim Cache – “recycling”

– Holds victim blocks discarded from the L1 cache due to replacement. – Small (otherwise an L2) and fully associative. – Checked on a L1 miss. If found, block is swapped back to L1 (the block previously took its place is put into victim cache). – Works better for small L1 caches since it saves victim blocks from conflict misses. – Effective: a 4-entry vc can reduce ¼ of the misses in a 4KB L1 cache [Jouppi, 1990].

L1 L2

hit in victim cache

slide-34
SLIDE 34

39

Reducing Hit Time

  • 1. Use small and simple L1 cache

– Small hardware is faster – Direct map cache is faster

  • 2. Pipeline writes

– In writes: must check tag BEFORE write is done. – Separate tag check and data write, delay data write with respect to tag check: back to back writes (Alpha 21064).

  • 3. Trace cache, another type of I-cache (Pentiums)

– Stores dynamic instruction sequence (trace) instead of static sequence. – Include multiple taken branches and make them into straight line code

  • Reduce I-cache misses due to fetch branch target since the target now is just the

next instruction

– Downside:

  • Instruction may repeatedly occur in trace cache – wasting space
slide-35
SLIDE 35

40

Reducing Hit Time

4. Avoid address translation

– Addresses sent to cache need to be translated from virtual addresses to physical addresses – done by transla slation lo lookasi side buffer (TLB) – Translation occur before going to the L1 cache – cost time – Virtually addressed cache:

+ Index the cache using virtual addresses, avoid TLB accesses, save time − Context switch causes flushing the entire cache − Alias: different virtual addresses may map into same physical address – need protection mechanism!

5. Way prediction – approaching hit time of a direct-mapped cache for set

associative caches

  • Do not compare all the tags.
  • Predict one and access the way just as a direct-mapped cache.
  • Prediction is done in the previous cache access (extra prediction bits are

maintained).

  • On a miss, all the rest ways are compared as a normal set-associative cache

– takes longer time.

  • Alpha 21264 instruction cache
slide-36
SLIDE 36

41

Parallelize mem. Access with Execution

  • 1. Nonblocking cache (lockup-free)

– Continue to service cache accesses during a miss – works for ooo processors. – Significantly increase the complexity of the cache controller.

  • 2. Hardware prefetching

– Get the data or instruction before it is accessed – Does not slow down other cache activities

  • Continue to serve other instructions or data while waiting for the

prefetched data – normally nonblocking.

– Never replace useful data (use additional buffers). – Stream buffer for instructions

  • On I-cache miss, check stream buffer.
  • Stream buffer hit →move the block into I-cache, refill (prefetch) stream

buffer with nex ext block.

  • Stream buffer miss → fetch target block into I-cache and nex

ext block into stream buffer