Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) - - PowerPoint PPT Presentation

why memory hierarchy
SMART_READER_LITE
LIVE PREVIEW

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) - - PowerPoint PPT Presentation

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast memory fast memory expensive, slow memory cheap cache: small, fast memory near CPU large, slow memory (main memory, disk, )


slide-1
SLIDE 1

dt10 2011 6.1

Why memory hierarchy

(3rd Ed: p.468-487, 4th Ed: p. 452-470)

  • users want unlimited fast memory
  • fast memory expensive, slow memory cheap
  • cache: small, fast memory near CPU
  • large, slow memory (main memory, disk, …)
  • connected to faster memory (one level up)
slide-2
SLIDE 2

dt10 2011 6.2

Core-2 Duo Extreme

slide-3
SLIDE 3

dt10 2011 6.3

Locality principles

  • temporal locality: items recently used by CPU

tend to be referenced again soon

– guides cache replacement policy: what to replace when cache is full

  • spatial locality: items with addresses close to

recently-used items tend to be referenced

– fetches multiple data

slide-4
SLIDE 4

dt10 2011 6.4

Cache operation

  • cache organised in blocks of one or more words
  • cache hit: CPU finds required block in cache
  • otherwise, cache miss: get data from main memory
  • hit rate: ratio of cache access to total memory access
  • hit time: time to access cache

+ time to determine cache hit or miss

  • miss penalty: time to replace item in cache

+ time to transfer it to CPU

slide-5
SLIDE 5

dt10 2011 6.5

Direct mapped cache

  • each memory location mapped to one cache location

e.g. cache index = (memory block address) (cache address) mod (number of blocks in cache)

  • multiple memory locations -> one cache location

e.g. 8 blocks in cache, cache location 001 may contain items from memory locations 00001, 01001, 10001

slide-6
SLIDE 6

dt10 2011 6.6

  • address is modulo the number of blocks in the cache

Direct mapped cache mapping

00001 00101 01001 01101 10001 10101 11001 11101 000 Cache Memory 001 010 011 100 101 110 111

slide-7
SLIDE 7

dt10 2011 6.7

Address translation

  • lower portion:

– cache index

  • upper portion:

– compare with tag

  • Hit if:

– tag matches – and block is valid

  • What about miss?
slide-8
SLIDE 8

dt10 2011 6.8

Handling misses: by exception

  • cache miss on instruction read:

– restore PC: PC = PC - 4 – send address to main memory and wait (stall) – write data received from memory to cache – refetch instruction (from restored PC)

  • cache miss on data read:

– similar: stall CPU until data from main memory are available in cache

  • which is probably more common?
slide-9
SLIDE 9

dt10 2011 6.9

Cache write policy

  • need to maintain cache consistency

– how to keep data in cache and in memory consistent?

  • write back: write to cache only

– complex control, need ‘dirty’ bit for cache entries – have to flush dirty entries when they are evicted

  • write through: write to both cache and memory

– memory write bandwidth can cause bottleneck

  • What happens with shared memory multiprocessors?
slide-10
SLIDE 10

dt10 2011 6.10

Write buffer for write through

  • insert buffer between cache and memory

– processor: write data into cache and write buffer – memory controller: write contents of the write buffer to memory

  • write buffer is just a FIFO queue

– fine if store frequency (w.r.t. time) « – otherwise have write buffer saturation 1 memory write cycle

slide-11
SLIDE 11

dt10 2011 6.11

Write buffer saturation

  • store buffer overflows when

– CPU cycle time too fast with respect to memory access time – too many store instructions in a row

  • solutions to write buffer saturation

– use a write back cache – install a second-level (L2) cache – store compression

slide-12
SLIDE 12

dt10 2011 6.12

Exploiting spatial locality

slide-13
SLIDE 13

dt10 2011 6.13

Block size and performance

  • ↑ block size, ↓ miss rate generally (especially for instructions)
  • large block for small cache: ↑ miss rate - too few blocks
  • ↑ block size: ↑ transfer time between cache and main memory
slide-14
SLIDE 14

dt10 2011 6.14

  • ↑ block size, ↓ miss rate generally
  • large block for small cache:

↑ miss rate - too few blocks

  • ↑ block size: ↑ transfer time between

cache and main memory

Impact of block size on miss rate

2 5 6 4 0 % 3 5 % 3 0 % 2 5 % 2 0 % 1 5 % 1 0 % 5 % 0 % 6 4 1 6 4 1 KB 8 KB 16 KB 64 KB 256 KB

Block size (bytes) miss rate Total cache size (KBytes)

slide-15
SLIDE 15

dt10 2011 6.15

Multi-word cache: handling misses

  • cache miss on read:

– same way as single-word block – bring back the entire multi-word block from memory

  • cache miss on write, given write-through cache:

– single-word block: disregard hit or miss, just write to cache and write buffer / memory – do the same for multi-word block?

slide-16
SLIDE 16

dt10 2011 6.16

Cache performance

  • CPU time = (execution cycles + memory stall cycles)

× cycle time

  • memory stall cycles = read stall cycles + write stall cycles
  • read stall cycles =
  • write stall cycles =

reads read miss read miss program rate (%) penalty (cycle)

read miss per program number of cycles per read miss

× × writes write miss write miss program rate (%) penalty (cycle) × × + write buffer stalls (when full)

slide-17
SLIDE 17

dt10 2011 6.17

Effect on CPU

  • assume hit time insignificant (data transfer time

dominated)

  • let c: CPI-no stall,

i: instruction miss rate p: miss penalty, d: data miss rate n: instruction count, f: load/store frequency

  • total memory stall cycles: nip + nfdp
  • total CPU cycles without stall: nc

total CPU cycles with stall: n(c + ip + fdp)

  • % time on stall:

ip + fdp c + ip + fdp

slide-18
SLIDE 18

dt10 2011 6.18

Faster CPU

  • same memory speed, halved CPI

– % time on stall = – lower CPI results in greater impact of stall cycles

  • same memory speed, halved clock cycle time t

– miss penalty: 2p – total CPU time with new clock: n(c + 2ip + 2fdp) – performance improvement = = = ip + fdp (½)c + ip + fdp ip + fdp c + ip + fdp >

  • exec. time with old clock
  • exec. time with new clock

n(c + ip + fdp) × t n(c + 2ip + 2fdp) × t/2 c + ip + fdp c + 2ip + 2fdp × 2 < 2

slide-19
SLIDE 19

dt10 2011 6.19

Multi-level cache hierarchy

  • how can we look at the cache hierarchy?

– performance view – capacity view – physical hierarchy – abstract hierarchy

slide-20
SLIDE 20

dt10 2011 6.20

Typical scale

  • L1

– size: tens of KB – hit time: complete in one clock cycle – miss rates: 1-5%

  • L2:

– size: hundreds of KB – hit time: a few clock cycles – miss rates: 10-20%

  • L2 miss rate: fraction of L1 misses also miss L2

– why so high?

  • complex: different block size/placement for L1, L2
slide-21
SLIDE 21

dt10 2011 6.21

Average Memory Access Time

  • want the Average Memory Access Time (AMAT)

– take into account all levels of the hierarchy – calculate MTcpu : AMAT for ISA-level accesses – follow the abstract hierarchy AMATCPU = AMATL1 AMATL1 = HitTimeL1 + MissRateL1 * AMATL2 AMATL2 = HitTimeL2 + MissRateL2 * AMATM AMATM = constant AMATCPU=HitTmL1 + MissRtL1(HitTmL2+MissRtL2 AMATM)

slide-22
SLIDE 22

dt10 2011 6.22

Example

  • assume

– L1 hit time = 1 cycle – L1 miss rate = 5% – L2 hit time = 5 cycles – L2 miss rate = 15% (% L1 misses that miss) – L2 miss penalty = 200 cycles

  • L1 miss penalty = 5 + 0.15 x 200 = 35 cycles
  • AMAT = 1 + 0.05 x 35

= 2.75 cycles

slide-23
SLIDE 23

dt10 2011 6.23

Example: without L2 cache

  • assume

– L1 hit time = 1 cycle – L1 miss rate = 5% – L1 miss penalty = 200 cycles

  • AMAT = 1 + 0.05 x 200

= 11 cycles

  • 4 times faster with L2 cache! (2.75 versus 11)