Chapter Seven 1 2004 Morgan Kaufmann Publishers Memories: Review - - PowerPoint PPT Presentation

chapter seven
SMART_READER_LITE
LIVE PREVIEW

Chapter Seven 1 2004 Morgan Kaufmann Publishers Memories: Review - - PowerPoint PPT Presentation

Chapter Seven 1 2004 Morgan Kaufmann Publishers Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on


slide-1
SLIDE 1

Chapter Seven

1

2004 Morgan Kaufmann Publishers

slide-2
SLIDE 2
  • SRAM:

– value is stored on a pair of inverting gates – very fast but takes up more space than DRAM (4 to 6 transistors)

  • DRAM:

– value is stored as a charge on capacitor (must be refreshed) – very small but slower than SRAM (factor of 5 to 10)

Memories: Review

2

2004 Morgan Kaufmann Publishers

– very small but slower than SRAM (factor of 5 to 10)

B A A B

W

  • rd line

P ass transistor C apacitor B it line

slide-3
SLIDE 3
  • Users want large and fast memories!

SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB. DRAM access times are 50-70ns at cost of $100 to $200 per GB. Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB.

  • Try and give it to them anyway

Exploiting Memory Hierarchy

3

2004 Morgan Kaufmann Publishers

– build a memory hierarchy

slide-4
SLIDE 4

Locality

  • A principle that makes having a memory hierarchy a good idea
  • If an item is referenced,

temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon.

4

2004 Morgan Kaufmann Publishers

Why does code have locality?

  • Our initial focus: two levels (upper, lower)

– block: minimum unit of data transferred between adjacent levels – hit: data requested is in the upper level – miss: data requested is not in the upper level – miss penalty: the requirement to fetch a block into a level of the memory hierarchy from the lower level

slide-5
SLIDE 5
  • Two issues:

– How do we know if a data item is in the cache? – If it is, how do we find it?

  • Our first example:

– block size is one word of data – "direct mapped"

Cache

5

2004 Morgan Kaufmann Publishers

For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level

slide-6
SLIDE 6
  • Mapping: address is modulo the number of blocks in the cache

Direct Mapped Cache

000 Cache 001 010 011 100 101 110 111

6

2004 Morgan Kaufmann Publishers

00001 00101 01001 01101 10001 10101 11001 11101 Memory

slide-7
SLIDE 7
  • Tag – contains the

address information to identify whether the associated block corresponds to a requested word

  • Valid bit – a bit to indicate

Direct Mapped Cache For MIPS:

7

2004 Morgan Kaufmann Publishers

  • Valid bit – a bit to indicate

that the associated block in the hierarchy contains valid data

Why “valid” bit ?

slide-8
SLIDE 8

Direct Mapped Cache - Example

Memory request Decimal address of reference Binary address of reference Hit or miss in cache Assigned cache block a 22 10110 Miss 10110 mod 8 = 110 b 26 11010 Miss 11010 mod 8 = 010 c 18 10010 Miss 10110 mod 8 = 110 b 26 11010 Miss 11010 mod 8 = 010 a 22 10000 Hit 10000 mod 8 = 000 Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 10 a 111 N

a (Miss)

8

2004 Morgan Kaufmann Publishers

111 N Inde x V Tag Data 000 N 001 N 010 Y 11 b 011 N 100 N 101 N 110 Y 10 a 111 N Ind ex V Tag Data 000 N 001 N 010 Y 10 c 011 N 100 N 101 N 110 Y 10 a 111 N Ind ex V Tag Data 000 N 001 N 010 Y 11 b 011 N 100 N 101 N 110 Y 10 a 111 N

b (Miss) c (Miss) b (Miss) a (Hit)

Ind ex V Tag Data 000 N 001 N 010 Y 11 b 011 N 100 N 101 N 110 Y 10 a 111 N

slide-9
SLIDE 9
  • Increase Block Size:

– E.g., A 16KB cache contains 256 blocks with 16 words per block – What kind of locality are we taking advantage of? (spatial locality)

Direct Mapped Cache

9

2004 Morgan Kaufmann Publishers

slide-10
SLIDE 10

Analysis of Tag Bits and Index Bits

  • Assume the 32-bit byte address, a directed-mapped cache of size 2n

blocks with 2m-word (2m+2-byte) blocks

– Tag field: 32 – (n+m+2) bits – Cache size: 2n × (2m × 32 + (32 – n – m – 2) + 1)

  • Ex1. Bits in a cache
  • How many total bits are required for a directed-mapped cache with

16KB of data and 4-word blocks, assuming a 32-bit address?

– 16KB = 214 Bytes = 212 Words

10

2004 Morgan Kaufmann Publishers

– 16KB = 214 Bytes = 212 Words – Number of blocks = 212/4 = 210 blocks – Tag field = 32 – (2 + 2 + 10) = 18 – Total size = 210 × × × × (4 × × × × 32 + 18 + 1) = 147 Kbits

  • Ex2. Mapping an address to a multiword cache block
  • Consider a cache with 64 blocks and a block size of 16 bytes. What

block number does byte address 1200 map to?

– Block address =

  • 1200/16
  • = 75

– Block number = 75 modulo 64 = 11 – Block number 11 ranges from 1200 to 1215

byte word index tag

slide-11
SLIDE 11
  • Read hits

– this is what we want!

  • Read misses

– stall the CPU, fetch block from memory, deliver to cache, restart

  • Write hits:

– can replace data in cache and memory (write-through)

  • The writes always update both the cache and the memory,

Hits vs. Misses

11

2004 Morgan Kaufmann Publishers

ensuring that data is always consistent between the two.

– write the data only into the cache (write-back the cache later)

  • The modified blocks (dirty blocks) are written to the lower level of

the hierarchy when the block is replaced.

– Write the data into the cache and a buffer (write buffer)

  • After writing into buffer, CPU continue execution, writing to memory

is controlled by memory controller

  • If buffer is full, CPU must wait for a free buffer entry
  • Write misses:

– read the entire block into the cache, then write the word

slide-12
SLIDE 12
  • Assume a cache block of 4 words (transfer 4 words for one miss)
  • Make reading multiple words easier by using banks of memory

Hardware Issues

The width of bus and cache need not change

CPU Cache Bus CPU Cache Bus Multiplexor CPU Cache Bus

12

2004 Morgan Kaufmann Publishers Memory One-word-wide memory organization a.

  • b. Wide memory organization

Memory Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3

  • c. Interleaved memory organization
  • Ex. 3 Assuming following memory access times
  • 1 clock cycle to send the address
  • 15 clock cycles for each DRAM access initiated
  • 1 clock cycle to send a word of data

Assume the cache block is of 4 words. What’s the cache miss penalty for different memory

  • rganizations?

1+ 1 × × × × 15 + 1 = 17 (4-word wide) 1+ 1 × × × × 15 + 4 × × × × 1 = 20 1+ 4 × × × × 15 + 4 × × × × 1 = 65

  • c. Interleaved memory
  • b. Wide memory
  • a. One-word-wide memory

1+ 2 × × × × 15 + 2 = 33 (2-word wide)

slide-13
SLIDE 13
  • Use split caches because there is more spatial locality in code:
  • Increasing the block size tends to decrease miss rate:

Performance

Program Block size in words Instruction miss rate Data miss rate Effective combined miss rate gcc 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% spice 1 1.2% 1.3% 1.2% 4 0.3% 0.6% 0.4%

13

2004 Morgan Kaufmann Publishers

  • However, for a fixed cache size, as block size increases over a

threshold value, miss rate will increase (Why ?)

slide-14
SLIDE 14

Performance

  • Simplified model:

execution time = (execution cycles + stall cycles) ´ ´ ´ ´ cycle time stall cycles = # of instructions ´ ´ ´ ´ miss ratio ´ ´ ´ ´ miss penalty

  • Two ways of improving performance:

14

2004 Morgan Kaufmann Publishers

– decreasing the miss ratio – decreasing the miss penalty What happens if we increase block size?

slide-15
SLIDE 15

Cache Performance Examples

  • Ex. 4(a) Instruction cache miss rate = 2%, data cache miss rate = 4%, CPI

= 2 without any memory stall, miss penalty = 100 clock cycles, how much faster a processor would run with a perfect cache that never missed (Assume the percentage of instructions lw and sw is 36%)

– Instruction miss cycle = I × × × × 2% × × × × 100 = 2.0 × × × × I – data miss cycle = I × × × × 36%(lw and sw percentage) × × × × 4% × × × × 100 = 1.44 × × × × I – CPI with memory stall = 2 + 3.44 = 5.44 – CPU_timestall/CPU_timenostall = I × × × ×CPIstall × × × ×cycle_time/I × × × ×CPInostall × × × ×cycle_time =

15

2004 Morgan Kaufmann Publishers

CPIstall/CPInostall = 5.44/2 = 2.72

  • Ex. 4(b) What happens if the processor is made twofold faster by

reducing CPI from 2 to 1, but the memory system is not?

– (1+3.44)/1 = 4.44

  • Ex. 4(c) Double clock rate, the time to handle a cache miss does not
  • change. How much faster will the computer be with the same miss rate?

– Miss cycle/inst = (2% × × × ×200)+36% × × × ×(4% × × × ×200)=6.88 – Performancefast/performanceslow = execution_timeslow/execution_timfast = IC × × × ×CPIslow × × × ×cycle_time/IC × × × ×CPIfast × × × ×(cycle_time/2) = 5.44/(8.88 × × × ×0.5) = 1.23

slide-16
SLIDE 16

Decreasing miss ratio with associativity

  • Directed-mapping placement
  • Full associative placement – a block can be placed in any location in

the cache

  • Set associative cache – each block can be placed in a fixed number
  • f locations (at least two). Mapping = (block number) modulo

(number of sets in the cache)

Block # 0 1 2 3 4 5 6 7 Set # 1 2 3

16

2004 Morgan Kaufmann Publishers

1 2 Tag Data Block # 0 1 2 3 4 5 6 7 Search 1 2 Tag Data Set # 1 2 3 Search 1 2 Tag Data Search

slide-17
SLIDE 17

Decreasing miss ratio with associativity

T a g D a t a T a g D a ta T a g D a ta T a g D a ta F o u r - w a y s e t a s s o c ia tiv e S e t T a g D a t a (d ir e c t m a p p e d ) B lo c k 7 1 2 3 4 5 6 T a g D a ta T w o - w a y s e t a s s o c ia t iv e S e t 1 2 3 T a g D a ta

m-way set associative: m blocks per set

17

2004 Morgan Kaufmann Publishers

Compared to direct mapped, give a series of references that: – results in a lower miss ratio using a 2-way set associative cache – results in a higher miss ratio using a 2-way set associative cache assuming we use the “least recently used” replacement strategy

T a g D a t a T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta E ig h t - w a y s e t a s s o c ia t iv e ( fu lly a s s o c ia tiv e ) T a g D a t a T a g D a ta T a g D a ta T a g D a ta S e t 1

slide-18
SLIDE 18

Misses and Associativity Example

  • Ex. 5 Question
  • There are three small caches, each consisting of

four one-word blocks. One cache is direct mapped, a second is two-way associative, and the third is fully associative. Find the number of misses for each cache organization given the following 18

2004 Morgan Kaufmann Publishers

each cache organization given the following sequence of block address: 0, 8, 0, 6, 8. (Assume LRU is used for replacement)

  • Least recently used (LRU): the block replaced is the one

that has been unused for the longest time

slide-19
SLIDE 19

Misses and Associativity Example

Block address Cache block 0 mod 4 = 0 6 6 mod 4 = 2 8 8 mod 4 = 0 Address

  • f memory

block accessed Hit or miss Contents of cache blocks after reference 1 2 3 Miss Mem[0] 8 Miss Mem[8] Miss Mem[0] 6 Miss Mem[0] Mem[6] 8 Miss Mem[8] Mem[6] Block address Cache block 0 mod 2 = 0

Direct-mapped

Answer 19

2004 Morgan Kaufmann Publishers

6 6 mod 2 = 0 8 8 mod 2 = 0 Address

  • f memory

block accessed Hit or miss Contents of cache blocks after reference Set 0 Set 0 Set 1 Set 1 Miss Mem[0] 8 Miss Mem[0] Mem[8] hit Mem[0] Mem[8] 6 Miss Mem[0] Mem[6] 8 Miss Mem[8] Mem[6] Address of memory block accessed Hit or miss Contents of cache blocks after reference Block 0 Block1 Block 2 Block 3 Miss Mem[0] 8 Miss Mem[0] Mem[8] hit Mem[0] Mem[8] 6 Miss Mem[0] Mem[8] Mem[6] 8 hit Mem[0] Mem[8] Mem[6]

Set-associative Fully associative

slide-20
SLIDE 20

An implementation of a Four-Way Set-Associative Cache

Address 22 8 V Tag Index 1 2 253 254 Data V Tag Data V Tag Data V Tag Data 1 2 3 8 9 10 11 12 30 31

20

2004 Morgan Kaufmann Publishers

254 255 32 22 4-to-1 multiplexor Hit Data

More specifically, it’s

  • nly four AND gates

and one OR gate

slide-21
SLIDE 21

Performance

9% 12% 15% 1 KB 2 KB 4 KB

21

2004 Morgan Kaufmann Publishers

Associativity One-way Two-way 3% 6% Four-way Eight-way 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB

slide-22
SLIDE 22

Size of Tags versus Set Associativity

  • Increasing associativity requires more comparators and more

tag bits per cache block.

  • Ex. 6 Assume a cache of 4K blocks, a four-word block size,

and a 32-bit address, find the total number of sets and the total number of tag bits for caches that are direct mapped, two-way and four-way set associative, and full associative

– Direct mapped:

22

2004 Morgan Kaufmann Publishers

byte/word : 2 bits, word/block: 2 bits, block/set: 12 bits, ∴ ∴ ∴ ∴ tag bits = 32 – 16 = 16 and 16 × × × × 4K = 64 K bits – Two-way set associative: byte/word: 2 bits, word/block: 2 bits, block/set: 4K/2 = 2K, 11 bits, ∴ ∴ ∴ ∴ tag bits = 32 – 15 = 17 and 17 × × × × 2K × × × × 2 = 68 K bits – Four-way set associative: byte/word: 2 bits, word/block: 2 bits, block/set: 4K/4 = 1K, 10 bits, ∴ ∴ ∴ ∴ tag bits = 32 – 14 = 18 and 18 × × × × 1K × × × × 4 = 72 K bits – Full associative: byte/w: 2 bits, word/b: 2 bits, b/set: 1 sets, 0 bits, ∴ ∴ ∴ ∴ tag bits = 32 – 4 = 28 and 28 × × × × 4K × × × × 1 = 112 K bits

slide-23
SLIDE 23

Decreasing miss penalty with multilevel caches

  • Add a second level cache:

– often primary cache is on the same chip as the processor – use SRAMs to add another cache above primary memory (DRAM) – miss penalty goes down if data is in 2nd level cache

  • Ex. 7: CPI of 1.0 on a 5 Ghz machine with a 2% miss rate, 100ns

DRAM access. How much faster will the machine be if we add a 2nd level cache with 5ns access time, decreases miss rate to .5%

23

2004 Morgan Kaufmann Publishers

level cache with 5ns access time, decreases miss rate to .5%

  • Miss penalty for main memory = 100ns/(1/5Ghz) = 500cycles

Total CPI = base CPI + memory stall cycle/inst = 1.0 + 2%× × × ×500 = 11

  • Miss penalty on the second level cache: 5ns/(1/5Ghz) = 25cycles

Total CPI for 2-level cache = 1 + primary stall/instr. + second stall/inst. = 1 + 2% × × × ×25 + 0.5% × × × ×500 = 1 + 0.5 + 2.5 = 4.0 11/4 = 2.8 faster if 2nd cache is used.

  • Using multilevel caches:

– try and optimize the hit time on the 1st level cache – try and optimize the miss rate on the 2nd level cache

slide-24
SLIDE 24

Cache Complexities

  • Not always easy to understand implications of caches:

Radix sort 600 800 1000 1200 Radix sort 800 1200 1600 2000

24

2004 Morgan Kaufmann Publishers

Quicksort Size (K item s to sort) 4 8 16 32 200 400 64 128 256 512 1024 2048 4096 Quicksort Size (K items to sort) 4 8 16 32 400 800 64 128 256 512 1024 2048 4096

Theoretical behavior of Radix sort vs. Quicksort Observed behavior of Radix sort vs. Quicksort

slide-25
SLIDE 25

Cache Complexities

  • Here is why:

Radix sort Quicksort 1 2 3 4 5

25

2004 Morgan Kaufmann Publishers

  • Memory system performance is often critical factor

– multilevel caches, pipelined processors, make it harder to predict outcomes – Compiler optimizations to increase locality sometimes hurt ILP

  • Difficult to predict best algorithm: need experimental data

Quicksort Size (K items to sort) 4 8 16 32 64 128 256 512 1024 2048 4096

slide-26
SLIDE 26

Virtual Memory

  • Main memory can act as a cache for the secondary storage (disk)
  • A d d re s s tra n s la tio n

26

2004 Morgan Kaufmann Publishers

  • Advantages:

– illusion of having more physical memory – program relocation – protection

D is k a d d re s s e s

slide-27
SLIDE 27

Pages: virtual memory blocks

  • Page faults: the data is not in memory, retrieve it from disk

– huge miss penalty, thus pages should be fairly large (e.g., 4KB) – reducing page faults is important (LRU is worth the price) – can handle the faults in software instead of hardware – using write-through is too expensive so we use writeback

27

2004 Morgan Kaufmann Publishers

slide-28
SLIDE 28

Page Tables

28

2004 Morgan Kaufmann Publishers

slide-29
SLIDE 29

Page Tables

29

2004 Morgan Kaufmann Publishers

slide-30
SLIDE 30

Page table

  • Page table implementation

  • ften resides in memory

– is indexed by page number from virtual address

  • In case 220 pages in virtual address space

=> 220 entries in page table – each program has its own page table on contiguous memory space

  • Page table register: hold the start address of the page table
  • State of a program: PC, registers, page table, ..

30

2004 Morgan Kaufmann Publishers

  • State of a program: PC, registers, page table, ..

– Page fault: handled by O.S. through Exception

  • Check valid bit (off means page fault occurs)
  • OS take over the control (through exception)
  • Find address in disk
  • Find address in memory to replace (LRU, approximate LRU, …)

– Write old page to disk (if dirty bit is on) -> use write-back. Why? – Read page from disk to memory, and set ref bit on

  • Transfer control back to user process

– Swap space

  • The space on the disk reserved for the virtual memory space of a process
slide-31
SLIDE 31

Making Address Translation Fast: TLB

  • Translation lookaside buffer (TLB): A cache for holding a portion of

page table

31

2004 Morgan Kaufmann Publishers

.

slide-32
SLIDE 32

TLB

  • TLB implementation

– Tag in TLB

  • TLB needs a tag field because it holds only portion of page table entries.
  • Page table does not need a tag field because it holds all virtual address

– Associativity in TLB

  • Small TLB : full-associative -> low miss rate, cost not too high (due to small)
  • Large TLB : small associativity -> cost too much for full associative (LRU)

32

2004 Morgan Kaufmann Publishers

  • Associativity of Page placement in memory

– Full-associative placement of page in memory (but no need full search) – Replacement: LRU, approximate LRU, or more sophisticate algorithms. – Some typical values for a TLB might be

  • TLB size: 16-512 entries, Block size: 1-2 page table entries
  • Hit time: 0.5 – 1 clock cycle, Miss penalty: 10 – 100 clock cycles
  • Miss rate: 0.01% - 1%

– TLB miss can be handled by h/w or s/w

  • Page fault is often handled by s/w (OS through exception)
slide-33
SLIDE 33

TLBs and Caches

  • Physically addressed

cache

  • Virtually addressed

cache

  • Aliasing problem ?

33

2004 Morgan Kaufmann Publishers

slide-34
SLIDE 34
  • The possible combinations of events in the TLB,

virtual memory system, and cache.

TLB Page table Cache Possible ? If so, under what circumstance ?

Hit Hit Miss

Possible

34

2004 Morgan Kaufmann Publishers

Miss Hit Hit

Possible

Miss Hit Miss

Possible

Miss Miss Miss

Possible

Hit Miss Miss

Impossible

Hit Miss Hit

Impossible

Miss Miss Hit

Impossible

slide-35
SLIDE 35

Modern Systems

  • 35

2004 Morgan Kaufmann Publishers

slide-36
SLIDE 36

Modern Systems

  • Things are getting complicated!

36

2004 Morgan Kaufmann Publishers

slide-37
SLIDE 37
  • Processor speeds continue to increase very fast

— much faster than either DRAM or disk access times

Some Issues

37

2004 Morgan Kaufmann Publishers

  • Design challenge: dealing with this growing disparity

– Prefetching? 3rd level caches and more? Memory design?

slide-38
SLIDE 38

Summary (p.578)

  • Cache performance:

– The total # of cycles spent on a program is the sum of the processor cycles and the mem-stall cycles. – As processors get faster, the relative effect of the mem-stall cycles increases. – The # of mem-stall cycles depends on both the miss rate and the miss penalty.

38

2004 Morgan Kaufmann Publishers

  • Reduce the miss rate: associative placement schemes

– The choice among different placement strategies:

  • Depend on the cost of a miss vs. the cost of implementing

associativity, both in time and in extra hw.

  • Reduce the miss penalty: multilevel cache

– Allow a larger secondary cache to handle misses to the primary cache.

  • Memory hierarchies 3Cs and 4Qs.
slide-39
SLIDE 39

The Three Cs

  • Cache Miss Rates: 3C

– Compulsory miss (also called cold-start miss)

  • First-ever reference to a given block of memory
  • Increase block size may reduce this rate (but increase miss

penalty)

– Capacity miss

  • Working set exceeds cache capacity

39

2004 Morgan Kaufmann Publishers

  • Working set exceeds cache capacity
  • Increase total cache size may reduce this rate (but increase

access time)

– Conflict miss (also called collision miss)

  • Placement restrictions cause useful blocks to be replaced.
  • Higher associativity may reduce this rate (but increase access

time)

  • Cache Miss Rate - determined by
  • Locality
  • Cache Organization
slide-40
SLIDE 40

The Four Qs:

  • Question 1: where can a block be placed?

– One place (direct mapped), a few places (set-associative),

  • r any place (fully associative).
  • Question 2: How is a block found?

– Indexing (direct-mapped), limited search (set-associative), full search (fully associative), and a separate lookup table

40

2004 Morgan Kaufmann Publishers

full search (fully associative), and a separate lookup table (as in a page table).

  • Question 3: What block is replaced on a miss?

– Typically, LRU or a random block.

  • Question 4: How are writes handled?

– Write-through, write-back, write-buffer