Chapter Overview 5.1 Introduction 5.2 The ABCs of Caches 5.3 - - PowerPoint PPT Presentation

chapter overview
SMART_READER_LITE
LIVE PREVIEW

Chapter Overview 5.1 Introduction 5.2 The ABCs of Caches 5.3 - - PowerPoint PPT Presentation

Chapter Overview 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples of Virtual Memory 1 The Big Picture:


slide-1
SLIDE 1

1

Chapter Overview

5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples of Virtual Memory

slide-2
SLIDE 2

2

Introduction

5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples

  • f Virtual Memory

The Big Picture: Where are We Now?

The Five Classic Components of a Computer

Control Datapath Memory Processor Input Output

  • Topics In This Chapter:

– SRAM Memory Technology – DRAM Memory Technology – Memory Organization

slide-3
SLIDE 3

3

Technology Trends

DRAM Year Size Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns Capacity Speed (latency) Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years

1000:1! 2:1!

Introduction

The Big Picture: Where are We Now?

slide-4
SLIDE 4

4

µProc 60%/yr. (2X/1.5yr ) DRAM 9%/yr. (2X/10 yrs)

1 10 100 1000

1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

DRAM CPU

1982

Processor-Memory Performance Gap: (grows 50% / year)

Performance

Time

“Moore’s Law” Processor-DRAM Memory Gap (latency)

Who Cares About the Memory Hierarchy?

Introduction

The Big Picture: Where are We Now?

slide-5
SLIDE 5

5

Interfacing Memory

  • Memory connected to cache via a

bus

  • Board-level wiring: 100MHz is

“good”

P C Mem bus bus is usually narrower than cache block size (e.g. 8 bytes vs 32)

slide-6
SLIDE 6

6

Miss Penalty

  • Three components to miss

penalty

– 1. Wait for bus – 2. Memory latency (wait for first byte)

Thit Tmiss P C Mem bus

slide-7
SLIDE 7

7

Memory Busses

  • Crude: lock bus for entire transaction

– simple: can use the DRAM core interface exactly.

  • Better: split-transaction

– pass commands and data as separate chunks – requires buffering at end-points, tagging of requests if you permit multiple outstanding requests, etc.

Adr. data time .... (bus unused in the meantime) .... Bus:

slide-8
SLIDE 8

8

Latency vs. Bandwidth

  • Two metrics of interest:

– Latency: I.e. bulk of Cache Miss Penalty

  • Access Time: time between request and word arrives
  • Cycle Time: time between requests

– Bandwidth: contributes to transfer time

  • relevent for large cache blocks
  • relevant for I/O

– bandwidth is easier to improve than latency (just add money)

slide-9
SLIDE 9

9

Memory Organizations

  • Simple:

– CPU, Cache, Bus, Memory same width (32 or 64 bits)

  • Wide:

– CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512)

  • Interleaved:

– CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved

simple interleaved wide P M C P C M mux P M C M M M mux bus

slide-10
SLIDE 10

10

Performan ce: Simple

A D time Bus: A 1 D 1 A 2 D 2 P M C Each memory transaction handled separately. Example: 32byte cache lines, 8-byte-wide bus and memory. Bus: 100MHz (10nS), memory 80nS Total: 40 cycles = 4 * (1 + 8 + 1) A 3

slide-11
SLIDE 11

11

Performan ce: Wide

A D time Bus: Bus, memory is 32-bytes wide! Example: 32byte cache lines fetched in one transaction. Bus: 100MHz, memory 80nS Total: 10 cycles 1 * (1 + 8 + 1) P C M mux works great but is expensive!

slide-12
SLIDE 12

12

Performan ce: Interleaved

A A 1 A 2 A 3 D time Bus: Memory is 8-bytes wide but there are four banks. Example: 32byte cache lines fetched in four transactions,

  • verlapped

Bus: 100MHz, memory 80nS Total: 13 cycles 1 + 8 + 4 P M C M M M mux bus D 1 D 2 D 3 nice tradeoff

slide-13
SLIDE 13

13

Interleaving: Two Variations

  • 1. Strictly sequential accesses

– I.e. for cache line fill, as above – Can be implemented with one DRAM array using column-only (“page mode”) accesses (up to width of a column)

  • 2. Arbitrary accesses

– Requires a source of multiple, overlapped requests

  • Advanced CPU and cache technology: write buffers, non-blocking

miss-under-miss caches, prefetching.

  • I/O with DMA-capable controller
  • Multiprocessors

– Requires multiple independent arrays

slide-14
SLIDE 14

14

Independent Memory Banks

  • How many banks?

– IF accesses are sequential, THEN making the number

  • f banks equal to the number of clock cycles to access
  • ne bank allows all latency to be covered.

– BUT if the pattern is non-sequential, the first bank will be reused prematurely, inducing a stall.

1 2 3 4 5 6 7 8 9 10 11 4 8 sequential access

  • -> fast!

pathological case

  • -> slow!

... ...

slide-15
SLIDE 15

15

Avoiding Bank Conflicts

  • Lots of banks

int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j];

  • Even with 128 banks, since 512 is multiple of 128, conflict on word

accesses

  • Solutions:

– software: loop interchange – software: adjust the array size to a prime # (“array padding”) – hardware: prime number of banks (e.g. 17)

slide-16
SLIDE 16

16

Improved DRAM interfaces

  • Multiple CAS accesses: several names (page mode)

– Extended Data Out (EDO): 30% faster in page mode

  • New DRAMs to address gap;

what will they cost, will they survive? – Synchronous DRAM: 2-4 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz) – RAMBUS: startup company; reinvent DRAM interface

  • Each chip a module vs. slice of memory (8-16 banks)
  • Short bus between CPU and chips
  • Does own refresh
  • Block transfer mechanism, arbitrary sequencing
  • 2 bytes @ 800Mhz (1.6GB/s per bus)
slide-17
SLIDE 17

17

Current Memory Technology: Direct RDRAM

(see also SDRAM, DDR SDRAM)

  • Packet-switched bus interface

– 18-bits of data, 8 bits control at 800MHz – collected into packets of 8 at 100MHz (10nS) – That’s 16 bytes w/ECC plus controls for multiple banks

  • Internally banked

– 128Mbit (16Mbyte) part has 32 banks

  • Timing

– TRA is 40nS (but effectively 60nS due to the interface) – one row cycle time is 80nS but a new bank can start every 10nS – TCA is 20nS (effectively 40nS), new column can start every 10nS

slide-18
SLIDE 18

18

slide-19
SLIDE 19

19

slide-20
SLIDE 20

20

slide-21
SLIDE 21

21

Today’s Situation: Microprocessor

  • Rely on caches to bridge gap
  • Microprocessor-DRAM performance gap

– time of a full cache miss in instructions executed

1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 instructions 2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 instructions 3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 instructions

– 1/2X latency x 3X clock rate x 3X Instr/clock ⇒ -5X

Introduction

The Big Picture: Where are We Now?

slide-22
SLIDE 22

22

Levels of the Memory Hierarchy

CPU Registers 100s Bytes 1s ns Cache K Bytes 4 ns 1-0.1 cents/bit Main Memory M Bytes 100ns- 300ns $.0001-.00001 cents /bit Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10 cents/bit

  • 5
  • 6

Capacity Access Time Cost Tape infinite sec-min 10 -8

Registers Cache Memory Disk Tape

  • Instr. Operands

Blocks Pages Files

Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 8-128 bytes OS 512-4K bytes user/operator Mbytes

Upper Level Lower Level faster Larger

Introduction

The Big Picture: Where are We Now?

slide-23
SLIDE 23

23

The ABCs of Caches

In this section we will:

Learn lots of definitions about caches – you can’t talk about something until you understand it (this is true in computer science at least!) Answer some fundamental questions about caches:

  • Q1: Where can a block be placed in the

upper level? (Block placement)

  • Q2: How is a block found if it is in the

upper level? (Block identification)

  • Q3: Which block should be replaced on a

miss? (Block replacement)

  • Q4: What happens on a write?

(Write strategy)

5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples

  • f Virtual Memory
slide-24
SLIDE 24

24

The Principle of Locality

  • The Principle of Locality:

– Program access a relatively small portion of the address space at any instant of time.

  • Two Different Types of Locality:

– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) – Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

  • Last 15 years, HW relied on locality for speed

The ABCs of Caches

Definitions

slide-25
SLIDE 25

25

Memory Hierarchy: Terminology

  • Hit: data appears in some block in the upper level (example: Block X)

– Hit Rate: the fraction of memory access found in the upper level – Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss

  • Miss: data needs to be retrieve from a block in the lower level (Block Y)

– Miss Rate = 1 - (Hit Rate) – Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor

  • Hit Time << Miss Penalty (500 instructions on 21264!)

Lower Level Memory Upper Level Memory To Processor From Processor

Blk X Blk Y

The ABCs of Caches

Definitions

slide-26
SLIDE 26

26

Cache Measures

  • Hit rate: fraction found in that level

– So high that usually talk about Miss rate – Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory

  • Average memory-access time

= Hit time + Miss rate x Miss penalty (ns or clocks)

  • Miss penalty: time to replace a block from lower level, including

time to replace in CPU – access time: time to lower level = f(latency to lower level) – transfer time: time to transfer block =f(Bandwidth between upper & lower levels)

The ABCs of Caches

Definitions

slide-27
SLIDE 27

27

Simplest Cache: Direct Mapped Memory 4 Byte Direct Mapped Cache Memory Address

1 2 3 4 5 6 7 8 9 A B C D E F Cache Index 1 2 3

  • Location 0 can be occupied by data from:

– Memory location 0, 4, 8, ... etc. – In general: any memory location whose 2 LSBs of the address are 0s – Address<1:0> => cache index

  • Which one should we place in the cache?
  • How can we tell which one is in the cache?

The ABCs of Caches

Definitions

slide-28
SLIDE 28

28

1 KB Direct Mapped Cache, 32B blocks

  • For a 2 ** N byte cache:

– The uppermost (32 - N) bits are always the Cache Tag – The lowest M bits are the Byte Select (Block Size = 2 ** M)

Cache Index 1 2 3

:

Cache Data Byte 0 4 31

:

Cache Tag Example: 0x50 Ex: 0x01 0x50

Stored as part

  • f the cache “state”

Valid Bit

:

31 Byte 1 Byte 31

:

Byte 32 Byte 33 Byte 63

:

Byte 992 Byte 1023

:

Cache Tag Byte Select Ex: 0x00 9

The ABCs of Caches

Definitions

slide-29
SLIDE 29

29

Simplest Cache: Direct Mapped Memory 16K Byte Direct Mapped Cache Memory Address

1 2 ~~~~~~~~~~~~~ 16384 5 6 ~~~~~~~~~~~~~ 32768 9 A ~~~~~~~~~~~~~ 49152 D E ~~~~~~~~~~~~~ Cache Index 1 ~ 511

The ABCs of Caches

Definitions

16384

4 5 13 14 31

000000000000000001 000000000 00000

16415

4 5 13 14 31

000000000000000001 000000000 11111

16416

4 5 13 14 31

000000000000000001 000000001 00000

slide-30
SLIDE 30

30

Simplest Cache: Direct Mapped Memory Memory Address

1 2 ~~~~~~~~~~~~~ 16384 5 6 ~~~~~~~~~~~~~ 32768 9 A ~~~~~~~~~~~~~ 49152 D E ~~~~~~~~~~~~~

The ABCs of Caches

Definitions

16384

4 5 13 14 31

000000000000000001 000000000 00000

32768

4 5 13 14 31

000000000000000010 000000000 00000

49152

4 5 13 14 31

000000000000000011 000000000 00000

Tag Index

  • Loc. In

Cache Line

slide-31
SLIDE 31

31

Set Associative Memory 4 Byte 2-way Set Associative Cache Memory Address

1 2 3 4 5 6 7 8 9 A B C D E F Cache Index 1 2 3

  • Location 0 can be occupied by data from:

– Memory location 0, 2, 4, 6, 8, ... etc. – In general: any memory location whose 1 LSBs of the address are 0s – Address<0> => cache index

  • Which one should we place in the cache?
  • How can we tell which one is in the cache?

The ABCs of Caches

Definitions

slide-32
SLIDE 32

32

Set Associative Cache Memory 16K Byte 4-way Set Associative Cache Memory Address

8192 16384 24576 ~~~~~~~~~~~~~ Cache Index 1 ~ 128

The ABCs of Caches

Definitions

16384

4 5 13 14 31

000000000000000001 000000000 00000

16384

4 5 11 12 31

00000000000000000100 0000000 00000 4096 12288 20480

Direct Mapped 4-Way Set Associative Why???

~~~~~~~~~~~~~ ~~~~~~~~~~~~~

slide-33
SLIDE 33

33

4 -Way Set Associative Cache Memory Memory Address

8192 16384 24576 ~~~~~~~~~~~~~

The ABCs of Caches

Definitions

16416

4 5 11 12 31

00000000000000000001 0000001 00000 4096 12288 20480

4-Way Set Associative

~~~~~~~~~~~~~ ~~~~~~~~~~~~~

16384

4 5 11 12 31

00000000000000000100 0000000 00000

16415

4 5 11 12 31

00000000000000000100 0000000 11111

8192

4 5 11 12 31

00000000000000000010 0000000 00000

12288

4 5 11 12 31

00000000000000000011 0000000 00000

slide-34
SLIDE 34

34

  • Block 12 placed in 8 block cache:

– Fully associative, direct mapped, 2-way set associative – S.A. Mapping = Block Number Modulo Number Sets

Memory

The ABCs of Caches

Q1: Where Can A Block Be Placed In A Cache?

slide-35
SLIDE 35

35

Two-way Set Associative Cache

  • N-way set associative: N entries for each Cache Index

– N direct mapped caches operates in parallel (N typically 2 to 4)

  • Example: Two-way set associative cache

– Cache Index selects a “set” from the cache – The two tags in the set are compared in parallel – Data is selected based on the tag result

Cache Data Cache Block 0 Cache Tag Valid

: : :

Cache Data Cache Block 0 Cache Tag Valid

: : :

Cache Index Mux

1 Sel1 Sel0

Cache Block Compare Adr Tag Compare OR Hit

The ABCs of Caches

Definitions

slide-36
SLIDE 36

36

Disadvantage of Set Associative Cache

  • N-way Set Associative Cache v. Direct Mapped Cache:

– N comparators vs. 1 – Extra MUX delay for the data – Data comes AFTER Hit/Miss

  • In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:

– Possible to assume a hit and continue. Recover later if miss.

Cache Data Cache Block 0 Cache Tag Valid

: : :

Cache Data Cache Block 0 Cache Tag Valid

: : :

Cache Index Mux

1 Sel1 Sel0

Cache Block Compare Adr Tag Compare OR Hit

The ABCs of Caches

Definitions

slide-37
SLIDE 37

37

  • Tag on each block

– No need to check index or block offset

  • Increasing associativity shrinks index,

expands tag

The ABCs of Caches

Q2: How is a block found if it is in the cache?

Block Address Block Offset Tag Index This is Figure 5.3

slide-38
SLIDE 38

38

  • Easy for Direct Mapped
  • Set Associative or Fully Associative:

– Random – LRU (Least Recently Used) Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

The ABCs of Caches

Q3: Which block should be replaced on a cache miss?

This is Figure 5.4

slide-39
SLIDE 39

39

  • Write through—The information is written to both the block

in the cache and to the block in the lower-level memory.

  • Write back—The information is written only to the block in

the cache. The modified cache block is written to main memory only when it is replaced. – is block clean or dirty?

  • Pros and Cons of each?

– WT: read misses cannot result in writes – WB: no repeated writes to same location

  • WT always combined with write buffers so that don’t wait for

lower level memory

The ABCs of Caches

Q4: What happens on a write?

slide-40
SLIDE 40

40

  • A Write Buffer is needed between the Cache and Memory

– Processor: writes data into the cache and the write buffer – Memory controller: write contents of the buffer to memory

  • Write buffer is just a FIFO:

– Typical number of entries: 4 – Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

  • Memory system designer’s nightmare:

– Store frequency (w.r.t. time) -> 1 / DRAM write cycle – Write buffer saturation

Processor Cache Write Buffer DRAM

The ABCs of Caches

Q4: What happens on a write?

slide-41
SLIDE 41

41

  • Assume: a 16-bit write to memory location 0x0 and causes a miss

– Do we read in the block?

  • Yes: Write Allocate
  • No: Write Not Allocate

Cache Index 1 2 3

:

Cache Data Byte 0 4 31

:

Cache Tag Example: 0x00 Ex: 0x00 0x50 Valid Bit

:

31 Byte 1 Byte 31

:

Byte 32 Byte 33 Byte 63

:

Byte 992 Byte 1023

:

Cache Tag Byte Select Ex: 0x00 9

Write-miss Policy: Write Allocate versus Not Allocate Q5:What happens on a write?

The ABCs of Caches

slide-42
SLIDE 42

42

  • Suppose a processor executes at

– Clock Rate = 1000 MHz (1 ns per cycle) – CPI = 1.0 – 50% arithmetic/logic, 30% load/store, 20% control

  • Suppose that 10% of memory operations get 100 cycle miss penalty
  • CPI = ideal CPI + average stalls time per instruction

= 1.0(cycle)

  • +( 0.30 (data-operations/instruction)

x 0.10 (miss/data-op) x 100 (cycle/miss) ) = 1.0 cycle + 3.0 cycle = 4.0 cycle

  • 75 % of the time the processor is stalled waiting for memory!
  • a 1% instruction miss rate would add an additional 1.0 cycles to the CPI!

Ideal CPI 1.0 Data Miss 1.5 Inst Miss 0.5

Cache Performance

The ABCs of Caches

slide-43
SLIDE 43

43 Which has a lower miss rate:

  • A 16-KB instruction cache with a 16-KB data cache, or
  • A 32-KB unified cache?
  • Assume a hit takes 1 clock cycle and the miss penalty is 50 cycles.
  • Assume a load or store takes 1 extra clock cycle on a unified cache

since there is only one cache port.

  • Assume 75% of memory accesses are instruction references.

(75% X 0.64%) + (25% X 6.47%) = 2.10% Average memory access time (split) = 75% X ( 1 + 0.64% X 50) + 25% X ( 1 + 6.47% X 50) = 0.990 + 1.059 = 2.05 Average memory access time(unified) = 75% X ( 1 + 1.99% X 50) + 25% X ( 1 + 1 + 1.99% X 50) = 1.496 + 0.749 = 2.24

The ABCs of Caches

Cache Performance Example Pages 384-5

1.35% 3.77% 0.15% 64KB 1.99% 4.82% 0.39% 32 KB 2.87% 6.47% 0.64% 16KB 4.57% 10.19% 1.10% 8 KB Unified Cache Data Cache Instruction Cache Size

slide-44
SLIDE 44

44

The ABCs of Caches

Improving Cache Performance The next few sections look at ways to improve cache and memory access times.

) * * * ( * Time ClockCycle y MissPenalt MissRate CPI IC CPUtime

n Instructio ss MemoryAcce Execution +

=

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Does this equation make sense?? Section 5.3 Section 5.4 Section 5.5

slide-45
SLIDE 45

45

Reducing Cache Misses

5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples

  • f Virtual Memory

– Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) – Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) – Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache)

Classifying Misses: 3 Cs

slide-46
SLIDE 46

46

Cache Size (KB) 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 4 8 16 32 64 128 1-way 2-way 4-way 8-way Capacity Compulsory

3Cs Absolute Miss Rate (SPEC92)

Conflict

Compulsory vanishingly small

Reducing Cache Misses

Classifying Misses: 3 Cs

slide-47
SLIDE 47

47

Cache Size (KB) 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 4 8 16 32 64 128 1-way 2-way 4-way 8-way Capacity Compulsory

2:1 Cache Rule

Conflict

miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2

Reducing Cache Misses

Classifying Misses: 3 Cs

slide-48
SLIDE 48

48

3Cs Relative Miss Rate

Cache Size (KB) 0% 20% 40% 60% 80% 100% 1 2 4 8 16 32 64 128 1-way 2-way 4-way 8-way Capacity Compulsory

Conflict

Flaws: for fixed block size Good: insight => invention

Reducing Cache Misses

Classifying Misses: 3 Cs

slide-49
SLIDE 49

49 Block Size (bytes) Miss Rate 0% 5% 10% 15% 20% 25% 16 32 64 128 256 1K 4K 16K 64K 256K

Reducing Cache Misses

  • 1. Larger Block Size

Size of Cache Using the principle of locality. The larger the block, the greater the chance parts of it will be used again.

slide-50
SLIDE 50

50

  • 2:1 Cache Rule:

Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2

  • But Beware: Execution time is the only final measure we can

believe! – Will Clock Cycle time increase as a result of having a more complicated cache? – Hill [1988] suggested hit time for 2-way vs. 1-way is: external cache +10%, internal + 2%

  • 2. Higher Associativity

Reducing Cache Misses

slide-51
SLIDE 51

51

Example: Avg. Memory Access Time vs. Miss Rate

  • 2. Higher Associativity

Reducing Cache Misses

The time to access memory has several components. The equation is: Average Memory Access Time = Hit Time + Miss Rate X Miss Penalty The miss penalty is 50 cycles. See data on next page.

1.14 8 1.12 3 1.10 2 1.00 1 Clock Cycle Time Associativity

Result

slide-52
SLIDE 52

52

  • 2. Higher Associativity

Reducing Cache Misses

Example: Avg. Memory Access Time vs. Miss Rate

slide-53
SLIDE 53

53

  • How to combine fast hit time of direct mapped yet still avoid conflict

misses?

  • Add buffer to place data discarded from cache
  • A 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct

mapped data cache

  • Used in Alpha, HP machines.
  • In effect, this gives the same behavior as associativity, but only on those

cache lines that really need it.

Reducing Cache Misses

  • 3. Victim Caches
slide-54
SLIDE 54

54

Reducing Cache Miss Penalty

Time to handle a miss is becoming more and more the controlling factor. This is because of the great improvement in speed of processors as compared to the speed of memory.

5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples

  • f Virtual Memory

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

slide-55
SLIDE 55

55

  • Write through with write buffers offer RAW conflicts with main

memory reads on cache misses

  • If simply wait for write buffer to empty, might increase read miss

penalty (old MIPS 1000 by 50% )

  • Check write buffer contents before read;

if no conflicts, let the memory access continue

  • Write Back?

– Read miss replacing dirty block – Normal: Write dirty block to memory, and then do the read – Instead copy the dirty block to a write buffer, then do the read, and then do the write – CPU stall less since restarts as soon as do read

Reducing Cache Miss Penalty

Prioritization of Read Misses over Writes

slide-56
SLIDE 56

56

  • Don’t have to load full block on a miss
  • Have valid bits per subblock to indicate valid
  • (Originally invented to reduce tag storage)

Valid Bits Subblocks

Reducing Cache Miss Penalty

Sub Block Placement for Reduced Miss Penalty

slide-57
SLIDE 57

57

  • Don’t wait for full block to be loaded before restarting CPU

– Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution – Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first

  • Generally useful only in large blocks,
  • Spatial locality a problem; tend to want next sequential word, so

not clear if benefit by early restart block

Reducing Cache Miss Penalty

Early Restart and Critical Word First

slide-58
SLIDE 58

58

  • L2 Equations

Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 Average Memory Access Time = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)

  • Definitions:

– Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2) – Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1 x Miss RateL2) – Global Miss Rate is what matters

Reducing Cache Miss Penalty

Second Level Caches

slide-59
SLIDE 59

59

Comparing Local and Global Miss Rates

  • 32 KByte 1st level cache;

Increasing 2nd level cache

  • Global miss rate close to single

level cache rate provided L2 >> L1

  • Don’t use local miss rate
  • L2 not tied to CPU clock cycle!
  • Cost & A.M.A.T.
  • Generally Fast Hit Times and

fewer misses

  • Since hits are few, target miss

reduction

Cache Size Cache Size

Reducing Cache Miss Penalty

Second Level Caches

Linear Scale Log Scale

slide-60
SLIDE 60

60

Reducing Hit Time

This is about how to reduce time to access data that IS in the cache. What techniques are useful for quickly and efficiently finding out if data is in the cache, and if it is, getting that data out of the cache.

5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples

  • f Virtual Memory

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

slide-61
SLIDE 61

61

Reducing Hit Time

  • Why Alpha 21164 has 8KB Instruction and 8KB data cache

+ 96KB second level cache? – Small data cache and clock rate

  • Direct Mapped, on chip
  • Since most data DOES hit the cache, saving a cycle on data

access in the cache is a significant result. Small and Simple Caches

slide-62
SLIDE 62

62

  • Send virtual address to cache? Called Virtually Addressed Cache or just

Virtual Cache vs. Physical Cache – Every time process is switched logically must flush the cache; otherwise get false hits

  • Cost is time to flush + “compulsory” misses from empty cache

– Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address – I/O must interact with cache, so need virtual address

  • Solution to aliases

– HW guarantees that every cache block has unique physical address – SW guarantee : lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; called page coloring

  • Solution to cache flush

– Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process

Reducing Hit Time

Avoid Address Translation During Indexing of Cache

slide-63
SLIDE 63

63

How to do parallel operations with Virtually Addressed Caches

CPU TB $ MEM VA PA PA Conventional Organization CPU $ TB MEM VA VA PA Virtually Addressed Cache Translate only on miss Synonym Problem CPU $ TB MEM VA PA Tags PA Overlap $ access with VA translation: requires $ index to remain invariant across translation VA Tags L2 $

Reducing Hit Time

Avoid Address Translation During Indexing of Cache

slide-64
SLIDE 64

64 Fast Cache Hits by Avoiding Translation: Process ID impact Black is uni-process Light Gray is multi-process when flushing cache Dark Gray is multi-process when using Process ID tag Y axis: Miss Rates up to 20% X axis: Cache size from 2 KB to 1024 KB

Reducing Hit Time

Avoid Address Translation During Indexing of Cache

slide-65
SLIDE 65

65

  • Pipeline Tag Check and Update Cache as separate stages; current write tag

check & previous write cache update

  • Only STORES in the pipeline; empty during a miss

Store r2, (r1) Check r1 Add

  • Sub
  • Store r4, (r3)

M[r1]<-r2&

  • In shade is “Delayed Write Buffer”; must be checked on reads; either complete

write or read from buffer

Reducing Hit Time

Pipelining Writes for Fast Write Hits

Check r3

slide-66
SLIDE 66

66

Cache Optimization Summary

Technique MR MP HT Complexity Larger Block Size + – Higher Associativity + – 1 Victim Caches + 2 Pseudo-Associative Caches + 2 HW Prefetching of Instr/Data + 2 Compiler Controlled Prefetching + 3 Compiler Reduce Misses + Priority to Read Misses + 1 Subblock Placement + + 1 Early Restart & Critical Word 1st + 2 Non-Blocking Caches + 3 Second Level Caches + 2 Small & Simple Caches – + Avoiding Address Translation + 2 Pipelining Writes + 1

miss rate hit time miss penalty

slide-67
SLIDE 67

67

Main Memory

5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples

  • f Virtual Memory
  • Performance of Main Memory:

– Latency: Cache Miss Penalty

  • Access Time: time between request and word

arrives

  • Cycle Time: time between requests

– Bandwidth: I/O & Large Block Miss Penalty (L2)

  • Main Memory is DRAM: Dynamic Random Access

Memory

– Dynamic since needs to be refreshed periodically (8 ms, 1% time) – Addresses divided into 2 halves (Memory as a 2D matrix):

  • RAS or Row Access Strobe
  • CAS or Column Access Strobe
  • Cache uses SRAM: Static Random Access

Memory

– No refresh (6 transistors/bit vs. 1 transistor /bit, area is 10X) – Address not divided: Full addreess

  • Size: DRAM/SRAM - 4-8,

Cost/Cycle time: SRAM/DRAM - 8-16

slide-68
SLIDE 68

68

  • “Out-of-Core”, “In-Core,” “Core Dump”?
  • “Core memory”?
  • Non-volatile, magnetic
  • Lost to 4 Kbit DRAM (today using 64Kbit DRAM)
  • Access time 750 ns, cycle time 1500-3000 ns

Main Memory

Memory Technology & Organization

Core Memory

slide-69
SLIDE 69

69

DRAM logical organization (4 Mbit)

  • Square root of bits per RAS/CAS

Column Decoder Sense Amps & I/O Memory Array (2,048 x 2,048) A0…A10 … 11

Data In Data Out

Word Line Storage Cell

Main Memory

Memory Technology & Organization

Address Buffer Row Decoder

slide-70
SLIDE 70

70

Block Row Dec. 9 : 512 Row Block Row Dec. 9 : 512 Column Address

Block Row Dec. 9 : 512 Block Row Dec. 9 : 512 …

Block 0 Block 3

I/O I/O I/O I/O I/O I/O I/O I/O D Q Address 2 8 I/Os 8 I/Os

DRAM logical organization (4 Mbit)

Main Memory

Memory Technology & Organization

Row Address Column Address

slide-71
SLIDE 71

71

DRAM Performance

  • A 60 ns (tRAC) DRAM can

– perform a row access only every 110 ns (tRC) – perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC).

  • In practice, external address delays and turning around buses

make it 40 to 50 ns

  • These times do not include the time to drive the addresses off the

microprocessor nor the memory controller overhead!

Main Memory

Memory Technology & Organization

slide-72
SLIDE 72

72

Simple: – CPU, Cache, Bus, Memory same width (32 or 64 bits) Wide: – CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512) Interleaved: – CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved

Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 Bus Cache CPU (c) Interleaved memory organization Bus Cache CPU (a) One-word-wide memory organization Bus Cache CPU (b) Wide memory organization Multiplexor Memory Memory

Main Memory

Main Memory Performance

slide-73
SLIDE 73

73

  • Timing model (word size is 32 bits)

– 1 to send address, – 6 access time, 1 to send data – Cache Block is 4 words

  • Simple M.P.

= 4 x (1+6+1) = 32

  • Wide M.P.

= 1 + 6 + 1 = 8

  • Interleaved M.P. = 1 + 6 + 4x1 = 11

Main Memory

Main Memory Performance

slide-74
SLIDE 74

74

  • Memory banks for independent accesses
  • vs. faster sequential accesses

– Multiprocessor – I/O – CPU with Hit under n Misses, Non-blocking Cache

  • Superbank: all memory active on one block transfer (or Bank)
  • Bank: portion within a superbank that is word interleaved

Superbank Bank

… Main Memory

Independent Memory Banks

slide-75
SLIDE 75

75

Virtual Memory

5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples

  • f Virtual Memory

Permits a program's memory to be physically noncontiguous so it can be allocated from wherever available. This avoids fragmentation and compaction. WHY VIRTUAL MEMORY?

  • We've previously required the entire logical

space of the process to be in memory before the process could run. We will now look at alternatives to this.

  • Most code/data isn't needed at any instant, or

even within a finite time - we can bring it in

  • nly as needed.

VIRTUES

  • Gives a higher level of multiprogramming
  • The program size isn't constrained (thus the

term 'virtual memory'). Virtual memory allows very large logical address spaces.

  • Swap sizes smaller.
slide-76
SLIDE 76

76 Virtual memory The conceptual separation of user logical memory from physical memory. Thus we can have large virtual memory

  • n a small physical memory.

Virtual Memory Memory Map Physical Memory Disk Individual Pages

Virtual Memory

Definitions

slide-77
SLIDE 77

77

Permits a program's memory to be physically noncontiguous so it can be allocated from wherever available. This avoids fragmentation and compaction. HARDWARE An address is determined by: page number ( index into table ) +

  • ffset
  • --> mapping into --->

base address ( from table ) + offset. Frames = physical blocks Pages = logical blocks Size of frames/pages is defined by hardware (power

  • f 2 to ease calculations)

Virtual Memory

Paging

slide-78
SLIDE 78

78 Paging Example - 32-byte memory with 4-byte pages

0 a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 I 9 j 10 k 11 l 12 m 13 n 14 o 15 p 0 5 1 6 2 1 3 2 Page Table Logical Memory

4 I j k l 8 m n

  • p

12 16 20 a b c d 24 e f g h 28

Physical Memory

Virtual Memory

Paging

slide-79
SLIDE 79

79 IMPLEMENTATION OF THE PAGE TABLE

– A 32 bit machine can address 4 gigabytes which is 4 million pages (at 1024 bytes/page). WHO says how big a page is, anyway? – Could use dedicated registers (OK

  • nly

with small tables.) – Could use a register pointing to table in memory (slow access.) – Cache

  • r

associative memory (TLB = Translation Lookaside Buffer): simultaneous search is fast and uses

  • nly

a few registers. TLB TLB Hit TLB Miss

Virtual Memory

Paging

slide-80
SLIDE 80

80

IMPLEMENTATION OF THE PAGE TABLE Issues include: key and value hit rate 90 - 98% with 100 registers add entry if not found Effective access time = %fast * time_fast + %slow * time_slow Relevant times: 20 nanoseconds to search associative memory – the TLB. 200 nanoseconds to access memory and bring it into TLB for next time. Calculate time of access: hit = 1 search + 1 memory reference miss = 1 search + 1 mem reference(of page table) + 1 mem reference.

Virtual Memory

Paging

slide-81
SLIDE 81

81

SHARED PAGES Data

  • ccupying
  • ne

physical page, but pointed to by multiple logical pages. Useful for common code - must be write

  • protected. (NO write-

able data mixed with code.) Extremely useful for read/write communication between processes.

Virtual Memory

Paging

slide-82
SLIDE 82

82

INVERTED PAGE TABLE: One entry for each real page of memory. Entry consists of the virtual address of the page stored in that real memory location, with information about the process that owns that page. Essential when you need to do work on the page and must find

  • ut what process owns it.

Use hash table to limit the search to one - or at most a few

  • page table entries.

Virtual Memory

Paging

slide-83
SLIDE 83

83 MULTILEVEL PAGE TABLE

A means of using page tables for large address spaces.

Virtual Memory

Paging

slide-84
SLIDE 84

84 HARDWARE -- Must map a dyad (segment / offset) into one-dimensional address.

CPU MEMORY Limit Base

+ <

No Logical Address Yes Physical Address Segment Table S D

Virtual Memory

Paging

slide-85
SLIDE 85

85

Basic Issues in VM System Design

  • Size of information blocks that are transferred from secondary to main

storage (M)

  • Block of information brought into M, and M is full, then some region of M

must be released to make room for the new block replacement policy

  • Which region of M is to hold the new block --> placement policy
  • Missing item fetched from secondary memory only on the occurrence of a

fault --> demand load policy

pages

reg cache

mem disk frame

Questions about Memory

Virtual Memory

slide-86
SLIDE 86

86

4 Questions for Virtual Memory

– Q1: Where can a block be placed in the upper level? Fully Associative, Set Associative, Direct Mapped – Q2: How is a block found if it is in the upper level? Tag/Block – Q3: Which block should be replaced on a miss? Random, LRU – Q4: What happens on a write? Write Back or Write Through (with Write Buffer) Questions about Memory

Virtual Memory

slide-87
SLIDE 87

87

Virtual Address and a Cache

CPU Trans- lation Cache Main Memory VA PA miss hit data It takes an extra memory access to translate VA to PA This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible

Virtual Memory

Techniques for Fast Address Translation

slide-88
SLIDE 88

88

A Brief Detour Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! For update: must update all cache entries with same physical address or memory becomes inconsistent Determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or Software enforced alias boundary: same least significant bit of VA & PA > cache size

Virtual Memory

Techniques for Fast Address Translation

slide-89
SLIDE 89

89

TLBs

A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access Really just a cache on the page table mappings TLB access time comparable to cache access time (much less than main memory access time)

Virtual Memory

Techniques for Fast Address Translation

slide-90
SLIDE 90

90 Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these

  • machines. Most mid-range machines use small n-way set associative
  • rganizations.

CPU TLB Lookup Cache Main Memory VA PA miss hit data Trans- lation hit miss 20 t t 1/2 t Translation with a TLB

TLBs

Virtual Memory

Techniques for Fast Address Translation

slide-91
SLIDE 91

91

Machines with TLBs go one step further to reduce # cycles/cache access They overlap the cache access with the TLB access: high order bits of the VA are used to look in the TLB while low

  • rder bits are used as index into cache

TLBs

Virtual Memory

Techniques for Fast Address Translation

slide-92
SLIDE 92

92

Overlapped Cache & TLB Access

TLB Cache 10 2 00 4 bytes index 1 K page # disp 20 12 assoc lookup 32 PA Hit/ Miss PA Data Hit/ Miss = IF cache hit AND (cache tag = PA) then deliver data to CPU ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN access memory with the PA from the TLB ELSE do standard VA translation

Virtual Memory

Techniques for Fast Address Translation

slide-93
SLIDE 93

93

Problems With Overlapped TLB Access

Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K: 11 2 00 virt page # disp 20 12

cache index

This bit is changed by VA translation, but is needed for cache lookup Solutions: go to 8K byte page sizes; go to 2 way set associative cache; or SW guarantee VA[13]=PA[13] 1K 4 4 10 2 way set assoc cache

Virtual Memory

Techniques for Fast Address Translation

slide-94
SLIDE 94

94

Protection and Examples

The Goal: One process should not interfere with another Process model

  • privileged kernel
  • independent user processes

Privileges vs policy

  • architecture provided primitives
  • OS implements policy
  • problems arise when h/w implements

policy

5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples

  • f Virtual Memory
slide-95
SLIDE 95

95

Protection and Examples

Protection Primitives

user vs kernel

  • at least one privileged mode
  • special case of rings
  • usually implemented as mode bits

how do we switch to kernel mode?

  • protected ``gates''
  • change mode and continue at pre-determined address

h/w to compare mode bits to access rights

  • nly access certain resources in kernel mode

Issues

slide-96
SLIDE 96

96

Protection and Examples

Protection Primitives

base and bounds

  • privileged registers
  • base <= address <= bounds

segmentation

  • multiple base and bound registers
  • protection bits for each segment

page-level protection

  • protection bits in page entry table
  • cache them in TLB for speed

Issues

slide-97
SLIDE 97

97

Protection and Examples

Protection on the Pentium

Pentium contains:

  • Four segments – a program can run in any of them.
  • Four segments – a program may or may not be able to touch data in a

particular segment.

  • How would a user and an operating system use these features?
slide-98
SLIDE 98

98

Protection and Examples

Protection on the Pentium Segment register contains:

  • Base
  • Limit
  • Present
  • Readable
  • Writable
slide-99
SLIDE 99

99

Summary

5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples of Virtual Memory