1
Chapter Overview
5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples of Virtual Memory
Chapter Overview 5.1 Introduction 5.2 The ABCs of Caches 5.3 - - PowerPoint PPT Presentation
Chapter Overview 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples of Virtual Memory 1 The Big Picture:
1
5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples of Virtual Memory
2
5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples
The Five Classic Components of a Computer
Control Datapath Memory Processor Input Output
– SRAM Memory Technology – DRAM Memory Technology – Memory Organization
3
Technology Trends
DRAM Year Size Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns Capacity Speed (latency) Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years
4
1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
DRAM CPU
1982
Who Cares About the Memory Hierarchy?
5
P C Mem bus bus is usually narrower than cache block size (e.g. 8 bytes vs 32)
6
Thit Tmiss P C Mem bus
7
Adr. data time .... (bus unused in the meantime) .... Bus:
8
9
simple interleaved wide P M C P C M mux P M C M M M mux bus
10
A D time Bus: A 1 D 1 A 2 D 2 P M C Each memory transaction handled separately. Example: 32byte cache lines, 8-byte-wide bus and memory. Bus: 100MHz (10nS), memory 80nS Total: 40 cycles = 4 * (1 + 8 + 1) A 3
11
A D time Bus: Bus, memory is 32-bytes wide! Example: 32byte cache lines fetched in one transaction. Bus: 100MHz, memory 80nS Total: 10 cycles 1 * (1 + 8 + 1) P C M mux works great but is expensive!
12
A A 1 A 2 A 3 D time Bus: Memory is 8-bytes wide but there are four banks. Example: 32byte cache lines fetched in four transactions,
Bus: 100MHz, memory 80nS Total: 13 cycles 1 + 8 + 4 P M C M M M mux bus D 1 D 2 D 3 nice tradeoff
13
miss-under-miss caches, prefetching.
14
1 2 3 4 5 6 7 8 9 10 11 4 8 sequential access
pathological case
... ...
15
int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j];
accesses
– software: loop interchange – software: adjust the array size to a prime # (“array padding”) – hardware: prime number of banks (e.g. 17)
16
– Extended Data Out (EDO): 30% faster in page mode
what will they cost, will they survive? – Synchronous DRAM: 2-4 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz) – RAMBUS: startup company; reinvent DRAM interface
17
– 18-bits of data, 8 bits control at 800MHz – collected into packets of 8 at 100MHz (10nS) – That’s 16 bytes w/ECC plus controls for multiple banks
– 128Mbit (16Mbyte) part has 32 banks
– TRA is 40nS (but effectively 60nS due to the interface) – one row cycle time is 80nS but a new bank can start every 10nS – TCA is 20nS (effectively 40nS), new column can start every 10nS
18
19
20
21
Today’s Situation: Microprocessor
– time of a full cache miss in instructions executed
1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 instructions 2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 instructions 3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 instructions
– 1/2X latency x 3X clock rate x 3X Instr/clock ⇒ -5X
22
Levels of the Memory Hierarchy
CPU Registers 100s Bytes 1s ns Cache K Bytes 4 ns 1-0.1 cents/bit Main Memory M Bytes 100ns- 300ns $.0001-.00001 cents /bit Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10 cents/bit
Capacity Access Time Cost Tape infinite sec-min 10 -8
Registers Cache Memory Disk Tape
Blocks Pages Files
Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 8-128 bytes OS 512-4K bytes user/operator Mbytes
Upper Level Lower Level faster Larger
23
In this section we will:
Learn lots of definitions about caches – you can’t talk about something until you understand it (this is true in computer science at least!) Answer some fundamental questions about caches:
upper level? (Block placement)
upper level? (Block identification)
miss? (Block replacement)
(Write strategy)
5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples
24
The Principle of Locality
– Program access a relatively small portion of the address space at any instant of time.
– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) – Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
25
Memory Hierarchy: Terminology
– Hit Rate: the fraction of memory access found in the upper level – Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss
– Miss Rate = 1 - (Hit Rate) – Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor
Lower Level Memory Upper Level Memory To Processor From Processor
Blk X Blk Y
26
Cache Measures
– So high that usually talk about Miss rate – Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory
= Hit time + Miss rate x Miss penalty (ns or clocks)
time to replace in CPU – access time: time to lower level = f(latency to lower level) – transfer time: time to transfer block =f(Bandwidth between upper & lower levels)
27
Simplest Cache: Direct Mapped Memory 4 Byte Direct Mapped Cache Memory Address
1 2 3 4 5 6 7 8 9 A B C D E F Cache Index 1 2 3
– Memory location 0, 4, 8, ... etc. – In general: any memory location whose 2 LSBs of the address are 0s – Address<1:0> => cache index
28
1 KB Direct Mapped Cache, 32B blocks
– The uppermost (32 - N) bits are always the Cache Tag – The lowest M bits are the Byte Select (Block Size = 2 ** M)
Cache Index 1 2 3
Cache Data Byte 0 4 31
Cache Tag Example: 0x50 Ex: 0x01 0x50
Stored as part
Valid Bit
31 Byte 1 Byte 31
Byte 32 Byte 33 Byte 63
Byte 992 Byte 1023
Cache Tag Byte Select Ex: 0x00 9
29
Simplest Cache: Direct Mapped Memory 16K Byte Direct Mapped Cache Memory Address
1 2 ~~~~~~~~~~~~~ 16384 5 6 ~~~~~~~~~~~~~ 32768 9 A ~~~~~~~~~~~~~ 49152 D E ~~~~~~~~~~~~~ Cache Index 1 ~ 511
16384
4 5 13 14 31
000000000000000001 000000000 00000
16415
4 5 13 14 31
000000000000000001 000000000 11111
16416
4 5 13 14 31
000000000000000001 000000001 00000
30
Simplest Cache: Direct Mapped Memory Memory Address
1 2 ~~~~~~~~~~~~~ 16384 5 6 ~~~~~~~~~~~~~ 32768 9 A ~~~~~~~~~~~~~ 49152 D E ~~~~~~~~~~~~~
16384
4 5 13 14 31
000000000000000001 000000000 00000
32768
4 5 13 14 31
000000000000000010 000000000 00000
49152
4 5 13 14 31
000000000000000011 000000000 00000
Tag Index
Cache Line
31
Set Associative Memory 4 Byte 2-way Set Associative Cache Memory Address
1 2 3 4 5 6 7 8 9 A B C D E F Cache Index 1 2 3
– Memory location 0, 2, 4, 6, 8, ... etc. – In general: any memory location whose 1 LSBs of the address are 0s – Address<0> => cache index
32
Set Associative Cache Memory 16K Byte 4-way Set Associative Cache Memory Address
8192 16384 24576 ~~~~~~~~~~~~~ Cache Index 1 ~ 128
16384
4 5 13 14 31
000000000000000001 000000000 00000
16384
4 5 11 12 31
00000000000000000100 0000000 00000 4096 12288 20480
Direct Mapped 4-Way Set Associative Why???
~~~~~~~~~~~~~ ~~~~~~~~~~~~~
33
4 -Way Set Associative Cache Memory Memory Address
8192 16384 24576 ~~~~~~~~~~~~~
16416
4 5 11 12 31
00000000000000000001 0000001 00000 4096 12288 20480
4-Way Set Associative
~~~~~~~~~~~~~ ~~~~~~~~~~~~~
16384
4 5 11 12 31
00000000000000000100 0000000 00000
16415
4 5 11 12 31
00000000000000000100 0000000 11111
8192
4 5 11 12 31
00000000000000000010 0000000 00000
12288
4 5 11 12 31
00000000000000000011 0000000 00000
34
– Fully associative, direct mapped, 2-way set associative – S.A. Mapping = Block Number Modulo Number Sets
Memory
Q1: Where Can A Block Be Placed In A Cache?
35
Two-way Set Associative Cache
– N direct mapped caches operates in parallel (N typically 2 to 4)
– Cache Index selects a “set” from the cache – The two tags in the set are compared in parallel – Data is selected based on the tag result
Cache Data Cache Block 0 Cache Tag Valid
Cache Data Cache Block 0 Cache Tag Valid
Cache Index Mux
1 Sel1 Sel0
Cache Block Compare Adr Tag Compare OR Hit
36
Disadvantage of Set Associative Cache
– N comparators vs. 1 – Extra MUX delay for the data – Data comes AFTER Hit/Miss
– Possible to assume a hit and continue. Recover later if miss.
Cache Data Cache Block 0 Cache Tag Valid
Cache Data Cache Block 0 Cache Tag Valid
Cache Index Mux
1 Sel1 Sel0
Cache Block Compare Adr Tag Compare OR Hit
37
– No need to check index or block offset
expands tag
Q2: How is a block found if it is in the cache?
Block Address Block Offset Tag Index This is Figure 5.3
38
– Random – LRU (Least Recently Used) Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
Q3: Which block should be replaced on a cache miss?
This is Figure 5.4
39
in the cache and to the block in the lower-level memory.
the cache. The modified cache block is written to main memory only when it is replaced. – is block clean or dirty?
– WT: read misses cannot result in writes – WB: no repeated writes to same location
lower level memory
Q4: What happens on a write?
40
– Processor: writes data into the cache and the write buffer – Memory controller: write contents of the buffer to memory
– Typical number of entries: 4 – Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
– Store frequency (w.r.t. time) -> 1 / DRAM write cycle – Write buffer saturation
Processor Cache Write Buffer DRAM
Q4: What happens on a write?
41
– Do we read in the block?
Cache Index 1 2 3
Cache Data Byte 0 4 31
Cache Tag Example: 0x00 Ex: 0x00 0x50 Valid Bit
31 Byte 1 Byte 31
Byte 32 Byte 33 Byte 63
Byte 992 Byte 1023
Cache Tag Byte Select Ex: 0x00 9
Write-miss Policy: Write Allocate versus Not Allocate Q5:What happens on a write?
42
– Clock Rate = 1000 MHz (1 ns per cycle) – CPI = 1.0 – 50% arithmetic/logic, 30% load/store, 20% control
= 1.0(cycle)
x 0.10 (miss/data-op) x 100 (cycle/miss) ) = 1.0 cycle + 3.0 cycle = 4.0 cycle
Ideal CPI 1.0 Data Miss 1.5 Inst Miss 0.5
Cache Performance
43 Which has a lower miss rate:
since there is only one cache port.
(75% X 0.64%) + (25% X 6.47%) = 2.10% Average memory access time (split) = 75% X ( 1 + 0.64% X 50) + 25% X ( 1 + 6.47% X 50) = 0.990 + 1.059 = 2.05 Average memory access time(unified) = 75% X ( 1 + 1.99% X 50) + 25% X ( 1 + 1 + 1.99% X 50) = 1.496 + 0.749 = 2.24
Cache Performance Example Pages 384-5
1.35% 3.77% 0.15% 64KB 1.99% 4.82% 0.39% 32 KB 2.87% 6.47% 0.64% 16KB 4.57% 10.19% 1.10% 8 KB Unified Cache Data Cache Instruction Cache Size
44
Improving Cache Performance The next few sections look at ways to improve cache and memory access times.
) * * * ( * Time ClockCycle y MissPenalt MissRate CPI IC CPUtime
n Instructio ss MemoryAcce Execution +
=
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Does this equation make sense?? Section 5.3 Section 5.4 Section 5.5
45
5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples
– Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) – Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) – Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache)
Classifying Misses: 3 Cs
46
Cache Size (KB) 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 4 8 16 32 64 128 1-way 2-way 4-way 8-way Capacity Compulsory
3Cs Absolute Miss Rate (SPEC92)
Compulsory vanishingly small
Classifying Misses: 3 Cs
47
Cache Size (KB) 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 4 8 16 32 64 128 1-way 2-way 4-way 8-way Capacity Compulsory
2:1 Cache Rule
miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2
Classifying Misses: 3 Cs
48
3Cs Relative Miss Rate
Cache Size (KB) 0% 20% 40% 60% 80% 100% 1 2 4 8 16 32 64 128 1-way 2-way 4-way 8-way Capacity Compulsory
Flaws: for fixed block size Good: insight => invention
Classifying Misses: 3 Cs
49 Block Size (bytes) Miss Rate 0% 5% 10% 15% 20% 25% 16 32 64 128 256 1K 4K 16K 64K 256K
Size of Cache Using the principle of locality. The larger the block, the greater the chance parts of it will be used again.
50
Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2
believe! – Will Clock Cycle time increase as a result of having a more complicated cache? – Hill [1988] suggested hit time for 2-way vs. 1-way is: external cache +10%, internal + 2%
51
Example: Avg. Memory Access Time vs. Miss Rate
The time to access memory has several components. The equation is: Average Memory Access Time = Hit Time + Miss Rate X Miss Penalty The miss penalty is 50 cycles. See data on next page.
1.14 8 1.12 3 1.10 2 1.00 1 Clock Cycle Time Associativity
52
Example: Avg. Memory Access Time vs. Miss Rate
53
misses?
mapped data cache
cache lines that really need it.
54
Time to handle a miss is becoming more and more the controlling factor. This is because of the great improvement in speed of processors as compared to the speed of memory.
5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
55
memory reads on cache misses
penalty (old MIPS 1000 by 50% )
if no conflicts, let the memory access continue
– Read miss replacing dirty block – Normal: Write dirty block to memory, and then do the read – Instead copy the dirty block to a write buffer, then do the read, and then do the write – CPU stall less since restarts as soon as do read
Prioritization of Read Misses over Writes
56
Valid Bits Subblocks
Sub Block Placement for Reduced Miss Penalty
57
– Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution – Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first
not clear if benefit by early restart block
Early Restart and Critical Word First
58
Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 Average Memory Access Time = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)
– Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2) – Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1 x Miss RateL2) – Global Miss Rate is what matters
Second Level Caches
59
Comparing Local and Global Miss Rates
Increasing 2nd level cache
level cache rate provided L2 >> L1
fewer misses
reduction
Cache Size Cache Size
Second Level Caches
Linear Scale Log Scale
60
This is about how to reduce time to access data that IS in the cache. What techniques are useful for quickly and efficiently finding out if data is in the cache, and if it is, getting that data out of the cache.
5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
61
+ 96KB second level cache? – Small data cache and clock rate
access in the cache is a significant result. Small and Simple Caches
62
Virtual Cache vs. Physical Cache – Every time process is switched logically must flush the cache; otherwise get false hits
– Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address – I/O must interact with cache, so need virtual address
– HW guarantees that every cache block has unique physical address – SW guarantee : lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; called page coloring
– Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process
Avoid Address Translation During Indexing of Cache
63
How to do parallel operations with Virtually Addressed Caches
CPU TB $ MEM VA PA PA Conventional Organization CPU $ TB MEM VA VA PA Virtually Addressed Cache Translate only on miss Synonym Problem CPU $ TB MEM VA PA Tags PA Overlap $ access with VA translation: requires $ index to remain invariant across translation VA Tags L2 $
Avoid Address Translation During Indexing of Cache
64 Fast Cache Hits by Avoiding Translation: Process ID impact Black is uni-process Light Gray is multi-process when flushing cache Dark Gray is multi-process when using Process ID tag Y axis: Miss Rates up to 20% X axis: Cache size from 2 KB to 1024 KB
Avoid Address Translation During Indexing of Cache
65
check & previous write cache update
Store r2, (r1) Check r1 Add
M[r1]<-r2&
write or read from buffer
Pipelining Writes for Fast Write Hits
Check r3
66
Technique MR MP HT Complexity Larger Block Size + – Higher Associativity + – 1 Victim Caches + 2 Pseudo-Associative Caches + 2 HW Prefetching of Instr/Data + 2 Compiler Controlled Prefetching + 3 Compiler Reduce Misses + Priority to Read Misses + 1 Subblock Placement + + 1 Early Restart & Critical Word 1st + 2 Non-Blocking Caches + 3 Second Level Caches + 2 Small & Simple Caches – + Avoiding Address Translation + 2 Pipelining Writes + 1
miss rate hit time miss penalty
67
5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples
– Latency: Cache Miss Penalty
arrives
– Bandwidth: I/O & Large Block Miss Penalty (L2)
Memory
– Dynamic since needs to be refreshed periodically (8 ms, 1% time) – Addresses divided into 2 halves (Memory as a 2D matrix):
Memory
– No refresh (6 transistors/bit vs. 1 transistor /bit, area is 10X) – Address not divided: Full addreess
Cost/Cycle time: SRAM/DRAM - 8-16
68
Memory Technology & Organization
Core Memory
69
DRAM logical organization (4 Mbit)
Data In Data Out
Memory Technology & Organization
Address Buffer Row Decoder
70
Block Row Dec. 9 : 512 Row Block Row Dec. 9 : 512 Column Address
Block Row Dec. 9 : 512 Block Row Dec. 9 : 512 …
I/O I/O I/O I/O I/O I/O I/O I/O D Q Address 2 8 I/Os 8 I/Os
DRAM logical organization (4 Mbit)
Memory Technology & Organization
Row Address Column Address
71
DRAM Performance
– perform a row access only every 110 ns (tRC) – perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC).
make it 40 to 50 ns
microprocessor nor the memory controller overhead!
Memory Technology & Organization
72
Simple: – CPU, Cache, Bus, Memory same width (32 or 64 bits) Wide: – CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512) Interleaved: – CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved
Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 Bus Cache CPU (c) Interleaved memory organization Bus Cache CPU (a) One-word-wide memory organization Bus Cache CPU (b) Wide memory organization Multiplexor Memory Memory
Main Memory Performance
73
– 1 to send address, – 6 access time, 1 to send data – Cache Block is 4 words
= 4 x (1+6+1) = 32
= 1 + 6 + 1 = 8
Main Memory Performance
74
– Multiprocessor – I/O – CPU with Hit under n Misses, Non-blocking Cache
Superbank Bank
Independent Memory Banks
75
5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples
Permits a program's memory to be physically noncontiguous so it can be allocated from wherever available. This avoids fragmentation and compaction. WHY VIRTUAL MEMORY?
space of the process to be in memory before the process could run. We will now look at alternatives to this.
even within a finite time - we can bring it in
VIRTUES
term 'virtual memory'). Virtual memory allows very large logical address spaces.
76 Virtual memory The conceptual separation of user logical memory from physical memory. Thus we can have large virtual memory
Virtual Memory Memory Map Physical Memory Disk Individual Pages
Definitions
77
Permits a program's memory to be physically noncontiguous so it can be allocated from wherever available. This avoids fragmentation and compaction. HARDWARE An address is determined by: page number ( index into table ) +
base address ( from table ) + offset. Frames = physical blocks Pages = logical blocks Size of frames/pages is defined by hardware (power
Paging
78 Paging Example - 32-byte memory with 4-byte pages
0 a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 I 9 j 10 k 11 l 12 m 13 n 14 o 15 p 0 5 1 6 2 1 3 2 Page Table Logical Memory
4 I j k l 8 m n
12 16 20 a b c d 24 e f g h 28
Physical Memory
Paging
79 IMPLEMENTATION OF THE PAGE TABLE
– A 32 bit machine can address 4 gigabytes which is 4 million pages (at 1024 bytes/page). WHO says how big a page is, anyway? – Could use dedicated registers (OK
with small tables.) – Could use a register pointing to table in memory (slow access.) – Cache
associative memory (TLB = Translation Lookaside Buffer): simultaneous search is fast and uses
a few registers. TLB TLB Hit TLB Miss
Paging
80
IMPLEMENTATION OF THE PAGE TABLE Issues include: key and value hit rate 90 - 98% with 100 registers add entry if not found Effective access time = %fast * time_fast + %slow * time_slow Relevant times: 20 nanoseconds to search associative memory – the TLB. 200 nanoseconds to access memory and bring it into TLB for next time. Calculate time of access: hit = 1 search + 1 memory reference miss = 1 search + 1 mem reference(of page table) + 1 mem reference.
Paging
81
SHARED PAGES Data
physical page, but pointed to by multiple logical pages. Useful for common code - must be write
able data mixed with code.) Extremely useful for read/write communication between processes.
Paging
82
INVERTED PAGE TABLE: One entry for each real page of memory. Entry consists of the virtual address of the page stored in that real memory location, with information about the process that owns that page. Essential when you need to do work on the page and must find
Use hash table to limit the search to one - or at most a few
Paging
83 MULTILEVEL PAGE TABLE
A means of using page tables for large address spaces.
Paging
84 HARDWARE -- Must map a dyad (segment / offset) into one-dimensional address.
CPU MEMORY Limit Base
+ <
No Logical Address Yes Physical Address Segment Table S D
Paging
85
Basic Issues in VM System Design
storage (M)
must be released to make room for the new block replacement policy
fault --> demand load policy
pages
reg cache
mem disk frame
Questions about Memory
86
– Q1: Where can a block be placed in the upper level? Fully Associative, Set Associative, Direct Mapped – Q2: How is a block found if it is in the upper level? Tag/Block – Q3: Which block should be replaced on a miss? Random, LRU – Q4: What happens on a write? Write Back or Write Through (with Write Buffer) Questions about Memory
87
CPU Trans- lation Cache Main Memory VA PA miss hit data It takes an extra memory access to translate VA to PA This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible
Techniques for Fast Address Translation
88
A Brief Detour Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! For update: must update all cache entries with same physical address or memory becomes inconsistent Determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or Software enforced alias boundary: same least significant bit of VA & PA > cache size
Techniques for Fast Address Translation
89
A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access Really just a cache on the page table mappings TLB access time comparable to cache access time (much less than main memory access time)
Techniques for Fast Address Translation
90 Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these
CPU TLB Lookup Cache Main Memory VA PA miss hit data Trans- lation hit miss 20 t t 1/2 t Translation with a TLB
Techniques for Fast Address Translation
91
Machines with TLBs go one step further to reduce # cycles/cache access They overlap the cache access with the TLB access: high order bits of the VA are used to look in the TLB while low
Techniques for Fast Address Translation
92
Overlapped Cache & TLB Access
TLB Cache 10 2 00 4 bytes index 1 K page # disp 20 12 assoc lookup 32 PA Hit/ Miss PA Data Hit/ Miss = IF cache hit AND (cache tag = PA) then deliver data to CPU ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN access memory with the PA from the TLB ELSE do standard VA translation
Techniques for Fast Address Translation
93
Problems With Overlapped TLB Access
Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K: 11 2 00 virt page # disp 20 12
cache index
This bit is changed by VA translation, but is needed for cache lookup Solutions: go to 8K byte page sizes; go to 2 way set associative cache; or SW guarantee VA[13]=PA[13] 1K 4 4 10 2 way set assoc cache
Techniques for Fast Address Translation
94
The Goal: One process should not interfere with another Process model
Privileges vs policy
policy
5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples
95
Protection Primitives
user vs kernel
how do we switch to kernel mode?
h/w to compare mode bits to access rights
Issues
96
Protection Primitives
base and bounds
segmentation
page-level protection
Issues
97
Protection on the Pentium
Pentium contains:
particular segment.
98
Protection on the Pentium Segment register contains:
99
5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples of Virtual Memory