1
EE 457 Unit 7d Virtual Memory 2 Virtual Memory Concept A - - PowerPoint PPT Presentation
EE 457 Unit 7d Virtual Memory 2 Virtual Memory Concept A - - PowerPoint PPT Presentation
1 EE 457 Unit 7d Virtual Memory 2 Virtual Memory Concept A mechanism for hiding the details of how much physical memory exists and how its being shared Allows the OS to Efficiently share the physical memory between several
2
Virtual Memory Concept
- A mechanism for hiding the details of how much physical
memory exists and how it’s being shared
- Allows the OS to
– Efficiently share the physical memory between several running programs/processes and provide protection against each other – Remove the need of the programmer to know how much memory is physically present and/or give the illusion of more or less physical memory than is present
- Use MM as a cache for multiple programs and their data as they
run using secondary storage (hard-drive) as the home location
3
Memory Hierarchy & Caching
- Lower levels act as a cache for upper levels
Disk / Secondary Storage ~1-10 ms Main Memory ~ 100 ns L2 Cache ~ 10ns L1 Cache ~ 1ns Registers L1/L2 is a “cache” for main memory Virtual memory provides each process its own address space in secondary storage and uses main memory as a cache
4
Virtual Memory Motivation
- Virtual memory is largely discussed in operating systems
courses
– We will focus on HW support for VM
- Magnetic hard drive consists of
– Double sided surfaces/platters (with R/W head) – Each platter is divided into concentric tracks of small sectors that each store several thousand bits
Surfaces
Read/Write Head 0 Read/Write Head 7 Read/Write Head 1 … Track 0 Track 1 Sector 0 Sector 1 Sector 2
- Seek Time: Time needed to
position the read-head above the proper track
- Rotational delay: Time needed
to bring the right sector under the read-head
- Depends on rotation
speed (e.g. 5400 RPM)
- Transfer Time:
- Disk Controller Overhead:
3-12 ms 5-6 ms 0.1 ms + 2.0 ms ~20 ms
5
Address Spaces
- Physical address spaces corresponds
to the actual system address range (based on the width of the address bus) of the processor and how much main memory is physically present
- Each process/program runs in its own
private "virtual" address space
– Virtual address space can be larger (or smaller) than physical memory – Virtual address spaces are protected from each other
32-bit Physical Address Space w/
- nly 1 GB of Mem
0x00000000 0xffff ffff
Mem. I/O Not used
0x3fffffff
Not used
0x80000000 0xbfffffff
Code
0x00000000 0xffff ffff
32-bit Fictitious Virtual Address Spaces ( > 1GB Mem)
Mapped I/O
- Program/Process
1,2,3,…
- Data
- Heap
- Stack
- 0xc0000000
0x10000000
6
Processes
- Process
– (def 1.) Address Space + Threads
- (Virtual) Address Space = Protected view of memory
- 1 or more threads
– (def 2.) : Running instance of a program that has limited rights
- Memory is protected: Address translation (VM) ensures no
access to any other processes' memory
- I/O is protected: Processes execute in user-mode (not
kernel mode) which generally means direct I/O access is disallowed instead requiring system calls into the kernel
- OS Kernel is not considered a "process"
– Has access to all resources and much of its code is invoked under the execution of a user process thread
Code
0x00000000 0xffff ffff
Address Spaces
Mapped I/O
- Program/Process
1,2,3,…
- Data
- Heap
- Stack
- 0xc0000000
0x10000000 = Thread
7
Virtual Address Spaces (VAS)
- Virtual address spaces
(VASs) are broken into blocks called "pages"
- Depending on the
program, much of the virtual address space will be unused
- Pages can be allocated
"on demand" (i.e. when the stack grows, etc.)
- All allocated pages can be
stored in secondary storage (hard drive)
1 2 3 Unalloc 1 2 Unalloc
Secondary Storage
… Unalloc … 1 2 3 1 2
Used/Unused Blocks in Virtual Address Space
Code
0x00000000 0xffff ffff
Mapped I/O
- Program/Process
1,2,3,…
- Data
- Heap
- Stack
- 0xc0000000
0x10000000
0 - Code 1 - Code 2 - Data 4 - Heap … 3 - Stack 0 - Code 1 - Code 2 - Data 4 - Heap 3 - Stack
8
1 2 3 Unalloc 1 2 Unalloc … Unalloc … 1 2 3 1 2 0 - Code 1 - Code 2 - Data 4 - Heap … 3 - Stack 0 - Code 1 - Code 2 - Data 4 - Heap 3 - Stack
Physical Address Space (PAS)
- Physical memory is broken into
page-size blocks called "frames"
- Multiple programs can be running
and their pages can share the physical memory
- Physical memory acts as a cache
for pages with secondary storage acting as the backing store (next lower level in the hierarchy)
- A page can be:
– Unallocated (not needed yet…stack/heap) – Allocated and residing in secondary storage (Uncached) – Allocated and residing in main memory (Cached)
0x00000000 0x3fffffff
1GB Physical Memory and 32-bit Address Space
0xffffffff
Secondary Storage Fictitious Virtual Address Spaces
frame
0-Code
- Pg. 0
- Pg. 2
2-Data
- Pg. 0
frame I/O and un- used area
9
Paging
- Virtual address space is divided into equal
size "pages" (often around 4KB)
- Physical memory is broken into page
frames (which can hold any page of virtual memory and then be swapped for another page)
- Virtual address spaces can be contiguous
while physical layout is not
Physical Frame of memory can hold data from any virtual page. Since all pages are the same size any page can go in any frame (and be swapped at our desire).
0x00000000 0x3fffffff
frame
0-Code
- Pg. 0
- Pg. 2
2-Data
- Pg. 0
frame I/O and un- used area
0xffffffff
- Pg. 0
- Pg. 1
- Pg. 2
- Pg. 3
unused …
- Pg. 0
- Pg. 1
- Pg. 2
unused unused …
- Phys. Addr.
Space
- Proc. 1 VAS Proc. 2 VAS
10
Virtual vs. Physical Addresses
- Key: Programs are written using virtual
addresses
- HW & the OS will translate the virtual
addresses used by the program to the physical address where that page resides
- If an attempt is made to access a page
that is not in physical memory, HW generates a "page fault exception" and the OS is invoked to bring in the page to physical memory (possibly evicting another page)
- Notice: Virtual addresses are not unique
– Each program/process has VA: 0x00000000
Translation Unit / MMU (Mem. Mgmt. Unit)
Proc. Core
Memory
Data PA: 0x0 PA:0x3fffffff
frame
0-Code
Physical Memory and Address Space
- Pg. 0
- Pg. 2
2-Data
- Pg. 0
frame I/O and un- used area
0xffffffff
Secondary Storage Fictitious Virtual Address Spaces
1 2 3 Unalloc 1 2 Unalloc … Unalloc … 1 2 3 1 2 0 - Code 1 - Code 2 - Data 4 - Heap … 3 - Stack 0 - Code 1 - Code 2 - Data 4 - Heap 3 - Stack
PA: 0x11f000 PA: 0x21b000 VA: 0x040000 VA: 0x100080 Virtual Addr Physical Addr VA: 0x040000 PA: 0x21b000
11
Summary
- Program takes an abstract (virtual) view of
memory and uses virtual addresses and necessary data is broken into large chunks called pages
- HW and OS work together to bring pages into
main memory acting as a cache and allowing sharing
- HW and OS work together to perform
translation between:
– Virtual address: Address used by the process (programmer) – Physical address: Physical memory location of the desired data
- Translation allows protection against other
programs
Translation Unit / MMU (Mem. Mgmt. Unit)
Proc. Core
Virtual Addr
Memory
Physical Addr Data PA: 0x0 PA:0x3fffffff
frame
0-Code
Physical Memory and Address Space
- Pg. 0
- Pg. 2
2-Data
- Pg. 0
frame I/O and un- used area
0xffffffff
Secondary Storage Fictitious Virtual Address Spaces
1 2 3 Unalloc 1 2 Unalloc … Unalloc … 1 2 3 1 2 0 - Code 1 - Code 2 - Data 4 - Heap … 3 - Stack 0 - Code 1 - Code 2 - Data 4 - Heap 3 - Stack
PA: 0x11f000 PA: 0x21b000 VA: 0x040000 VA: 0x100080
12
VM Design Implications
- SLOW secondary storage access on page faults (10 ms)
– Implies page size should be fairly large (i.e. once we’ve taken the time to find data on disk, make it worthwhile by accessing a reasonably large amount of data) – Implies the placement of pages in main memory should be fully associative to reduce conflicts and maximize page hit rates – Implies a "page fault" is going to take so much time to even access the data that we can handle them in software (via an exception) rather than using HW like typical cache misses – Implies eviction algorithms like LRU can be used since reducing page miss rates will pay off greatly – Implies write-back (write-through would be too expensive)
13
ADDRESS TRANSLATION
Page Tables
14
Page Size and Address Translation
- Since pages are usually retrieved from disk, we size them to be fairly large
(several KB) to amortize the large access time
- Virtual page number to physical page frame translation performed by HW
unit = MMU (Mem. Management Unit)
- Page table is an in-memory data structure that the HW MMU will use to look
up translations from VPN to PPFN
Offset within page
Virtual Address
Virtual Page Number
31 12 11
Offset within page
Physical Address
- Phys. Page Frame
Number
31 30 12 11
00
Copied
12
Translation Process (MMU + Page Table)
29 20 18
2 2 1 b
Lookup VPN 0x00040 to it lives in PPFN: 0x0021b
15
Address Translation Issues
- We want to take advantage of all the physical memory so page placement
should be fully associative
– For 1GB of physical memory, a 4KB page can be anywhere in the 256K = 218 page frames
- We could potentially track the contents of physical memory using similar
techniques to cache
– TAG = VPN that is currently stored in the frame
– TAG = 20 + 1 = 21 bits, 218 comparators…TOO MUCH
- Instead, most systems implement full associativity using a look-up table =
PAGE TABLE
Frame 2 Frame 1 Frame 0 … Frame 218-1 VPN
Page Frame # Virtual Address
- ffset
= = = = = Physical Memory Processor
Tag (VPN) V M Tag (VPN) V M Tag (VPN) V M … Tag (VPN) V M 218-1 2 1
16
Analogy for Page Tables
- Suppose we want to build a caller-ID mechanism for your
contacts on your cell phone
– Let us assume 1000 contacts represented by a 3-digit integer (0-999) in the cell phone (this ID can be used to look up their names) – We want to use a simple array (or Look-Up Table (LUT)) to translate phone numbers to contact ID’s, how shall we organize/index our LUT
213-745-9823
LUT indexed w/ contact ID
000
LUT indexed w/ all possible phone #’s
626-454-9985 … 323-823-7104 818-329-1980 001 002 999 null 000-000-0000 .. … null 000 213-745-9823 999-999-9999
Sorted LUT indexed w/ used phone #’s
436 213-745-9823 000 … 002 999 323-823-7104 213-730-2198 818-329-1980
O(n) - Doesn’t Work We are given phone # and need to translate to ID (1000 accesses) O(log n) - Could Work Since its in sorted order we could use a binary search (log21000 accesses) O(1) - Could Work Easy to index & find but LARGE (1 access) 1 2 3
…
17
Page Tables
- VA is broken into:
– VPN (upper bits) + Page offset: Based on page size (i.e. 0 to 4K-1 for 4KB page)
- MMU uses VPN & PTBR to access the page table in memory and lookup physical frame
(i.e. like an array access where VPN is the index: PTBR[VPN] )
– Each entry is referred to as a Page Table Entry (PTE) and holds the physical frame number & bookkeeping info
- Physical frame is combined with offset to form physical address
- For 20-bit VPN, how big is the page table? (See below)
VA
Offset w/in page Virtual Page Number
31 12 11
Page Table Size = 220 entries * 18 bits = approx. 220*4bytes = 4MB
PTBR = Page Table Base Reg. Offset w/in page
PA
- Phys. Frame #
31 12 11
00
PTE Page Frame Number PTE …
Other flags 20
Page Table in Main Memory
18
Processor
0xc0008000 0xc0008000 PTBR[2] PTBR[1] PTBR[0] … 0x0021b 0x0021b 0x00002 0x2d8 0x2d8
18
Page Table Example
- Suppose a system with 8-bit VAs, 10-bit PAs, and 32-byte pages.
VPN P1-VAS 1 2 3 4 5 6 7
Page Table VA
Offset VPN
7
Offset
PA
PFN
4 4 9
Page Table
0x00 0x1F 0x20 0x3F 0x40 0x5F 0xE0 0xFF
PFN
Phys Mem
VP 3 1 2 VP 1 3
PT for P1 (OS Owns)
31 VP 5
0x000 0x01F 0x020 0x03F 0x040 0x05F 0x3E0 0x3FF
V Entry 0x1a 1 1 0x02 2 1 0x18 3 1 0x00 4 0x10 5 1 0x1F 6 0x15 7 0x0A
19
Page Table Exercise
- Suppose a system with 8-bit VAs, 10-bit
PAs, and 32-byte pages.
- Fill in the table below showing the
corresponding physical or virtual address based on the other. If no translation can be made, indicate "INVALID"
V Entry 0x0E 1 1 0x1E 2 1 0x16 3 1 0x06 4 0x0B 5 1 0x1F 6 0x15 7 0x0A
Page Table
VA PA 0x2D = 0010 1101 0x3CD 0x7A = 0111 1010 0x0DA 0xEF = 1110 1111 INVALID 0xA8 = 1010 1000 0x3E8
VA
Offset VPN
7
Offset
PA
PFN
4 4 9
Page Table
20
Analogy for Page Tables
- Can we use the table indexed using all possible phone numbers (because it only
requires 1 access) but somehow reduce the size especially since much of it is unused?
- Do you have friends from every area code? Likely contacts are clustered in only a few
area codes.
- Use a 2-level organization
– 1st level LUT is indexed on area code and contains pointers to 2nd level tables – 2nd level LUT’s indexed on local phone numbers and contains contact ID entries LUT indexed w/ all possible phone #’s
null … … 000 213 323
1st Level Index = Area Code
Area Code null 000-000-0000 .. … null 000 213-745-9823 999-999-9999 … 213 Table
2nd Level Index = Local Phone #
000-0000 999-9999 323 Table 000-0000 999-9999
If only 2 used area codes then only 1000 + 2(107) entries rather than 1010 entries
21
Analogy for Page Tables
- Could extend to 3 levels if desired
– 1st Level = Area code and pointers to 2nd level tables – 2nd Level = First 3-digits of local phone and pointers to 3rd level tables – 3rd Level = Contact ID’s
null … … 000 213 323
1st Level Index = Area Code
Area Code …
2nd Level Index = Local Phone #
000 999 000 999 323 Table 213 Table null null 745 823 null null
3rd Level Index = Local Phone #
000 999 213-745 Table null 000 null 9823 000 999 323-823 Table null 999 null 7104
22
Analogy for Page Tables
- If we add a friend from area code 408 we would have to add a second and
third level table for just this entry
null … … 000 213 323
1st Level Index = Area Code
Area Code …
2nd Level Index = Local Phone #
000 999 000 999 323 Table 213 Table null null 745 823 null null
3rd Level Index = Local Phone #
000 999 213-745 Table null 000 null 9823 000 999 323-823 Table null 999 null 7104
23
Multi-level Page Tables
- VPN is broken into fields to index each level of the page table
- Think of a multi-level page table as a tree
– Internal nodes contain pointers to other page tables – Leaves hold actual translations
0x40 0x040 0x35 Virtual Addr VPN
- ffset
Idx1 Idx2 7 bits 7 bits 12 bits 0xd0000 PDBR/CR3 … … 0x3f 6 bits Idx3 [0x40] PT2[] = start addr PD start addr [0x3f] [0x35] PT3[] = start addr Level 1 Level 2 Level 3
- Phys. Frame Addr
Translations live in this level
Processor
- Unused entries in one
level mean no table at the next (saving space)
Page Directory
24
Page Faults
Uncached Page Uncached Page Uncached Page Uncached Page Uncached Page Uncached Page
1 2 1023 1 2 1023 1 2 1023 1 2 1023
Offset w/in page Level Index 1
31 12 11 22 21
Level Index 2
10 10
Pointer to start of 2nd Level Table PPFN’s
frame I/O and un- used area frame
0x0
When HW encounters a PTE whose page is not in physical memory, it will generate a page fault exception and the OS will take over and retrieve the page before resuming the program.
25
SPARC VM Implementation
Offset w/in page
Index 1
8 11 6
Index 2 Process ID Index 3
6
4095 MMU hold 4096 entry table (one entry per context/process) [Essentially, PTBR for each process] Context Table First Level Second Level Third Level 4K Page Desired word PPFN 28 * 4 bytes 26 * 4 bytes 26 * 4 bytes
How many accesses to memory does it take to get the desired word that corresponds to the given virtual address? Would that change for a 1- or 2- level table?
Virtual Address:
26
Page Table Entries
- Valid bit (1 = desired page in memory / 0 = page not present / page fault)
- Referenced = To implement pseudo-LRU replacement
- Protection: Read/Write/eXecute
Page Frame Number Valid / Present Modified / Dirty Referenced Protection Cacheable
27
Page Fault Steps
- HW will…
– Record the offending address, the EPC, and cause (page fault)
- SW will…
– Pick an empty frame or select a page to evict – Writeback the evicted page if it has been modified
- May block process while waiting and yield processor
– Bring in the desired page and update the page table
- May block process while waiting and yield processor
– Restart the offending instruction
28
Page Replacement Policies
- Possible algorithms: LRU, FIFO, Random
- Since page misses are so costly (slow) we can afford to spend sometime
keeping statistics to implement LRU
- Implementing exact LRU would require updating statistics every access (using
some kind of timestamp). This is too much to do in HW and we don’t want to use SW when we have hits
- HW will implement simple mechanism that allows SW to implement a
pseudo-LRU algorithm
– HW will set the “Referenced” bit when a page is used – At certain intervals, SW will use these reference bits to keep statistics on which pages have been used in that interval and then clear the reference bits – On replacement, these statistics can be used to find the pseudo-LRU page
29
Cache & VM Comparison
Cache Virtual Memory Block Size 16-64B 4 KB – 64 MB Mapping Schemes Direct or Set Associative Fully Associative Miss handling and replacement HW SW Replacement Policy Full LRU if low associativity / Random is also used Pseudo-LRU can be implemented
30
Performance Issues
- Let cache hits = 10ns, memory accesses=100ns
- Assume a program makes an access to data located in cache…
– Without VM, only requires 10ns cache access time – With VM, address must first be translated via the page table (recall page table is in memory)
- If a single-level, one access to the page table (MM) = 100ns
- If two-levels, two access to the page tables = 200ns
- If three-levels, three access to the page tables = 300ns
- Finally, physical address can access cache = 10 ns (if hit)
- Total time equals 100*L+10 (where L=# of Level of Page Table)
- Translation is extremely costly as currently implemented!!!
31
Translation Unit / MMU
Translation Lookaside Buffer (TLB)
- Solution: Let’s create a cache for translations = Translation
Lookaside Buffer (TLB)
- Needs to be small (64-128 entries) so it can be fast, with high degree of
associativity (at least 4-way and many times fully associative) to avoid conflicts – On hit, the PPFN is produced and concatenated with the offset – On miss, a page table walk is needed
TLB I or D Cache
CPU
VA VPN Page Offset PPFN PA data 10 ns 10 ns
Memory Memory (Page Table)
Hit Miss Miss Hit
Processor
32
TLB + Data Cache
Offset w/in page
Virtual Address
Virtual Page Number
31 12 11
Page Frame #
Offset w/in page
Physical Address
- Phys. Frame #
31 12 11
V M Tag = VPN
= = = =
TLB Fully Direct Set-Assoc.
20 12
- Phys. Tag
Index
Byte Offset
Data Data Data Data Tag V
= Hit
Desired Data 14 16
TLB Data Cache Fully Direct Set-Assoc. MMU Cache If data cache is tagged with physical addresses, then we must translate the VA before we can access the data cache.
TLB only has a few entries so now we need to store tags.
33
Differences of TLB & Data Cache
- Data cache
– 1 tag (to identify the block) corresponds to MANY bytes (16-64 bytes)
- TLB
– 1 tag (VPN) corresponds to 1 translation (PTE/PPFN) which is good for 4KB
- Main Point: TLBs are smaller than normal data caches and faster
to access
TLB
(1 tag = 1 translation. No Offset needed)
Instruc./Data Cache
(Offset needed since one tag corresponds to many values)
34
TLB Block Size
- A block in cache may be
– 1 word – 2 words – 4 words
- Consider a direct mapped cache mapping can the
word field be 0-bits?
- But an entry in the TLB is 1 value
– Log2(1) = 0…TLB mappings have no word field
Tag Block
Word Byte Offset
18 10 2 2
35
A 4-Way Set Associative TLB
- 64 entry 4-way SA TLB (set field indexes each “way”)
– On hit, page frame # supplied quickly w/o page table access
Offset w/in page
Virtual Address
VPN
31 12 11
Offset w/in page
Physical Address
- Phys. Frame #
31 12 11
Set Tag
308ac 7ffe Tag PF# Tag PF# Tag PF# Tag PF#
= = = =
Way 1 Way 0 Way 2 Way 3
4 16
7 f f e 1 6 d 8 3 8 a c 6 d 8 7 f f e 1
What is the page size? 4KB Tag size? 16 bits Comparator Width? 5-bits (4+V)
36
Processor Chip Translation Unit / MMU
Page Fault Steps
- On page fault, handler will access
disk to evict old page (if dirty) and to bring in desired page
– Likely context switch on each access since disk is slow
- Make sure PT & TLB are updated
appropriately
TLB Cache
CPU
VA VPN Page Offset PPFN PA data 10 ns
Miss Miss Hit
VA Miss Invalid / Not Present OS Exception (Page Fault) Handler
Memory
1 2 3 3 4
- 3. Evict (writeback) page if no
frame free (update PT & TLB)
- 4. then bring in needed page
and update PT
4 5
Restart faulting instruction
3 4
Page Table
4 3 Disk Driver (Interrupt) 6
TLB Miss / PT walk / Update TLB
6
37
Virtual Memory System Examples
Microprocessor AMD Opteron P4 PPC 7447a Virtual Address 48-bit 32- or 48-bit 52-bit Physical Address 40-bit 36-bit 32- or 36-bit TLB Entries (I/D/L2 TLB) L1: 40/40 L2: 512/512 L1: 128/128 L1: 128 / 128 TLB Mapping L1: Fully L2: 4-way SA Fully (? 4-way) 2-way set associative
- Min. Page Size
4 KB 4 KB 4 KB
Notes: Large VA’s include ASID (process ID’s) and other segment information Sources: H&P, “CO&D”, 3rd ed., Freescale.com,
38
TLB Issues
- Because of high degree of associativity and limited working set
- f pages (usually) we can get VERY HIGH hit rates for the TLB
– Variable page size settable by OS to allow for different working set sizes – Example: 64 TLB entries and 4 KB pages = 256KB
- Often times, separate TLB’s for instruction and data address
translation
39
TLB Miss Process
- On a TLB miss, there is some division of work between the hardware (MMU) and
OS
- Option 1
– MMU can perform the TLB search followed by a page table walk if needed – If page fault occurs, OS takes over to bring in the page
- Option 2
– MMU performs TLB Search – If TLB miss, OS can perform page table walk and bring in page if necessary
- When we want to remove a page from MM
– First flush out blocks belong to that page from cache (writing back if necessary) – Invalidate tags of those blocks – Invalidate TLB entry (if any) corresponding to that page
- If D=1, set dirty bit in page table
– If page is dirty, copy page back to the disk – Simple way to remember this…
- If parents (page) leave a party then the children (cache blocks & TLB entries) must leave too
40
Other Issues
- Property of Inclusion
– Cache contents are a (subset / superset) of main memory contents – Main memory contents are a (subset / superset) of page/swap file on disk – TLB contents are a (subset / superset) of page table contents
41
Cache, VM, and Main Memory
TLB VM Cache Possible Y/N & Description Hit Hit Hit Possible: best-case scenario Hit Hit Miss Possible: TLB hits (hit in VM is implied), then cache miss Miss Hit Hit TLB misses, then hits in page table, then cache hit Miss Hit Miss TLB misses, then hits in page table, then cache miss Miss Miss Miss TLB misses, then page fault, then miss in cache Hit Miss Miss Impossible: cannot hit in TLB if page not present Hit Miss Hit Impossible: cannot hit in TLB if page not present Miss Miss Hit Impossible: data cannot be in cache if page not present
Taken from H & P, “Computer Organization” 3rd, Ed.
42
Cache Addressing with VM
- Cache review
– Set or block field indexes LUT holding tags – 2 steps to determine hit:
- Index (lookup) to find tags (using block or set bits)
- Compare tags to determine hit
- Sequential connection between indexing and tag comparison
- Rather than waiting for address translation and then
performing this two step hit process, can we overlap the translation and portions of the hit sequence?
– Yes if we choose page size, block size, and set/direct mapping carefully
43
Cache Index/Tag Options
- Physically indexed, physically tagged (PIPT)
– Wait for full address translation – Then use physical address for both indexing and tag comparison
- Virtually indexed, physically tagged (VIPT)
– Use portion of the virtual address for indexing then wait for address translation and use physical address for tag comparisons – Easiest when index portion of virtual address w/in offset (page size) address bits, otherwise aliasing may occur
- Virtually indexed, virtually tagged (VIVT)
– Use virtual address for both indexing and tagging…No TLB access unless cache miss – Requires invalidation of cache lines on context switch or use of process ID as part of tags
Offset
VA
VPN
31 12 11
Offset
PA
PFN
31 12 11
Set/Blk Tag
PIPT
Offset
VA
VPN
31 12 11
Offset
PA
PFN
31 12 11
Tag Set/Blk Offset
VA
VPN
31 12 11
Offset
PA
PFN
31 12 11
Set/Blk Tag
VIPT VIVT
44
Multiple Processes
- Recall each process has its own virtual address space, page
table, and translations
- How does TLB handle context switch
– Can choose to only hold translations for current process and thus invalidate all entries on context switch – Can hold translations for multiple processes concurrently by concatenating a process or address space ID (PID or ASID) to the VPN tag
Offset
VA
VPN
31 12 11
ASID
Unique ID for each process
Page Frame # V M Tag
= = = =
ASID
45
Shared Memory
- In current system, all memory is
private to each process
- To share memory between two
processes, the OS can allocate an entry in each process’ page table to point to the same physical page
- Can use different protection
bits for each page table entry (e.g. P1 can be R/W while P2 can be read only)
1 2 3 1 2
… …
1 2
…
Physical Memory Virtual Address Spaces P1 P2
46
8-Stage Pipeline
- In 5-stage pipeline, we place the I-Cache and
D-Cache in one stage each
- When we add VM translation (i.e. TLB), cache
tag check, and data access we need more stages to balance the delay
D-Cache D-TLB
Cache Data Cache Tag Lookup
Tag Check
Single Memory Stage in 5- Stage Pipeline Expand to three stages when TLB and Cache access is balanced
47
8-Stage Pipeline
- MIPS R4000 uses pipelined instruction and data caches
- The instruction is available at the end of the IS stage but the
tag check is done in RF
– This is harmless since it’s fine to start the decode and read registers even if it’s an invalid tag so long as we know by the end of the clock
- For the data cache we must perform the tag check before we
can start writing into the register file
IM
Reg
ALU
DM
Reg
IF IS RF EX DF DS TC WB TLB Data/Tag Access Tag Check TLB Data/Tag Access Tag Check
48
Load Dependency Issue Latency
- 8-stage pipeline has a two-cycle load delay
– Possible since data is available at end of DS and can be forwarded – If tag check indicates a miss the pipeline can be “backed up” (re-execute the instruction when the data is available
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
LW $5,0($2) ADD $1,$1,$5
49
A Complete VM / Cache Example
- Use the following specification for the following questions
– 64-bit data, 32-bit virtual/physical address – Page Size: 128KB – TLB Size: 256 entry 4-way set associative – Page Table Org.: 3-levels
- A 64 entry A-Table (page directory) followed by several 32 entry B-Tables
(2nd level tables) followed by some number of C-Tables (3rd level)
– Cache Organization
- Cache Size: 512KB
- 8-way set associative
- Block size: 2 words [Word = 64-bits = 8 bytes]
50
Address Bus and Interleaving
- Use the following specification for the following questions
– 64-bit data, 32-bit virtual/physical address – Cache Organization: Block size: 2 words [1 Word = 64-bits = 8 bytes]
- How many banks would you suggest for interleaving purposes?
2 Banks so we can quickly get two words to the data cache when a block is transferred
Physical Address
Word 31 Byte
Proc.
Bank 0 (A3=0) Bank 1 (A3=1)
/BE7…/BE0 64 29 A31-A3
Block ID 8 8 31 3 2 1 0 Byte
31
Logical Address Bus Logical Address Bus
3 4 2 1 0
64-bit Data bus = 8 bytes = 8 Byte enables (/BE7…/BE0)
28 28
51
TLB Mapping
- Use the following specification for the following questions
– 64-bit data, 32-bit virtual/physical address – Page Size: 128KB – TLB Size: 256 entry 4-way set associative
Logical Address Bus
Page Size = 128KB => Offset Field Size = 17-bits # of TLB Sets: 256 entries / 4 entries per set = 64 sets => 6 set bits Offset VPN
31 17 16
Tag Set
17 23 31 22
52
Page Table Mapping
- Use the following specification for the following questions
– Page Size: 128KB – Page Table Org.: 3-levels
- A 64 entry A-Table (page directory)
32 entry B-Tables (2nd level tables) some number of C-Tables (3rd level)
Virtual Address
VPN = 15-bits Level 1 Page Table = 64 (26) entries => 6-bits Level 2 Page Table = 32 (25) entries => 5-bits Level 3 Page Table = 15-6-5= 4-bits => 16 (24) entries Offset=17-bits VPN=15-bits
31 17 16 Level 1 4 5 6 Level 2 Level 3
53
Data Cache Design
- Use the following specification for the following questions
– Cache Organization
- Cache Size: 512KB
- 8-way set associative
- Block size: 2 words [Word = 64-bits = 8 bytes]
64-bit data = 8 (23) bytes per word Block Size = 2 (21) words = 1 word bit # of Cache blocks = 512KB / 16 bytes per block = 219 / 24 = 215 # of Sets = 215 / 23 ways per set = 212 sets => 12 set bits # of Tag bits = 32 – 12 – 1 – 3 bits = 16-bits
Logical Address Bus
Word
3 Tag Set 1 16
31
12
Byte
54
Data Cache Implementation
- How many comparators and of what size are needed to
determine cache hit or miss?
- What is the size of the TAG RAM’s?
8 comparators of 17-bits Tag RAM Size = 4K x 17
Logical Address Bus
Word
3 Tag Set 1 16
31
12
Byte
Set Tag RAM DI[16:0]
DO[16:0]
A15-A4 A31-A16 + V-bit A[11:0]
=
Hit/Miss Tag
x 8
12 17 17