Virtualizing Memory: Faster with TLBS Questions answered in this - PDF document

9/23/16 UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 537 Andrea C. Arpaci-Dusseau Introduction to Operating Systems Remzi H. Arpaci-Dusseau Virtualizing Memory: Faster with TLBS Questions answered in this lecture: Review paging... How can page translations be made faster? What is the basic idea of a TLB (Translation Lookaside Buffer)? What types of workloads perform well with TLBs? How do TLBs interact with context-switches? Announcements • P1: Due tomorrow at 6pm • Create README file in your p1 directory: describe what you did a little bit (especially if you ran into problems and did not implement something). The most important bit, at the top, however, should be the authorship of the project. • Late handin directory for unusual circumstances + communicate • Project 2: Available by Monday; will announce • Due three weeks from tomorrow • Can work with project partner on PART 2 in your discussion section (unofficial) • Two parts: • Linux: Shell -- fork() and exec(), job control • Xv6: Scheduler – simplistic MLFQ with graph • Two discussion videos again; watch early and often! • Communicate with your project partner! • Form on course web page if you would like project partner assigned • Exam 1: No conflicts, no alternate exam time • Reading for today: Chapter 19 1

9/23/16 Review: PaginG Assume 4 KB pages 0x0000 P1 pagetable 1 5 4 … PT 0x0800 PT 0x1000 ptbr P2 pagetable 6 2 3 … P1 0x2000 P2 Virtual Physical 0x3000 P2 load 0x0000 load 0x0800 0x4000 load 0x6000 P1 load 0x1444 load 0x0808 0x5000 load 0x2444 P1 0x6000 load 0x1444 load 0x0008 P2 load 0x5444 0x7000 What do we need to know? Location of page table in memory (ptbr) Size of each page table entry (assume 8 bytes) Review: Paging PROS and CONS Advantages • No external fragmentation • don’t need to find contiguous RAM • All free pages are equivalent • Easy to manage, allocate, and free pages Disadvantages • Page tables are too big • Must have one entry for every page of address space • Accessing page tables is too slow [today’s focus] • Doubles number of memory references per instruction 2

9/23/16 Translation Steps H/W: for each mem reference: 1. extract VPN (virt page num) from VA (virt addr) (cheap) 2. calculate addr of PTE (page table entry) (cheap) 3. read PTE from memory (expensive) 4. extract PFN (page frame num) (cheap) 5. build PA (phys addr) (cheap) 6. read contents of PA from memory into register (expensive) Which steps are expensive? Which expensive step will we avoid in today’s lecture? 3) Don’t always have to read PTE from memory! Example: Array Iterator int sum = 0; What virtual addresses? Assume these physical addresses for (i=0; i<N; i++){ load 0x100C load 0x3000 sum += a[i]; load 0x7000 load 0x3004 load 0x100C } load 0x7004 load 0x3008 load 0x100C Assume ‘a’ starts at 0x3000 load 0x7008 load 0x300C Ignore instruction fetches load 0x100C … load 0x700C 4KB pages Aside: What can you infer? • ptbr: 0x1000; PTE 4 bytes each • VPN 3 -> PPN 7 Observation: Repeatedly access same PTE because program repeatedly accesses same virtual page 3

9/23/16 Strategy: Cache Page Translations CPU RAM PT Translation Some popular entries Cache memory interconnect TLB: T ranslation L ookaside B uffer (yes, a poor name!) TLB Organization TLB Entry Tag (virtual page number) Physical page number (page table entry) Various ways to organize a 16-entry TLB (artificially small) A A B A B C D 0 0 0 1 1 1 Index 2 2 2 3 3 3 4 4 Four-way set associative 5 5 6 6 Set 7 7 8 Two-way set associative 9 10 A B C D E L M N O P 11 12 Fully associative 13 14 15 Lookup Direct mapped • Calculate set (tag % num_sets) • Search for tag within resulting set 4

9/23/16 TLB Example TLB Entry 30 (decimal) 0xa6 Various ways to organize a 16-entry TLB (artificially small) A A B A B C D 0 0 0 1 1 1 2 2 2 30 0xa6 46 0xbe 6 0xf1 10 0x21 3 3 3 4 4 Four-way set associative 5 5 30 % 4 = ? 6 6 30 0xa6 46 0xbe 7 7 8 Two-way set associative 9 30 % 8 = ? 10 A B C D E L M N O P 11 30 0xa6 12 Fully associative 13 14 30 0xa6 15 Lookup Direct mapped • Calculate set (tag % num_sets) 30 % 16 = ? • Search for tag within resulting set TLB: Replace Entry TLB Entry 14 (decimal) 0x38 Various ways to organize a 16-entry TLB (artificially small) A A B A B C D 0 0 0 1 1 1 2 2 2 30 0xa6 46 0xbe 14 0x38 10 0x21 3 3 3 4 4 Four-way set associative 5 5 30 % 4 = ? 6 6 30 0xa6 14 0x38 7 7 8 Two-way set associative 9 30 % 8 = ? 10 A B C D E L M N O P 11 30 0xa6 14 0x38 12 Fully associative 13 14 14 0x38 15 Lookup Direct mapped • Calculate set (tag % num_sets) 30 % 16 = ? • Search for tag within resulting set 5

9/23/16 TLB Associativity Trade-offs Higher associativity + Better utilization, fewer collisions – Slower – More hardware Lower associativity + Fast + Simple, less hardware – Greater chance of collisions TLBs usually fully associative Array Iterator (w/ TLB) int sum = 0; for (i = 0; i < 2048; i++){ sum += a[i]; } Assume following virtual address stream: load 0x1000 What will TLB behavior look like? load 0x1004 load 0x1008 load 0x100C … 6

9/23/16 TLB Accesses: SEQUENTIAL Example Virt Phys Miss! load 0x0004 load 0x1000 0 KB PTBR PT load 0x5000 PT load 0x1004 (TLB hit) 4 KB P1 pagetable load 0x5004 P1 1 5 4 … load 0x1008 8 KB (TLB hit) 0 1 P2 2 3 load 0x5008 12 KB load 0x100c (TLB hit) P2 CPU’s TLB load 0x500C 16 KB … … Valid VPN PPN P1 Miss! load 0x0008 20 KB load 0x2000 1 1 5 load 0x4000 P1 24 KB 1 2 4 (TLB hit) load 0x2004 P2 load 0x4004 28 KB PERFORMANCe OF TLB? int sum = 0; Calculate miss rate of TLB for data (ignore code + sum) for (i=0; i<2048; i++) { # TLB misses / # TLB lookups sum += a[i]; # TLB lookups? } = number of accesses to array a[] = 2048 # TLB misses? = number of unique pages accessed Would hit rate get better or worse = 2048 / (elements of a[] per 4K page) larger values of i? = 2K / (4KB / sizeof(int)) = 2K / 1K = 2 Stay same! Miss first access to each page Miss rate? Always miss 1/1024 2/2048 = 0.1% Hit rate? (1 – miss rate) 99.9% Would hit rate get better or worse with smaller pages? Worse 7

9/23/16 TLB PERFORMANCE How can system improve TLB performance (hit rate) given fixed number of TLB entries? Increase page size Fewer unique page translations needed to access same amount of memory TLB Reach: Number of TLB entries * Page Size Break • What did you do this summer? • What was the best summer job you’ve ever had? 8

9/23/16 TLB PERFORMANCE with Workloads Sequential array accesses almost always hit in TLB • Very fast! What access pattern will be slow? • Highly random, with no repeat accesses Workload acCESS PATTERNS Workload A Workload B int sum = 0; int sum = 0; srand(1234); for (i=0; i<1000; i++) { for (i=0; i<2048; i++) { sum += a[rand() % N]; sum += a[i]; } srand(1234); } for (i=0; i<1000; i++) { sum += a[rand() % N]; } Spatial Locality Temporal Locality … … Repeated Random Accesses Sequential Accesses time time 9

9/23/16 Workload Locality Spatial Locality : future access will be to nearby addresses Temporal Locality : future access will be repeats to the same data accessed in recent time What TLB characteristics are best for each type? Spatial: • Access same page next; need same vpn->ppn translation • Same TLB entry re-used (just 1 TLB entry could be fine!) Temporal: • Access same address near in future • Same TLB entry re-used in near future • How near in future? How many TLB entries are there? TLB Replacement policies LRU : evict Least-Recently Used TLB slot when needed (More on LRU later in policies next week) Random : Evict randomly choosen entry Which is better? A B C D E L M N O P 10

9/23/16 LRU Troubles Valid Virt Phys 0 ? ? virtual addresses: 0 ? ? 0 ? ? 0 1 2 3 4 0 ? ? Workload repeatedly accesses same offset across 5 pages (strided access), but only 4 TLB entries What will TLB contents be over time? How will TLB perform? TLB Replacement policies LRU : evict Least-Recently Used TLB slot when needed (More on LRU later in policies next week) Random : Evict randomly choosen entry Sometimes random is better than a “smart” policy! 11

9/23/16 Context Switches What happens if a process uses cached TLB entries from another process? Solutions? 1. Flush TLB on each context switch • Costly; lose all recently cached translations, more misses 2. Track which entries are for which process • Address Space Identifier • Tag each TLB entry with an 8-bit ASID - how many ASIDs do we get? TLB Example with ASI D 0 KB P1 pagetable (ASID 11) 1 5 4 … PT PTBR PT 4 KB P2 pagetable (ASID 12) 6 2 3 … P1 8 KB P2 Virtual Physical 12 KB load 0x1444 ASID: 12 load 0x2444 P2 load 0x1444 load 0x5444 ASID: 11 16 KB TLB: P1 20 KB Valid Virt Phys ASID P1 24 KB 0 1 9 11 P2 1 1 5 11 28 KB 1 1 2 12 1 0 1 11 12

Virtualizing Memory: Faster with TLBS Questions answered in this - PDF document

9/23/16 UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 537 Andrea C. Arpaci-Dusseau Introduction to Operating Systems Remzi H. Arpaci-Dusseau Virtualizing Memory: Faster with TLBS Questions answered in this lecture: Review

[537] TLBs Tyler Harter 9/21/14 Overview Review Paging TLBs (Chapter 18) TLB measurement demo

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

TLBs 1 memory HW random memory image page tables with 1-byte page entries answer: 2-byte

TLBs 3 one or two pages in each area? small areas of memory active at a time Code + Constants

Translation Buffers (TLBs) To perform virtual to physical address translation we need to

Memory Hierarchy & Caching Use several levels of faster and faster memory to hide delay of

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

CS 423 Operating System Design: Virtualizing CPU and Memory Tianyin Xu CS 423: Operating

Virtualizing Memory: Smaller Page TAbles Questions answered in this lecture: Review: What are

Virtualizing Memory: Paging Questions answered in this lecture: Review segmentation and

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

VIRTUALIZING TIME Ken Birman CS6410 Is Time Travel Feasible? Yes! But only if you happen to

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Semantically Far Inspirations Considered Harmful? Accounting For Cognitive States In

The Fast (but not furious) TracKer The Fast (but not furious) TracKer for ATLAS: for ATLAS: a

s Sgn w s i ij j i j Synchronous / Asynchronous updating

Fast TracKing at ATLAS Why and How Jamie Saxon University of Chicago What is the FTK?

Evolving complete cognitive architectures: The role of neural competition and diffusive

For Monday Read chapter 5 Homework: Chapter 2, exercise 8 Write up a presentation

Diffusion in Social Networks with Competing Products Krzysztof R. Apt CWI, Amsterdam, the

North-East Visualization and Analytics Center (NEVAC) VAST Grand Challenge Award: Data

Virtualizing Memory: Faster with TLBS Questions answered in this - PDF document

9/23/16 UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 537 Andrea C. Arpaci-Dusseau Introduction to Operating Systems Remzi H. Arpaci-Dusseau Virtualizing Memory: Faster with TLBS Questions answered in this lecture: Review

[537] TLBs Tyler Harter 9/21/14 Overview Review Paging TLBs (Chapter 18) TLB measurement demo

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

TLBs 1 memory HW random memory image page tables with 1-byte page entries answer: 2-byte

TLBs 3 one or two pages in each area? small areas of memory active at a time Code + Constants

Translation Buffers (TLBs) To perform virtual to physical address translation we need to

Memory Hierarchy &amp; Caching Use several levels of faster and faster memory to hide delay of

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

CS 423 Operating System Design: Virtualizing CPU and Memory Tianyin Xu CS 423: Operating

Virtualizing Memory: Smaller Page TAbles Questions answered in this lecture: Review: What are

Virtualizing Memory: Paging Questions answered in this lecture: Review segmentation and

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy &amp; Caching Use several

VIRTUALIZING TIME Ken Birman CS6410 Is Time Travel Feasible? Yes! But only if you happen to

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Semantically Far Inspirations Considered Harmful? Accounting For Cognitive States In

The Fast (but not furious) TracKer The Fast (but not furious) TracKer for ATLAS: for ATLAS: a

s Sgn w s i ij j i j Synchronous / Asynchronous updating

Fast TracKing at ATLAS Why and How Jamie Saxon University of Chicago What is the FTK?

Evolving complete cognitive architectures: The role of neural competition and diffusive

For Monday Read chapter 5 Homework: Chapter 2, exercise 8 Write up a presentation

Diffusion in Social Networks with Competing Products Krzysztof R. Apt CWI, Amsterdam, the

North-East Visualization and Analytics Center (NEVAC) VAST Grand Challenge Award: Data

Memory Hierarchy & Caching Use several levels of faster and faster memory to hide delay of

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several