Translation Caching: Skip, Dont Walk (The Page Table) Thomas W. - PowerPoint PPT Presentation

Translation Caching: Skip, Don’t Walk (The Page Table) Thomas W. Barr, Alan L. Cox, Scott Rixner Rice Computer Architecture Group, Rice University International Symposium on Computer Architecture, June 2010

Virtual Memory: An increasing challenge - Virtual memory - Performance overhead 5-14% for “typical” applications. [Bhargava08] - 89% under virtualization! [Bhargava08] - Overhead comes primarily from referencing the in-memory page table - MMU Cache - Dedicated cache to speed access to parts of page table rice computer architecture group - 2

Overview - Background - Why is address translation slow? - How MMU Caching can help - Design and comparison of MMU Caches - Systematic exploration of design space - Previous designs - New, superior point in space - Novel replacement scheme - Revisiting previous work - Comparison to Inverted Page Table rice computer architecture group - 3

Why is Address Translation Slow? - Four-level Page Table 0x5c8315cc1016 rice computer architecture group - 4

MMU Caching - Upper levels of page table correspond to large regions of virtual memory - Should be easily cached - MMU does not have access to L1 cache - MMU Cache: Caches upper level entries (L4, L3, L2) rice computer architecture group - 5

MMU Caching - In production - AMD and Intel tagging - Design space - Tagging organization UPTC UTC - Page table/Translation - Organization - Split/Unified - Previous designs not optimal SPTC STC - Unified translation cache (with modified replacement scheme) outperforms existing devices rice computer architecture group - 6

Page table caches - {0b9, 00c, 0ae, 0c1, 016} Simple design - PTE address pointer Data cache - 0x23410 0xabcde Entries tagged by physical address of page table entry 0x55320 0x23144 - Page walk unchanged 0x23144 0x55320 - Replace memory accesses with ... ... MMU cache accesses ... ... - Three accesses/walk ... ... rice computer architecture group - 7

Translation caches {0b9, 00c, 0ae, 0c1, 016} - Alternate tag (L4, L3, L2 indices) pointer - Tag by virtual address (0b9, 00c, 0ae) 0xabcde fragment (0b9, 00c, xxx) 0x23410 - Smaller (0b9, xxx, xxx) 0x55320 - 27 bits vs. 49 bits ... ... - Skip parts of page walk ... ... - Skip to bottom of tree ... ... rice computer architecture group - 8

Cache tagging comparison SPEC CPU2006 Floating Point Suite rice computer architecture group - 9

Split vs. Unified Caches - Hash joins - Reads to many gigabyte table nearly completely random - Vital to overall DMBS performance [Aliamaki99] - Simulate with synthetic trace generator - MMU cache performance - 16 gigabyte hash table - 1 L4 entry - 16 L3 entries - 8,192 L2 entries - Low L2 hit rate leads to “level conflict” in unified caches - Solve by splitting caches or using a smarter replacement scheme rice computer architecture group - 10

Split vs. Unified Caches rice computer architecture group - 11

Level Conflict in Unified Caches - LRU replacement - Important for high-locality applications - Avoid replacing upper level entries - After every L3 access, must be one L2 access - Each L3 entry pollutes the cache with at least one unique L2 entry rice computer architecture group - 12

Split vs. Unified Caches - Split caches Type Type Type - Split caches have one cache per L4 L3 L2 level L3 L2 - Protects entries from upper levels - Intel's Paging Structure Cache rice computer architecture group - 13

Split vs. Unified Caches - Problem: Size allocation - Each level large? Type Type Type - Die area L4 L3 L2 - Each level small? L3 L2 - Hurts performance for all applications - Unequal distribution? - Hurts performance for particular applications rice computer architecture group - 14

Variable insertion point LRU replacement - Modified LRU - Preserve entries with low reuse Entry Type for less time - Insert them below the MRU slot L2 - VI-LRU L3 - Novel scheme L4 - Vary insertion point based on L2 content of cache L3 - If L3 entries have high reuse, give L2 entries less time rice computer architecture group - 15

Variable insertion point LRU replacement rice computer architecture group - 16

Page Table Formats - In the past, radix table implementations required four memory references per TLB miss - Many proposed data structure solutions to replace format - Reduces memory references/miss - This situation has changed - MMU cache is a hardware solution - Also reduces memory references - Revisit previous work - Competing formats are not as attractive now rice computer architecture group - 17

Inverted page table - Inverted (hashed) page table - Flat table, regardless of key (virtual address) size - Best case lookup is one - Average increases as hash collisions occur - 1.2 accesses / lookup for half full table [Knuth98] - Radix vs. inverted page table - IPT poorly exploits spatial locality in processor data cache - Increases DRAM accesses/walk by 400% for SPEC in simulation rice computer architecture group - 18

Inverted page table rice computer architecture group - 19

Inverted page table - IPT compared to cached radix table - Number of memory accesses similar ( ≈ 1.2) - Number of DRAM accesses increased 4x - SPARC TSB, Clustered Page Table, etc. - Similar results - Caching makes performance proportional to size - Translations / L2 cache - Consecutive translations / cache line - New hardware changes old “truths” - Replace complex data structures with simple hardware rice computer architecture group - 20

Conclusions - Address translation will continue to be a problem - Up to 89% performance overhead - First design space taxonomy and evaluation of MMU caches - Two-dimension space - Translation/Page Table Cache - Split/Unified Cache - 4.0 → 1.13 L2 accesses/TLB miss for current design - Existing designs are not ideal - Tagging - Translation caches can skip levels, use smaller tags - Partitioning - Novel VI-LRU allows partitioning to adapt to workload rice computer architecture group - 21

Translation Caching: Skip, Dont Walk (The Page Table) Thomas W. - PowerPoint PPT Presentation

Translation Caching: Skip, Dont Walk (The Page Table) Thomas W. Barr, Alan L. Cox, Scott Rixner Rice Computer Architecture Group, Rice University International Symposium on Computer Architecture, June 2010 Virtual Memory: An increasing

Agenda Caching Caching Gitlab Demo Caching Demos Mirroring Caching Limitations Manual

Web Proxy Web Proxy Caching Caching Caching Web Proxy Web Proxy Caching By Miquel Company

Cooperative Web Caching Cooperative Web Caching Cooperative Caching Cooperative Caching

Web Caching and Content Delivery Web Caching and Content Delivery Caching for a Better Web

Web Caching based on: Web Caching , Geoff Huston Web Caching and Zipf-like Distributions:

Scaling Your Cache & Caching at Scale Alex Miller @puredanger Mission Why does caching

Web Caching Web Caching and wireless networks Next generation Wireless Networks Helsinki

Guess What? Caching! Translation-Lookaside Buffer (TLB) stores for future use a successful

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Community Translation By Willem Stoeller Examples Community Translation Virtual Teams Powering

Temporal Temporal Radiance Caching Radiance Caching Pascal Gautron R&D Engineer Thomson

1 Harvest Harvest- -Style ICP Hierarchies Style ICP Hierarchies Issues for Cache Hierarchies

1 Web Traffic Characterization Zipf Web Traffic Characterization Zipf [Breslau/Cao99] and

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Magnifying (unknown) rare clusters to increase the chance of detection, using unsupervised

North Carolina Forest Carbon Offsets Workshop November 13, 2012 North Carolina Forest Service

Georgia Tech, Sony/Toshiba/IBM W Workshop on Software and Applications k h S ft d A li ti

Representation-Independent Program Analysis Michelle Mills Strout John Mellor-Crummey (Rice) Paul

CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September

Impr oving Memor y Hier ar chy Per for mance For Ir r egular Applications J ohn Mellor- Crummey

Data Needs for Sampling the Internet to Measure Performance Juana Sanchez UCLA Statistics In

Op#miza#on of High-Order Stencils* Kevin Stock

Translation Caching: Skip, Dont Walk (The Page Table) Thomas W. - PowerPoint PPT Presentation

Translation Caching: Skip, Dont Walk (The Page Table) Thomas W. Barr, Alan L. Cox, Scott Rixner Rice Computer Architecture Group, Rice University International Symposium on Computer Architecture, June 2010 Virtual Memory: An increasing

Agenda Caching Caching Gitlab Demo Caching Demos Mirroring Caching Limitations Manual

Web Proxy Web Proxy Caching Caching Caching Web Proxy Web Proxy Caching By Miquel Company

Cooperative Web Caching Cooperative Web Caching Cooperative Caching Cooperative Caching

Web Caching and Content Delivery Web Caching and Content Delivery Caching for a Better Web

Web Caching based on: Web Caching , Geoff Huston Web Caching and Zipf-like Distributions:

Scaling Your Cache &amp; Caching at Scale Alex Miller @puredanger Mission Why does caching

Web Caching Web Caching and wireless networks Next generation Wireless Networks Helsinki

Guess What? Caching! Translation-Lookaside Buffer (TLB) stores for future use a successful

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Community Translation By Willem Stoeller Examples Community Translation Virtual Teams Powering

Temporal Temporal Radiance Caching Radiance Caching Pascal Gautron R&amp;D Engineer Thomson

1 Harvest Harvest- -Style ICP Hierarchies Style ICP Hierarchies Issues for Cache Hierarchies

1 Web Traffic Characterization Zipf Web Traffic Characterization Zipf [Breslau/Cao99] and

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Magnifying (unknown) rare clusters to increase the chance of detection, using unsupervised

North Carolina Forest Carbon Offsets Workshop November 13, 2012 North Carolina Forest Service

Georgia Tech, Sony/Toshiba/IBM W Workshop on Software and Applications k h S ft d A li ti

Representation-Independent Program Analysis Michelle Mills Strout John Mellor-Crummey (Rice) Paul

CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September

Impr oving Memor y Hier ar chy Per for mance For Ir r egular Applications J ohn Mellor- Crummey

Data Needs for Sampling the Internet to Measure Performance Juana Sanchez UCLA Statistics In

Op#miza#on of High-Order Stencils* Kevin Stock

Scaling Your Cache & Caching at Scale Alex Miller @puredanger Mission Why does caching

Temporal Temporal Radiance Caching Radiance Caching Pascal Gautron R&D Engineer Thomson