Translation Caching: Skip, Dont Walk (The Page Table) Thomas W. - - PowerPoint PPT Presentation

translation caching
SMART_READER_LITE
LIVE PREVIEW

Translation Caching: Skip, Dont Walk (The Page Table) Thomas W. - - PowerPoint PPT Presentation

Translation Caching: Skip, Dont Walk (The Page Table) Thomas W. Barr, Alan L. Cox, Scott Rixner Rice Computer Architecture Group, Rice University International Symposium on Computer Architecture, June 2010 Virtual Memory: An increasing


slide-1
SLIDE 1

Translation Caching: Skip, Don’t Walk (The Page Table)

Thomas W. Barr, Alan L. Cox, Scott Rixner Rice Computer Architecture Group, Rice University International Symposium on Computer Architecture, June 2010

slide-2
SLIDE 2

rice computer architecture group - 2

Virtual Memory: An increasing challenge

  • Virtual memory
  • Performance overhead 5-14% for “typical” applications.

[Bhargava08]

  • 89% under virtualization! [Bhargava08]
  • Overhead comes primarily from referencing the in-memory

page table

  • MMU Cache
  • Dedicated cache to speed access to parts of page table
slide-3
SLIDE 3

rice computer architecture group - 3

Overview

  • Background
  • Why is address translation slow?
  • How MMU Caching can help
  • Design and comparison of MMU Caches
  • Systematic exploration of design space
  • Previous designs
  • New, superior point in space
  • Novel replacement scheme
  • Revisiting previous work
  • Comparison to Inverted Page Table
slide-4
SLIDE 4

rice computer architecture group - 4

Why is Address Translation Slow?

  • Four-level Page Table

0x5c8315cc1016

slide-5
SLIDE 5

rice computer architecture group - 5

MMU Caching

  • Upper levels of page table correspond to large regions of

virtual memory

  • Should be easily cached
  • MMU does not have access to L1 cache
  • MMU Cache: Caches upper level entries (L4, L3, L2)
slide-6
SLIDE 6

rice computer architecture group - 6

MMU Caching

  • In production
  • AMD and Intel
  • Design space
  • Tagging
  • Page table/Translation
  • Organization
  • Split/Unified
  • Previous designs not optimal
  • Unified translation cache (with

modified replacement scheme)

  • utperforms existing devices

UPTC UTC SPTC STC

tagging

  • rganization
slide-7
SLIDE 7

rice computer architecture group - 7

Page table caches

  • Simple design
  • Data cache
  • Entries tagged by physical

address of page table entry

  • Page walk unchanged
  • Replace memory accesses with

MMU cache accesses

  • Three accesses/walk

PTE address pointer 0x23410 0xabcde 0x55320 0x23144 0x23144 0x55320 ... ... ... ... ... ... {0b9, 00c, 0ae, 0c1, 016}

slide-8
SLIDE 8

rice computer architecture group - 8

Translation caches

  • Alternate tag
  • Tag by virtual address

fragment

  • Smaller
  • 27 bits vs. 49 bits
  • Skip parts of page walk
  • Skip to bottom of tree

{0b9, 00c, 0ae, 0c1, 016} (L4, L3, L2 indices) pointer (0b9, 00c, 0ae) 0xabcde (0b9, 00c, xxx) 0x23410 (0b9, xxx, xxx) 0x55320 ... ... ... ... ... ...

slide-9
SLIDE 9

rice computer architecture group - 9

Cache tagging comparison

SPEC CPU2006 Floating Point Suite

slide-10
SLIDE 10

rice computer architecture group - 10

Split vs. Unified Caches

  • Hash joins
  • Reads to many gigabyte table nearly completely random
  • Vital to overall DMBS performance [Aliamaki99]
  • Simulate with synthetic trace generator
  • MMU cache performance
  • 16 gigabyte hash table
  • 1 L4 entry
  • 16 L3 entries
  • 8,192 L2 entries
  • Low L2 hit rate leads to “level conflict” in unified caches
  • Solve by splitting caches or using a smarter replacement scheme
slide-11
SLIDE 11

rice computer architecture group - 11

Split vs. Unified Caches

slide-12
SLIDE 12

rice computer architecture group - 12

Level Conflict in Unified Caches

  • LRU replacement
  • Important for high-locality

applications

  • Avoid replacing upper level

entries

  • After every L3 access, must

be one L2 access

  • Each L3 entry pollutes the

cache with at least one unique L2 entry

slide-13
SLIDE 13

rice computer architecture group - 13

Split vs. Unified Caches

  • Split caches
  • Split caches have one cache per

level

  • Protects entries from upper

levels

  • Intel's Paging Structure Cache

Type L4 Type L3 L3 Type L2 L2

slide-14
SLIDE 14

rice computer architecture group - 14

Split vs. Unified Caches

  • Problem: Size allocation
  • Each level large?
  • Die area
  • Each level small?
  • Hurts performance for all

applications

  • Unequal distribution?
  • Hurts performance for particular

applications

Type L4 Type L3 L3 Type L2 L2

slide-15
SLIDE 15

rice computer architecture group - 15

Variable insertion point LRU replacement

  • Modified LRU
  • Preserve entries with low reuse

for less time

  • Insert them below the MRU slot
  • VI-LRU
  • Novel scheme
  • Vary insertion point based on

content of cache

  • If L3 entries have high reuse,

give L2 entries less time

Entry Type L2 L3 L4 L2 L3

slide-16
SLIDE 16

rice computer architecture group - 16

Variable insertion point LRU replacement

slide-17
SLIDE 17

rice computer architecture group - 17

Page Table Formats

  • In the past, radix table implementations required four

memory references per TLB miss

  • Many proposed data structure solutions to replace format
  • Reduces memory references/miss
  • This situation has changed
  • MMU cache is a hardware solution
  • Also reduces memory references
  • Revisit previous work
  • Competing formats are not as attractive now
slide-18
SLIDE 18

rice computer architecture group - 18

Inverted page table

  • Inverted (hashed) page table
  • Flat table, regardless of key (virtual address) size
  • Best case lookup is one
  • Average increases as hash collisions occur
  • 1.2 accesses / lookup for half full table [Knuth98]
  • Radix vs. inverted page table
  • IPT poorly exploits spatial locality in processor data cache
  • Increases DRAM accesses/walk by 400% for SPEC in

simulation

slide-19
SLIDE 19

rice computer architecture group - 19

Inverted page table

slide-20
SLIDE 20

rice computer architecture group - 20

Inverted page table

  • IPT compared to cached radix table
  • Number of memory accesses similar (≈1.2)
  • Number of DRAM accesses increased 4x
  • SPARC TSB, Clustered Page Table, etc.
  • Similar results
  • Caching makes performance proportional to size
  • Translations / L2 cache
  • Consecutive translations / cache line
  • New hardware changes old “truths”
  • Replace complex data structures with simple hardware
slide-21
SLIDE 21

rice computer architecture group - 21

Conclusions

  • Address translation will continue to be a problem
  • Up to 89% performance overhead
  • First design space taxonomy and evaluation of MMU caches
  • Two-dimension space
  • Translation/Page Table Cache
  • Split/Unified Cache
  • 4.0 → 1.13 L2 accesses/TLB miss for current design
  • Existing designs are not ideal
  • Tagging
  • Translation caches can skip levels, use smaller tags
  • Partitioning
  • Novel VI-LRU allows partitioning to adapt to workload