Ef Efficient Synonym Filtering and Sc Scalab alable le De Dela - PowerPoint PPT Presentation

Ef Efficient Synonym Filtering and Sc Scalab alable le De Dela layed Tran ansla lation ion fo for Hy Hybr brid Vir id Virtual C ual Cac aching hing Chang Hyun Park , Taekyung Heo, and Jaehyuk Huh

Ph Physic ical al Cac achin ing Virtual Address • Latency constraint limits TLB scalability Core • TLB size restricted TLB • Limited coverage of TLB entry L1 $ • Missed Opportunities [1] Last-Level $ • Memory access misses TLB, hits in cache • TLB miss delays cache hit opportunity Physical Address [1] Zhang et al. ICS 2010 2

Vi Virtual Caching Virtual Address • Delay translation: Virtual Caching • Access cache, then translate on miss Core • Cache hits do not need translation TLB L1 $ • Problem: Synonyms • Synonyms are rare [2] Last-Level $ • Optimize for the common case L1 $ Synonyms Last-Level $ • TLB accesses reduced significantly • Loosen TLB access latency restriction • Possibility of sophisticated translation • Reduces power consumption Physical Address [2] Basu et al. ISCA 2012 3

Hybrid Vi Virtual Caching Virtual Address Virtual Address Virtual Address Core Core Core TLB L1 $ L1 $ Last-Level $ Last-Level $ $ L1 $ Synonyms TLB Delayed TLB Scalable Delayed Last-Level $ Translation Physical Address Physical Address Physical Address Physical Caching Hybrid Virtual Caching Virtual Caching 4

Con Contri ribution ons • Propose hybrid virtual physical caching • Cache populated by both virtual and physical blocks • Virtual cache for common case, physical for synonyms • Synonyms not confined to fixed address range, use entire cache • Propose scalable yet flexible delayed translation • Improve TLB entry scalability by employing segments [2][3] • Provide many segments for flexibility of memory management • Propose efficient search mechanism to lookup segment [2] Basu et al. ISCA 2013 5 [3] Karakostas, Gandhi et al. ISCA 2015

Hybrid Vi Virtual Caching Core • Virtual and physical cache • Each page consistently determined as physical or virtual Synonyms • Cache tags hold either tags Non-Synonyms • Challenge : Choose address before cache access L1 $ • Synonym Filter : Bloom Filter that Last-Level $ $ detects synonyms Delayed TLB • HW managed by OS • Synonyms always detected, translated to physical address 6

Hybrid Vi Virtual Caching Efficiency Virtual Address • Pin-based simulation Core • Baseline TLB • L1 TLB: 64 entries • L2 TLB: 1024 entries • Hybrid Virtual Caching L1 $ • 2x1Kb Synonym filters Last-Level $ $ • Synonym TLB: 64 entries Delayed TLB • Delayed TLB: 1024 entries • Workloads • Apache, Ferret, Firefox, Postgres, SpecJBB Physical Address Hybrid Virtual Caching 7

Hybrid Vi Virtual Caching Efficiency Virtual Address Synonym Filter Core • 83.7~99.9% TLB accesses bypassed L1 $ Last-Level $ $ Delayed Translation Delayed TLB • Up to 99.9% TLB access reduction • Up to 69.7% TLB miss reduction Physical Address Hybrid Virtual Caching 8

Hybrid Vi Virtual Caching Efficiency Virtual Address Synonym Filter Core Majority of accesses to virtual cache • 83.7~99.9% TLB accesses bypassed L1 $ Last-Level $ $ Delayed Translation Delayed TLB Cache hits remove TLB accesses • Up to 99.9% TLB access reduction • Up to 69.7% TLB miss reduction and reduce TLB misses Physical Address Hybrid Virtual Caching 9

Limitation of Delayed TL TLB • TLB entries limited in scalability • Each entry maps fixed granularity • Increasing TLB size does not reduce miss as expected 1K Entries 2K 4K 8K 16K 32K 64K 100 Norm. TLB MPKI (%) 80 60 40 20 0 tigr Mcf Milc GUPS 10

Limitation of Delayed TL TLB • TLB entries limited in scalability • Each entry maps fixed granularity • Increasing TLB size does not reduce miss as expected 1K Entries 2K 4K 8K 16K 32K 64K 100 Norm. TLB MPKI (%) 80 60 TLB size is restricted, 40 Improve coverage of TLB entry 20 0 tigr Mcf Milc GUPS 11

Se Segme ments: Sc : Scalable T Translation on • Direct Segment [2] improves TLB entry coverage • Represented by three values (base, limit, offset) • Translates contiguous memory of any size Base Limit Virtual Address Space Physical Address Space Offset [2] Basu et al. ISCA 2013 12 [3] Karakostas, Gandhi et al. ISCA 2015

Se Segme ments: Sc : Scalable T Translation on • Direct Segment [2] improves TLB entry coverage • Represented by three values (base, limit, offset) • Translates contiguous memory of any size • OS benefits from more available segments • Memory sharing among processes fragment memory • OS can offer multiple smaller segments • Number of segments [3] limited by latency • Segment lookup between Core and L1 cache • Fully-associative lookup of all segments required [2] Basu et al. ISCA 2013 13 [3] Karakostas, Gandhi et al. ISCA 2015

Sc Scalable D Delayed T Translation on • Exploit reduced frequency of delayed translation • Prior work limited to 10s of segments • Provide 1000s of segments for OS Flexibility Delay Translation 32 Segments 1000s Segments • Efficient searching of owner segment required • OS managed tree that locates segment in a HW table • HW walker that traverses tree to acquire location • Use location (index) to access segment in HW table 14

Scalable D Sc Delayed T Translation on Segment Table : register values for many segments Index Base Limit Offset etc. 1 2 LLC Miss (Non-synonym) Memory Access 3 4 … Segment Table Infeasible to search all Segment Table entries 15

Sc Scalable D Delayed T Translation on Index Tree: B-tree that holds following mapping key : virtual address value : index to Segment Table LLC Miss (Non-synonym) Memory Access Index Base Limit Offset etc. 1 Segment index 2 3 4 Index Tree … Segment Table 16

Sc Scalable D Delayed T Translation on Index Cache: caches index tree nodes on-chip Hardware Walker : searches through the index tree to produce a segment table index LLC Miss (Non-synonym) Memory Access Index Base Limit Offset etc. 1 Segment index 2 Index Cache 3 4 Index Tree … Traverse tree Segment Table HW Walker 17

Ad Address Translation Procedure Segment Cache : caches many segment translation Segment Cache Hit LLC Miss (Non-synonym) Memory Access Miss Index Base Limit Offset etc. 1 Segment index 2 Index Cache 3 4 Index Tree … Traverse tree Segment Table HW Walker 18

Ad Address Translation Procedure Segment Cache : caches many segment translation Segment Cache Hit LLC Miss (Non-synonym) Memory Access Miss Index Base Limit Offset etc. Reduces latency and power consumption 1 Segment index 2 Index Cache 3 4 Index Tree … Traverse tree Segment Table HW Walker 19

Ev Evaluation • Full system OoO simulation on Marssx86 + DRAMSim2 • Hosts Linux with 4GB RAM (DDR3) • Three level cache hierarchy (based on Intel CPUs) • Baseline TLB configurations (based on Intel Haswell) • L1 TLB: 1 cycle, 64 entry, 4-way • L2 TLB: 7 cycle, 1024 entry, 8-way • Delayed TLB configurations range 1K - 16K entry • Many segment translation configurations • Segment Table: 2K entries • Index Cache: 32KB • Segment Cache: 128 entry • Benchmarks: SPECCPU, NPB, biobench, gups 20

Re Results Delayed TLB 1K entries 4K 16K Many Segment Translation 110 Normalized IPC to Baseline TLB (%) 105 100 95 90 bzip2 DC gamess perlbench cactusADM astar LU gromacs 21

Results Re Delayed TLB 1K entries 4K 16K Many Segment Translation 110 Normalized IPC to Baseline TLB (%) 105 100 Cache hits reduce TLB accesses & misses Improving Performance 95 90 bzip2 DC gamess perlbench cactusADM astar LU gromacs 22

Re Results Delayed TLB 1K entries 4K 16K Many Segment Translation 143 179 120 Normalized IPC to Baseline TLB (%) 115 110 105 100 95 90 85 80 c G c g k r x f p 3 s r n e c g l c p n m e p x a C i m g m i m n u e l t t e b p e g j i m s m o h c n n s p o h m a s e l o G a x 23 Delayed TLB is not scalable for these workloads Delayed TLB offers some scalability

Re Results Delayed TLB 1K entries 4K 16K Many Segment Translation 143 179 120 Normalized IPC to Baseline TLB (%) 115 110 105 Scalable Delayed Translation improves 100 performance by 10.7% on average 95 90 Power consumption is 85 reduced by 60% on average 80 c G c g k r x f p 3 s r n e c g l c p n m e p x a C i m g m i m n u e l t t e b p e g j i m s m o h c n n s p o h m a s e l o G a x 24 Increased translation scalability significantly reduces TLB misses Delayed TLB is not scalable for these workloads Delayed TLB offers some scalability

Con Conclusion on • Hybrid Virtual Cache allows delaying address translation • Majority of memory accesses use virtual caching, synonyms use physical caching • Synonym Filter consistently and quickly identifies access to synonym pages • Reduces up to 99.9% of TLB accesses, 69.7% of TLB misses • Scalable delayed translation • Exploits reduced translations • Provides many segments and efficient segment searching • Average 10.7% performance improvement, 60% power saving 25

Ef Efficient Synonym Filtering and Sc Scalab alable le De Dela - PowerPoint PPT Presentation

Ef Efficient Synonym Filtering and Sc Scalab alable le De Dela layed Tran ansla lation ion fo for Hy Hybr brid Vir id Virtual C ual Cac aching hing Chang Hyun Park , Taekyung Heo, and Jaehyuk Huh Ph Physic ical al Cac achin

Lesson 8 Vocabulary & Anti synonym Different words with synonym similar meanings

Lesson 7 Rate Conversion Filtering and Downsampling interchange Filtering and Upsampling

Filtering Cubemaps Filtering Cubemaps Angular Extent Filtering and Edge Seam Fixup Methods

Traffic Control Mechanisms Filtering Source address filtering Other forms of filtering

synonym antonym opposite meaning the same) to another word. meaning as another word. This

antonym synonym opposite meaning the same) to another word. meaning as another word. This

SCALASCA: Sc alable performance a nalysis of la rge- sc ale parallel a pplications Brian J. N.

Cr e ating a High Quality Sc alable Online Pr ogr am R a c hna Siz e mor e H e iz e r, J

Statistical Filtering and Control for AI and Robotics Part I. Bayes filtering Riccardo Muradore

aHomestake Array and Wiener Filtering Array Coherence Wiener Filtering Velocity Measurements

The Filtering Matrix Interrogating Internet Filtering and Surveillance Practices Worldwide Nart

Nonlinear Filtering using Particles and Outline Nonlinear Quadrature Filtering Monte Carlo

FILTERING MACROECONOMIC DATA WienerKolmogorov Filtering of Stationary Sequences The classical

Least-Action Filtering L. C. G. Rogers Statistical Laboratory, University of Cambridge

1 An Filtering System that Monitors Document Search Engines Can Help, But Not Enough!

CS490W: What is Collaborative Filtering? Collaborative Filtering (CF): Making recommendation

Memory Management & Virtual Memory Tevfik Ko ar University at Buffalo October 25 th , 2011

CS 423 Operating System Design: Virtual Memory Mgmt Professor Adam Bates Spring 2018 CS

lecture 16 virtual vs. physical memory - types of physical memory - paging Wed. March 9,

Memory Virtualization: Address Spaces Prof. Patrick G. Bridges 1 University of New Mexico

Virtual Memory Chapter 18 S. Dandamudi Outline Introduction Page table placement

Memory Thursday, 14 February 19 Challenge managing memory see which seats are available How

Module 8: Memory Management Background Logical versus Physical Address Space Swapping

Main Memory - II Tevfik Ko ar Louisiana State University March 27 th , 2008 1 Paging Example

Ef Efficient Synonym Filtering and Sc Scalab alable le De Dela - PowerPoint PPT Presentation

Ef Efficient Synonym Filtering and Sc Scalab alable le De Dela layed Tran ansla lation ion fo for Hy Hybr brid Vir id Virtual C ual Cac aching hing Chang Hyun Park , Taekyung Heo, and Jaehyuk Huh Ph Physic ical al Cac achin

Lesson 8 Vocabulary &amp; Anti synonym Different words with synonym similar meanings

Lesson 7 Rate Conversion Filtering and Downsampling interchange Filtering and Upsampling

Filtering Cubemaps Filtering Cubemaps Angular Extent Filtering and Edge Seam Fixup Methods

Traffic Control Mechanisms Filtering Source address filtering Other forms of filtering

synonym antonym opposite meaning the same) to another word. meaning as another word. This

antonym synonym opposite meaning the same) to another word. meaning as another word. This

SCALASCA: Sc alable performance a nalysis of la rge- sc ale parallel a pplications Brian J. N.

Cr e ating a High Quality Sc alable Online Pr ogr am R a c hna Siz e mor e H e iz e r, J

Statistical Filtering and Control for AI and Robotics Part I. Bayes filtering Riccardo Muradore

aHomestake Array and Wiener Filtering Array Coherence Wiener Filtering Velocity Measurements

The Filtering Matrix Interrogating Internet Filtering and Surveillance Practices Worldwide Nart

Nonlinear Filtering using Particles and Outline Nonlinear Quadrature Filtering Monte Carlo

FILTERING MACROECONOMIC DATA WienerKolmogorov Filtering of Stationary Sequences The classical

Least-Action Filtering L. C. G. Rogers Statistical Laboratory, University of Cambridge

1 An Filtering System that Monitors Document Search Engines Can Help, But Not Enough!

CS490W: What is Collaborative Filtering? Collaborative Filtering (CF): Making recommendation

Memory Management &amp; Virtual Memory Tevfik Ko ar University at Buffalo October 25 th , 2011

CS 423 Operating System Design: Virtual Memory Mgmt Professor Adam Bates Spring 2018 CS

lecture 16 virtual vs. physical memory - types of physical memory - paging Wed. March 9,

Memory Virtualization: Address Spaces Prof. Patrick G. Bridges 1 University of New Mexico

Virtual Memory Chapter 18 S. Dandamudi Outline Introduction Page table placement

Memory Thursday, 14 February 19 Challenge managing memory see which seats are available How

Module 8: Memory Management Background Logical versus Physical Address Space Swapping

Main Memory - II Tevfik Ko ar Louisiana State University March 27 th , 2008 1 Paging Example

Lesson 8 Vocabulary & Anti synonym Different words with synonym similar meanings

Memory Management & Virtual Memory Tevfik Ko ar University at Buffalo October 25 th , 2011