Ef Efficient Synonym Filtering and Sc Scalab alable le De Dela - - PowerPoint PPT Presentation

ef efficient synonym filtering and sc scalab alable le de
SMART_READER_LITE
LIVE PREVIEW

Ef Efficient Synonym Filtering and Sc Scalab alable le De Dela - - PowerPoint PPT Presentation

Ef Efficient Synonym Filtering and Sc Scalab alable le De Dela layed Tran ansla lation ion fo for Hy Hybr brid Vir id Virtual C ual Cac aching hing Chang Hyun Park , Taekyung Heo, and Jaehyuk Huh Ph Physic ical al Cac achin


slide-1
SLIDE 1

Ef Efficient Synonym Filtering and Sc Scalab alable le De Dela layed Tran ansla lation ion fo for Hy Hybr brid Vir id Virtual C ual Cac aching hing

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

slide-2
SLIDE 2

Ph Physic ical al Cac achin ing

  • Latency constraint limits TLB scalability
  • TLB size restricted
  • Limited coverage of TLB entry
  • Missed Opportunities[1]
  • Memory access misses TLB, hits in cache
  • TLB miss delays cache hit opportunity

2

L1 $ Last-Level $ TLB

[1] Zhang et al. ICS 2010

Physical Address Virtual Address

Core

slide-3
SLIDE 3

Vi Virtual Caching

  • Delay translation: Virtual Caching
  • Access cache, then translate on miss
  • Cache hits do not need translation
  • Problem: Synonyms
  • Synonyms are rare[2]
  • Optimize for the common case
  • TLB accesses reduced significantly
  • Loosen TLB access latency restriction
  • Possibility of sophisticated translation
  • Reduces power consumption

3

L1 $ Last-Level $ TLB Physical Address Virtual Address

Core

Last-Level $ Synonyms L1 $

[2] Basu et al. ISCA 2012

slide-4
SLIDE 4

Hybrid Vi Virtual Caching

L1 $ Last-Level $ TLB Physical Address Virtual Address

Core

Synonyms

Virtual Caching

L1 $ Last-Level $ TLB Physical Address Virtual Address

Core

Physical Caching

4

Physical Address Virtual Address

Hybrid Virtual Caching

L1 $ Last-Level $ $

Core

Delayed TLB Scalable Delayed Translation

slide-5
SLIDE 5

Con Contri ribution

  • ns
  • Propose hybrid virtual physical caching
  • Cache populated by both virtual and physical blocks
  • Virtual cache for common case, physical for synonyms
  • Synonyms not confined to fixed address range, use entire cache
  • Propose scalable yet flexible delayed translation
  • Improve TLB entry scalability by employing segments [2][3]
  • Provide many segments for flexibility of memory management
  • Propose efficient search mechanism to lookup segment

5

[2] Basu et al. ISCA 2013 [3] Karakostas, Gandhi et al. ISCA 2015

slide-6
SLIDE 6

Hybrid Vi Virtual Caching

  • Virtual and physical cache
  • Each page consistently determined as

physical or virtual

  • Cache tags hold either tags
  • Challenge: Choose address before

cache access

  • Synonym Filter: Bloom Filter that

detects synonyms

  • HW managed by OS
  • Synonyms always detected,

translated to physical address

6

L1 $ Last-Level $ $ Delayed TLB

Non-Synonyms Synonyms

Core

slide-7
SLIDE 7

Hybrid Vi Virtual Caching Efficiency

  • Pin-based simulation
  • Baseline TLB
  • L1 TLB: 64 entries
  • L2 TLB: 1024 entries
  • Hybrid Virtual Caching
  • 2x1Kb Synonym filters
  • Synonym TLB: 64 entries
  • Delayed TLB: 1024 entries
  • Workloads
  • Apache, Ferret, Firefox, Postgres,

SpecJBB

7

Physical Address Virtual Address

Hybrid Virtual Caching

L1 $ Last-Level $ $

Core

Delayed TLB

slide-8
SLIDE 8

Hybrid Vi Virtual Caching Efficiency

Physical Address Virtual Address

Hybrid Virtual Caching

L1 $ Last-Level $ $

Core

Delayed TLB

8

  • 83.7~99.9% TLB accesses bypassed

Synonym Filter

  • Up to 99.9% TLB access reduction
  • Up to 69.7% TLB miss reduction

Delayed Translation

slide-9
SLIDE 9

Hybrid Vi Virtual Caching Efficiency

Physical Address Virtual Address

Hybrid Virtual Caching

L1 $ Last-Level $ $

Core

Delayed TLB

9

  • 83.7~99.9% TLB accesses bypassed

Synonym Filter

  • Up to 99.9% TLB access reduction
  • Up to 69.7% TLB miss reduction

Delayed Translation

Majority of accesses to virtual cache Cache hits remove TLB accesses and reduce TLB misses

slide-10
SLIDE 10

Limitation of Delayed TL TLB

10

  • TLB entries limited in scalability
  • Each entry maps fixed granularity
  • Increasing TLB size does not reduce miss as expected

20 40 60 80 100

tigr Mcf Milc GUPS

  • Norm. TLB MPKI (%)

1K Entries 2K 4K 8K 16K 32K 64K

slide-11
SLIDE 11

Limitation of Delayed TL TLB

11

  • TLB entries limited in scalability
  • Each entry maps fixed granularity
  • Increasing TLB size does not reduce miss as expected

20 40 60 80 100

tigr Mcf Milc GUPS

  • Norm. TLB MPKI (%)

1K Entries 2K 4K 8K 16K 32K 64K

TLB size is restricted, Improve coverage of TLB entry

slide-12
SLIDE 12
  • Direct Segment[2] improves TLB entry coverage
  • Represented by three values (base, limit, offset)
  • Translates contiguous memory of any size

Se Segme ments: Sc : Scalable T Translation

  • n

12

[2] Basu et al. ISCA 2013 [3] Karakostas, Gandhi et al. ISCA 2015

Virtual Address Space Base Limit Physical Address Space Offset

slide-13
SLIDE 13
  • Direct Segment[2] improves TLB entry coverage
  • Represented by three values (base, limit, offset)
  • Translates contiguous memory of any size
  • OS benefits from more available segments
  • Memory sharing among processes fragment memory
  • OS can offer multiple smaller segments
  • Number of segments[3] limited by latency
  • Segment lookup between Core and L1 cache
  • Fully-associative lookup of all segments required

Se Segme ments: Sc : Scalable T Translation

  • n

13

[2] Basu et al. ISCA 2013 [3] Karakostas, Gandhi et al. ISCA 2015

slide-14
SLIDE 14

Sc Scalable D Delayed T Translation

  • n
  • Exploit reduced frequency of delayed translation
  • Prior work limited to 10s of segments
  • Provide 1000s of segments for OS Flexibility
  • Efficient searching of owner segment required
  • OS managed tree that locates segment in a HW table
  • HW walker that traverses tree to acquire location
  • Use location (index) to access segment in HW table

14

32 Segments 1000s Segments Delay Translation

slide-15
SLIDE 15

Sc Scalable D Delayed T Translation

  • n

15

LLC Miss (Non-synonym) Memory Access

Index Base Limit Offset etc. 1 2 3 4 …

Segment Table

Segment Table: register values for many segments

Infeasible to search all Segment Table entries

slide-16
SLIDE 16

Sc Scalable D Delayed T Translation

  • n

16

LLC Miss (Non-synonym) Memory Access Index Tree Segment index

Index Base Limit Offset etc. 1 2 3 4 …

Segment Table

Index Tree: B-tree that holds following mapping

key: virtual address value: index to Segment Table

slide-17
SLIDE 17

Sc Scalable D Delayed T Translation

  • n

17

LLC Miss (Non-synonym) Memory Access Index Tree Traverse tree Index Cache Segment index

Index Base Limit Offset etc. 1 2 3 4 …

Segment Table HW Walker

Index Cache: caches index tree nodes on-chip Hardware Walker: searches through the index tree to produce a segment table index

slide-18
SLIDE 18

Ad Address Translation Procedure

18

LLC Miss (Non-synonym) Memory Access Segment Cache Hit Segment index

Index Base Limit Offset etc. 1 2 3 4 …

Segment Table Miss Index Tree Traverse tree Index Cache HW Walker

Segment Cache: caches many segment translation

slide-19
SLIDE 19

Ad Address Translation Procedure

19

LLC Miss (Non-synonym) Memory Access Segment Cache Hit Segment index

Index Base Limit Offset etc. 1 2 3 4 …

Segment Table Miss Index Tree Traverse tree Index Cache HW Walker

Segment Cache: caches many segment translation

Reduces latency and power consumption

slide-20
SLIDE 20

Ev Evaluation

  • Full system OoO simulation on Marssx86 + DRAMSim2
  • Hosts Linux with 4GB RAM (DDR3)
  • Three level cache hierarchy (based on Intel CPUs)
  • Baseline TLB configurations (based on Intel Haswell)
  • L1 TLB: 1 cycle, 64 entry, 4-way
  • L2 TLB: 7 cycle, 1024 entry, 8-way
  • Delayed TLB configurations range 1K - 16K entry
  • Many segment translation configurations
  • Segment Table: 2K entries
  • Index Cache: 32KB
  • Segment Cache: 128 entry
  • Benchmarks: SPECCPU, NPB, biobench, gups

20

slide-21
SLIDE 21

90 95 100 105 110

bzip2 DC gamess perlbench cactusADM astar LU gromacs

Normalized IPC to Baseline TLB (%) Delayed TLB 1K entries

4K 16K Many Segment Translation

Re Results

21

slide-22
SLIDE 22

90 95 100 105 110

bzip2 DC gamess perlbench cactusADM astar LU gromacs

Normalized IPC to Baseline TLB (%) Delayed TLB 1K entries

4K 16K Many Segment Translation

Cache hits reduce TLB accesses & misses Improving Performance

Re Results

22

slide-23
SLIDE 23

Re Results

80 85 90 95 100 105 110 115 120

m i l c C G g c c s j e n g x a l a n c b m k h m m e r s

  • p

l e x m c f

  • m

n e t p p s p h i n x 3 g u p s t i g r G e

  • m

e a n

Normalized IPC to Baseline TLB (%) Delayed TLB 1K entries 4K 16K Many Segment Translation

23

Delayed TLB is not scalable for these workloads Delayed TLB offers some scalability

143 179

slide-24
SLIDE 24

Re Results

80 85 90 95 100 105 110 115 120

m i l c C G g c c s j e n g x a l a n c b m k h m m e r s

  • p

l e x m c f

  • m

n e t p p s p h i n x 3 g u p s t i g r G e

  • m

e a n

Normalized IPC to Baseline TLB (%) Delayed TLB 1K entries 4K 16K Many Segment Translation

24

Delayed TLB is not scalable for these workloads Delayed TLB offers some scalability

143 179

Scalable Delayed Translation improves performance by 10.7% on average Power consumption is reduced by 60% on average Increased translation scalability significantly reduces TLB misses

slide-25
SLIDE 25

Con Conclusion

  • n
  • Hybrid Virtual Cache allows delaying address translation
  • Majority of memory accesses use virtual caching, synonyms use

physical caching

  • Synonym Filter consistently and quickly identifies access to

synonym pages

  • Reduces up to 99.9% of TLB accesses, 69.7% of TLB misses
  • Scalable delayed translation
  • Exploits reduced translations
  • Provides many segments and efficient segment searching
  • Average 10.7% performance improvement, 60% power saving

25