Adapting Cache Partitioning Algorithms to Pseudo-LRU Replacement - - PowerPoint PPT Presentation

adapting cache partitioning algorithms to pseudo lru
SMART_READER_LITE
LIVE PREVIEW

Adapting Cache Partitioning Algorithms to Pseudo-LRU Replacement - - PowerPoint PPT Presentation

Adapting Cache Partitioning Algorithms to Pseudo-LRU Replacement Policies Kamil Kedzierski 1,3 , Miquel Moreto 1,3 , Francisco J. Cazorla 2,3 , Mateo Valero 1,3 1 Technical University of Catalonia 2 Spanish National Research Council 3 Barcelona


slide-1
SLIDE 1

IPDPS, April 2010 Kamil Kedzierski 1 kkedzier@ac.upc.edu

Adapting Cache Partitioning Algorithms to Pseudo-LRU Replacement Policies

Kamil Kedzierski1,3, Miquel Moreto1,3, Francisco J. Cazorla2,3, Mateo Valero1,3

1 Technical University of Catalonia 2 Spanish National Research Council 3 Barcelona Supercomputing Center

slide-2
SLIDE 2

IPDPS, April 2010 Kamil Kedzierski 2 kkedzier@ac.upc.edu

Chip Multiprocessors (CMPs)

CMPs are good representative of the transition from ILP to TLP Current CMPs share the Last Level Cache (LLC)

Pros: Better utilization than a private LLC, which translates into improved performance Cons: LLC has been identified as a source of contention between threads

Cache competition may lead to performance degradation

Cache Partitioning Algorithms (CPAs) control the interaction between threads

CPAs can deliver a flexible and easy-to-manage infrastructure to control threads’ behavior in shared caches CPAs have become the central element of current QoS frameworks for CMPs

slide-3
SLIDE 3

IPDPS, April 2010 Kamil Kedzierski 3 kkedzier@ac.upc.edu

We focus on dynamic CPAs

Execution divided into time intervals At interval boundary we select a new cache partition based on the behavior in the previous interval(s)

Cache partitioned at the way granularity

Each thread assigned a number of ways, between 1 and A – N

A – associativity N – number of cores

Main components of CPAs

Profiling logic Partitioning logic Enforcement logic

Cache Partitioning Algorithms

slide-4
SLIDE 4

IPDPS, April 2010 Kamil Kedzierski 4 kkedzier@ac.upc.edu

Limiting factors to implement CPAs in real processors

Size of the profiling logic (Auxilary Tag Directory)

Its size can be similar to the size of the L1 cache Received significant attention

  • Sampled profiling logic
  • No profiling (check all cases and select the best performing one)

We conclude the problem has been solved

Replacement scheme

So far solutions focus on LRU replacement scheme LRU has high implementation cost High associativity caches use pseudo-LRU schemes It has not been shown how current CPAs work with pseudo-LRU Problem not solved

Motivation

slide-5
SLIDE 5

IPDPS, April 2010 Kamil Kedzierski 5 kkedzier@ac.upc.edu

Outline

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-6
SLIDE 6

IPDPS, April 2010 Kamil Kedzierski 6 kkedzier@ac.upc.edu

Outline

Replacement schemes

LRU: Least Recently Used NRU: Not Recently Used (UltraSPARC) (pLRU) BT: Binary Tree (IBM) (pLRU)

Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-7
SLIDE 7

IPDPS, April 2010 Kamil Kedzierski 7 kkedzier@ac.upc.edu

Least Recently Used (LRU)

Hit Miss

B C D A LRU MRU 3 1 2

Each line that is between the MRU line and the hit line increments its LRU bits

  • In the worst case positions of all the lines are

updated

Hit line is promoted to the MRU position Search for value 3 in corresponding replacement bits Promote the line to MRU position and set its bits to 0 Increase all the other bits

B C D A LRU 3 1 2 MRU B access B C D A LRU MRU 3 1 2 B C D E MRU 2 1 3 LRU E access

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-8
SLIDE 8

IPDPS, April 2010 Kamil Kedzierski 8 kkedzier@ac.upc.edu

Not Recently Used (NRU)

Hit Miss

B C D A 1

Set corresponding used bit to 1

  • If it causes all used bits to be 1, reset all the
  • ther bits

Start looking for a victim at the position pointed by the replacement pointer

  • Search for used bit equal 0

Set corresponding used bit to 1

  • If it causes all used bits to be 1, reset all the
  • ther bits

Rotate the replacement pointer forward one way

B access E access B C D A 1 1 B C D A 1 replacement pointer B C D A 1 1

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-9
SLIDE 9

IPDPS, April 2010 Kamil Kedzierski 9 kkedzier@ac.upc.edu

Binary Tree (BT)

Hit Miss

Update corresponding bits so that they point to MRU position Update corresponding bits so that they point to MRU position

B access B C D A 1 p-LRU MRU B C D A 1 1 p-LRU MRU E access B C D A 1 p-LRU MRU C D E 1 1 1 MRU B p-LRU 3 1 2 MRU 1

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

p-LRU 1

slide-10
SLIDE 10

IPDPS, April 2010 Kamil Kedzierski 10 kkedzier@ac.upc.edu

2 4 8 16 32 64 1 10 100 1000 LRU NRU BT

Associativity Replacement bits

Summary

position LRU MRU B C D A 1 1 1 1 + + + +

LRU A · log2(A)‏

B C D A 1 1

NRU A

B C D A 1

A - 1 BT

3 1 2

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

LRU requires more replacement bits LRU requires more information to update Current processors available on the market use pseudo-LRU replacement policies

slide-11
SLIDE 11

IPDPS, April 2010 Kamil Kedzierski 11 kkedzier@ac.upc.edu

Outline

Replacement schemes Problem definition for pseudo-LRU schemes

Cache Partitioning Algorithms Profiling Logic

Profiling for pseudo-LRU Results Conclusions

slide-12
SLIDE 12

IPDPS, April 2010 Kamil Kedzierski 12 kkedzier@ac.upc.edu

Cache Partitioning Algorithms

Shared L2 cache Core 0 Core 1 I $ D $ I $ D $ Partitioning Logic Profiling Logic 1 Profiling Logic 0 Enforcement Logic

Profiling Logic

Observe each thread behavior in L2 cache

Partitioning Logic

Make the decision on how to partition the cache We use way partitioning

Enforcement Logic

Put the partitions into practice

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-13
SLIDE 13

IPDPS, April 2010 Kamil Kedzierski 13 kkedzier@ac.upc.edu

Profiling Logic for LRU

Shared L2 cache Core 0 Core 1 I $ D $ I $ D $ Partitioning Logic SDH Enforcement Logic

Auxiliary Tag Directory (ATD)

Separate copy of the tag directory with the same associativity Simulates single-threaded behavior On every cache access reports LRU stack position to SDH

Stack Distance Histogram (SDH)

Gathers stack positions Allows us to derive the miss curve of the thread as a function of the ways assigned to a thread

ATD SDH ATD

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-14
SLIDE 14

IPDPS, April 2010 Kamil Kedzierski 14 kkedzier@ac.upc.edu

Profiling Background for LRU

Building SDH, ATD content (1 set)

B C D A LRU MRU 1 2 3 C access A B D C LRU MRU 1 2 3 D access C A B D LRU MRU 1 2 3 D access +1 r0 r1 r2 r3 r4 +1 r0 r1 r2 r3 r4 +1 r0 r1 r2 r3 r4 +1 r0 r1 r2 r3 r4 +1 +1

Building miss curve

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

1 2 3 4 ways

r4 r3 + r4 r4 r2 + r3 + r4 r1 + 2 + r3 + r4 r0 + r1 + 2 + r3 + r4

+1 misses +1 +2 +2 +3

slide-15
SLIDE 15

IPDPS, April 2010 Kamil Kedzierski 15 kkedzier@ac.upc.edu

Profiling in pseudo-LRU?

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

B C D A 1 B access B access B C D A 1 p-LRU MRU

BT NRU ... but what is the stack position ?

B C D A LRU MRU 3 1 2 B access

LRU 1 don't know don't know

slide-16
SLIDE 16

IPDPS, April 2010 Kamil Kedzierski 16 kkedzier@ac.upc.edu

Outline

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU

NRU scheme BT scheme Limitations

Results Conclusions

slide-17
SLIDE 17

IPDPS, April 2010 Kamil Kedzierski 17 kkedzier@ac.upc.edu

Profiling in NRU

Used bits in a 4-way ATD using NRU for three consecutive accesses. The arrows point to the line of the last access with the estimated stack distance next to it Count number of used bits equal 1 (U)

If current used bit = 1, stack distance is between 1 and U If current used bit = 0, stack distance is between U+1 and A

ATD for CDD accesses ATD for ABC accesses

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-18
SLIDE 18

IPDPS, April 2010 Kamil Kedzierski 18 kkedzier@ac.upc.edu

Profiling in BT

Estimated SDH profiling Decoder for ID bits extraction from the way number

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-19
SLIDE 19

IPDPS, April 2010 Kamil Kedzierski 19 kkedzier@ac.upc.edu

Limitations

Two stacks with the same BT bits affect profiling accuracy

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

Over- vs. under-estimation of the position in the pseudo-LRU stack We evaluate three scaling factors:

1.0 x used bits equal “1”

  • assume stack distance 4

0.75 x used bits equal “1”

  • assume stack distance 3

0.5 x used bits equal “1”

  • assume stack distance 2

B C D A 1 1 F G H E 1 1

NRU BT

slide-20
SLIDE 20

IPDPS, April 2010 Kamil Kedzierski 20 kkedzier@ac.upc.edu

Outline

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-21
SLIDE 21

IPDPS, April 2010 Kamil Kedzierski 21 kkedzier@ac.upc.edu

Without Cache Partitioning

Performance of LRU, NRU and BT. Analysis for 1, 2, 4 and 8 core CMPs using a 16- way 2MB L2 cache with 128 bytes lines Is it worth to develop complex, area expensive, power hungry LRU replacement for high associativity caches and win 2% - 5% in performance?

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-22
SLIDE 22

IPDPS, April 2010 Kamil Kedzierski 22 kkedzier@ac.upc.edu

Cache partitioning

Analysis done for a 16-way 2MB L2 cache with 128 bytes lines Counters (any K ways out of A) vs. Masks (specific K ways out of A) Neglible difference for 1 million cycles sampling interval

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-23
SLIDE 23

IPDPS, April 2010 Kamil Kedzierski 23 kkedzier@ac.upc.edu

Cache partitioning

Analysis done for a 16-way 2MB L2 cache with 128 bytes lines We select 0.75 factor as a winner Random-like NRU replacement evicts not least recently used data

One replacement pointer for all the sets Gets significant when the number of cores increases

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-24
SLIDE 24

IPDPS, April 2010 Kamil Kedzierski 24 kkedzier@ac.upc.edu

Cache partitioning

Analysis done for a 16-way 2MB L2 cache with 128 bytes lines Alternating nodes do not evict least recently used line

Misses not evenly distributed among partition Gets significant when the number of cores increases

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

C D A

BT0 BT1 BT2

B 50% 25% 25%

slide-25
SLIDE 25

IPDPS, April 2010 Kamil Kedzierski 25 kkedzier@ac.upc.edu

Outline

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-26
SLIDE 26

IPDPS, April 2010 Kamil Kedzierski 26 kkedzier@ac.upc.edu

Conclusions

We propose a complete partitioning design that targets two pseudo-LRU replacement policies.

Not Recently Used, implemented in the L2 cache in the market UltraSPARC T1/T2 processor Binary Tree proposed by IBM

We identify profiling logic as the main source of the so-far lack of CPA implementations The results show a negligible performance degradation with respect to the LRU-based CPA

For NRU our design loses as much as 0.3%, 3.6% and 7.3% throughput for 2, 4 and 8-core CMP architectures, respectively For BT the proposal degrades throughput by 1.4%, 3.4% and 9.7%, respectively

Replacement schemes Problem definition for pseudo-LRU schemes Profiling for pseudo-LRU Results Conclusions

slide-27
SLIDE 27

IPDPS, April 2010 Kamil Kedzierski 27 kkedzier@ac.upc.edu

Thank you Q & A

Kamil Kedzierski1,3, Miquel Moreto1,3, Francisco J. Cazorla2,3, Mateo Valero1,3

1 Technical University of Catalonia 2 Spanish National Research Council 3 Barcelona Supercomputing Center