Comparison of Cache Replacement Policies using Teammates - - - PowerPoint PPT Presentation

comparison of cache replacement policies using
SMART_READER_LITE
LIVE PREVIEW

Comparison of Cache Replacement Policies using Teammates - - - PowerPoint PPT Presentation

Comparison of Cache Replacement Policies using Teammates - Bhagyashree GEM5 - Nivin Simulator - Sri Divya Performance Bottleneck The performance gap between CPU and Dram has increased drastically. It leads to producer consumer


slide-1
SLIDE 1

Comparison

  • f Cache

Replacement Policies using GEM5 Simulator

Teammates

  • Bhagyashree
  • Nivin
  • Sri Divya
slide-2
SLIDE 2

Performance Bottleneck

  • The performance gap between

CPU and Dram has increased drastically.

  • It leads to producer consumer

problem.

  • How do we fix this ?
slide-3
SLIDE 3

Ideal Cache

What are we looking for ?

  • Cache that is big and fast.
  • Provides good temporal & spatial

locality.

  • Cheap to buy.
  • Easier to construct.
slide-4
SLIDE 4

Big and Fast Memory

  • We introduced memory

hierarchy to solve big and fast memory.

  • Now process can have

memory to size of HDD and fast as Registers.

  • For simplicity lets look at the

memory hierarch as a sequence <1, 2, 3, 4>

  • Lower number denotes faster

cache.

slide-5
SLIDE 5

More about memory

Lets model the problem to Consumer and Producer:

Processor (consumer) (Mem) Processor (consumer) (Mem)

  • Consider the sequence <1, 2, 3, 4> increasing order of memory.
  • Seq1: <1, 2, 3, 4> makes perfect sense.
  • Seq2: <4, 3, 2, 1> is equivalent as <4> right ?
  • Seq3: <1, 3, 2, 4> does this improve the performance ?
slide-6
SLIDE 6

Research Idea 1: Increase the cache levels

  • Previous discussion led to why

cache level is limited 3 or 4.

  • Is the cost the only factor

stopping it from doing ?.

Level 1 Level 2 Level 3 Level n

slide-7
SLIDE 7

Cache Replacement Policy

  • Cache full, CRP algorithm

decides best to discard.

  • More about Locality
  • Temporal– “a resource

that is referenced at one point in time will be referenced again sometime in the near future.” – eg: Web cache

  • Spatial - ” likelihood of

referencing a resource is higher if a resource near it was just referenced” – eg: Matrix Multiplication

slide-8
SLIDE 8

Research Idea 2: Cache Replacement Favors a specific locality

  • Is it valid to say LRU is more of a temporal locality solver.
  • Simulation to the rescue.
  • What else contribute to spatial locality - * think of larger

block size.

Temp

  • ral

Spatial

slide-9
SLIDE 9

Research Idea 3: Mixing of Cache Replacement algorithm in different levels

  • From the previous idea, we want mix cache replacement

algorithms at different levels and study its performance.

  • One can argue that:
  • Going back to Sequences <1, 2, ,3, 4>
  • One can argue that higher sequence number influences the rest.
  • Hmm, Eg: lets take matrix multiplication:
  • Let Level 1 favor spatial locality and level 2 favor temporal one.

Level 1 - LRU Level 2 - Random Level 3 Level n

slide-10
SLIDE 10

Cache Associative or Set:

  • Fully Associative – the best miss rate, ( set size of 2^k)
  • Set Associative – (Intermediate)
  • Direct-Mapped (set size from 1)
  • Larger sets and higher associativity lead to fewer cache conflicts and

lower miss rates, but they also increase the hardware cost.

slide-11
SLIDE 11

Research Idea 4: Cache associativity

Intuition: Have higher set value for lower cache levels. Lets forget the cost for now ? But does reversing it gives you better performance. Coming back to seq: Cache sequence: <1, 2, 3, 4> Set sequence : <N, N-1, N-2> Reversing it:

slide-12
SLIDE 12

Research Idea 5: Lets combine (Flow Diagram)

  • Combining

Let’s represent all the above research idea into a tree.

  • Compare it to OPT and study

where each algorithm stands.

measurement cache performance Application hardware locality cost Complexity benchmark Set associativity Levels mix

slide-13
SLIDE 13

Common Cache Replacement Policies LRU:

slide-14
SLIDE 14

Common Cache Replacement Policies

  • LRU:
  • Expensive in terms of speed and

hardware.

  • Need to remember the order in

which all N lines were accessed.

  • N! scenarios – O(log N!) LRU bits
  • 2-ways→ AB BA = 2 = 2!
  • 3-ways → ABC ACB BAC BCA

CAB CBA = 6 = 3!

  • Pseudo LRU: O(N)
  • Approximates LRU policy with a

binary tree.

slide-15
SLIDE 15

Common Cache Replacement Policies Pseudo LRU:

slide-16
SLIDE 16

Common Cache Replacement Policies MRU:

  • Discards in contrast to LRU, the most

recently used items first.

  • MRU algorithms are most useful in

situations where the older an item is, the more likely it is to be accessed.

slide-17
SLIDE 17

Common Cache Replacement Policies

Random Replacement

  • simpler, but at the cost of

performance. Round Robin (or FIFO) Replacement

  • Replacing oldest block in cache
  • memory. Circular counter.
  • Each cache memory set is

accompanied with a circular counter which points to the next cache block to be replaced; the counter is updated on every cache miss.

slide-18
SLIDE 18

Common Cache Replacement Policies:

Adaptive Cache Replacement: L-2 |T1| + |T2| = C Ghost Caches (Not in Memory) L-1 T1 B1 T2 B2

slide-19
SLIDE 19

Common Cache Replacement Policies

  • Clock with Adaptive Replacement:
  • Combines the advantages of ARC and Clock.
  • It used 4 doubly-linked lists : two clocks ( T1 & T2) and 2 simple LRU lists (B1

& B2).

  • T1 clock stores pages based on “recency” and T2 stores pages based on

“frequency”.

  • B1 & B2 contain pages that have recently been evicted from T1 & T2

respectively.

slide-20
SLIDE 20

Simulation and Benchmark

  • Why do we need system

simulator ?

  • CPU behavior depends on

memory system and the behavior of memory system depends on the CPUs.

  • We choose gem5 over other

simulators, as it is much easier to perform different measurements on cache replacement policies.

  • SPEC CPU benchmarks will be

used for performance evaluation.

slide-21
SLIDE 21

Gem5 simulator

  • Gem5 = m5 + gems
  • a modular discrete event

driven computer system simulator platform

  • Rich availability of modules in

the framework.

slide-22
SLIDE 22
  • GEM5 has an open source license, a good
  • bject-oriented infrastructure and a very active

mailing list.

  • System Modes –
  • CPU models –
  • Memory System –
  • Supported ISAs – ALPHA, ARM, X86, PowerPC, SPARC,

MIPS

  • 1. System Emulation
  • 2. Full System
  • 1. Atomic

Simple

  • 2. Timing Simple
  • 3. In-order
  • 4. Out-of-order
  • 1. Classic Model (M5)
  • 2. Ruby Model (GEMS)
slide-23
SLIDE 23
  • Flexibility
  • Availability
  • Collaboration
  • Example -
  • class L2Cache(Cache): size =

'256kB' assoc = 8 hit_latency = 20 response_latency = 20

  • system.cpu.apic_clk_domain.c

lock 16000 # Clock period in ticks

  • system.cpu.numCycles

345518 # number of cpu cycles simulated

slide-24
SLIDE 24

References

  • http://www.ece.uah.edu/~milenka/docs/milenkovic_acmse04r.

pdf

  • http://www.cs.ucf.edu/~neslisah/final.pdf
  • http://www.cse.scu.edu/~mwang2/projects/Cache_replacemen

t_10s.pdf

  • https://ece752.ece.wisc.edu/lect11-cache-replacement.pdf
  • https://en.wikipedia.org/wiki/Cache_replacement_policies
  • http://people.csail.mit.edu/emer/papers/2010.06.isca.rrip.pdf
  • http://www.cs.utexas.edu/users/mckinley/papers/evict-me-pac

t-2002.pdf

  • http://snir.cs.illinois.edu/PDF/Temporal%20and%20Spatial%20L
  • cality.pdf
  • https://math.mit.edu/~stevenj/18.335/ideal-cache.pdf
  • https://pdfs.semanticscholar.org/6ebe/c8701893a6770eb0e19a

0d4a732852c86256.pdf

  • https://fD3hhNnfL6kwww.youtube.com/watch?v=
  • http://pages.cs.wisc.edu/~david/courses/cs752/Spring2015/ge

m5-tutorial/index.html

  • http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/
  • https://github.com/dependablecomputinglab/csi3102-gem5-ne

w-cache-policy