Analytical Cache Models with Applications to Cache Partitioning G. - - PowerPoint PPT Presentation

analytical cache models with applications to cache
SMART_READER_LITE
LIVE PREVIEW

Analytical Cache Models with Applications to Cache Partitioning G. - - PowerPoint PPT Presentation

Analytical Cache Models with Applications to Cache Partitioning G. Edward Suh, Srinivas Devadas, and Larry Rudolph LCS, MIT Motivation Memory system performance is critical Everyone thinks about their own application But modern


slide-1
SLIDE 1

Analytical Cache Models with Applications to Cache Partitioning

  • G. Edward Suh, Srinivas

Devadas, and Larry Rudolph LCS, MIT

slide-2
SLIDE 2

Motivation

Memory system performance is critical Everyone thinks about their own application

But modern computer systems execute multiple applications

concurrently/simultaneously

Context switches cause cold misses Simultaneous applications compete for cache space

Caches should be managed more carefully,

considering multiple processes

Explicit management of cache space => partitioning Cache-aware job schedulers

slide-3
SLIDE 3

Related Work

Analytical Cache Models

Thiébaut and Stone (1987) Agarwal, Horowitz and Hennessy (1989) Both only focus on long time quanta Inputs are hard to obtain on-line

Cache Partitioning

  • Stone, Turek and Wolf (1992)

Optimal cache partitioning for very short time quanta

Our Model & Partitioning

Work for any time quantum Inputs are easier to obtain (possible to estimate on-line)

slide-4
SLIDE 4

Input

C: Cache Size Schedule: job sequences with

time quantum (TA)

MA(x): a miss rate as a

function of cache size for Process A

Output

Overall miss-rate (OMR) for

multi-tasking Cache Model

Overall Miss Rate MA(x) C Schedule

Cache Size Miss-Rate Miss-Rate Cache Size Miss-Rate

Our Multi-tasking Cache Model

slide-5
SLIDE 5

The miss-rate of a process is a function of

cache size alone, not time

One MR(size) per application

Curve is averaged over application lifetime In cases of high variance

Split the application into phases One MR(size) per phase

Generated off-line (or on-line with HW support)

No shared memory space among processes

Assumptions

slide-6
SLIDE 6

Assumptions: Cont.

Fully-associative caches

Extended to set-associative caches (memo 433) The fully-associative model works for set-

associative cache partitioning

LRU replacement policy Time in terms of the number of memory

references

The number of memory reference can be easily

converted to real time in a steady-state

slide-7
SLIDE 7

Independent Footprint xA

Φ(t)

Independent footprint

The amount of data for Process A at time t starting from an

empty cache, xA

Φ(0) = 0

Assume only one process executes

Changes

If hit,

xA

Φ(t+1) = xA Φ(t)

If miss,

xA

Φ(t+1) = MIN[ xA Φ(t) + 1, C ]

If we approximate real value of xA

Φ(t) with its

expectation:

E[xA

Φ(t+1)] = MIN[ E[xA Φ(t)] + PA(t), C ]

= MIN[ E[xA

Φ(t)] + MA(E[xA Φ(t)]), C ]

slide-8
SLIDE 8

Dependent Footprint xA(t)

Dependent footprint

The amount of data for Process A when multiple

processes concurrently execute

Obtained from the given schedule and the

independent footprint of all processes

Example

Four processes: A, B, C, D round-robin schedule: ABCDABCD…

slide-9
SLIDE 9

An infinite size cache when Process A is executed for time t

MRU Data L R U D a t a

D-1 C-1 B-1

D-3

A-1 C-2 D-2 B-2 A-2

C-3

A0

xA

Φ(t)

xA

Φ(t+TA)- xA Φ(t)

t t+TA xA

Φ(t)

Independent Footprint of A Time Blocks

Dependent Footprint xA(t): Cont.

Compute block sizes from

left: A0,D-1,C-1,B-1,A-1,D-2,…

Use independent footprint Until cache is full

slide-10
SLIDE 10

An infinite size cache when Process A is executed for time t

MRU Data L R U D a t a

D-1 C-1 B-1

D-3

A-1 C-2 D-2 B-2 A-2

C-3

A0

Dependent Footprint xA(t): Cont.

Cache Size (C)

Case 1: dormant process’ block is the LRU

xA(t) = A0+ A-1 = xA

Φ(t+TA)

slide-11
SLIDE 11

An infinite size cache when Process A is executed for time t

MRU Data L R U D a t a

D-1 C-1 B-1

D-3

A-1 C-2 D-2 B-2 A-2

C-3

A0

Dependent Footprint xA(t): Cont.

Cache Size (C)

Case 1: dormant process’ block is the LRU

xA(t) = A0+ A-1 = xA

Φ(t+TA)

Case 2: active process’ block is the LRU

xA(t) = C-(D0+C0+B0+D-1+C-1+B-1)

= C-xD

Φ(TD)-xC Φ(TC)- xB Φ(TB)

slide-12
SLIDE 12

Computing the Miss Probability: PA(t)

Effective cache size

xA(t): The amount of

data in a cache for process A at time t

The probability to

miss at time t

PA(t) = MA(xA(t))

Process A’s Data xA(t) Other Process’ Data Cache at time t Cache Size MA(x) PA(t) Miss-Rate xA(t)

slide-13
SLIDE 13

Miss-rate of Process A

In a steady-state, all time

quanta of Process A are identical

Time starts (t=0) at the

beginning of a time quantum

=

A

T A A A

(t)dt P T mr 1

Probability to Miss Integrate The number of misses PA(t) Time TA

Estimating Miss-Rate

Overall miss-rate (OMR) Weighted sum of each process’ miss-rate

slide-14
SLIDE 14

Model Summary

Miss-rate Curve MA(x) OMR IF xA

Φ(t)

DF xA(t) Miss-rate Curve MB(x) IF xB

Φ(t)

DF xB(t) Miss-rate mrA Miss-rate mrB(t) Schedule Schedule

=

N i i i sum

T mr T

1

1

A

T A A

)dt t (x M T ) ( 1

Cache snapshot

)) ( ( )] ( [ )] 1 ( [ t x M t x E t x E

A A A A Φ Φ Φ

+ = +

slide-15
SLIDE 15

Model vs. Simulation: 2 Processes

0.03 0.032 0.034 0.036 0.038 0.04 0.042 0.044 20000 40000 60000 80000 100000 Time Quantum Miss-rate Simulation Model

Miss-rate (vpr+vortex, 32KB)

slide-16
SLIDE 16

Model vs. Simulation: 4 Processes

Miss-rate (vpr+vortex+gcc+bzip2, 32KB)

0.04 0.045 0.05 0.055 0.06 0.065 0.07 20000 40000 60000 80000 100000 Time Quantum Miss-rate Simulation Model

slide-17
SLIDE 17

Cache Partitioning

Time-sharing degrades the cache performance

significantly for some time quanta

Due to dumb allocation by LRU policy Could be improved by explicit cache partitioning

Specifying a partition

Dedicated Area (DA)

Cache blocks that only Process A can use

Shared Area (S)

Cache blocks that any process can use while it is active

slide-18
SLIDE 18

Strategy

Off-line profiling of MR(size) curves

One for each phase Independent of other processes Can also be obtained on-line with HW support

On-line partitioning

Partitioning decision based on the model Modify the LRU policy to partition the cache

slide-19
SLIDE 19

Optimal Cache Partition

Dedicated areas (DA) specify the initial amount

  • f data for each process

xA(0) = DA

Shared (S) and dedicated (DA) areas specify

the maximum cache space for each process

CA = DA + S

The model can estimate the miss-rate for a

given partition

Use a gradient based search algorithm

slide-20
SLIDE 20

Simulation Results: Fully-Associative Caches

32-KB Fully-Associative (bzip2+gcc+swim+mesa+vortex+vpr+twolf+iu)

0.02 0.025 0.03 0.035 0.04 0.045 0.05 1 10 100 1000 10000 100000 1000000 Time Quantum Miss-rate LRU Partition

25% miss-rate

improvement in the best case

7% improvement

for short time quanta

slide-21
SLIDE 21

From Full to Partial Associative

Use the fully-associative model and curves to

determine DA, S

Modify the LRU replacement policy to partition

Count the number of cache blocks for each process (XA) Try to match XA to the allocated cache space Replacement (Process A active)

Replace Process A’s LRU block if Replace Process B’s LRU block if Replace the standard LRU block if there is no over-allocated

process Add a small victim cache (16 entries)

S D X

A A

+ ≥

B B

D X ≥

slide-22
SLIDE 22

Simulation Results: Set-Associative Caches

0.02 0.025 0.03 0.035 0.04 0.045 0.05 1 10 100 1000 10000 100000 1000000 Time Quantum Miss-rate LRU Partition

32-KB 8-way Set-Associative (bzip2+gcc+swim+mesa+vortex+vpr+twolf+iu)

15% miss-rate

improvement in the best case

4% improvement

for short time quanta

slide-23
SLIDE 23

Summary

Analytical cache model

Very accurate, yet tractable Works for any cache size and time quanta Applicable to set-associative cache partitioning

Applications

Dynamic cache partitioning with on-line/off-line

approximations of miss-rate curves

Various scheduling problems