Analytical Cache Models with Applications to Cache Partitioning G. - PowerPoint PPT Presentation

Analytical Cache Models with Applications to Cache Partitioning G. Edward Suh, Srinivas Devadas, and Larry Rudolph LCS, MIT

Motivation � Memory system performance is critical � Everyone thinks about their own application � But modern computer systems execute multiple applications concurrently/simultaneously � Context switches cause cold misses � Simultaneous applications compete for cache space � Caches should be managed more carefully, considering multiple processes � Explicit management of cache space => partitioning � Cache-aware job schedulers

Related Work � Analytical Cache Models � Thiébaut and Stone (1987) � Agarwal, Horowitz and Hennessy (1989) � Both only focus on long time quanta � Inputs are hard to obtain on-line � Cache Partitioning Stone, Turek and Wolf (1992) � � Optimal cache partitioning for very short time quanta � Our Model & Partitioning � Work for any time quantum � Inputs are easier to obtain (possible to estimate on-line)

Our Multi-tasking Cache Model � Input Miss-Rate Miss-Rate Miss-Rate � C: Cache Size � Schedule: job sequences with Cache Size Cache Size time quantum (T A ) C Schedule M A (x) � M A (x): a miss rate as a function of cache size for Process A Cache Model � Output � Overall miss-rate (OMR) for multi-tasking Overall Miss Rate

Assumptions � The miss-rate of a process is a function of cache size alone, not time � One MR(size) per application � Curve is averaged over application lifetime � In cases of high variance � Split the application into phases � One MR(size) per phase � Generated off-line (or on-line with HW support) � No shared memory space among processes

Assumptions: Cont. � Fully-associative caches � Extended to set-associative caches (memo 433) � The fully-associative model works for set- associative cache partitioning � LRU replacement policy � Time in terms of the number of memory references � The number of memory reference can be easily converted to real time in a steady-state

Independent Footprint x A Φ (t) � Independent footprint � The amount of data for Process A at time t starting from an empty cache, x A Φ (0) = 0 � Assume only one process executes � Changes � If hit, x A Φ (t+1) = x A Φ (t) � If miss, x A Φ (t+1) = MIN[ x A Φ (t) + 1, C ] � If we approximate real value of x A Φ (t) with its expectation: � E[x A Φ (t+1)] = MIN[ E[x A Φ (t)] + P A (t), C ] = MIN[ E[x A Φ (t)] + M A (E[x A Φ (t)]), C ]

Dependent Footprint x A (t) � Dependent footprint � The amount of data for Process A when multiple processes concurrently execute � Obtained from the given schedule and the independent footprint of all processes � Example � Four processes: A, B, C, D � round-robin schedule: ABCDABCD…

Dependent Footprint x A (t): Cont. An infinite size cache when Process A is executed for time t MRU Data a t a A 0 D -1 C -1 B -1 A -1 D -2 C -2 B -2 A -2 D -3 C -3 … D U R L Φ (t ) x A Φ (t+T A )- x A Φ (t ) x A Independent Footprint of A Blocks � Compute block sizes from x A Φ (t) left: A 0 ,D -1 ,C -1 ,B -1 ,A -1 ,D -2 ,… � Use independent footprint � Until cache is full t t+T A Time

Dependent Footprint x A (t): Cont. An infinite size cache when Process A is executed for time t MRU Data a t a A 0 D -1 C -1 B -1 A -1 D -2 C -2 B -2 A -2 D -3 C -3 … D U R L Cache Size (C) � Case 1: dormant process’ block is the LRU � x A (t) = A 0 + A -1 = x A Φ (t+T A )

Dependent Footprint x A (t): Cont. An infinite size cache when Process A is executed for time t MRU Data a t a A 0 D -1 C -1 B -1 A -1 D -2 C -2 B -2 A -2 D -3 C -3 … D U R L Cache Size (C) � Case 1: dormant process’ block is the LRU � x A (t) = A 0 + A -1 = x A Φ (t+T A ) � Case 2: active process’ block is the LRU � x A (t) = C-(D 0 +C 0 +B 0 +D -1 +C -1 +B -1 ) = C-x D Φ (T D )-x C Φ (T C )- x B Φ (T B )

Computing the Miss Probability: P A (t) � Effective cache size Process A’s Data � x A (t): The amount of x A (t) data in a cache for Other Process’ Data process A at time t Cache at time t Miss-Rate � The probability to M A (x) miss at time t � P A (t) = M A (x A (t)) P A (t) x A (t) Cache Size

Estimating Miss-Rate � Miss-rate of Process A Probability to Miss � In a steady-state, all time quanta of Process A are P A (t) identical � Time starts (t=0) at the Integrate beginning of a time quantum 1 T A T ∫ = A mr P (t)dt Time � A A T The number of misses 0 A � Overall miss-rate (OMR) � Weighted sum of each process’ miss-rate

Model Summary Φ 1 + 1 N E [ x ( t 1 )] Cache T ∫ ∑ A M (x ( t ) )dt ⋅ A mr T A i i T Φ Φ snapshot = + T 0 E [ x ( t )] M ( x ( t )) = A i 1 A A A sum Miss-rate Curve Miss-rate IF x A Φ (t) DF x A (t) M A (x) mr A OMR Schedule Miss-rate Curve Miss-rate IF x B Φ (t) DF x B (t) M B (x) mr B (t) Schedule

Model vs. Simulation: 2 Processes Miss-rate (vpr+vortex, 32KB) 0.044 Simulation 0.042 Model 0.04 Miss-rate 0.038 0.036 0.034 0.032 0.03 0 20000 40000 60000 80000 100000 Time Quantum

Model vs. Simulation: 4 Processes Miss-rate (vpr+vortex+gcc+bzip2, 32KB) 0.07 Simulation 0.065 Model 0.06 Miss-rate 0.055 0.05 0.045 0.04 0 20000 40000 60000 80000 100000 Time Quantum

Cache Partitioning � Time-sharing degrades the cache performance significantly for some time quanta � Due to dumb allocation by LRU policy � Could be improved by explicit cache partitioning � Specifying a partition � Dedicated Area (D A ) � Cache blocks that only Process A can use � Shared Area (S) � Cache blocks that any process can use while it is active

Strategy � Off-line profiling of MR(size) curves � One for each phase � Independent of other processes � Can also be obtained on-line with HW support � On-line partitioning � Partitioning decision based on the model � Modify the LRU policy to partition the cache

Optimal Cache Partition � Dedicated areas (D A ) specify the initial amount of data for each process � x A (0) = D A � Shared (S) and dedicated (D A ) areas specify the maximum cache space for each process � C A = D A + S � The model can estimate the miss-rate for a given partition � Use a gradient based search algorithm

Simulation Results: Fully-Associative Caches 32-KB Fully-Associative (bzip2+gcc+swim+mesa+vortex+vpr+twolf+iu) 0.05 LRU � 25% miss-rate 0.045 Partition improvement in 0.04 the best case Miss-rate 0.035 � 7% improvement for short time 0.03 quanta 0.025 0.02 1 10 100 1000 10000 100000 1000000 Time Quantum

From Full to Partial Associative � Use the fully-associative model and curves to determine D A , S � Modify the LRU replacement policy to partition � Count the number of cache blocks for each process (X A ) � Try to match X A to the allocated cache space � Replacement (Process A active) ≥ + X D S � Replace Process A’s LRU block if A A X ≥ D � Replace Process B’s LRU block if B B � Replace the standard LRU block if there is no over-allocated process � Add a small victim cache (16 entries)

Simulation Results: Set-Associative Caches 32-KB 8-way Set-Associative (bzip2+gcc+swim+mesa+vortex+vpr+twolf+iu) 0.05 � 15% miss-rate LRU 0.045 improvement in Partition the best case 0.04 Miss-rate � 4% improvement 0.035 for short time 0.03 quanta 0.025 0.02 1 10 100 1000 10000 100000 1000000 Time Quantum

Summary � Analytical cache model � Very accurate, yet tractable � Works for any cache size and time quanta � Applicable to set-associative cache partitioning � Applications � Dynamic cache partitioning with on-line/off-line approximations of miss-rate curves � Various scheduling problems

Analytical Cache Models with Applications to Cache Partitioning G. - PowerPoint PPT Presentation

Analytical Cache Models with Applications to Cache Partitioning G. Edward Suh, Srinivas Devadas, and Larry Rudolph LCS, MIT Motivation Memory system performance is critical Everyone thinks about their own application But modern

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

An Analytical Model for Tim e-Driven Cache Attacks Kris Tiri Onur Ac imez Michael Neve

BTEC: Analytical Services and Capabilities Nathaniel Hentz, Assistant Director Analytical What is

P1 Holistic Assessment for Mathematics 2013 Curricula Goal Curricula Goal Analytical

Cache Creek Placer Area Fee Proposal History of Placer Mining at Cache Creek Prospecting in

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

Compilerconstructie najaar 2019 http://www.liacs.leidenuniv.nl/~vlietrvan1/coco/ Rudy van Vliet

Pervasive Computing: Opportunities and Challenges Dimitris Kalofonos Pervasive Computing Group

Access Control for Smart Objects Access Control for Smart Objects Jan Janak, Hyunwoo Nam, Henning

Practical Office Automation or How to Hack the OpenOffice.org File Format Jacob Sparre Andersen

YA YAMTL Solutions Artur Boronat YA YAMTL Declarative M2M trafos in Xtend Inspired in

ICII ROBOTS ICII ROBOTS Ing. Frantiek Ducho Ing. Marian Kik Contents Contents

The Spectroscopic Study of M31 Globular Clusters with Xinglong 2.16m Telescope Zhou Fan, Ya-Fang

Signed numbers Goals unsigned numbers - non-negative integers signed numbers - positive/negative