A New Cache Monitoring Scheme for Memory-Aware Scheduling and - - PowerPoint PPT Presentation

a new cache monitoring scheme for memory aware scheduling
SMART_READER_LITE
LIVE PREVIEW

A New Cache Monitoring Scheme for Memory-Aware Scheduling and - - PowerPoint PPT Presentation

A New Cache Monitoring Scheme for Memory-Aware Scheduling and Partitioning G. Edward Suh Srinivas Devadas Larry Rudolph Massachusetts Institute of Technology February 5, 2002 HPCA-8: 1 Problem Memory system performance is critical


slide-1
SLIDE 1

HPCA-8: 1 February 5, 2002

A New Cache Monitoring Scheme for Memory-Aware Scheduling and Partitioning

  • G. Edward Suh

Srinivas Devadas Larry Rudolph Massachusetts Institute of Technology

slide-2
SLIDE 2

HPCA-8: 2 February 5, 2002

Problem

  • Memory system performance is critical
  • Everyone thinks about their own application

– Tuning replacement policies – Software/hardware prefetching

  • But modern computer systems execute multiple

applications concurrently/simultaneously

– Time-shared systems

  • Context switches cause cold misses

– Multiprocessors systems sharing memory hierarchy (SMP, SMT, CMP)

  • Simultaneous applications compete for cache space
slide-3
SLIDE 3

HPCA-8: 3 February 5, 2002

Solutions: Cache Partitioning & Memory-Aware Scheduling

  • Cache Partitioning

– Explicitly manage cache space allocation amongst concurrent/ simultaneous processes

  • Each process gets different benefit from more cache space
  • Similar to main memory partition (e.g.. Stone 1992) in the old days
  • Memory-Aware Scheduling

– Choose a set of simultaneous processes to minimize memory/cache contention – Schedule for SMT systems (Snavely 2000)

  • Threads interact in various ways (RUU, functional units, caches, etc)
  • Based on executing various schedules and profiling them

– Admission control for gang scheduling (Batat 2000)

  • Based on the footprint of a job (total memory usage)
slide-4
SLIDE 4

HPCA-8: 4 February 5, 2002

BUT…

  • Testing many possible schedules not viable

– The number of possible schedules increase exponentially as the number of processes increase – Need to decide a good schedule from individual process characteristics complexity increases linearly

  • Footprint-based scheduling not enough information

– Footprint of a process is often larger than the cache – Processes may not need the entire working set in the cache

  • Can we find a good schedule for cache performance?

– What information do we need for each process?

slide-5
SLIDE 5

HPCA-8: 5 February 5, 2002

Information a Scheduler/Partitioner Needs

  • Characterizing a process

– For scheduling and partitioning, need to know the effect of varying cache size

  • Multiple performance numbers for different cache sizes
  • Ignore other effects than cache size
  • Miss-rate curves; m(c)

– Cache miss-rates as a function of cache size (cache blocks)

  • Assume a process is isolated
  • Assume the cache is FULLY-ASSOCIATIVE

– Provides essential information for scheduling and partitioning

100 50 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate

slide-6
SLIDE 6

HPCA-8: 6 February 5, 2002

Using Miss-Rate Curves for Partitioning

  • What do miss-rate curves tell about cache allocation?

50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate 50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate

Process A Process B cA cB

Cache misses

  • mA(cA)·refA+ mB(cB)·refB

Cache Allocation A B

slide-7
SLIDE 7

HPCA-8: 7 February 5, 2002

Finding the best allocation

  • Use marginal gain; g(c) = m(c) ·ref - m(c+1)·ref

– Gain in the number of misses by increasing the cache space

  • Allocate cache blocks to each process in a greedy manner

– Guaranteed to result in the optimal partition if m(c) are convex

987 409 282 250 2111 1568 746 104

500 1000 1500 2000 2500 1 2 3 4 Cache Space (Blocks) Marginal Gain (Hits) Process A Process B

Cache Allocation

Initially no cache block is allocated Compare Marginal Gains 987 < 2111

B

Allocate a block to Process B Compare Marginal Gains 987 > 1568 Allocate a block to Process B

B

Compare Marginal Gains 987 > 746

A

Allocate a block to Process A Compare Marginal Gains 409 < 746

B

Allocate a block to Process B

slide-8
SLIDE 8

HPCA-8: 8 February 5, 2002

Partitioning Results

0.5 1 1.5 2 2.5 0.25 0.5 1 2 4 L2 Size (MB) IPC LRU Partition

  • Partition the L2 cache amongst two simultaneous

processes (spec2000 benchmarks: art and mcf )

slide-9
SLIDE 9

HPCA-8: 9 February 5, 2002

Intuition for Memory-Aware Scheduling

  • How to schedule 4 processes on 2 processor system

using individual miss-rate curves?

50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate 50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate 50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate 100 50 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate

Curves tend to have a knee The amount of cache space where the marginal gain diminishes a lot Group processes based on the knees

Process A Process B Process D Process C

  • Working set size is larger than

the cache for all processes

  • All processes result in similar

miss-rate if they have the entire cache Schedule A and C, and B and D together

slide-10
SLIDE 10

HPCA-8: 10 February 5, 2002

Determining the Knee of the Curve

  • Use partitioning technique

987 409 282 250 2111 1568 746 104

500 1000 1500 2000 2500 1 2 3 4 Cache Space (Blocks) Marginal Gain (Hits) Process A Process B

Cache Allocation

  • However, now we may need multiple time slices to

schedule processes (2 time slices in our example)

  • Available cache resource should be doubled

Cache Allocation

slide-11
SLIDE 11

HPCA-8: 11 February 5, 2002

Scheduling Results

  • Schedule 6 SPEC CPU benchmarks for 2 Processors

0.5 1 1.5 2 8 16 32 64 128 256 Memory Size (MB) Normalized Miss-rate Worst Best Scheduling Algorithm

slide-12
SLIDE 12

HPCA-8: 12 February 5, 2002

Analytical Model (ICS`01)

0.02 0.025 0.03 0.035 0.04 0.045 0.05 1 10 100 1000 10000 100000 1000000 Time Quantum (# of cache accesses) Miss-rate LRU Partition

  • Miss-rate curves (or marginal gains) alone may not be

enough for optimizing time-shared systems

– Partitioning amongst concurrent processes – Scheduling considering the effects of context switches

  • Use analytical model to predict cache-sharing effects

32-KB 8-way Set- Associative (bzip2+gcc+swim+ mesa+vortex+vpr+t wolf+iu)

slide-13
SLIDE 13

HPCA-8: 13 February 5, 2002

BUT…

  • Processes to execute are only known at run-time

– Users decide what applications to run – Scheduling/Partitioning decisions should be made at run-time

  • The behavior of a process changes over time

– Applications have different phases – Miss-rates curves (and marginal gains) may change over an execution

  • Cache configurations are different for systems

– Miss-rate curves (and marginal gains) are different for systems

  • Need an on-line estimation of miss-rate curves (and

marginal gains)

slide-14
SLIDE 14

HPCA-8: 14 February 5, 2002

On-Line Estimation of Marginal Gains: Fully-Associative Caches

2432 350 912

  • Marginal gains can be directly counted based on the

temporal ordering of cache blocks (LRU information)

– Use one counter per each cache block (or a group of cache blocks) and one for counting all accesses – Hit on the ith MRU Increment ith counter

  • Example: a FA cache with 4 blocks

1

LRU Order LRU Order

2

LRU Order

3

LRU Order

722 124 Hit on the 3rd MRU Cache Block Increment the 3th Counter Access Counter 350 2

LRU Order LRU Order

1

LRU Order

351 2432 2433 Increment the 1st Counter 2433 2434 912 913 Hit on the MRU Cache Block

913 722 351 124

250 500 750 1000 1 2 3 Cache Space (Blocks) Marginal Gain

Marginal-Gain Counters Cache Blocks

slide-15
SLIDE 15

HPCA-8: 15 February 5, 2002

BUT…

  • Most caches are SET-ASSOCIATIVE

– Except main memory – Usually up to 8-way associative

  • Set-associative caches only maintain temporal
  • rdering within a set

– No global temporal ordering

  • Cannot use block-by-block temporal ordering to
  • btain marginal gains for fully-associative caches
slide-16
SLIDE 16

HPCA-8: 16 February 5, 2002

Way-Counters

  • Way-Counters

– Use the existing LRU information within a set – One counter per way (D-way cacehs D counters) – Hit on the ith MRU Increment ith counter

  • Each way-counter represents the gain of having S more

blocks (S is the number of sets)

1 4-way Associative Cache 2 3 1 3 2 … … … …

S sets

Way Counters 4384 376 121 31 Access Counter 5012 Hit on the MRU Cache Block Increment the 1st Counter 4385 5013 Hit on the 2nd MRU Cache Block Increment the 2nd Counter

1 0.123 0.0477 0.0234 0.0171 0.2 0.4 0.6 0.8 1 256 512 768 1024 Cache Size (Blocks) Miss-Rate Way-Counter Fully-Associative

1 3 2 377 5014 5014

slide-17
SLIDE 17

HPCA-8: 17 February 5, 2002

Way+Set Counters

  • Use more counters for more detailed information

– Maintain the LRU information of sets – Hit on the ith MRU way and jth MRU set Increment counter(i,j)

1 2-way Associative Cache 1 1 1 … … 1 1 Counters 2132 5248 377 1073 283 431 31 Access Counter … … 1 Group 0 Group 1 Group S’ 8 Increment the Counter (0,1) Hit on the MRU way the 2nd MRU group 1074 5249 1

Temporal Ordering Of Set Groups

256 512 768 1024 0.2 0.4 0.6 0.8 1 Cache Size (Blocks) Miss-Rate Way-Counter (2-way) Way+Set (8 Groups) Way+Set (16 Groups) Fully-Associative

slide-18
SLIDE 18

HPCA-8: 18 February 5, 2002

Summary

  • Caches should be managed more carefully considering

the effect of space/time-sharing

– Cache Partitioning – Memory-Aware Scheduling

  • Miss-rate curves provide very relevant information for

scheduling and partitioning

– Enables us to predict the effect of varying the cache space – Useful for any tradeoff between performance and space (power)

  • On-line counters can estimate miss-rate curves at run-

time

– Use the temporal ordering of blocks to predict miss-rates for smaller caches – Works for both fully-associative and set-associative caches