A New Cache Monitoring Scheme for Memory-Aware Scheduling and - PowerPoint PPT Presentation

A New Cache Monitoring Scheme for Memory-Aware Scheduling and Partitioning G. Edward Suh Srinivas Devadas Larry Rudolph Massachusetts Institute of Technology February 5, 2002 HPCA-8: 1

Problem • Memory system performance is critical • Everyone thinks about their own application – Tuning replacement policies – Software/hardware prefetching • But modern computer systems execute multiple applications concurrently/simultaneously – Time-shared systems • Context switches cause cold misses – Multiprocessors systems sharing memory hierarchy (SMP, SMT, CMP) • Simultaneous applications compete for cache space February 5, 2002 HPCA-8: 2

Solutions: Cache Partitioning & Memory-Aware Scheduling • Cache Partitioning – Explicitly manage cache space allocation amongst concurrent/ simultaneous processes • Each process gets different benefit from more cache space • Similar to main memory partition (e.g.. Stone 1992) in the old days • Memory-Aware Scheduling – Choose a set of simultaneous processes to minimize memory/cache contention – Schedule for SMT systems (Snavely 2000) • Threads interact in various ways (RUU, functional units, caches, etc) • Based on executing various schedules and profiling them – Admission control for gang scheduling (Batat 2000) • Based on the footprint of a job (total memory usage) February 5, 2002 HPCA-8: 3

BUT… • Testing many possible schedules � not viable – The number of possible schedules increase exponentially as the number of processes increase – Need to decide a good schedule from individual process characteristics � complexity increases linearly • Footprint-based scheduling � not enough information – Footprint of a process is often larger than the cache – Processes may not need the entire working set in the cache • Can we find a good schedule for cache performance? – What information do we need for each process? February 5, 2002 HPCA-8: 4

Information a Scheduler/Partitioner Needs • Characterizing a process – For scheduling and partitioning, need to know the effect of varying cache size • Multiple performance numbers for different cache sizes • Ignore other effects than cache size • Miss-rate curves; m(c) 1 – Cache miss-rates as a function of cache 0.8 size (cache blocks) Miss-rate 0.6 • Assume a process is isolated • Assume the cache is FULLY-ASSOCIATIVE 0.4 – Provides essential information for 0.2 scheduling and partitioning 0 0 50 100 Cache Space (%) February 5, 2002 HPCA-8: 5

Using Miss-Rate Curves for Partitioning • What do miss-rate curves tell about cache allocation? 1 1 Process A Process B 0.8 0.8 Cache misses � Miss-rate 0.6 Miss-rate 0.6 m A (c A )·ref A + m B (c B )·ref B 0.4 0.4 0.2 0.2 0 0 0 50 100 0 50 100 Cache Space (%) Cache Space (%) Cache Allocation c A c B A B February 5, 2002 HPCA-8: 6

Finding the best allocation • Use marginal gain; g(c) = m(c) ·ref - m(c+1)·ref – Gain in the number of misses by increasing the cache space • Allocate cache blocks to each process in a greedy manner – Guaranteed to result in the optimal partition if m(c) are convex Compare Marginal Gains Compare Marginal Gains Compare Marginal Gains Compare Marginal Gains Initially no cache block Allocate a block to Allocate a block to Allocate a block to Allocate a block to 2500 Process A 987 > 1568 987 < 2111 Process B Process B is allocated 987 > 746 409 < 746 Process B Process A 2111 2000 Process B Marginal Gain (Hits) 1568 1500 1000 987 746 Cache Allocation 500 409 282 250 104 0 A B B B 0 1 2 3 4 Cache Space (Blocks) February 5, 2002 HPCA-8: 7

Partitioning Results • Partition the L2 cache amongst two simultaneous processes (spec2000 benchmarks: art and mcf ) 2.5 2 1.5 LRU IPC Partition 1 0.5 0 0.25 0.5 1 2 4 L2 Size (MB) February 5, 2002 HPCA-8: 8

Intuition for Memory-Aware Scheduling • How to schedule 4 processes on 2 processor system using individual miss-rate curves? Curves tend to have a knee 1 1 Process A Process B � The amount of cache 0.8 0.8 space where the marginal Miss-rate Miss-rate 0.6 0.6 • Working set size is larger than gain diminishes a lot 0.4 0.4 the cache for all processes 0.2 0.2 0 0 0 50 100 0 50 100 Group processes based on • All processes result in similar Cache Space (%) Cache Space (%) 1 1 the knees miss-rate if they have the entire Process C Process D 0.8 0.8 cache Miss-rate Miss-rate 0.6 0.6 Schedule A and C, and B 0.4 0.4 and D together 0.2 0.2 0 0 0 50 100 0 50 100 Cache Space (%) Cache Space (%) February 5, 2002 HPCA-8: 9

Determining the Knee of the Curve • Use partitioning technique 2500 Process A 2111 2000 Cache Allocation Process B Marginal Gain (Hits) 1568 1500 1000 987 746 Cache Allocation 500 409 282 250 104 0 0 1 2 3 4 Cache Space (Blocks) • However, now we may need multiple time slices to schedule processes (2 time slices in our example) • Available cache resource should be doubled February 5, 2002 HPCA-8: 10

Scheduling Results • Schedule 6 SPEC CPU benchmarks for 2 Processors Worst Best Scheduling Algorithm 2 Normalized Miss-rate 1.5 1 0.5 0 8 16 32 64 128 256 Memory Size (MB) February 5, 2002 HPCA-8: 11

Analytical Model (ICS`01) • Miss-rate curves (or marginal gains) alone may not be enough for optimizing time-shared systems – Partitioning amongst concurrent processes – Scheduling considering the effects of context switches • Use analytical model to predict cache-sharing effects 0.05 LRU 32-KB 8-way Set- 0.045 Partition Associative 0.04 (bzip2+gcc+swim+ Miss-rate mesa+vortex+vpr+t 0.035 wolf+iu) 0.03 0.025 0.02 1 10 100 1000 10000 100000 1000000 Time Quantum (# of cache accesses) February 5, 2002 HPCA-8: 12

BUT… • Processes to execute are only known at run-time – Users decide what applications to run – Scheduling/Partitioning decisions should be made at run-time • The behavior of a process changes over time – Applications have different phases – Miss-rates curves (and marginal gains) may change over an execution • Cache configurations are different for systems – Miss-rate curves (and marginal gains) are different for systems • Need an on-line estimation of miss-rate curves (and marginal gains) February 5, 2002 HPCA-8: 13

On-Line Estimation of Marginal Gains: Fully-Associative Caches • Marginal gains can be directly counted based on the temporal ordering of cache blocks (LRU information) 1000 913 – Use one counter per each cache block (or a group of cache 750 722 Marginal Gain blocks) and one for counting all accesses 500 – Hit on the i th MRU � Increment i th counter 351 250 • Example: a FA cache with 4 blocks 124 0 0 1 2 3 Increment Increment Access Cache Space (Blocks) 2433 2432 2432 the 1 st the 3 th Counter 2433 2434 Counter Counter Marginal-Gain 912 912 722 350 350 124 Hit on the 3 rd Hit on 913 351 Counters MRU Cache the MRU Cache Block Block Cache LRU LRU LRU LRU LRU LRU LRU 1 2 0 1 0 2 3 Order Order Order Order Order Order Order Blocks February 5, 2002 HPCA-8: 14

BUT… • Most caches are SET-ASSOCIATIVE – Except main memory – Usually up to 8-way associative • Set-associative caches only maintain temporal ordering within a set – No global temporal ordering • Cannot use block-by-block temporal ordering to obtain marginal gains for fully-associative caches February 5, 2002 HPCA-8: 15

Way-Counters 1 1 • Way-Counters Way-Counter 0.8 Fully-Associative – Use the existing LRU information within a set Miss-Rate 0.6 – One counter per way (D-way cacehs � D counters) 0.4 – Hit on the i th MRU � Increment i th counter 0.2 0.123 0.0477 • Each way-counter represents the gain of having S more 0.0234 0.0171 0 0 256 512 768 1024 Increment Increment blocks (S is the number of sets) Cache Size (Blocks) the 1 st the 2 nd Counter Counter Hit on Way Access the MRU 4384 4385 377 376 121 31 5014 5014 5012 5013 Counters Counter Cache Block 1 0 2 3 4-way Hit on Associative S sets the 2 nd MRU Cache … … … … Cache Block 1 0 1 0 3 3 2 2 February 5, 2002 HPCA-8: 16

Way+Set Counters • Use more counters for more detailed information – Maintain the LRU information of sets – Hit on the i th MRU way and j th MRU set � Increment counter(i,j) 1 Access Hit on Way-Counter (2-way) Counters Way+Set (8 Groups) Counter the MRU way Way+Set (16 Groups) the 2 nd MRU group 1 0 Fully-Associative 0.8 Group 0 2132 377 5249 5248 Increment 0 1 0 1 the Counter 0.6 Miss-Rate (0,1) 2-way 0 1 Temporal Ordering Group 1 Associative 1073 1074 283 0.4 Cache Of 0 1 0 1 Set Groups 0.2 … … … … 1 0 0 Group S’ 0 256 512 768 1024 431 31 Cache Size (Blocks) 8 0 1 February 5, 2002 HPCA-8: 17

A New Cache Monitoring Scheme for Memory-Aware Scheduling and - PowerPoint PPT Presentation

A New Cache Monitoring Scheme for Memory-Aware Scheduling and Partitioning G. Edward Suh Srinivas Devadas Larry Rudolph Massachusetts Institute of Technology February 5, 2002 HPCA-8: 1 Problem Memory system performance is critical

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Chapter 4 Cache Memory Contents Computer memory system overview Characteristics of

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Lecture 23: Cache, Memory, Virtual Memory Todays topics: Cache examples, caching

Part 3: Memory-Aware DAG Scheduling CR05: Data Aware Algorithms October 12 & 15, 2020

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Cache Example Main memory: Byte addressable memory of size 4GB = 2 32 bytes Cache size: 64KB = 2 16

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

CS356 : Discussion #10 Dynamic Memory and Cache Lab Illustrations from CS:APP3e textbook Cache

Probabilities Sven Koenig, USC Russell and Norvig, 3 rd Edition, Chapter 13 These slides are new

RTOS & LwIP on Zynq and Zedboard Dr. Heinz Rongen Forschungszentrum Jlich GmbH

The The Ov Overwhelm lm Loo oop: Running Your Business So It Doesnt Run un YOU or or...

Hit Finder Validation & Prospects for Purity Measurement Matthew Thiesse 7 September 2016

Advances in Programming Languages APL13: Concurrency Abstractions David Aspinall School of

Top-K Query Processing D. Gunopulos 1 Multimedia Top-K Queries The IBM QBIC project (90s):

Optimizing Redis for Locality and Capacity Kevin C., Yoongu K. Lavanya S. 15-799 Project

A New Cache Monitoring Scheme for Memory-Aware Scheduling and - PowerPoint PPT Presentation

A New Cache Monitoring Scheme for Memory-Aware Scheduling and Partitioning G. Edward Suh Srinivas Devadas Larry Rudolph Massachusetts Institute of Technology February 5, 2002 HPCA-8: 1 Problem Memory system performance is critical

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Chapter 4 Cache Memory Contents Computer memory system overview Characteristics of

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Lecture 23: Cache, Memory, Virtual Memory Todays topics: Cache examples, caching

Part 3: Memory-Aware DAG Scheduling CR05: Data Aware Algorithms October 12 &amp; 15, 2020

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Cache Example Main memory: Byte addressable memory of size 4GB = 2 32 bytes Cache size: 64KB = 2 16

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

CS356 : Discussion #10 Dynamic Memory and Cache Lab Illustrations from CS:APP3e textbook Cache

Probabilities Sven Koenig, USC Russell and Norvig, 3 rd Edition, Chapter 13 These slides are new

RTOS &amp; LwIP on Zynq and Zedboard Dr. Heinz Rongen Forschungszentrum Jlich GmbH

The The Ov Overwhelm lm Loo oop: Running Your Business So It Doesnt Run un YOU or or...

Hit Finder Validation &amp; Prospects for Purity Measurement Matthew Thiesse 7 September 2016

Advances in Programming Languages APL13: Concurrency Abstractions David Aspinall School of

Top-K Query Processing D. Gunopulos 1 Multimedia Top-K Queries The IBM QBIC project (90s):

Optimizing Redis for Locality and Capacity Kevin C., Yoongu K. Lavanya S. 15-799 Project

Part 3: Memory-Aware DAG Scheduling CR05: Data Aware Algorithms October 12 & 15, 2020

RTOS & LwIP on Zynq and Zedboard Dr. Heinz Rongen Forschungszentrum Jlich GmbH

Hit Finder Validation & Prospects for Purity Measurement Matthew Thiesse 7 September 2016