Online Cache Modeling for Commodity Multicore Processors Richard - PowerPoint PPT Presentation

Online Cache Modeling for Commodity Multicore Processors Richard West, Puneet Zaroo, Carl A. Waldspurger and Xiao Zhang Contact: richwest@cs.bu.edu Computer Science

The “Big Picture” . . . Application threads VM VM VM VM VM . . . VCPU VCPU VCPU Interconnect PCPU PCPU PCPU PCPU PCPU PCPU PCPU PCPU Cores/HTs Cores/HTs . . . . . . . . . . . . Shared LLC Shared LLC Socket Socket

Proliferation of CMPs • Chip Multiprocesors (CMPs) have multiple cores on same chip • CMP cores usually share last-level cache (LLC) and compete for memory bus bandwidth • Competition for microarchitectural resources by co-running workloads can lead to highly-variable performance – Potential for poor performance isolation

The Software Challenge • CMPs manage shared h/w resources (e.g., cache space, memory bandwidth) in opaque manner to s/w • Software systems cannot easily optimize for efficient resource utilization or QoS without improved visibility and control over h/w resources – e.g., Cache conflict misses can incur several hundred clock cycle penalties for off-chip memory stalls

Hardware Solutions • Provide performance isolation using cache partitioning – Optimal partition size? – Utility of cache space to a workload? • Hardware-assisted miss-ratio (and miss-rate) curves (MRCs) – not applicable to commodity multicore processors

Improved Cache Management • Expose state of shared caches (and other microarchitectural resources) to OS / hypervisor – Fairer / more efficient co-scheduling – Reduced resource contention – How do we do this on commodity CMPs?

Current Software Solutions • Page coloring – Can reduce cache conflicts – Recoloring pages can be expensive for varying working set sizes and workloads • S/W-generated MRCs – Existing solutions require special h/w support • e.g., RapidMRC uses SDAR on POWER5 – Potentially high overhead • e.g., RapidMRC takes > 80ms on POWER5

Our Approach • Online cache modeling for commodity CMPs • Leverage commonly-available hardware performance counters – Construct cache occupancy estimators for individual workloads competing for cache – Construct cache performance curves (MRCs) using occupancy predictions – Low-cost and online

Basic Occupancy Model • Leverage two performance events: – local misses to thread τ l : m l – misses by every other thread τ o sharing – cache: m o – Misses drive cache line fills • Assume C cache lines accessed uniformly at random • E’ = E + (1 – E/C)·m l – (E/C)·m o • E’ = updated occupancy of τ l, , E = old value

Extended Occupancy Model • Basic approach assumes uniform cache-line access • Set associativity and LRU line replacement breaks this assumption • Add support for likelihood of line reuse – Use cache hit information

Extended Occupancy Model • Uses four performance events: – As for basic model plus • Local hits (h l ) and hits by all other threads (h o ) • Now: E’ = E·(1-m o p l ) + (C-E) ·m l p o -- Equation 1 p l is probability miss falls on line for τ l P o is probability miss falls on line for τ o

Reuse Frequency • Approximate LRU with LFU: – Model cacheline reuse by τ l and τ o, respectively, as: r l = (h l + m l ) /E r o = (h o + m o ) / (C – E)

Approximating LRU Effects • Model evictions due to misses inversely proportional to reuse frequencies: p o / p l = r l / r o • Given a miss must fall on some line: p l ·E + p o ·(C-E) = 1 Can calculate p l and p o and substitute into Equation 1

Occupancy Experiments • Used Intel’s CMPSched$im – Binary execution of SPEC workloads – Modeled 2- and 4-core CMPs • 32KB 4-way per-core L1 • 4MB 16-way shared L2 • 64 byte cache line size – Sample perf counters every 1ms – Average occupancies over 100 ms intervals

Occupancy Results Quadcore – 4 co-runners (3 shown) mcf art00 wupwise00

Occupancy Results Quadcore – 10 co-runners (3 shown) mcf art00 wupwise00 Model tolerant of over-committed situations.

Cache Performance Curves • Modeled performance (MPKI, MPKR, MPKC, CPKI,…) as function of cache occupancy • Implemented CAFÉ scheduling framework in VMware ESX Server – 4-core 2.0 GHz Intel Xeon E5535 w/ 4GB RAM and 4MB L2 cache per 2-cores – Update workload occupancies every 2ms using basic model (2 perf ctrs) • 320 cycles overhead for occupancy update fn

Online Generation of Utility Curves • Curve Types – Miss-ratio curve, y-axis being Misses-Per-Kilo-Instructions – Miss-rate curve, y-axis being Misses-Per-Kilo-Cycles – CPKI curve, y-axis being Cycles-Per-Kilo-Instructions • Implementation issues – Monotonicity enforcement – Lack of updates across entire cache – Duty-cycle modulation enforcement – MPKC curves sensitive to memory bandwidth contention mcf running under different amounts of memory read bandwidth

MRC Results • Quantized into 8 occupancy buckets • Configurable interval for curve generation frequency (here, several seconds) • Expect monotonicity – Higher cache occupancy, fewer misses per instruction – Except on phase changes • Monotonic enforcement algorithm updates MRC readings in order of bucket reference (highest to lowest)

Online MRC: Accuracy • 6 apps on 2 cores sharing L2, each in a single-CPU VM • Using page-coloring measurement as comparison baseline

Online MRC: Case Study • Running mcf with different co-runners Before monontonic enforcement After monotonic enforcement

Application of Utility Curves • Guidance to improve fairness – CPU time compensation based on estimated performance degradation due to CMP resource contention • Guidance to improve performance – Smart scheduling placement based on predicted cache space allocation among co-runners

Future Work • Application of occupancy prediction to hardware- aided cache partitioning / enforcement • Investigate techniques to improve coverage of cache space (0-100%) for utility curve generation – Co-runner interference control – MRCs at different tie granularities • Online phase change detection

Online Cache Modeling for Commodity Multicore Processors Richard - PowerPoint PPT Presentation

Online Cache Modeling for Commodity Multicore Processors Richard West, Puneet Zaroo, Carl A. Waldspurger and Xiao Zhang Contact: richwest@cs.bu.edu Computer Science The Big Picture . . . Application threads VM VM VM VM VM . . .

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Cache Coherency Cache coherent processors most current value for an address is the last

Asymmetries in Commodity Price Asymmetries in Commodity Price Behaviour Asymmetries in Commodity

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Multicore Processors Raul Queiroz Feitosa Parts of these slides are from the support material

rMPI: Message Passing on Multicore Processors with On-Chip Interconnect 19. Oktober 2009

BRU: Bandwidth Regulation Unit for Real-Time Multicore Processors Farzad Farshchi , Qijing

Addressing Shared Resource Contention in Multicore Processors via Scheduling ASPLOS 10

HETEROGENEOUS MULTICORE PROCESSORS A LEXANDER V ITKALOV ENGRC 350 Novem ber 2 1 ,2 0 0 5 1

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

AMD Naples EPYC Family Alif R Rochman (alifrr2) Robert A Ruester (ruester2) William Sentosa

F r e e R T O S a n d T C P / I P c o mmu n i c a t i o n : t h e l

Lecture 2: Terminology and Definitions Abhinav Bhatele, Department of Computer Science

Mod 2 linear algebra and tabulation of rational eigenforms Kiran S. Kedlaya Department of

Multicore OS Lecture 22 UdS/TUKL WS 2015 MPI-SWS 1 Multicore 2001: IBM POWER4, dual-core

Performance evaluation of Linux CAN-related system calls Michal Sojka , Pavel P sa,

Transaction-Level Models of Systems-on-a-Chip Can they be Fast, Correct and Faithful? Matthieu

Computational Linguistics: Chomsky Hierarchy Raffaella Bernardi e-mail: