V ANTAGE : S CALABLE AND E FFICIENT F INE -G RAIN C ACHE P - - PowerPoint PPT Presentation

v antage s calable and e fficient f ine g rain c ache p
SMART_READER_LITE
LIVE PREVIEW

V ANTAGE : S CALABLE AND E FFICIENT F INE -G RAIN C ACHE P - - PowerPoint PPT Presentation

V ANTAGE : S CALABLE AND E FFICIENT F INE -G RAIN C ACHE P ARTITIONING Daniel Sanchez and Christos Kozyrakis Stanford University ISCA-38, June 6 th 2011 Executive Summary 2 ! Problem: Interference in shared caches ! Lack of isolation " no


slide-1
SLIDE 1

VANTAGE: SCALABLE AND EFFICIENT FINE-GRAIN CACHE PARTITIONING

Daniel Sanchez and Christos Kozyrakis Stanford University ISCA-38, June 6th 2011

slide-2
SLIDE 2

Executive Summary

! Problem: Interference in shared caches ! Lack of isolation " no QoS ! Poor cache utilization " degraded performance ! Cache partitioning addresses interference, but current partitioning

techniques (e.g. way-partitioning) have serious drawbacks

! Support few coarse-grain partitions " do not scale to many-cores ! Hurt associativity " degraded performance ! Vantage solves deficiencies of previous partitioning techniques ! Supports hundreds of fine-grain partitions ! Maintains high associativity ! Strict isolation among partitions ! Enables cache partitioning in many-cores

2

slide-3
SLIDE 3

Outline

3

! Introduction ! Vantage Cache Partitioning ! Evaluation

slide-4
SLIDE 4

Motivation

4

! Fully shared last-level caches are the norm in multi-cores

# Better cache utilization, faster communication, cheaper coherence $ Interference " performance degradation, no QoS

! Increasingly important problem due to more cores/chip and virtualization,

consolidation (datacenter/cloud)

! Major performance and energy losses due to cache contention (~2x) ! Consolidation opportunities lost to maintain SLAs

LLC

L2 Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 Core VM1 VM2 VM3 VM4 VM5 VM6 L2 L2 L2 L2 L2 L2 L2 L2

LLC

slide-5
SLIDE 5

Cache Partitioning

5

! Cache partitioning: Divide cache space among competing

workloads (threads, processes, VMs)

# Eliminates interference, enabling QoS guarantees # Adjust partition sizes to maximize performance, fairness, satisfy SLA... $ Previously proposed partitioning schemes have major drawbacks

LLC

L2 Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 Core VM1 VM2 VM3 VM4 VM5 VM6 L2 L2 L2 L2 L2 L2 L2 L2

LLC

slide-6
SLIDE 6

Cache Partitioning = Policy + Scheme

6

! Cache partitioning consists of a policy (decide partition sizes

to achieve a goal, e.g. fairness) and a scheme (enforce sizes)

! Focus on the scheme ! For policy to be effective, scheme should be:

1.

Scalable: can create hundreds of partitions

2.

Fine-grain: partitions sizes specified in cache lines

3.

Strict isolation: partition performance does not depend on other partitions

4.

Dynamic: can create, remove, resize partitions efficiently

5.

Maintains associativity

6.

Independent of replacement policy

7.

Simple to implement Maintain high cache performance

slide-7
SLIDE 7

Existing Schemes with Strict Guarantees

! Based on restricting line placement ! Way partitioning: Restrict insertions to specific ways

7

  • 15
  • 10
  • 5

5 10 15 20 mix1 mix2 IPC improvement vs 16-way (%) WayPart

Way 0 Way 1 Way 2 Way 3 Way 4 Way 5 Way 6 Way 7

# Strict isolation # Dynamic # Indep of repl policy # Simple $ Few coarse-grain partitions $ Hurts associativity

slide-8
SLIDE 8

Existing Schemes with Soft Guarantees

8

! Based on tweaking the replacement policy ! PIPP [ISCA 2009]: Lines inserted and promoted in LRU

chain depending on the partition they belong to

  • 20
  • 10

10 20 mix1 mix2 IPC improvement vs 16-way (%) WayPart PIPP

Way 0 Way 1 Way 2 Way 3 Way 4 Way 5 Way 6 Way 7

# Dynamic # Maintains associativity # Simple $ Few coarse-grain partitions $ Weak isolation $ Sacrifices replacement policy

slide-9
SLIDE 9

Comparison of Schemes

9 Scalable & fine-grain Strict isolation Dynamic Maintains assoc.

  • Indep. of repl. policy

Simple $ # # $ # # Way partitioning $ $ # # $ # $ # $ # # $ PIPP $ # $ # # # # # # # # # Reconfig. caches Page coloring Vantage Partitions whole cache # # # # $ (most)

slide-10
SLIDE 10

Outline

10

! Introduction ! Vantage Cache Partitioning ! Evaluation

slide-11
SLIDE 11

Vantage Design Overview

11

1.

Use a highly-associative cache (e.g. a zcache)

2.

Logically divide cache in managed and unmanaged regions

3.

Logically partition the managed region

!

Leverage unmanaged region to allow many partitions with minimal interference

slide-12
SLIDE 12

Analytical Guarantees

12

! Vantage can be completely characterized using analytical

models # We can prove that strict guarantees are kept on partition sizes and interference independently of workload $ The paper has too much math to describe it here

! We now focus on the intuition behind the math

m R S S C C A

i P k k P k k i i

! =

" "

= =

1

1 1

R A S

P i i

1 1

max

! "

#

=

m R Amgd ! = 1 ] 1 , [ , ) ( ) ( } ,..., max{ ] 1 , [ . . . ~ ,...,

1 1

! = " = = x x x A P x F E E A U d i i E E

R A R R

???

slide-13
SLIDE 13

ZCache [MICRO 2010]

13

! A highly-associative cache with a low number of ways

! Hits take a single lookup ! In a miss, replacement process

provides many replacement candidates

! Provides cheap high associativity (e.g. associativity

equivalent to 64 ways with a 4-way cache)

! Achieves analytical guarantees on associativity

Indexes

H0 H1 H2

Line address

Way0 Way1 Way2

slide-14
SLIDE 14

Analytical Associativity Guarantees

14

! Eviction priority: Rank of a line given by the replacement policy (e.g. LRU),

normalized to [0,1]

! Higher is better to evict (e.g. LRU line has 1.0 priority, MRU has 0.0)

! Associativity distribution: Probability distribution of the eviction priorities of

evicted lines

! In a zcache, associativity distribution depends only on the number of

replacement candidates (R)

! Independent of ways, workload and replacement policy With R=8, 17% of evictions happen to the 80% least evictable lines With R=64, 10-6 of evictions happen to the 80% least evictable lines

slide-15
SLIDE 15

Managed-Unmanaged Region Division

15

! Logical division (tag each block as managed/unmanaged) ! Unmanaged region large enough to absorb most evictions ! Unmanaged region still used, acts as victim cache (demotion " eviction) ! Single partition with guaranteed size

Evictions Insertions Demotions

Managed region Unmanaged region

slide-16
SLIDE 16

Multiple Partitions in Managed Region

! P partitions + unmanaged region ! Each line is tagged with its partition ID (0 to P-1) ! On each miss:

! Insert new line into corresponding partition ! Demote one of the candidates to unmanaged region ! Evict from the unmanaged region

16

Insertions

Partition 0 Unmanaged region Partition 1 Partition 2 Partition 3

Evictions Demotions

slide-17
SLIDE 17

Churn-Based Management

17

! Problem: always demoting from inserting partition does not scale

! Could demote from partition 0, but only 3 candidates ! With many partitions, might not even see a candidate from inserting partition!

! Instead, demote to match insertion rate (churn) and demotion rate

1.

Access A (partition 2) " HIT

2.

Access B (partition 0) " MISS

Get replacement candidates (16) 3 P0 4 P1 1 P2 5 P3 3 unmgd Evict from unmanaged region Insert new line (in partition 0)

slide-18
SLIDE 18

Churn-Based Management

18

! Aperture: Portion of candidates to demote from each partition

1) Partition 0 MISS

0.1 0.5 0.4 0.3 0.7 0.1 0.2 0.6 0.1 0.3 0.9 0.2 0.4 0.3 0.7 0.8

Replacement candidates

Eviction priorities Evict Demote (in top 11% of P3) Partition 0 Partition 1 Partition 2 Partition 3 23% 15% 12% 11% Apertures

2) Partition 1 MISS

0.3 0.6 0.7 0.4 0.1 0.3 0.2 0.8 0.3 0.7 0.4 0.2 0.2 0.7 0.3 0.6

Eviction priorities Evict Nothing is demoted (all candidates above apertures!)

3) Partition 3 MISS

0.1 0.8 0.2 0.4 0. 0.9 0.2 0.9 0.1 0.3 0.8 0.7 0.4 0.3 0.3 0.6

Eviction priorities Evict Demote (in top 23% of P0) Demote (in top 15% of P1)

slide-19
SLIDE 19

Managing Apertures

19

! Set each aperture so that partition churn = demotion rate

! Instantaneous partition sizes vary a bit, but sizes are maintained ! Unmanaged region prevents interference

! Each partition requires aperture proportional to its churn/

size ratio

! Higher churn More frequent insertions (and demotions!) ! Larger size We see lines from that partition more often

! Partition aperture determines partition associativity

! Higher aperture less selective lower associativity

slide-20
SLIDE 20

Stability

20

! In partitions with high churn/size, controlling aperture is sometimes

not enough to keep size

! e.g. 1-line partition that misses all the time ! To keep high associativity, set a maximum aperture Amax (e.g. 40%) ! If a partition needs Ai > Amax, we just let it grow ! Key result: Regardless of the number of partitions that need to

grow beyond their target, the worst-case total growth over their target sizes is bounded and small!

! 5% of the cache with R=52, Amax=0.4 ! Simply size the unmanaged region with that much extra slack ! Stability and scalability are guaranteed

R A 1 1

max

slide-21
SLIDE 21

A Simple Vantage Controller

21

! Directly implementing these techniques is impractical

! Must constantly compute apertures, estimate churns ! Need to know eviction priorities of every block

! Solution: Use negative feedback loops to derive

apertures and the lines below aperture

! Practical implementation ! Maintains analytical guarantees

slide-22
SLIDE 22

Feedback-Based Aperture Control

22

! Adjust aperture by letting partition size (Si) grow over its

target (Ti):

! Need small extra space in unmanaged region

! e.g. 0.5% of the cache with R=52, Amax=0.4, slack=10% Amax Ai Ti (1+slack)Ti Si Ai

slide-23
SLIDE 23

Implementation Costs

! See paper for detailed implementation

Cache Controller

Partition 0 state (256b) Partition P-1 state (256b)

… Data Array Tag Array 256 bits of state per partition

Line Address Coherence/ Valid Bits Timestamp (8b)

Tags: Extra partition ID field

Partition (6b) Vantage Replacement Logic

Simple logic, ~10 adders and comparators Logic not on critical path

slide-24
SLIDE 24

Vantage Summary

24

! Use a cache with associativity guarantees ! Maintain an unmanaged region ! Match insertion and demotion rates in each partition

! Partitions help each other evict lines " maintain associativity ! Unmanaged region guarantees isolation and stability

! Use negative feedback to simplify implementation

slide-25
SLIDE 25

Outline

25

! Introduction ! Vantage Cache Partitioning ! Evaluation

slide-26
SLIDE 26

Methodology

26

! Simulations of small (4-core) and large (32-core) systems

! Private L1s, shared non-inclusive L2, 1 partition/core

! Partitioning policy: Utility-based partitioning [ISCA’06]

! Assign more space to threads that can use it better

! Partitioning schemes: Way-partitioning, PIPP

, Vantage

! Workloads: 350 multiprogrammed mixes from

SPECCPU2006 (full suite)

slide-27
SLIDE 27

Small-Scale: 4 cores, 4 partitions

27

! Each line shows throughput improvement versus an unpartitioned

16-way set-associative cache

! Way-partitioning and PIPP degrade throughput for 45% of

workloads

slide-28
SLIDE 28

Small-Scale: 4 cores, 4 partitions

28

! Vantage works on best on zcaches ! We use Vantage on a 4-way zcache with R=52 replacement

candidates

slide-29
SLIDE 29

Small-Scale: 4 cores, 4 partitions

29

! Vantage improves throughput for most workloads ! 6.2% throughput improvement (gmean), 26% for the 50 most

memory-intensive workloads

slide-30
SLIDE 30

Large-Scale: 32 cores, 32 partitions

30

! Way-partitioning and PIPP use a 64-way set-associative cache ! Both degrade throughput for most workloads

slide-31
SLIDE 31

Large-Scale: 32 cores, 32 partitions

31

! Vantage uses the same Z4/52 cache as the 4-core system ! Vantage improves throughput for most workloads " scalable

slide-32
SLIDE 32

A Closer Look: Sizes & Associativity

32 Vantage Way-partitioning

! Vantage maintains strict partition sizes ! Vantage maintains high associativity even in the worst case

slide-33
SLIDE 33

Additional Results (see paper)

33

! Vantage maintains strict control of partition sizes ! Vantage maintains high associativity ! Unmanaged region size vs isolation tradeoff

! ~5% unmanaged region and moderate isolation ! ~20% unmanaged region and strict isolation

! Validation of analytical models ! Vantage on set-associative caches

! Loses analytical guarantees, but outperforms other schemes

! Vantage with other replacement policies (RRIP)

slide-34
SLIDE 34

Conclusions

34

! Vantage enables cache partitioning for many-cores

! Tens to hundreds of fine-grain partitions ! High associativity per partition ! Strict isolation among partitions ! Derived from analytical models, bounds independent of

number of partitions and cache ways

! Simple to implement

slide-35
SLIDE 35

THANK YOU FOR YOUR ATTENTION QUESTIONS?

slide-36
SLIDE 36

Backup: Associativity Guarantees

36

! Why does zcache produce uniform random replacement

candidates, independently of access pattern?

! ZCache hashing and replacement scheme eliminates

spatial locality

! Evictions have negligible temporal locality w.r.t. cache

! Evictions to the same block are widely separated in time ! NOTE: Invalidations (e.g. coherence) are not evictions

! No locality " uniform random

slide-37
SLIDE 37

Backup: Setpoint-Based Demotions

37

! Derive portion of lines below aperture without tracking eviction priorities ! Coarse-grain timestamp LRU replacement

! Tag each block with an 8-bit LRU per-partition timestamp ! Increment timestamp every Si/16 accesses

! Demote every candidate below the setpoint timestamp ! Adjust setpoint using negative feedback

Partition lines distrib Timestamp 255 Setpoint TS Current TS Demote Demote Keep

slide-38
SLIDE 38

A Closer Look: Partition Sizes

38 Way-partitioning Vantage PIPP $ Coarse-grain partitions # Strict size $ Slow convergence $ Coarse-grain partitions $ Approximate size $ No convergence # Fine-grain partitions # Strict size # Fast convergence

slide-39
SLIDE 39

Unmanaged Size vs Isolation Trade-off

40

! A larger unmanaged region reduces UCP perfomance slightly, but gives excellent isolation ! Simulations match analytical models ! See paper for additional results (Vantage on set-associative caches, other replacement

policies, etc.)