[PPT] - KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity PowerPoint Presentation

SLIDE 1

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores

Nosayba EI-Sayed Anurag Mukkara Po-An Tsai Harshad Kasture Xiaosong Ma Daniel Sanchez

SLIDE 2

Cache partitioning in commodity multicores

2

¨ Partitioning the last-level cache among co-running apps

can reduce interference è improve system performance ✔ Recent processors offer hardware cache-partitioning support! ✖ Two key challenges limit its usability

1. Current hardware implements coarse-grained way-partitioning

è hurts system performance!

2. Lacks hardware monitoring units to collect cache-profiling data

Last-Level Cache DRAM App 1 App 2 App 3 App 4

KPart tackles these limitations, unlocking significant performance on real hardware (avg gain: 24%, max: 79%), and is publicly available

SLIDE 3

Limitations of hardware cache partitioning

3

1. Implements coarse-grained way-partitioning è hurts system performance

¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS)

Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11

Last-Level Cache (12MB)

SLIDE 4

Limitations of hardware cache partitioning

3

1. Implements coarse-grained way-partitioning è hurts system performance

¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS) ¨ Baseline: NoPart (All apps share all ways)

Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11

Last-Level Cache (12MB)

SLIDE 5

Limitations of hardware cache partitioning

3

1. Implements coarse-grained way-partitioning è hurts system performance

¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS)

Smallest partition size

1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB

Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11

Last-Level Cache (12MB)

SLIDE 6

Limitations of hardware cache partitioning

3

1. Implements coarse-grained way-partitioning è hurts system performance

¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS)

…

Application Cache-Profiles

Smallest partition size

1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB

Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11

Last-Level Cache (12MB)

¨ Conventional policy: Per-app, utility-based cache part (UCP)

SLIDE 7

Limitations of hardware cache partitioning

3

1. Implements coarse-grained way-partitioning è hurts system performance

¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS)

…

Application Cache-Profiles

Smallest partition size

1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB

Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11

Last-Level Cache (12MB)

¨ Conventional policy: Per-app, utility-based cache part (UCP)

SLIDE 8

Limitations of hardware cache partitioning

3

1. Implements coarse-grained way-partitioning è hurts system performance

¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS)

…

Application Cache-Profiles

Smallest partition size

1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB

Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11

Last-Level Cache (12MB)

Conventional policies yield small partitions with few ways: low associativity è more misses

This example: throughput degrades by 3.8%

¨ Conventional policy: Per-app, utility-based cache part (UCP)

SLIDE 9

Prior work on cache partitioning

9

¨ Page coloring

¤ No hardware support required ¤ Not compatible with superpages;

costly repartitioning due to recoloring; heavy OS modifications

¨ Hybrid technique: Set and WAy

Partitioning (SWAP) [HPCA’17]

¤ Combines page coloring and way-

partitioning è fine-grained partitions

¤ Inherits page coloring limitations

¨ Hardware way-partitioning: restrict

insertions into subsets of ways

¤ Available in commodity hardware ¤ Small number of coarsely-grained partitions!

¨ High-performance, fine-grained

hardware partitioners (e.g. Vantage

[ISCA’11], Futility Scaling [MICRO’14])

¤ Support hundreds of partitions ¤ Not available in existing hardware

SLIDE 10

Prior work on cache partitioning

9

¨ Page coloring

¤ No hardware support required ¤ Not compatible with superpages;

costly repartitioning due to recoloring; heavy OS modifications

¨ Hybrid technique: Set and WAy

Partitioning (SWAP) [HPCA’17]

¤ Combines page coloring and way-

partitioning è fine-grained partitions

¤ Inherits page coloring limitations

¨ Hardware way-partitioning: restrict

insertions into subsets of ways

¤ Available in commodity hardware ¤ Small number of coarsely-grained partitions!

¨ High-performance, fine-grained

hardware partitioners (e.g. Vantage

[ISCA’11], Futility Scaling [MICRO’14])

¤ Support hundreds of partitions ¤ Not available in existing hardware

SLIDE 11

KPart performs hybrid cache sharing-partitioning to make use of coarse-grained partitions

11

Cache-Aware App Grouping

group 1 group 2 group 3

Avoids significant reduction in cache associativity

è throughput improves by 17%

Grouping must be done carefully!

SLIDE 12

KPart overview: Hybrid cache sharing-partitioning

12

Application Profiles

Per-Cluster Cache Partition Plan Cluster#1 Cluster#2 Cluster#3

Cache-Sharing Clusters

Group applications into clusters Assign cache partitions to clusters

Miss Curves

Cache Misses cache capacity

How?

Collected

nline or
ffline

SLIDE 13

Clustering apps based on cache-compatibility: Distance metric

13

Application Profiles

Shared LLC Partitioned LLC Cache Misses

Cache Capacity

app1 app2

partitioned miss curve

[divide cap using UCP]

combined miss curve

[Mukkara et al., ASPLOS’16]

area

Area è expected performance degradation when apps share cache capacity (due to additional misses)

¨ How many additional cache misses are expected when

two apps share cache capacity vs. when it’s partitioned?

¨ Use cache miss curves to estimate:

distance

SLIDE 14

Grouping applications into clusters

14

K=2 K=3

Application Miss Curves

¨ Hierarchical clustering:

¤ Start with the applications as individual clusters ¤ At each step, merge the closest pair of clusters

until only one cluster is left..

How do we find the best K without running the mix?

SLIDE 15

Automatic selection of K in KPart

15

Cluster#1 Cluster#2 Cluster#Kauto

….

Kauto

Performance Estimator Per-Cluster Cache Partition Plan ...

Estimate throughput under all possible Ks Account for bandwidth contention Estimate speedup curves Return Kauto that produces best result

How?

Application Profiles

SLIDE 16

Cache-partitioning in commodity multicores

16

¨ Partitioning the last-level cache among co-running apps

can reduce interference è improve system performance ✔ Recent processors offer hardware cache-partitioning support! ✖ Two key challenges limit its usability

1. Implements coarse-grained way-partitioning è hurts system performance!
2. Lacks hardware monitoring units to collect cache-profiling data

SLIDE 17

Application Profiles

17

How do we profile applications online at low overhead and high accuracy?

¨ Prior work mostly simulated hardware monitors that don’t exist in real

systems, or used expensive software-based mem address sampling

Miss Curves

Cache Misses

cache capacity

* **

DynaWay exploits hardware partitioning support to adjust partition sizes periodically è measure performance (misses, IPC, bandwidth) We applied optimizations to reduce measurement points and interval length (see paper)

è less than 1% profiling overhead (8-app workloads)

SLIDE 18

KPart+DynaWay profiles applications online, partitions the cache dynamically

Per-Cluster Partition Plan

Cluster#1 Cluster#Kauto

KPart

…

Invoke DynaWay

Generate online profiles + update periodically 18

SLIDE 19

KPart+DynaWay profiles applications online, partitions the cache dynamically

Cluster#1 Cluster#Kauto

KPart

…

Invoke DynaWay

Generate online profiles + update periodically 19

Per-Cluster Partition Plan

SLIDE 20

KPart Evaluation

SLIDE 21

Evaluation methodology

21

¨ Platform: 8-core Intel Broadwell D-1540 processor (12MB LLC) ¨ Benchmarks: SPEC-CPU2006, PBBS ¨ Mixes: 30 different mixes of 8 apps (randomly selected), each app

running at least 10B instr.

¨ Experiments:

KPart on real system with offline profiling KPart on real system with online profiling (using DynaWay) KPart in simulation compared against high-performance techniques KPart with mix of batch and latency- critical applications

SLIDE 22

20 40 60 80 100 Application Mixes (%)

20

20 40 60 80

Kauto NoClust

KPart unlocks significant performance on real hardware

22

NoClust

Kauto Koracle

¨ Evaluation results on a real system with offline profiling

Avg throughput gain over NoPart(%) Application Mixes(%) Throughput gain(%)

KPart improves system performance by 24% on average!

0 5 10 15 20 25

20 40 60 80 100 Application Mixes (%)

20

20 40 60 80 Performance gain over NoPart (%)

Kauto NoClust Koracle

20 40 60 80 100 Application Mixes (%)

20

20 40 60 80 Performance gain over NoPart (%)

Koracle Kauto K2 K4 K6 NoClust

NoClust hurts ~30% of mixes

Important to use Kauto instead of fixed K

KPart up to 79%

SLIDE 23

KPart unlocks significant performance on real hardware

23

¨ Evaluation results on a real system with offline profiling ¨ Case studies of individual mixes:

Mix 1 Mix 2

SLIDE 24

KPart evaluation with DynaWay’s online profiles

24

KPart+DynaWay Kauto [Offline profiles] Koracle [Offline profiles]

Reconfiguration Interval (Cycles) KPart+DynaWay can even outperform static KPart with offline profiling (adapts to application phase changes!)

SLIDE 25

KPart bridges the gap between current and future hardware partitioners

25

¨ In simulation: we compared KPart to a high-

performance fine-grained hardware partitioner, Vantage [ISCA’11]

KPart achieves most of the gains obtained by fine- grained partitioning!

N

C

l u s t V a n t a g e

K

r

a c l e

SLIDE 26

KPart helps LC apps when combined with QoS-oriented techniques

26

¨ KPart focuses on batch apps, but data centers colocate latency-critical (LC) and batch ¨ Prior work uses cache partitioning to provide QoS guarantees for LC apps

¤ but does not improve batch apps throughput

¨ Combining KPart with QoS-oriented technique can

improve both batch throughput and LC latency:

¤ Kpart improves batch throughput which leads to

reduced memory traffic

¤ LC apps benefit from more bandwidth and cache

{latency-critical application}

Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7

Last-Level Cache (12MB) batch4 batch3 batch2 batch1

QoS-oriented policy (e.g., Heracles [ISCA’15]) KPart+DynaWay

Evaluation: On same 8-core system running both LC and batch apps, up to 28% improvement in batch throughput and up to 7% improvement in LC tail latency

SLIDE 27

KPart summary

ü KPart unlocks the potential of hardware way-partitioning using a hybrid sharing-partitioning approach ü KPart improves throughput significantly (avg: 24%) & bridges the gap between current and future partitioning techniques ü DynaWay exploits existing way-partitioning support to perform lightweight & accurate cache-profiling ü KPart+DynaWay can be combined with QoS-oriented policies to colocate latency-critical apps and batch apps effectively

KPart is open-sourced and publicly available at http://kpart.csail.mit.edu

SLIDE 28

Thank you! Questions?

ü KPart unlocks the potential of hardware way-partitioning using a hybrid sharing-partitioning approach ü KPart improves throughput significantly (avg: 24%) & bridges the gap between current and future partitioning techniques ü DynaWay exploits existing way-partitioning support to perform lightweight & accurate cache-profiling ü KPart+DynaWay can be combined with QoS-oriented policies to colocate latency-critical apps and batch apps effectively

KPart is open-sourced and publicly available at http://kpart.csail.mit.edu