[PPT] - FOR L ATENCY C RITICAL W ORKLOADS H ARSHAD K ASTURE , D ANIEL S PowerPoint Presentation

SLIDE 1

UBIK: EFFICIENT CACHE SHARING WITH STRICT QOS

FOR LATENCY CRITICAL WORKLOADS

HARSHAD KASTURE, DANIEL SANCHEZ

ASPLOS 2014

SLIDE 2

Motivation

2

 Low server utilization in datacenters is a major source of

inefficiency

L. Barroso and U. Hölzle, The Case for Energy-Proportional Computing

SLIDE 3

Common Industry Practice

3

Core 0 Last Level Cache Core 1 Core 2 Core 3 Core 4 Core 5

Latency Critical Application

 Dedicated machines for latency-critical applications

guarantees QoS

SLIDE 4

Core 0 Last Level Cache Core 1 Core 2 Core 3 Core 4 Core 5

Common Industry Practice

4

Latency Critical Application

 Dedicated machines for latency-critical applications

guarantees QoS

 Under utilization of machine resources

SLIDE 5

Core 0 Last Level Cache Core 1 Core 2 Core 3 Core 4 Core 5

Colocation to Improve Utilization

5

Latency Critical Applications

 Can utilize spare resources by colocating batch apps

SLIDE 6

Core 0 Last Level Cache Core 1 Core 2 Core 3 Core 4 Core 5

Sharing Causes Interference!

6

Latency Critical Applications

 Can utilize spare resources by colocating batch apps

 Contention in shared resources degrades QoS

SLIDE 7

Outline

7

 Introduction  Analysis of latency-critical apps  Inertia-oblivious cache management schemes  Ubik: Inertia-aware cache management  Evaluation

SLIDE 8

Understanding Latency-Critical Applications

8

 Large number of backend servers participate in handling

every user request

 Total service time determined by tail latency behavior of

backend

Client Client Client

Front End

Back End Back End Back End

Front End

Datacenter

SLIDE 9

Understanding Latency-Critical Applications

9

 Service latency highly sensitive to changes in load

SLIDE 10

Understanding Latency-Critical Applications

10

 Short bursts of activity interspersed with idle periods

 Need guaranteed high performance during active periods Active Idle Time

SLIDE 11

Inertia and Transient Behavior

11

Core 0 Core 1 Last Level Cache Core 2 Core 3 Core 4 Core 5

Time IPC

SLIDE 12

Inertia and Transient Behavior

12

 Transient lengths can dominate tail latency!

 Any dynamic reconfiguration scheme has to be inertia-aware

 Many hardware resources exhibit inertia

 branch predictors, prefetchers, memory bandwidth…  LLCs are one of the biggest sources of inertia

Core 0 Core 1 Last Level Cache Core 2 Core 3 Core 4 Core 5

Time IPC

Transient begin Transient end

SLIDE 13

Outline

13

 Introduction  Analysis of latency-critical apps  Inertia-oblivious cache management schemes  Ubik: Inertia-aware cache management  Evaluation

SLIDE 14

Inertia-Oblivious Cache Management 14

Last Level Cache (LLC) Core 2 Core 3 Core 0 Core 1

LC1 LC2 Batch1 Batch2

Active Idle Active Idle Active Idle Active Idle

Time

SLIDE 15

Unmanaged LLC (LRU Replacement)

15

Last Level Cache (LLC) Core 2 Core 3 Core 0 Core 1

LC1 LC2 Batch1 Batch2

Active Idle Active Idle Active Idle Active Idle

Time Time LLC Space

✖ Unconstrained interference results in poor tail-latency behavior

SLIDE 16

Utility Based Cache Partitioning (UCP)

16

Time LLC Space

Active Idle Active Idle Active Idle Active Idle

Time

Reconfigure

Last Level Cache (LLC) Core 2 Core 3 Core 0 Core 1

LC1 LC2 Batch1 Batch2

✔ High batch throughput ✖ Poor tail latency (low allocation)

SLIDE 17

OnOff: Efficient but Unsafe

17

Active Idle Active Idle Active Idle Active Idle

Time

Batch Reconfigure

Time LLC Space

Last Level Cache (LLC) Core 2 Core 3 Core 0 Core 1

LC1 LC2 Batch1 Batch2

✔ High batch throughput

SLIDE 18

Cross-Request LLC Inertia

18

 Other applications qualitatively similar (see paper for details)

Shore-MT, 2 MB LLC

Cross-request hits LLC Access Breakdown (%)

Misses Hits (same request)

SLIDE 19

StaticLC: Safe but Inefficient

19

Batch Reconfigure

Time LLC Space

Active Idle Active Idle Active Idle Active Idle

Time

Last Level Cache (LLC) Core 2 Core 3 Core 0 Core 1

LC1 LC2 Batch1 Batch2

✔ Low tail latency (preserve LLC state) ✖ Low batch throughput (poor space utilization)

SLIDE 20

Outline

20

 Introduction  Analysis of latency-critical apps  Inertia-oblivious cache management schemes  Ubik: Inertia-aware cache management  Evaluation

SLIDE 21

Ubik: Performance Guarantee

21

 Performance as well as overall progress under Ubik after

the deadline is identical to static partitioning

Instructions Time deadline

Progress with constant size Progress with Ubik Request begins

SLIDE 22

Ubik: Overview

22

Activity Time Size Time

nominal static size idle size

Target Size Actual Size

SLIDE 23

Ubik: Overview

23

Activity Time Size Time

Target Size Actual Size

nominal static size idle size boosted size

SLIDE 24

Ubik: Overview

24

Activity Time Size Time

Target Size Actual Size

nominal static size idle size boosted size

SLIDE 25

Ubik: Overview

25

 Constraint: Cycles lost during should be compensated

for by the cycles gained during before the deadline

Activit y Time Size Time Target Size Actual Size

nominal static size idle size boosted size

SLIDE 26

Analyzing Transients

26

s2

Size Time

s1 Target Size Actual Size Ttransient

 Need accurate predictions for

 The length of the transient from s1 to s2  Cycles lost during the transient from s1 to s2 Instructions Time

Progress with constant size (s2) Progress with Ubik

Transient begins Transient ends

Lost Performanc e

SLIDE 27

Hardware Support

27

 Utility monitors to measure per-application miss curves  Fine grained cache partitioning  Memory Level Parallelism (MLP) profiler

pS2 Size s1 pS1 s2 Miss probability

SLIDE 28

Bounds on Transient Behavior

28

s2

Size Time

s1 Target Size Actual Size Ttransient

Instructions Time

Progress with constant size (s2) Progress with Ubik

Transient begins Transient ends

Lost Performance (L) ฀ Ttransient  c p s  M 

s  s1 s2 1



s 2  s1

 

c p s2  M         ฀ L  M 1  p s2 p s  M

s  s1 s2 1



s 2  s1

  1 

p s2 p s1        

SLIDE 29

Ubik: Partition Sizing

29

 Use transient analysis to identify feasible (idle size,

boosted size) pairs

Size Time

1

deadline

SLIDE 30

Ubik: Partition Sizing

30

 Use transient analysis to identify feasible (idle size,

boosted size) pairs

Size Time

1 2

deadline

SLIDE 31

Ubik: Partition Sizing

31

 Use transient analysis to identify feasible (idle size,

boosted size) pairs

Size Time

1 2 3

deadline

SLIDE 32

Ubik: Partition Sizing

32

 Use transient analysis to identify feasible (idle size,

boosted size) pairs

I N F E A S I B L E

Size Time

1 2 3 4

deadline

SLIDE 33

Ubik: Partition Sizing

33

 Use transient analysis to identify feasible (idle size,

boosted size) pairs

 Choose the pair that yields the maximum batch throughput

Size Time deadline

1 2 3

 See paper for details

SLIDE 34

Outline

34

 Introduction  Analysis of latency-critical apps  Inertia-oblivious cache management schemes  Ubik: Inertia-aware cache management  Evaluation

SLIDE 35

Workloads

35

 Five diverse latency-critical apps

 xapian (search engine)  masstree (in-memory key-value store)  moses (statistical machine translation)  shore-mt (multi-threaded DBMS)  specjbb (java middleware)

 Batch applications: random mixes of SPECCPU 2006

benchmarks

SLIDE 36

Target System

36

 6 OOO cores

 Private L1I, L1D and L2

caches

 12MB shared LLC

 400 6-app mixes: 3

latency-critical + 3 batch apps

 Apps pinned to cores

Core 0 L3 Bank 0 Core 1 L3 Bank 1 Core 2 L3 Bank 2 Core 3 L3 Bank 3 Core 4 L3 Bank 4 Core 5 L3 Bank 5

Batch1 Batch2 Batch3

LC1 LC2 LC3

SLIDE 37

Metrics

37

 Baseline system has

private LLCs

 We report

 Normalized tail latency  Throughput improvement

for batch applications

Core 0 L3 0 Core 1 L3 1 Core 2 L3 2 Core 3 L3 3 Core 4 L3 4 Core 5 L3 5

Batch1 Batch2 Batch3

LC1 LC2 LC3

SLIDE 38

Results: Unmanaged LLC (LRU)

38

Higher is better

SLIDE 39

Results: UCP

39

Higher is better

SLIDE 40

Results: OnOff

40

Higher is better

SLIDE 41

Results: StaticLC

41

Higher is better

SLIDE 42

Results: Ubik

42

Higher is better

SLIDE 43

Results: Summary

43

Higher is better OnOff UCP Ubik LRU StaticLC Private LLC

SLIDE 44

Conclusions

44

 To guarantee tail latency, dynamic resource management

schemes must be inertia-aware

 Ubik: Inertia-aware cache capacity management

 Preserves tail of latency-critical apps  Achieves high cache space utilization for batch apps  Requires minimal additional hardware

SLIDE 45