FOR L ATENCY C RITICAL W ORKLOADS H ARSHAD K ASTURE , D ANIEL S - - PowerPoint PPT Presentation
FOR L ATENCY C RITICAL W ORKLOADS H ARSHAD K ASTURE , D ANIEL S - - PowerPoint PPT Presentation
U BIK : E FFICIENT C ACHE S HARING WITH S TRICT Q O S FOR L ATENCY C RITICAL W ORKLOADS H ARSHAD K ASTURE , D ANIEL S ANCHEZ ASPLOS 2014 Motivation 2 L. Barroso and U. Hlzle, The Case for Energy-Proportional Computing Low server
Motivation
2
Low server utilization in datacenters is a major source of
inefficiency
- L. Barroso and U. Hölzle, The Case for Energy-Proportional Computing
Common Industry Practice
3
Core 0 Last Level Cache Core 1 Core 2 Core 3 Core 4 Core 5
Latency Critical Application
Dedicated machines for latency-critical applications
guarantees QoS
Core 0 Last Level Cache Core 1 Core 2 Core 3 Core 4 Core 5
Common Industry Practice
4
Latency Critical Application
Dedicated machines for latency-critical applications
guarantees QoS
Under utilization of machine resources
Core 0 Last Level Cache Core 1 Core 2 Core 3 Core 4 Core 5
Colocation to Improve Utilization
5
Latency Critical Applications
Can utilize spare resources by colocating batch apps
Core 0 Last Level Cache Core 1 Core 2 Core 3 Core 4 Core 5
Sharing Causes Interference!
6
Latency Critical Applications
Can utilize spare resources by colocating batch apps
Contention in shared resources degrades QoS
Outline
7
Introduction Analysis of latency-critical apps Inertia-oblivious cache management schemes Ubik: Inertia-aware cache management Evaluation
Understanding Latency-Critical Applications
8
Large number of backend servers participate in handling
every user request
Total service time determined by tail latency behavior of
backend
Client Client Client
Front End
Back End Back End Back End
Front End
Datacenter
Understanding Latency-Critical Applications
9
Service latency highly sensitive to changes in load
Understanding Latency-Critical Applications
10
Short bursts of activity interspersed with idle periods
Need guaranteed high performance during active periods Active Idle Time
Inertia and Transient Behavior
11
Core 0 Core 1 Last Level Cache Core 2 Core 3 Core 4 Core 5
Time IPC
Inertia and Transient Behavior
12
Transient lengths can dominate tail latency!
Any dynamic reconfiguration scheme has to be inertia-aware
Many hardware resources exhibit inertia
branch predictors, prefetchers, memory bandwidth… LLCs are one of the biggest sources of inertia
Core 0 Core 1 Last Level Cache Core 2 Core 3 Core 4 Core 5
Time IPC
Transient begin Transient end
Outline
13
Introduction Analysis of latency-critical apps Inertia-oblivious cache management schemes Ubik: Inertia-aware cache management Evaluation
Inertia-Oblivious Cache Management 14
Last Level Cache (LLC) Core 2 Core 3 Core 0 Core 1
LC1 LC2 Batch1 Batch2
Active Idle Active Idle Active Idle Active Idle
Time
Unmanaged LLC (LRU Replacement)
15
Last Level Cache (LLC) Core 2 Core 3 Core 0 Core 1
LC1 LC2 Batch1 Batch2
Active Idle Active Idle Active Idle Active Idle
Time Time LLC Space
✖ Unconstrained interference results in poor tail-latency behavior
Utility Based Cache Partitioning (UCP)
16
Time LLC Space
Active Idle Active Idle Active Idle Active Idle
Time
Reconfigure
Last Level Cache (LLC) Core 2 Core 3 Core 0 Core 1
LC1 LC2 Batch1 Batch2
✔ High batch throughput ✖ Poor tail latency (low allocation)
OnOff: Efficient but Unsafe
17
Active Idle Active Idle Active Idle Active Idle
Time
Batch Reconfigure
Time LLC Space
Last Level Cache (LLC) Core 2 Core 3 Core 0 Core 1
LC1 LC2 Batch1 Batch2
✔ High batch throughput
Cross-Request LLC Inertia
18
Other applications qualitatively similar (see paper for details)
Shore-MT, 2 MB LLC
Cross-request hits LLC Access Breakdown (%)
Misses Hits (same request)
StaticLC: Safe but Inefficient
19
Batch Reconfigure
Time LLC Space
Active Idle Active Idle Active Idle Active Idle
Time
Last Level Cache (LLC) Core 2 Core 3 Core 0 Core 1
LC1 LC2 Batch1 Batch2
✔ Low tail latency (preserve LLC state) ✖ Low batch throughput (poor space utilization)
Outline
20
Introduction Analysis of latency-critical apps Inertia-oblivious cache management schemes Ubik: Inertia-aware cache management Evaluation
Ubik: Performance Guarantee
21
Performance as well as overall progress under Ubik after
the deadline is identical to static partitioning
Instructions Time deadline
Progress with constant size Progress with Ubik Request begins
Ubik: Overview
22
Activity Time Size Time
nominal static size idle size
Target Size Actual Size
Ubik: Overview
23
Activity Time Size Time
Target Size Actual Size
nominal static size idle size boosted size
Ubik: Overview
24
Activity Time Size Time
Target Size Actual Size
nominal static size idle size boosted size
Ubik: Overview
25
Constraint: Cycles lost during should be compensated
for by the cycles gained during before the deadline
Activit y Time Size Time Target Size Actual Size
nominal static size idle size boosted size
Analyzing Transients
26
s2
Size Time
s1 Target Size Actual Size Ttransient
Need accurate predictions for
The length of the transient from s1 to s2 Cycles lost during the transient from s1 to s2 Instructions Time
Progress with constant size (s2) Progress with Ubik
Transient begins Transient ends
Lost Performanc e
Hardware Support
27
Utility monitors to measure per-application miss curves Fine grained cache partitioning Memory Level Parallelism (MLP) profiler
pS2 Size s1 pS1 s2 Miss probability
Bounds on Transient Behavior
28
s2
Size Time
s1 Target Size Actual Size Ttransient
Instructions Time
Progress with constant size (s2) Progress with Ubik
Transient begins Transient ends
Lost Performance (L) Ttransient c p s M
s s1 s2 1
s 2 s1
c p s2 M L M 1 p s2 p s M
s s1 s2 1
s 2 s1
1
p s2 p s1
Ubik: Partition Sizing
29
Use transient analysis to identify feasible (idle size,
boosted size) pairs
Size Time
1
deadline
Ubik: Partition Sizing
30
Use transient analysis to identify feasible (idle size,
boosted size) pairs
Size Time
1 2
deadline
Ubik: Partition Sizing
31
Use transient analysis to identify feasible (idle size,
boosted size) pairs
Size Time
1 2 3
deadline
Ubik: Partition Sizing
32
Use transient analysis to identify feasible (idle size,
boosted size) pairs
I N F E A S I B L E
Size Time
1 2 3 4
deadline
Ubik: Partition Sizing
33
Use transient analysis to identify feasible (idle size,
boosted size) pairs
Choose the pair that yields the maximum batch throughput
Size Time deadline
1 2 3
See paper for details
Outline
34
Introduction Analysis of latency-critical apps Inertia-oblivious cache management schemes Ubik: Inertia-aware cache management Evaluation
Workloads
35
Five diverse latency-critical apps
xapian (search engine) masstree (in-memory key-value store) moses (statistical machine translation) shore-mt (multi-threaded DBMS) specjbb (java middleware)
Batch applications: random mixes of SPECCPU 2006
benchmarks
Target System
36
6 OOO cores
Private L1I, L1D and L2
caches
12MB shared LLC
400 6-app mixes: 3
latency-critical + 3 batch apps
Apps pinned to cores
Core 0 L3 Bank 0 Core 1 L3 Bank 1 Core 2 L3 Bank 2 Core 3 L3 Bank 3 Core 4 L3 Bank 4 Core 5 L3 Bank 5
Batch1 Batch2 Batch3
LC1 LC2 LC3
Metrics
37
Baseline system has
private LLCs
We report
Normalized tail latency Throughput improvement
for batch applications
Core 0 L3 0 Core 1 L3 1 Core 2 L3 2 Core 3 L3 3 Core 4 L3 4 Core 5 L3 5
Batch1 Batch2 Batch3
LC1 LC2 LC3
Results: Unmanaged LLC (LRU)
38
Higher is better
Results: UCP
39
Higher is better
Results: OnOff
40
Higher is better
Results: StaticLC
41
Higher is better
Results: Ubik
42
Higher is better
Results: Summary
43
Higher is better OnOff UCP Ubik LRU StaticLC Private LLC
Conclusions
44
To guarantee tail latency, dynamic resource management
schemes must be inertia-aware
Ubik: Inertia-aware cache capacity management
Preserves tail of latency-critical apps Achieves high cache space utilization for batch apps Requires minimal additional hardware