No Tradeoff
Low Latency + High Efficiency
Christos Kozyrakis
http://mast.stanford.edu
No Tradeoff Low Latency + High Efficiency Christos Kozyrakis - - PowerPoint PPT Presentation
No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS), Occupy 1,000s of servers in
http://mast.stanford.edu
A growing class of online workloads
Search, social networking, software-as-service (SaaS), …
Occupy 1,000s of servers in datacenters
2
3
Leaf Leaf Leaf Leaf Leaf Leaf Web Servers Root Parent Parent Parent Leaf Leaf Leaf … …
Front end Back end Q R Q Q Q Q Q Q R R R R R R R R R R R R
Metrics: queries per sec (QPS) at a 99th-% latency threshold
Q Q
Massive distributed datasets, multiple tiers, high fan-out
High throughput + low latency (100μsec – 10msec) Focus on tail latency (e.g., 99th percentile SLO) Distributed and (often) stateful Multi-tier with high fan-out Diurnal patterns but load is never 0
4
Apps as collection of micro-services
Loosely-coupled services each with bounded context
Even lower tail latency requirements
From 100s to just a few μsec
5
[Adrian Cockcroft’14]
6
7
8
Search load Power (DVFS off) Tail latency (DVFS on)
[Lo et al’14]
9
10
LLC >300% >300% >300% >300% >300% >300% >300% 264%
123%
DRAM >300% >300% >300% >300% >300% >300% >300% 270%
122%
HyperThread
110% 107% 114% 115% 105% 117% 120% 136% >300%
CPU power
124% 107% 116% 109% 115% 105% 101% 100% 100%
Network
36% 36% 37% 37% 39% 42% 48% 55% 64%
10% 20% 30% 40% 50% 60% 70% 80% 90% Impact of interference on websearch’s latency
Load
0% 100% 300%
O K B A D
[Lo et al’15]
11
90 80 70 60 50 40 30 20 10 10 20 30 40 50 60 70 80 90
Memcached QPS (% Peak) 1ms RPC-service QPS (% Peak)
Achieved QoS w/ two low-latency workloads Both meet QoS 1ms RPC fails Memcached fails Both fail to meet QoS QoS requirement = 95th < 5x low-load 95th
Bad QoS with minor interference!
[Leverich’14]
12
13
200 400 600 800 1000 1200 50 100 150 200 250 300 p95 read latency (us) IOPS (Thousands) Local Flash iSCSI (1 core) libaio+libevent (1core)
75% throughput drop 2x latency
14
Flash performance varies a lot with read/write ratio
200 400 600 800 1000 1200 1400 1600 1800 2000 250 500 750 1000 1250 p95 read latency (us) Total IOPS (Thousands) 100%read 99%read 95%read 90%read 75%read 50%read
The curve you get with sharing The curve you paid for
15
16
10 20 30 40 50 60 Microseconds HW Limit Linux 2 4 6 8 10 Requests per Second Millions 4.8x Gap 8.8x Gap
[Belay et al’15]
64-byte TCP echo
17
Understand the problem Understand the opportunity
18
Latency-critical apps as queuing systems Tail latency affected by
Service time, queuing delay, scheduling time, load imbalance
19 0.0 0.2 0.4 0.6 0.8 1.0
Load
500 1000 1500 2000 2500 3000 3500 4000
Latency (usec)
MM1 memcached N=1 MM4 memcached N=4
[Delimitrou’15]
20
100 200 300 400 500 600 700 800 900 1000 3% 8% 14% 19% 24% 30% 35% 41% 46% 51% 57% 62% 68% 73% 78% 84% 89% 95% 100% Latency (usecs) Memcached QPS (% of peak)
95th-%
Provisioned QPS
[Leverich’15]
21
0% 20% 40% 60% 80% 100% Tail latency % of maximum cluster load
Significant latency slack! Iso-latency: maintain constant tail latency at all loads
Enables power management, HW sharing, adaptive batching, … Key point: use tail latency as control signal
[Lo et al’14]
Understand the problem Understand the opportunity End-to-end & latency-aware management
22
Measures latency slack end-to-end Sets uniform latency goal across all servers
RAPL as knob for power Power is set by workload specific policy
23
L L L L P [Lo et al’14]
24
Achieves dynamic energy proportionality
[Lo et al’14]
Goal: meet SLO while using idle resources for batch workloads End-to-end latency measurements control isolation mechanisms
25
LC workload Controller CPU + Memory CPU power Network LLC
CPU
DRAM BW
DVFS
CPU Power
HTB Net. BW
Latency readings Can BE grow? Internal feedback loops
[Lo et al’15]
26
LC workload Controller CPU + Memory CPU power Network LLC
CPU
DRAM BW
DVFS
CPU Power
HTB Net. BW
Latency readings Can BE grow? Internal feedback loops
LC LC Max LC BE BE BE BE L
ü
BE BE BE BE BE BE BE [Lo et al’15] Cores LLC Core freq. Network BW
27
LC workload Controller CPU + Memory CPU power Network LLC
CPU
DRAM BW
DVFS
CPU Power
HTB Net. BW
Latency readings Can BE grow? Internal feedback loops
Cores LLC Core freq. Network BW LC LC Max LC BE BE BE BE L
L
[Lo et al’15]
LC apps + best-effort tasks on the same servers
HW & SW isolation mechanisms (cores, caches, network, power) Iso-latency based control for resource allocation
28
[Lo et al’15]
>90% HW utilization No tail latency problems
Understand the problem Understand the opportunity End-to-end & latency-aware management Optimize for modern multi-core hardware Specialize the SW
29
30
App 1 App 2
OS Kernel
Userspace Kernelspace C C C C C TCP/IP RX/TX
OS Kernel
31
App 1 App 2
Userspace Kernelspace C C C C RX TX RX TX RX TX RX TX TCP/IP RX/TX C
Control plane
32
App 1 App 2
OS kernel Control plane
C C C C RX TX RX TX RX TX RX TX Ring 3 Guest Ring 0 Host Ring 0 Dune C
33
OS Kernel Control plane
C C C C RX TX RX TX RX TX RX TX Ring 3 Guest Ring 0 Host Ring 0 Dune IX TCP/IP libIX
Memcached
Custom Transport libIX HTTPd
[Belay et al’14]
C
34
event-driven app libIX RX TX TCP/IP 2 RX FIFO Event Conditions 3 Batched Syscalls TCP/IP Timer 4 5 6 Ring 3 Guest Ring 0 1
Improves data cache locality Removes scheduling unpredictably
35
event-driven app libIX RX TX TCP/IP 2 RX FIFO Event Conditions 3 Batched Syscalls TCP/IP Timer 4 5 6 Ring 3 Guest Ring 0 1
Adaptive Batch Calculation
Enabled by ISO-latency Improves instruction cache locality and prefetching
36
5x More RPS 2x Less Tail Latency IX tail lower than Linux avg
Linux (avg) Linux (99th pct) IX (avg) IX (99th pct)
+ TCP + commodity HW + protection + 1M connections
[Belay et al’14]
37
50 100 150 200 50 100 150 200 50 100 150 200 Time (seconds) 50 100 150 200 50 100 150 200 50 100 150 200 Time (seconds)
MC load MC latency Power MC load MC latency BE Performance
Linux kernel
Control Plane RX TX RX TX Ring 3 Guest Ring 0 Host Ring 0 Dune
38
IX ReFlex libIX libIX App App Core Core Core SQ CQ
ReFlex Server libIX RX TCP/IP Event Conditions Batched Syscalls Ring 3 Guest Ring 0 NVMe TCP/IP NVMe CQ SQ Scheduler
39
TX
1 2 3 4
[Klimovic et al’17]
ReFlex Server libIX RX TCP/IP Event Conditions Batched Syscalls Ring 3 Guest Ring 0 NVMe TCP/IP NVMe CQ SQ Scheduler
40
TX
5 6 7 8
[Klimovic et al’17]
100 200 300 400 500 600 700 800 900 1000 250 500 750 1000 1250 p95 Read Latency (us) IOPS (Thousands) Local-1T ReFlex-1T Libaio-1T
41
ReFlex: 850K IOPS/core Linux: 75K IOPS/core
100 200 300 400 500 600 700 800 900 1000 250 500 750 1000 1250 p95 Read Latency (us) IOPS (Thousands) Local-1T ReFlex-1T Libaio-1T
42
Unloaded latency Local Flash 78 µs ReFlex 99 µs Libaio 121 µs
100 200 300 400 500 600 700 800 900 1000 250 500 750 1000 1250 p95 Read Latency (us) IOPS (Thousands) Local-1T Local-2T ReFlex-1T ReFlex-2T Libaio-1T Libaio-2T
43
ReFlex saturates Flash device
20 40 60 80 100 120 140 Tenant A Tenant B Tenant C Tenant D IOPS (Thousands) I/O sched disabled I/O sched enabled 500 1000 1500 2000 2500 3000 3500 4000 Tenant A Tenant B Tenant C Tenant D Read p95 latency (us) I/O sched disabled I/O sched enabled
44
Tenants A & B: latency-critical; Tenant C + D: best effort Without scheduler: latency and bandwidth QoS for A/B are violated Scheduler rate limits best-effort tenants to enforce SLOs
Latency SLO Tenant A IOPS SLO Tenant B IOPS SLO
45
Tenants A & B: latency-critical; Tenant C + D: best effort Without scheduler: latency and bandwidth QoS for A/B are violated Scheduler rate limits best-effort tenants to enforce SLOs
20 40 60 80 100 120 140 Tenant A Tenant B Tenant C Tenant D IOPS (Thousands)
I/O sched disabled I/O sched enabled
500 1000 1500 2000 2500 3000 3500 4000 Tenant A Tenant B Tenant C Tenant D Read p95 latency (us)
I/O sched disabled I/O sched enabled
Latency SLO Tenant A IOPS SLO Tenant B IOPS SLO
46
Tenants A & B: latency-critical; Tenant C + D: best effort Without scheduler: latency and bandwidth QoS for A/B are violated Scheduler rate limits best-effort tenants to enforce SLOs
20 40 60 80 100 120 140 Tenant A Tenant B Tenant C Tenant D IOPS (Thousands)
I/O sched disabled I/O sched enabled
500 1000 1500 2000 2500 3000 3500 4000 Tenant A Tenant B Tenant C Tenant D Read p95 latency (us)
I/O sched disabled I/O sched enabled
Latency SLO Tenant A IOPS SLO Tenant B IOPS SLO
Programming models for low tail latency
Especially for multi-tier applications
Variability within and across applications
Short vs long requests
Cluster-level issues
Provisioning, scheduling, load-balancing, throttling, …
μsecond-scale tail latency
Can we still get efficiency at ultra low latency?
Hardware for latency-critical applications
Big vs little cores, isolation mechanisms, accelerators
47
Conventional wisdom is wrong
Low latency is compatible with high efficiency Efficiency: energy proportionality, resource usage, throughput
Encouraging initial results
Room for improvements throughout the stack
48
[Source:ThinkStock]