No Tradeoff Low Latency + High Efficiency Christos Kozyrakis - - PowerPoint PPT Presentation

no tradeoff
SMART_READER_LITE
LIVE PREVIEW

No Tradeoff Low Latency + High Efficiency Christos Kozyrakis - - PowerPoint PPT Presentation

No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS), Occupy 1,000s of servers in


slide-1
SLIDE 1

No Tradeoff

Low Latency + High Efficiency

Christos Kozyrakis

http://mast.stanford.edu

slide-2
SLIDE 2

Latency-critical Applications

A growing class of online workloads

Search, social networking, software-as-service (SaaS), …

Occupy 1,000s of servers in datacenters

2

slide-3
SLIDE 3

Example: Web-search

3

Leaf Leaf Leaf Leaf Leaf Leaf Web Servers Root Parent Parent Parent Leaf Leaf Leaf … …

Front end Back end Q R Q Q Q Q Q Q R R R R R R R R R R R R

Metrics: queries per sec (QPS) at a 99th-% latency threshold

Q Q

Massive distributed datasets, multiple tiers, high fan-out

slide-4
SLIDE 4

Characteristics

High throughput + low latency (100μsec – 10msec) Focus on tail latency (e.g., 99th percentile SLO) Distributed and (often) stateful Multi-tier with high fan-out Diurnal patterns but load is never 0

4

slide-5
SLIDE 5

Trends

Apps as collection of micro-services

Loosely-coupled services each with bounded context

Even lower tail latency requirements

From 100s to just a few μsec

5

[Adrian Cockcroft’14]

slide-6
SLIDE 6

Conventional Wisdom for LC Apps

6

slide-7
SLIDE 7

#1 LC apps cannot be power managed

7

slide-8
SLIDE 8

LC Apps & Power Management

8

Search load Power (DVFS off) Tail latency (DVFS on)

[Lo et al’14]

slide-9
SLIDE 9

#2 LC apps must use dedicated servers

9

slide-10
SLIDE 10

LC Apps & Server Sharing

10

LLC >300% >300% >300% >300% >300% >300% >300% 264%

123%

DRAM >300% >300% >300% >300% >300% >300% >300% 270%

122%

HyperThread

110% 107% 114% 115% 105% 117% 120% 136% >300%

CPU power

124% 107% 116% 109% 115% 105% 101% 100% 100%

Network

36% 36% 37% 37% 39% 42% 48% 55% 64%

10% 20% 30% 40% 50% 60% 70% 80% 90% Impact of interference on websearch’s latency

Load

0% 100% 300%

O K B A D

[Lo et al’15]

slide-11
SLIDE 11

LC Apps & Core Sharing

11

90 80 70 60 50 40 30 20 10 10 20 30 40 50 60 70 80 90

Memcached QPS (% Peak) 1ms RPC-service QPS (% Peak)

Achieved QoS w/ two low-latency workloads Both meet QoS 1ms RPC fails Memcached fails Both fail to meet QoS QoS requirement = 95th < 5x low-load 95th

Bad QoS with minor interference!

[Leverich’14]

slide-12
SLIDE 12

#3 LC apps must use local Flash

12

slide-13
SLIDE 13

Local Vs Remote NVMe Access

13

200 400 600 800 1000 1200 50 100 150 200 250 300 p95 read latency (us) IOPS (Thousands) Local Flash iSCSI (1 core) libaio+libevent (1core)

75% throughput drop 2x latency

slide-14
SLIDE 14

Sharing NVMe Flash

14

Flash performance varies a lot with read/write ratio

200 400 600 800 1000 1200 1400 1600 1800 2000 250 500 750 1000 1250 p95 read latency (us) Total IOPS (Thousands) 100%read 99%read 95%read 90%read 75%read 50%read

The curve you get with sharing The curve you paid for

slide-15
SLIDE 15

#4 LC apps must bypass the kernel for I/O cannot use TCP cannot use Ethernet must use specialized HW

15

slide-16
SLIDE 16

LC Apps & Linux/TCP/Ethernet

16

10 20 30 40 50 60 Microseconds HW Limit Linux 2 4 6 8 10 Requests per Second Millions 4.8x Gap 8.8x Gap

[Belay et al’15]

64-byte TCP echo

slide-17
SLIDE 17

Conventional Wisdom for LC Apps

17

slide-18
SLIDE 18

Low Latency + High Efficiency

Understand the problem Understand the opportunity

18

slide-19
SLIDE 19

Understanding the Latency Problem

Latency-critical apps as queuing systems Tail latency affected by

Service time, queuing delay, scheduling time, load imbalance

19 0.0 0.2 0.4 0.6 0.8 1.0

Load

500 1000 1500 2000 2500 3000 3500 4000

Latency (usec)

MM1 memcached N=1 MM4 memcached N=4

[Delimitrou’15]

slide-20
SLIDE 20

Understanding the Latency Problem

20

100 200 300 400 500 600 700 800 900 1000 3% 8% 14% 19% 24% 30% 35% 41% 46% 51% 57% 62% 68% 73% 78% 84% 89% 95% 100% Latency (usecs) Memcached QPS (% of peak)

95th-%

Provisioned QPS

[Leverich’15]

slide-21
SLIDE 21

Understanding the Opportunity

21

0% 20% 40% 60% 80% 100% Tail latency % of maximum cluster load

Significant latency slack! Iso-latency: maintain constant tail latency at all loads

Enables power management, HW sharing, adaptive batching, … Key point: use tail latency as control signal

[Lo et al’14]

slide-22
SLIDE 22

Low Latency + High Efficiency

Understand the problem Understand the opportunity End-to-end & latency-aware management

22

slide-23
SLIDE 23

Pegasus: Iso-latency + Fine-grain DVFS

Measures latency slack end-to-end Sets uniform latency goal across all servers

RAPL as knob for power Power is set by workload specific policy

23

L L L L P [Lo et al’14]

slide-24
SLIDE 24

Pegasus: Iso-latency + Fine-grain DVFS

24

Achieves dynamic energy proportionality

[Lo et al’14]

slide-25
SLIDE 25

Heracles: Iso-latency + QoS Isolation

Goal: meet SLO while using idle resources for batch workloads End-to-end latency measurements control isolation mechanisms

25

LC workload Controller CPU + Memory CPU power Network LLC

CPU

DRAM BW

DVFS

CPU Power

HTB Net. BW

Latency readings Can BE grow? Internal feedback loops

[Lo et al’15]

slide-26
SLIDE 26

Heracles: Iso-latency + QoS Isolation

26

LC workload Controller CPU + Memory CPU power Network LLC

CPU

DRAM BW

DVFS

CPU Power

HTB Net. BW

Latency readings Can BE grow? Internal feedback loops

LC LC Max LC BE BE BE BE L

ü

BE BE BE BE BE BE BE [Lo et al’15] Cores LLC Core freq. Network BW

slide-27
SLIDE 27

Heracles: Iso-latency + QoS Isolation

27

LC workload Controller CPU + Memory CPU power Network LLC

CPU

DRAM BW

DVFS

CPU Power

HTB Net. BW

Latency readings Can BE grow? Internal feedback loops

Cores LLC Core freq. Network BW LC LC Max LC BE BE BE BE L

L

[Lo et al’15]

slide-28
SLIDE 28

Iso-latency + QoS Isolation

LC apps + best-effort tasks on the same servers

HW & SW isolation mechanisms (cores, caches, network, power) Iso-latency based control for resource allocation

28

[Lo et al’15]

>90% HW utilization No tail latency problems

slide-29
SLIDE 29

Low Latency + High Efficiency

Understand the problem Understand the opportunity End-to-end & latency-aware management Optimize for modern multi-core hardware Specialize the SW

29

slide-30
SLIDE 30

System SW for Low Latency + High BW

30

App 1 App 2

OS Kernel

Userspace Kernelspace C C C C C TCP/IP RX/TX

slide-31
SLIDE 31

OS Kernel

31

App 1 App 2

Userspace Kernelspace C C C C RX TX RX TX RX TX RX TX TCP/IP RX/TX C

Control plane

System SW for Low Latency + High BW

slide-32
SLIDE 32

32

App 1 App 2

OS kernel Control plane

C C C C RX TX RX TX RX TX RX TX Ring 3 Guest Ring 0 Host Ring 0 Dune C

System SW for Low Latency + High BW

slide-33
SLIDE 33

33

OS Kernel Control plane

C C C C RX TX RX TX RX TX RX TX Ring 3 Guest Ring 0 Host Ring 0 Dune IX TCP/IP libIX

Memcached

Custom Transport libIX HTTPd

[Belay et al’14]

C

System SW for Low Latency + High BW

slide-34
SLIDE 34

Run-to-Completion

34

event-driven app libIX RX TX TCP/IP 2 RX FIFO Event Conditions 3 Batched Syscalls TCP/IP Timer 4 5 6 Ring 3 Guest Ring 0 1

Improves data cache locality Removes scheduling unpredictably

slide-35
SLIDE 35

Adaptive Batching

35

event-driven app libIX RX TX TCP/IP 2 RX FIFO Event Conditions 3 Batched Syscalls TCP/IP Timer 4 5 6 Ring 3 Guest Ring 0 1

Adaptive Batch Calculation

Enabled by ISO-latency Improves instruction cache locality and prefetching

slide-36
SLIDE 36

250 500 750 1 2 3 4 5 6 Latency (µs) USR: Throughput (RPS x 106) SLA

IX: Low Latency + High Bandwidth

36

5x More RPS 2x Less Tail Latency IX tail lower than Linux avg

Linux (avg) Linux (99th pct) IX (avg) IX (99th pct)

+ TCP + commodity HW + protection + 1M connections

[Belay et al’14]

slide-37
SLIDE 37

Low Latency + High Bandwidth + High Efficiency

37

50 100 150 200 50 100 150 200 50 100 150 200 Time (seconds) 50 100 150 200 50 100 150 200 50 100 150 200 Time (seconds)

MC load MC latency Power MC load MC latency BE Performance

slide-38
SLIDE 38

ReFlex: IX + QoS Scheduling

Linux kernel

Control Plane RX TX RX TX Ring 3 Guest Ring 0 Host Ring 0 Dune

38

IX ReFlex libIX libIX App App Core Core Core SQ CQ

slide-39
SLIDE 39

ReFlex Server libIX RX TCP/IP Event Conditions Batched Syscalls Ring 3 Guest Ring 0 NVMe TCP/IP NVMe CQ SQ Scheduler

39

TX

ReFlex: IX + QoS Scheduling

1 2 3 4

[Klimovic et al’17]

slide-40
SLIDE 40

ReFlex Server libIX RX TCP/IP Event Conditions Batched Syscalls Ring 3 Guest Ring 0 NVMe TCP/IP NVMe CQ SQ Scheduler

40

TX

5 6 7 8

ReFlex: IX + QoS Scheduling

[Klimovic et al’17]

slide-41
SLIDE 41

100 200 300 400 500 600 700 800 900 1000 250 500 750 1000 1250 p95 Read Latency (us) IOPS (Thousands) Local-1T ReFlex-1T Libaio-1T

ReFlex: Local ≈ Remote Latency

41

ReFlex: 850K IOPS/core Linux: 75K IOPS/core

slide-42
SLIDE 42

100 200 300 400 500 600 700 800 900 1000 250 500 750 1000 1250 p95 Read Latency (us) IOPS (Thousands) Local-1T ReFlex-1T Libaio-1T

ReFlex: Local ≈ Remote Latency

42

Unloaded latency Local Flash 78 µs ReFlex 99 µs Libaio 121 µs

slide-43
SLIDE 43

100 200 300 400 500 600 700 800 900 1000 250 500 750 1000 1250 p95 Read Latency (us) IOPS (Thousands) Local-1T Local-2T ReFlex-1T ReFlex-2T Libaio-1T Libaio-2T

ReFlex: Local ≈ Remote Latency

43

ReFlex saturates Flash device

slide-44
SLIDE 44

20 40 60 80 100 120 140 Tenant A Tenant B Tenant C Tenant D IOPS (Thousands) I/O sched disabled I/O sched enabled 500 1000 1500 2000 2500 3000 3500 4000 Tenant A Tenant B Tenant C Tenant D Read p95 latency (us) I/O sched disabled I/O sched enabled

ReFlex: Private ≈ Shared QoS

44

Tenants A & B: latency-critical; Tenant C + D: best effort Without scheduler: latency and bandwidth QoS for A/B are violated Scheduler rate limits best-effort tenants to enforce SLOs

Latency SLO Tenant A IOPS SLO Tenant B IOPS SLO

slide-45
SLIDE 45

ReFlex: Private ≈ Shared QoS

45

Tenants A & B: latency-critical; Tenant C + D: best effort Without scheduler: latency and bandwidth QoS for A/B are violated Scheduler rate limits best-effort tenants to enforce SLOs

20 40 60 80 100 120 140 Tenant A Tenant B Tenant C Tenant D IOPS (Thousands)

I/O sched disabled I/O sched enabled

500 1000 1500 2000 2500 3000 3500 4000 Tenant A Tenant B Tenant C Tenant D Read p95 latency (us)

I/O sched disabled I/O sched enabled

Latency SLO Tenant A IOPS SLO Tenant B IOPS SLO

slide-46
SLIDE 46

ReFlex: Private ≈ Shared QoS

46

Tenants A & B: latency-critical; Tenant C + D: best effort Without scheduler: latency and bandwidth QoS for A/B are violated Scheduler rate limits best-effort tenants to enforce SLOs

20 40 60 80 100 120 140 Tenant A Tenant B Tenant C Tenant D IOPS (Thousands)

I/O sched disabled I/O sched enabled

500 1000 1500 2000 2500 3000 3500 4000 Tenant A Tenant B Tenant C Tenant D Read p95 latency (us)

I/O sched disabled I/O sched enabled

Latency SLO Tenant A IOPS SLO Tenant B IOPS SLO

slide-47
SLIDE 47

Looking Forward

Programming models for low tail latency

Especially for multi-tier applications

Variability within and across applications

Short vs long requests

Cluster-level issues

Provisioning, scheduling, load-balancing, throttling, …

μsecond-scale tail latency

Can we still get efficiency at ultra low latency?

Hardware for latency-critical applications

Big vs little cores, isolation mechanisms, accelerators

47

slide-48
SLIDE 48

Conclusions

Conventional wisdom is wrong

Low latency is compatible with high efficiency Efficiency: energy proportionality, resource usage, throughput

Encouraging initial results

Room for improvements throughout the stack

48

[Source:ThinkStock]