No Tradeoff Low Latency + High Efficiency Christos Kozyrakis - PowerPoint PPT Presentation

No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu

Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS), … Occupy 1,000s of servers in datacenters 2

Example: Web-search Q Web Servers Front end Back end R Root Q Q Parent R Parent Parent R R Q Q … … Q Q Q Q Leaf R Leaf Leaf R R R Leaf R R Leaf Leaf Leaf R Leaf R Leaf R Metrics: queries per sec (QPS) at a 99 th -% latency threshold Massive distributed datasets, multiple tiers, high fan-out 3

Characteristics High throughput + low latency (100 μ sec – 10msec) Focus on tail latency (e.g., 99 th percentile SLO) Distributed and (often) stateful Multi-tier with high fan-out Diurnal patterns but load is never 0 4

Trends [Adrian Cockcroft’14] Apps as collection of micro-services Loosely-coupled services each with bounded context Even lower tail latency requirements From 100s to just a few μ sec 5

Conventional Wisdom for LC Apps 6

#1 LC apps cannot be power managed 7

LC Apps & Power Management Power (DVFS off) [Lo et al’14] Tail latency Search load 8 (DVFS on)

#2 LC apps must use dedicated servers 9

LC Apps & Server Sharing Impact of interference on websearch’s latency 300% LLC >300% >300% >300% >300% >300% >300% >300% 264% 123% DRAM >300% >300% >300% >300% >300% >300% >300% 270% 122% B A HyperThread 110% 107% 114% 115% 105% 117% 120% 136% >300% D CPU power 124% 107% 116% 109% 115% 105% 101% 100% 100% 100% O Network 36% 36% 37% 37% 39% 42% 48% 55% 64% K 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Load 10 [Lo et al’15]

LC Apps & Core Sharing Achieved QoS w/ two low-latency workloads 90 Both meet QoS Memcached QPS (% Peak) 80 1ms RPC fails 70 Memcached fails Bad QoS with minor 60 Both fail to meet QoS interference! 50 QoS requirement 40 = 95 th < 5x low-load 95 th 30 20 10 10 20 30 40 50 60 70 80 90 1ms RPC-service QPS (% Peak) 11 [Leverich’14]

#3 LC apps must use local Flash 12

Local Vs Remote NVMe Access 1200 1000 p95 read latency (us) 800 75% throughput drop 600 400 Local Flash 200 2x latency iSCSI (1 core) libaio+libevent (1core) 0 0 50 100 150 200 250 300 IOPS (Thousands) 13

Sharing NVMe Flash 2000 The curve you 1800 paid for 1600 1400 p95 read latency (us) The curve you get 1200 with sharing 1000 100%read 800 99%read 600 95%read 90%read 400 75%read 200 50%read 0 0 250 500 750 1000 1250 Total IOPS (Thousands) Flash performance varies a lot with read/write ratio 14

#4 LC apps must bypass the kernel for I/O cannot use TCP cannot use Ethernet must use specialized HW 15

LC Apps & Linux/TCP/Ethernet 64-byte TCP echo 60 10 Millions 50 8 40 6 4.8x 8.8x 30 HW Limit Gap Gap 4 20 Linux 2 10 0 0 Microseconds Requests per Second 16 [Belay et al’15]

Conventional Wisdom for LC Apps 17

Low Latency + High Efficiency Understand the problem Understand the opportunity 18

Understanding the Latency Problem 4000 MM1 3500 memcached N=1 MM4 3000 Latency (usec) memcached N=4 2500 2000 [Delimitrou’15] 1500 1000 500 0 0.0 0.2 0.4 0.6 0.8 1.0 Load Latency-critical apps as queuing systems Tail latency affected by Service time, queuing delay, scheduling time, load imbalance 19

Understanding the Latency Problem 1000 Provisioned QPS 95th-% 900 800 700 Latency (usecs) 600 500 400 300 200 100 0 3% 8% 14% 19% 24% 30% 35% 41% 46% 51% 57% 62% 68% 73% 78% 84% 89% 95% 100% Memcached QPS (% of peak) [Leverich’15] 20

Understanding the Opportunity Significant latency slack! Tail latency [Lo et al’14] 0% 20% 40% 60% 80% 100% % of maximum cluster load Iso-latency: maintain constant tail latency at all loads Enables power management, HW sharing, adaptive batching, … Key point: use tail latency as control signal 21

Low Latency + High Efficiency Understand the problem Understand the opportunity End-to-end & latency-aware management 22

Pegasus: Iso-latency + Fine-grain DVFS L P L L L [Lo et al’14] Measures latency slack end-to-end Sets uniform latency goal across all servers RAPL as knob for power Power is set by workload specific policy 23

Pegasus: Iso-latency + Fine-grain DVFS Achieves dynamic energy proportionality [Lo et al’14] 24

Heracles: Iso-latency + QoS Isolation Latency readings Controller Can BE grow? CPU + LC workload CPU power Network Memory Internal feedback Net. loops DRAM CPU LLC CPU DVFS HTB BW Power BW [Lo et al’15] Goal : meet SLO while using idle resources for batch workloads End-to-end latency measurements control isolation mechanisms 25

Heracles: Iso-latency + QoS Isolation Cores LLC Core freq. Network BW LC BE BE BE LC BE BE BE BE BE Max LC BE BE BE Latency readings Controller ü Can BE grow? L LC CPU + CPU Network workload Memory power Internal feedback Net. loops DRAM CPU LLC CPU DVFS HTB BW Power BW [Lo et al’15] 26

Heracles: Iso-latency + QoS Isolation Cores LLC Core freq. Network BW LC BE LC BE LC BE BE Max Latency readings Controller L Can BE grow? L LC CPU + CPU Network workload Memory power Internal feedback Net. loops DRAM CPU LLC CPU DVFS HTB BW Power BW [Lo et al’15] 27

Iso-latency + QoS Isolation >90% HW utilization No tail latency problems [Lo et al’15] LC apps + best-effort tasks on the same servers HW & SW isolation mechanisms (cores, caches, network, power) Iso-latency based control for resource allocation 28

Low Latency + High Efficiency Understand the problem Understand the opportunity End-to-end & latency-aware management Optimize for modern multi-core hardware Specialize the SW 29

System SW for Low Latency + High BW App 1 App 2 Userspace OS Kernel Kernelspace TCP/IP C C C C C 30 RX/TX

System SW for Low Latency + High BW Control plane App 1 App 2 Userspace OS Kernel Kernelspace RX RX RX RX TCP/IP TX TX TX TX C C C C C 31 RX/TX

System SW for Low Latency + High BW Control Ring 3 plane App 1 App 2 Guest Ring 0 OS kernel Host Ring 0 Dune RX RX RX RX TX TX TX TX C C C C C 32

System SW for Low Latency + High BW Control HTTPd Memcached Ring 3 plane libIX libIX IX Custom Guest TCP/IP Transport Ring 0 OS Kernel Host Ring 0 Dune RX RX RX RX TX TX TX TX C C C C C 33 [Belay et al’14]

Run-to-Completion 3 Ring 3 event-driven app Event Batched libIX Conditions Syscalls Guest 2 4 Ring 0 TCP/IP TCP/IP 5 Timer RX FIFO 6 RX TX 1 Improves data cache locality Removes scheduling unpredictably 34

Adaptive Batching 3 Ring 3 event-driven app Event Batched libIX Conditions Syscalls Guest 2 4 Ring 0 TCP/IP TCP/IP Adaptive Batch 5 Calculation Timer RX FIFO 6 RX TX 1 Enabled by ISO-latency Improves instruction cache locality and prefetching 35

IX: Low Latency + High Bandwidth Linux (99 th pct) IX (99 th pct) Linux (avg) IX (avg) 750 Latency ( µ s) + TCP + commodity HW 500 SLA 2x Less + protection Tail + 1M connections 5x Latency More 250 RPS 0 IX tail lower than 0 1 2 3 4 5 6 Linux avg USR: Throughput (RPS x 10 6 ) 36 [Belay et al’14]

Low Latency + High Bandwidth + High Efficiency MC load MC load 0 50 100 150 200 0 50 100 150 200 MC latency MC latency 0 50 100 150 200 0 50 100 150 200 BE Performance Power 0 50 100 150 200 0 50 100 150 200 Time (seconds) Time (seconds) 37

ReFlex: IX + QoS Scheduling App App Control Ring 3 Plane libIX libIX Guest IX ReFlex Ring 0 Linux kernel Host Dune Ring 0 RX RX SQ TX TX CQ Core Core Core 38

ReFlex: IX + QoS Scheduling Ring 3 3 ReFlex Server Event Batched libIX Conditions Syscalls Guest 2 Ring 0 Scheduler NVMe TCP/IP TCP/IP NVMe 1 CQ TX SQ RX 4 39 [Klimovic et al’17]

ReFlex: IX + QoS Scheduling Ring 3 ReFlex Server 7 Event Batched libIX Conditions Syscalls Guest Ring 0 6 Scheduler NVMe TCP/IP TCP/IP NVMe 8 CQ TX SQ RX 5 40 [Klimovic et al’17]

ReFlex: Local ≈ Remote Latency 1000 900 Linux: ReFlex: 75K IOPS/core 850K IOPS/core 800 p95 Read Latency (us) 700 600 500 Local-1T 400 300 ReFlex-1T 200 100 Libaio-1T 0 0 250 500 750 1000 1250 IOPS (Thousands) 41

ReFlex: Local ≈ Remote Latency 1000 900 800 p95 Read Latency (us) 700 600 Unloaded latency Local Flash 78 µs 500 Local-1T ReFlex 99 µs 400 Libaio 121 µs 300 ReFlex-1T 200 100 Libaio-1T 0 0 250 500 750 1000 1250 IOPS (Thousands) 42

ReFlex: Local ≈ Remote Latency 1000 900 ReFlex saturates Flash device 800 p95 Read Latency (us) 700 600 Local-1T 500 Local-2T 400 ReFlex-1T 300 ReFlex-2T 200 Libaio-1T 100 Libaio-2T 0 0 250 500 750 1000 1250 IOPS (Thousands) 43

No Tradeoff Low Latency + High Efficiency Christos Kozyrakis - PowerPoint PPT Presentation

No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS), Occupy 1,000s of servers in

5. Structured Descriptions & Tradeoff Between Expressiveness and Tractability Outline

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Analysis of the Parallel Distinguished Point Tradeoff Jin Hong, *Ga Won Lee, Daegun Ma Seoul

Optimal Communication-Distortion Tradeoff in Voting Debmalya Mandal (Columbia), Nisarg Shah

Optimizing the Relevance-Redundancy Tradeoff for Efficient Semantic Segmentation Caner Hazrba

Applying Jury Trial Process Wisdom to Tradeoff Studies: Hybridization of Systems Engineering and

Compressed Sensing and Dictionary Learning to Alleviate Tradeoff between Temporal and Spatial

Latency-Reliability Tradeoff for Different Hop-Level ARQ-based Error Recovery in a Multi-Hop

Performance Tradeoff Considerations in a Graphics Processing Unit

A Rich Morphological Tagger for English: Exploring the Cross-Linguistic Tradeoff Between

On the Resource/Performance Tradeoff in Large Scale Queueing Systems David Gamarnik MIT Joint

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Bias-Variance Tradeoff David Dalpiaz STAT 430, Fall 2017 1 Announcements Homework 03

Analyzing the Effects of Insuring Health Risks: On the Tradeoff between Short Run Insurance

Tradeoff Between Quality And Quantity Of Raters To Characterize Expressive Speech Alec Burmania,

ARTIFICIAL INTELLIGENCE Russell & Norvig Chapter 2: Intelligent Agents, part 2 Agent

Rational Agents (Ch. 2) Rational agent Remember vacuum problem? Agent program: if [Dirty],

Clings Context, Status and Plans Prototype of ROOTs new Interpreter and Reflection

Using X-ray Clusters as Cosmological Probes Hans Bhringer Max-Planck-Institut fr

Simplified flexibility parameters for evaluating renewable integration JRC Workshop on

The impact of the REFLEX cluster survey in cosmology Ariel Snchez A.Balaguera-Antolinez, H.

Partial Behavioral Reflection: Spatial and Temporal Selection of Reification Eric Tanter

Intelligent Agents Chapter 2, Sections 14 of; based on AIMA Slides c Artificial Intelligence,

No Tradeoff Low Latency + High Efficiency Christos Kozyrakis - PowerPoint PPT Presentation

No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS), Occupy 1,000s of servers in

5. Structured Descriptions &amp; Tradeoff Between Expressiveness and Tractability Outline

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Analysis of the Parallel Distinguished Point Tradeoff Jin Hong, *Ga Won Lee, Daegun Ma Seoul

Optimal Communication-Distortion Tradeoff in Voting Debmalya Mandal (Columbia), Nisarg Shah

Optimizing the Relevance-Redundancy Tradeoff for Efficient Semantic Segmentation Caner Hazrba

Applying Jury Trial Process Wisdom to Tradeoff Studies: Hybridization of Systems Engineering and

Compressed Sensing and Dictionary Learning to Alleviate Tradeoff between Temporal and Spatial

Latency-Reliability Tradeoff for Different Hop-Level ARQ-based Error Recovery in a Multi-Hop

Performance Tradeoff Considerations in a Graphics Processing Unit

A Rich Morphological Tagger for English: Exploring the Cross-Linguistic Tradeoff Between

On the Resource/Performance Tradeoff in Large Scale Queueing Systems David Gamarnik MIT Joint

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Bias-Variance Tradeoff David Dalpiaz STAT 430, Fall 2017 1 Announcements Homework 03

Analyzing the Effects of Insuring Health Risks: On the Tradeoff between Short Run Insurance

Tradeoff Between Quality And Quantity Of Raters To Characterize Expressive Speech Alec Burmania,

ARTIFICIAL INTELLIGENCE Russell &amp; Norvig Chapter 2: Intelligent Agents, part 2 Agent

Rational Agents (Ch. 2) Rational agent Remember vacuum problem? Agent program: if [Dirty],

Clings Context, Status and Plans Prototype of ROOTs new Interpreter and Reflection

Using X-ray Clusters as Cosmological Probes Hans Bhringer Max-Planck-Institut fr

Simplified flexibility parameters for evaluating renewable integration JRC Workshop on

The impact of the REFLEX cluster survey in cosmology Ariel Snchez A.Balaguera-Antolinez, H.

Partial Behavioral Reflection: Spatial and Temporal Selection of Reification Eric Tanter

Intelligent Agents Chapter 2, Sections 14 of; based on AIMA Slides c Artificial Intelligence,

5. Structured Descriptions & Tradeoff Between Expressiveness and Tractability Outline

ARTIFICIAL INTELLIGENCE Russell & Norvig Chapter 2: Intelligent Agents, part 2 Agent