[PPT] - FIRM: An Intelligent Fine-grained Resource Management Framework for PowerPoint Presentation

SLIDE 1

FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-oriented Microservices

Haoran Qiu*, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer DEPEND Research Group University of Illinois at Urbana-Champaign

* Presenter

SLIDE 2

From Monolithic to Microservices

Microservice architecture growing in popularity
A set of loosely-coupled, self-concerned “micro” services
Scalability, fault isolation, flexibility, etc.
Scale and complexity are increasing
Increasing in scale, e.g. 700+ (Netflix in ’17), 1000+ (Uber in ’19)
Performance guarded by service level objectives (SLOs)
Violation leads to financial loss (100ms increase converted to $0.7 billion loss in Amazon sales (Q4 ’18)

2

Uber in 2019

Function A Function B Function C Function D Main() Monolithic Binary Client Service A Gateway Microservices Service B Service C Service D

SLIDE 3

Performance Predictability in Microservices is Hard

Challenge #1: Difficulty in isolating root causes of SLO violations
Complex inter-microservice dependencies cascading SLO violations
Challenge #2: Inability in capturing shared-resource contention at a lower-level
Interference over shared resources (e.g. LLC, memory bandwidth, network devices)
Challenge #3: Difficulty in taking the right action to mitigate SLO violations
High fidelity performance models/scheduling heuristics -> significant human-effort and training
Frequent service updates/migrations -> recurring effort for model reconstruction and re-training

3

Main Memory Shared L3 Cache Multicore Processor Host OS Kernel Container Engine Container 1 Container 2 I/O Devices Network Devices I/O System

50 100 150 200 250 300

Time (s)

400 800

99%ile Latency (ms)

200 400

CPU Util (%)

50 100 150

Per-core DRAM Access

with FIRM without FIRM

Undesired SLO Violations

SLIDE 4

FIRM As The Cure

Two-level machine learning based SLO violation mitigation framework
Challenge #1 – Detection and localization of SLO violations to individual microservices
Challenge #2 & #3 – Estimation of resources in contention and dynamic resource reprovision
Benefits: Improved interpretability and less training time
Designed, developed, and deployed in a 15-node Kubernetes cluster
Outperforms Kubernetes autoscaling by up to 16x in reducing SLO violations

4

Kubelet Proxy Docker Container Container Pod

Worker Node

Clients

Master Node

Scheduler API Server Controller Manager etcd FIRM Measurements Actions Support Vector Machine (SVM)- based State Inference Reinforcement Learning (RL)- based SLO Violation Mitigation FIRM’s Two-level ML Model Container Container Pod

SLIDE 5

Insight 1: Dynamic Behavior of Critical Paths

400 600

Latency (ms)

0.00 0.25 0.50 0.75 1.00

CDF

Max-CP Min-CP 400 500 600 700

Latency (ms)

0.00 0.25 0.50 0.75 1.00

CDF

Max-CP Min-CP 600 800 1000

Latency (ms)

0.00 0.25 0.50 0.75 1.00

CDF

Max-CP Min-CP 300 400 500 600

Latency (ms)

0.00 0.25 0.50 0.75 1.00

CDF

Max-CP Min-CP

Critical path defines the longest path in

execution

Detection of critical paths helps reveal the

bottleneck of performance

Critical path is not static, but dynamically

changing based on the performance of individual service instances

Different type of underlying shared-resource

contention

Different degree of sensitivity to the same

type of interference

It’s important to capture the changes at

runtime, and make runtime decision

Social Network Media Service Train Ticket Booking Hotel Reservation

5

99th percentile 99th percentile 1.3x 1.5x 2x 1.6x

SLIDE 6

Insight 2: Significance of Latency Variability

Microservices with larger latency are not necessarily the root causes of SLO violations
Processing time with higher variance makes it harder to obtain low tail latency
Variability represents opportunities for reducing latency

100 125 150

Total Latency (ms)

0.00 0.25 0.50 0.75 1.00

CDF

Before Text Compose 40 60 80 100

Individual Latency (ms)

0.00 0.25 0.50 0.75 1.00

CDF

Text Compose

Social Network – Composing Post Request

6

High Variance High Median Better

SLIDE 7

Span Graph (Execution) Service Dependence

State Inference (1)

Real-time observability on request execution

provided by end-to-end distributed tracing

7

Service A Gateway Microservices Service B Service C Service D Gateway Service A Service B Service C Service D

Performance Anomaly Injector Microservices Deployment & Service Dependency Graph

Nginx PHP-FPM Load Balancer Tracing Module Microservice Instance Replica Set

Tracing Coordinator

Auto-labeled training data driven by the

performance anomaly injection

SLIDE 8

State Inference (3)

Real-time observability on request execution

provided by end-to-end distributed tracing

Auto-labeled training data driven by the

performance anomaly injection

SLO violation detection and narrow down via

critical path analysis

SVM-based critical component localization
Given individual latency vector Ti, and end-to-

end latency vector TCP

Relative importance defined as the Pearson

correlation coefficient between Ti and TCP

Congestion intensity defined as 99-th

percentile value divided by median value of Ti

Extractor Critical Path Extraction Critical Instance Extraction

Execution History Graph Telemetry Data Candidates cr i t i cal Com ponent ( ) l ongest Pat h( ) Critical Paths

Performance Anomaly Injector Microservices Deployment & Service Dependency Graph

Nginx PHP-FPM Load Balancer Tracing Module Microservice Instance Replica Set

Tracing Coordinator

8

SLIDE 9

Insight 3: No Common Mitigation Policy for All

SLO violation mitigation policies vary with applications, user loads, and the types of resource

in contention

Designing optimal resource provisioning strategy is intractable, just like scheduling problems
Modeling complexity: Tetris [SIGCOMM ’14], Jokey [EuroSys ’12]
Placement constraints: TetriSched [EuroSys ’16], device placement [NIPS ’17]
Data locality: Delayed scheduling [EuroSys ’10], SWAG [SoCC ’15]
…

250 500 750 1000 1250 1500 1750 2000 2250

Load (# requests/s)

104 106 108 104 106 108

Scale Up Scale Out CPU Memory End-to-End Latency (us)

Social Network Train Ticket Booking

9

250 500 750 1000 1250 1500 1750 2000 2250

Load (# requests/s)

105 107 105 107

Scale Up Scale Out CPU Memory End-to-End Latency (us)

SLIDE 10

Why not human-driven performance engineering?

No “one-size-fits-all” solution for the online decision problem
Best algorithm depends on specific workload and system
Human-driven performance engineering
Assume a simple system model
Produce some clever heuristics
Painstakingly test & tune the heuristics in practice
Redo the above steps
Is there a way to work around human-generated heuristics? Yes

10

OR Microservices Resources

RL-based SLO violation mitigation
Assume a random scheduling policy
Perceive states and receive rewards
Optimize the policy based on the rewards
Loop continues until convergence

SLIDE 11

SLO Violation Mitigation (1)

Observability improved through online

distributed tracing

Auto-labeled training data and RL online

learning driven by the performance anomaly injector

SLO violation detection and localization via

critical path analysis

SVM-based critical component extraction
SLO violation mitigation based on

reinforcement learning

Identifies low-level resource contention
Estimates the right amount to reprovision

RL-based Resource Estimator

Re-allocation Actions Performance Counters

Deployment Module

CPU LLC Memory I/O Network Replicas

Controlled Resources

Extractor Critical Path Extraction Critical Instance Extraction

Execution History Graph Telemetry Data Candidates cr i t i cal Com ponent ( ) l ongest Pat h( ) Critical Paths

Performance Anomaly Injector Microservices Deployment & Service Dependency Graph

Nginx PHP-FPM Load Balancer Tracing Module Microservice Instance Replica Set

Tracing Coordinator

11

SLIDE 12

RL Agent CPU Utilization Memory Bandwidth LLC Bandwidth LLC Capacity Disk I/O Bandwidth Network Bandwidth Microservices Managed by FIRM Actions (at) Performance & Resource Measurements States (st) Rewards (rt) SLO Utilization Actor Critic Vt SLO Violation Arrival Rate

SLO Violation Mitigation (2)

An RL-based resource estimation agent that learns to make provisioning decisions directly

from experience

Optimizes objectives end-to-end:
Minimize SLO violation
Maximize resource utilization efficiency

𝑠 𝑢 = 𝛽 % 𝑇𝑁! % ℛ + (1 − 𝛽) % .

" |ℛ|

𝑆𝑉!

"/𝑆𝑀! "

𝑇𝑁! = 𝑀𝑏𝑢𝑓𝑜𝑑𝑧%&' 𝑀𝑏𝑢𝑓𝑜𝑑𝑧! (SLO maintenance) Resource utilization

f 𝑗 at time 𝑢

Resource limit

f 𝑗 at time 𝑢

12

Mitigate SLO Violation Fast Avoid Over- provisioning

SLIDE 13

Multilevel ML Training

13

K8S Cluster Workload Generators FIRM’s RL Agent Reinforcement Learning Training Generate experience data Anomaly Injector FIRM’s SVM Model Feature (X) Label (y) Labeling

SLIDE 14

Multilevel ML Training

14

K8S Cluster Workload Generators FIRM’s RL Agent Reinforcement Learning Training Generate experience data Anomaly Injector FIRM’s SVM Model Feature (X) Label (y) Labeling

9x 2x

SLIDE 15

Evaluation

Implemented and deployed FIRM on a Kubernetes cluster of 15 physical nodes
Running microservices benchmarks from DeathStarBench [1] and TrainTickets [2] driven by
pen-loop workload generators
Training and experiments driven by performance anomaly injection
Comparison targets include Kubernetes autoscaling [3] and an additive increase multiplicative

decrease (AIMD)-based method [4]

[1] Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. An
pensource benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth

International Conference on Architectural Support for Programming Languages and Operating Systems, pages 3–18, 2019. [2] Train Ticket - A train-ticket booking system based on microservice architecture. https://github.com/ FudanSELab/train-ticket. [3] Autoscaling in Kubernetes. https:// kubernetes.io/blog/2016/07/autoscalingin-kubernetes/ [4] Sonja Stüdli, M. Corless, Richard H. Middleton, and Robert Shorten. On the modified AIMD algorithm for distributed resource management with saturation of each user’s share. In Proceedings of 2015 54th IEEE Conference on Decision and Control (CDC), pages 1631–1636. IEEE, 2015. 15

SLIDE 16

Results

2000 4000 6000 8000 10000 12000

(a) End-to-end Latency (ms)

0.0 0.2 0.4 0.6 0.8 1.0

CDF

Better 100 200 300 400 500

(b) Requested CPU Limit %

0.0 0.2 0.4 0.6 0.8 1.0

CDF

Better 2000 4000 6000 8000 10000 12000

(c) # of Dropped Requests

0.0 0.2 0.4 0.6 0.8 1.0

CDF

Better

FIRM (Transferred Single-RL) FIRM (Multi-RL) AIMD K8S Auto-scaling 16

Reduces the SLO violation mitigation time by up to 9×

compared with AIMD

Reduces the average tail latencies by up to 6-11×
Reduces the overall average requested CPU limit by 29-62%
Reduces the number of dropped/timed out requests by up

to 8x

99th percentile 6x 11x

9x 2x

SLIDE 17

Conclusion

FIRM uses SVM-based critical component extraction to localize at runtime root cause

microservice instances for SLO violations

FIRM uses RL to generate workload-specific mitigation policies, optimized to estimate

resources in contention and provide re-provision actions

FIRM leverages the two-level ML model structure to improve interpretability and save training

time

FIRM outperforms Kubernetes auto-scalers and AIMD-based methods

17

SLIDE 18

Thank you!

Check out the full paper for more details! (haoranq4@Illinois.edu)

18