FIRM: An Intelligent Fine-grained Resource Management Framework for - - PowerPoint PPT Presentation

firm an intelligent fine grained resource management
SMART_READER_LITE
LIVE PREVIEW

FIRM: An Intelligent Fine-grained Resource Management Framework for - - PowerPoint PPT Presentation

FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-oriented Microservices Haoran Qiu* , Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer DEPEND Research Group University of Illinois at


slide-1
SLIDE 1

FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-oriented Microservices

Haoran Qiu*, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer DEPEND Research Group University of Illinois at Urbana-Champaign

* Presenter

slide-2
SLIDE 2

From Monolithic to Microservices

  • Microservice architecture growing in popularity
  • A set of loosely-coupled, self-concerned “micro” services
  • Scalability, fault isolation, flexibility, etc.
  • Scale and complexity are increasing
  • Increasing in scale, e.g. 700+ (Netflix in ’17), 1000+ (Uber in ’19)
  • Performance guarded by service level objectives (SLOs)
  • Violation leads to financial loss (100ms increase converted to $0.7 billion loss in Amazon sales (Q4 ’18)

2

Uber in 2019

Function A Function B Function C Function D Main() Monolithic Binary Client Service A Gateway Microservices Service B Service C Service D

slide-3
SLIDE 3

Performance Predictability in Microservices is Hard

  • Challenge #1: Difficulty in isolating root causes of SLO violations
  • Complex inter-microservice dependencies cascading SLO violations
  • Challenge #2: Inability in capturing shared-resource contention at a lower-level
  • Interference over shared resources (e.g. LLC, memory bandwidth, network devices)
  • Challenge #3: Difficulty in taking the right action to mitigate SLO violations
  • High fidelity performance models/scheduling heuristics -> significant human-effort and training
  • Frequent service updates/migrations -> recurring effort for model reconstruction and re-training

3

Main Memory Shared L3 Cache Multicore Processor Host OS Kernel Container Engine Container 1 Container 2 I/O Devices Network Devices I/O System

50 100 150 200 250 300

Time (s)

400 800

99%ile Latency (ms)

200 400

CPU Util (%)

50 100 150

Per-core DRAM Access

with FIRM without FIRM

Undesired SLO Violations

slide-4
SLIDE 4

FIRM As The Cure

  • Two-level machine learning based SLO violation mitigation framework
  • Challenge #1 – Detection and localization of SLO violations to individual microservices
  • Challenge #2 & #3 – Estimation of resources in contention and dynamic resource reprovision
  • Benefits: Improved interpretability and less training time
  • Designed, developed, and deployed in a 15-node Kubernetes cluster
  • Outperforms Kubernetes autoscaling by up to 16x in reducing SLO violations

4

Kubelet Proxy Docker Container Container Pod

Worker Node

Clients

Master Node

Scheduler API Server Controller Manager etcd FIRM Measurements Actions Support Vector Machine (SVM)- based State Inference Reinforcement Learning (RL)- based SLO Violation Mitigation FIRM’s Two-level ML Model Container Container Pod

slide-5
SLIDE 5

Insight 1: Dynamic Behavior of Critical Paths

400 600

Latency (ms)

0.00 0.25 0.50 0.75 1.00

CDF

Max-CP Min-CP 400 500 600 700

Latency (ms)

0.00 0.25 0.50 0.75 1.00

CDF

Max-CP Min-CP 600 800 1000

Latency (ms)

0.00 0.25 0.50 0.75 1.00

CDF

Max-CP Min-CP 300 400 500 600

Latency (ms)

0.00 0.25 0.50 0.75 1.00

CDF

Max-CP Min-CP

  • Critical path defines the longest path in

execution

  • Detection of critical paths helps reveal the

bottleneck of performance

  • Critical path is not static, but dynamically

changing based on the performance of individual service instances

  • Different type of underlying shared-resource

contention

  • Different degree of sensitivity to the same

type of interference

  • It’s important to capture the changes at

runtime, and make runtime decision

Social Network Media Service Train Ticket Booking Hotel Reservation

5

99th percentile 99th percentile 1.3x 1.5x 2x 1.6x

slide-6
SLIDE 6

Insight 2: Significance of Latency Variability

  • Microservices with larger latency are not necessarily the root causes of SLO violations
  • Processing time with higher variance makes it harder to obtain low tail latency
  • Variability represents opportunities for reducing latency

100 125 150

Total Latency (ms)

0.00 0.25 0.50 0.75 1.00

CDF

Before Text Compose 40 60 80 100

Individual Latency (ms)

0.00 0.25 0.50 0.75 1.00

CDF

Text Compose

Social Network – Composing Post Request

6

High Variance High Median Better

slide-7
SLIDE 7

Span Graph (Execution) Service Dependence

State Inference (1)

  • Real-time observability on request execution

provided by end-to-end distributed tracing

7

Service A Gateway Microservices Service B Service C Service D Gateway Service A Service B Service C Service D

Performance Anomaly Injector Microservices Deployment & Service Dependency Graph

Nginx PHP-FPM Load Balancer Tracing Module Microservice Instance Replica Set

Tracing Coordinator

  • Auto-labeled training data driven by the

performance anomaly injection

slide-8
SLIDE 8

State Inference (3)

  • Real-time observability on request execution

provided by end-to-end distributed tracing

  • Auto-labeled training data driven by the

performance anomaly injection

  • SLO violation detection and narrow down via

critical path analysis

  • SVM-based critical component localization
  • Given individual latency vector Ti, and end-to-

end latency vector TCP

  • Relative importance defined as the Pearson

correlation coefficient between Ti and TCP

  • Congestion intensity defined as 99-th

percentile value divided by median value of Ti

Extractor Critical Path Extraction Critical Instance Extraction

Execution History Graph Telemetry Data Candidates cr i t i cal Com ponent ( ) l ongest Pat h( ) Critical Paths

Performance Anomaly Injector Microservices Deployment & Service Dependency Graph

Nginx PHP-FPM Load Balancer Tracing Module Microservice Instance Replica Set

Tracing Coordinator

8

slide-9
SLIDE 9

Insight 3: No Common Mitigation Policy for All

  • SLO violation mitigation policies vary with applications, user loads, and the types of resource

in contention

  • Designing optimal resource provisioning strategy is intractable, just like scheduling problems
  • Modeling complexity: Tetris [SIGCOMM ’14], Jokey [EuroSys ’12]
  • Placement constraints: TetriSched [EuroSys ’16], device placement [NIPS ’17]
  • Data locality: Delayed scheduling [EuroSys ’10], SWAG [SoCC ’15]

250 500 750 1000 1250 1500 1750 2000 2250

Load (# requests/s)

104 106 108 104 106 108

Scale Up Scale Out CPU Memory End-to-End Latency (us)

Social Network Train Ticket Booking

9

250 500 750 1000 1250 1500 1750 2000 2250

Load (# requests/s)

105 107 105 107

Scale Up Scale Out CPU Memory End-to-End Latency (us)

slide-10
SLIDE 10

Why not human-driven performance engineering?

  • No “one-size-fits-all” solution for the online decision problem
  • Best algorithm depends on specific workload and system
  • Human-driven performance engineering
  • Assume a simple system model
  • Produce some clever heuristics
  • Painstakingly test & tune the heuristics in practice
  • Redo the above steps
  • Is there a way to work around human-generated heuristics? Yes

10

OR Microservices Resources

  • RL-based SLO violation mitigation
  • Assume a random scheduling policy
  • Perceive states and receive rewards
  • Optimize the policy based on the rewards
  • Loop continues until convergence
slide-11
SLIDE 11

SLO Violation Mitigation (1)

  • Observability improved through online

distributed tracing

  • Auto-labeled training data and RL online

learning driven by the performance anomaly injector

  • SLO violation detection and localization via

critical path analysis

  • SVM-based critical component extraction
  • SLO violation mitigation based on

reinforcement learning

  • Identifies low-level resource contention
  • Estimates the right amount to reprovision

RL-based Resource Estimator

Re-allocation Actions Performance Counters

Deployment Module

CPU LLC Memory I/O Network Replicas

Controlled Resources

Extractor Critical Path Extraction Critical Instance Extraction

Execution History Graph Telemetry Data Candidates cr i t i cal Com ponent ( ) l ongest Pat h( ) Critical Paths

Performance Anomaly Injector Microservices Deployment & Service Dependency Graph

Nginx PHP-FPM Load Balancer Tracing Module Microservice Instance Replica Set

Tracing Coordinator

11

slide-12
SLIDE 12

RL Agent CPU Utilization Memory Bandwidth LLC Bandwidth LLC Capacity Disk I/O Bandwidth Network Bandwidth Microservices Managed by FIRM Actions (at) Performance & Resource Measurements States (st) Rewards (rt) SLO Utilization Actor Critic Vt SLO Violation Arrival Rate

SLO Violation Mitigation (2)

  • An RL-based resource estimation agent that learns to make provisioning decisions directly

from experience

  • Optimizes objectives end-to-end:
  • Minimize SLO violation
  • Maximize resource utilization efficiency

𝑠 𝑢 = 𝛽 % 𝑇𝑁! % ℛ + (1 − 𝛽) % .

" |ℛ|

𝑆𝑉!

"/𝑆𝑀! "

𝑇𝑁! = 𝑀𝑏𝑢𝑓𝑜𝑑𝑧%&' 𝑀𝑏𝑢𝑓𝑜𝑑𝑧! (SLO maintenance) Resource utilization

  • f 𝑗 at time 𝑢

Resource limit

  • f 𝑗 at time 𝑢

12

Mitigate SLO Violation Fast Avoid Over- provisioning

slide-13
SLIDE 13

Multilevel ML Training

13

K8S Cluster Workload Generators FIRM’s RL Agent Reinforcement Learning Training Generate experience data Anomaly Injector FIRM’s SVM Model Feature (X) Label (y) Labeling

slide-14
SLIDE 14

Multilevel ML Training

14

K8S Cluster Workload Generators FIRM’s RL Agent Reinforcement Learning Training Generate experience data Anomaly Injector FIRM’s SVM Model Feature (X) Label (y) Labeling

9x 2x

slide-15
SLIDE 15

Evaluation

  • Implemented and deployed FIRM on a Kubernetes cluster of 15 physical nodes
  • Running microservices benchmarks from DeathStarBench [1] and TrainTickets [2] driven by
  • pen-loop workload generators
  • Training and experiments driven by performance anomaly injection
  • Comparison targets include Kubernetes autoscaling [3] and an additive increase multiplicative

decrease (AIMD)-based method [4]

  • [1] Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. An
  • pensource benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth

International Conference on Architectural Support for Programming Languages and Operating Systems, pages 3–18, 2019. [2] Train Ticket - A train-ticket booking system based on microservice architecture. https://github.com/ FudanSELab/train-ticket. [3] Autoscaling in Kubernetes. https:// kubernetes.io/blog/2016/07/autoscalingin-kubernetes/ [4] Sonja Stüdli, M. Corless, Richard H. Middleton, and Robert Shorten. On the modified AIMD algorithm for distributed resource management with saturation of each user’s share. In Proceedings of 2015 54th IEEE Conference on Decision and Control (CDC), pages 1631–1636. IEEE, 2015. 15

slide-16
SLIDE 16

Results

2000 4000 6000 8000 10000 12000

(a) End-to-end Latency (ms)

0.0 0.2 0.4 0.6 0.8 1.0

CDF

Better 100 200 300 400 500

(b) Requested CPU Limit %

0.0 0.2 0.4 0.6 0.8 1.0

CDF

Better 2000 4000 6000 8000 10000 12000

(c) # of Dropped Requests

0.0 0.2 0.4 0.6 0.8 1.0

CDF

Better

FIRM (Transferred Single-RL) FIRM (Multi-RL) AIMD K8S Auto-scaling 16

  • Reduces the SLO violation mitigation time by up to 9×

compared with AIMD

  • Reduces the average tail latencies by up to 6-11×
  • Reduces the overall average requested CPU limit by 29-62%
  • Reduces the number of dropped/timed out requests by up

to 8x

99th percentile 6x 11x

9x 2x

slide-17
SLIDE 17

Conclusion

  • FIRM uses SVM-based critical component extraction to localize at runtime root cause

microservice instances for SLO violations

  • FIRM uses RL to generate workload-specific mitigation policies, optimized to estimate

resources in contention and provide re-provision actions

  • FIRM leverages the two-level ML model structure to improve interpretability and save training

time

  • FIRM outperforms Kubernetes auto-scalers and AIMD-based methods

17

slide-18
SLIDE 18

Thank you!

Check out the full paper for more details! (haoranq4@Illinois.edu)

18