Executive Summary From monoliths to microservices: Monoliths all - - PowerPoint PPT Presentation

executive summary
SMART_READER_LITE
LIVE PREVIEW

Executive Summary From monoliths to microservices: Monoliths all - - PowerPoint PPT Presentation

S EER: L EVERAGING B IG D ATA T O N AVIGATE T HE C OMPLEXITY O F P ERFORMANCE D EBUGGING I N C LOUD M ICROSERVICES Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou Cornell University ASPLOS


slide-1
SLIDE 1

Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou Cornell University

ASPLOS – April 15th 2019

SEER: LEVERAGING BIG DATA TO NAVIGATE THE COMPLEXITY OF PERFORMANCE DEBUGGING IN CLOUD MICROSERVICES

slide-2
SLIDE 2

2

 From monoliths to microservices:

 Monoliths  all functionality in a single service  Microservices  many single-concerned, loosely-coupled services

Executive Summary

slide-3
SLIDE 3

3

 From monoliths to microservices:

 Monoliths  all functionality in a single service  Microservices  many single-concerned, loosely-coupled services

Executive Summary

slide-4
SLIDE 4

4

 From monoliths to microservices:

 Monoliths  all functionality in a single service  Microservices  many single-concerned, loosely-coupled services

 Microservices implications:

 Modularity, specialization, faster development  Performance unpredictability (us-level QoS), cascading QoS violations  A-posteriori

debugging

Executive Summary

slide-5
SLIDE 5

5

 From monoliths to microservices:

 Monoliths  all functionality in a single service  Microservices  many single-concerned, loosely-coupled services

 Microservices implications:

 Modularity, specialization, faster development  Performance unpredictability (us-level QoS), cascading QoS violations  A-posteriori

debugging

 Seer: Proactive performance debugging for interactive microservices

 Leverage DL to anticipate & diagnose root cause of QoS violations  >90% accuracy on large-scale end-to-end microservices deployments  Avoid unpredictable performance  Offer insight to improve microservices design and deployment

Executive Summary

slide-6
SLIDE 6

6

Motivation

webserver databases ads posts photos recommender

slide-7
SLIDE 7

7

Motivation

webserver databases

recommender

ads photos posts

slide-8
SLIDE 8

8

Motivation

webserver databases

recommender

ads photos posts ads posts photos recommender

webserver databases

Monolith Microservices

slide-9
SLIDE 9

9

 Advantages of microservices:  Modular  easier to understand  Speed of development & deployment  On-demand provisioning, elasticity  Language/framework heterogeneity

Motivation

webserver databases

recommender

ads photos posts ads posts photos recommender

webserver databases

Monolith Microservices

slide-10
SLIDE 10

10

 Complicate cluster management & performance debugging  Dependencies cause cascading QoS violations  Difficult to isolate root cause of performance unpredictability

Performance Debugging Challenges

Netflix Twitter Amazon

slide-11
SLIDE 11

11

 Complicate cluster management & performance debugging  Dependencies cause cascading QoS violations  Difficult to isolate root cause of performance unpredictability

Performance Debugging Challenges

Netflix Twitter Amazon

slide-12
SLIDE 12

12

Performance Debugging Challenges

Netflix Amazon Social Network

 Dependencies cause cascading QoS violations  Empirical performance debugging  too

slow, bottlenecks propagate

 Long recovery times for performance

slide-13
SLIDE 13

13

Performance Debugging Challenges

Netflix Amazon Social Network

 Dependencies cause cascading QoS violations  Empirical performance debugging  too

slow, bottlenecks propagate

 Long recovery times for performance

slide-14
SLIDE 14

14

Performance Debugging Challenges

Netflix Amazon Social Network

 Dependencies cause cascading QoS violations  Empirical performance debugging  too

slow, bottlenecks propagate

 Long recovery times for performance

slide-15
SLIDE 15

15

Performance Debugging Challenges

Netflix Amazon Social Network

 Dependencies cause cascading QoS violations  Empirical performance debugging  too

slow, bottlenecks propagate

 Long recovery times for performance

slide-16
SLIDE 16

16

Performance Debugging Challenges

Netflix Amazon Social Network

 Dependencies cause cascading QoS violations  Empirical performance debugging  too

slow, bottlenecks propagate

 Long recovery times for performance

slide-17
SLIDE 17

17

Performance Debugging Challenges

Netflix Amazon QoS met Social Network

 Dependencies cause cascading QoS violations  Empirical performance debugging  too

slow, bottlenecks propagate

 Long recovery times for performance

slide-18
SLIDE 18

18

Performance Debugging Challenges

Netflix Amazon QoS violated Social Network

 Dependencies cause cascading QoS violations  Empirical performance debugging  too

slow, bottlenecks propagate

 Long recovery times for performance

slide-19
SLIDE 19

19

 Dependencies cause cascading QoS violations  Empirical performance debugging  too

slow, bottlenecks propagate

 Long recovery times for performance

Performance Debugging Challenges

Amazon Netflix Social Network

Demo: http://www.csl.cornell.edu/~delimitrou/2019.asplos.seer.demo_motivation.mp4

slide-20
SLIDE 20

20

 Use ML to identify the culprit (root cause) of an upcoming QoS violation

 Leverage the massive amount of distributed traces collected over time  Use targeted per-server hardware probes to determine the cause of the QoS violation

 Inform cluster manager to take proactive action & prevent QoS violation

 Need to predict 100s of msec – a few sec in the future

Seer: Proactive Performance Debugging

Cluster manager TraceDB Seer

slide-21
SLIDE 21

21

 Two-level tracing  Distributed RPC-level tracing

 Similar to Dapper, Zipkin  Per-microservice latencies  Inter- and intra-microservice queue

lengths

 Tracing overhead: <0.1% in QPS,

<0.2% in 99th %ile latency

 Per-node hardware monitoring

 Targeted on nodes with

problematic microservices

 Perf counters & contentious

microbenchmarks

Instrumentation & Tracing

TCP RX Epoll Nginx proc TCP TX LB C

DB DB DB DB DB DB

Front-end Logic tiers Back-end Client

slide-22
SLIDE 22

22

 Two-level tracing  Distributed RPC-level tracing

 Similar to Dapper, Zipkin  Per-microservice latencies  Inter- and intra-microservice queue

lengths

 Tracing overhead: <0.1% in QPS,

<0.2% in 99th %ile latency

 Per-node hardware monitoring

 Targeted on nodes with

problematic microservices

 Perf counters & contentious

microbenchmarks

Instrumentation & Tracing

TCP RX Epoll Nginx proc TCP TX LB C

DB DB DB DB DB DB

Front-end Logic tiers Back-end Client

slide-23
SLIDE 23

23

 Two-level tracing  Distributed RPC-level tracing

 Similar to Dapper, Zipkin  Per-microservice latencies  Inter- and intra-microservice queue

lengths

 Tracing overhead: <0.1% in QPS,

<0.2% in 99th %ile latency

 Per-node hardware monitoring

 Targeted on nodes with

problematic microservices

 Perf counters & contentious

microbenchmarks

Instrumentation & Tracing

TCP RX Epoll Nginx proc TCP TX LB C

DB DB DB DB DB DB

Front-end Logic tiers Back-end Client

slide-24
SLIDE 24

24

 Two-level tracing  Distributed RPC-level tracing

 Similar to Dapper, Zipkin  Per-microservice latencies  Inter- and intra-microservice queue

lengths

 Tracing overhead: <0.1% in QPS,

<0.2% in 99th %ile latency

 Per-node hardware monitoring

 Targeted on nodes with

problematic microservices

 Perf counters & contentious

microbenchmarks

Instrumentation & Tracing

TCP RX Epoll Nginx proc TCP TX LB C

DB DB DB DB DB DB

Front-end Logic tiers Back-end Client

slide-25
SLIDE 25

25

 Two-level tracing  Distributed RPC-level tracing

 Similar to Dapper, Zipkin  Per-microservice latencies  Inter- and intra-microservice queue

lengths

 Tracing overhead: <0.1% in QPS,

<0.2% in 99th %ile latency

 Per-node hardware monitoring

 Targeted on nodes with

problematic microservices

 Perf counters & contentious

microbenchmarks

Instrumentation & Tracing

TCP RX Epoll Nginx proc TCP TX LB C

DB DB DB DB DB DB

Front-end Logic tiers Back-end Client

slide-26
SLIDE 26

26

DL for Cloud Performance Debugging

Output signal

 Why?

 Architecture-agnostic  Adjusts to changes over time  High accuracy, good scalability & fast inference (within window of opportunity)

Probability that a microservice will initiate a QoS violation in the near future

slide-27
SLIDE 27

27

DL for Cloud Performance Debugging

Output signal

Probability that a microservice will initiate a QoS violation in the near future

slide-28
SLIDE 28

28

DL for Cloud Performance Debugging

 Container

utilization

Output signal

Probability that a microservice will initiate a QoS violation in the near future

Input signal

slide-29
SLIDE 29

29

DL for Cloud Performance Debugging

Output signal

Probability that a microservice will initiate a QoS violation in the near future

Input signal

 Container

utilization

 Latency

slide-30
SLIDE 30

30

DL for Cloud Performance Debugging

Output signal

Probability that a microservice will initiate a QoS violation in the near future

Input signal

 Container

utilization

 Latency  Queue

length

slide-31
SLIDE 31

31

DL for Cloud Performance Debugging

 Container

utilization

 Latency  Queue

length

Output signal

Probability that a microservice will initiate a QoS violation in the near future

Input signal

slide-32
SLIDE 32

32

DL for Cloud Performance Debugging

 Container

utilization

 Latency  Queue

length

Output signal

Probability that a microservice will initiate a QoS violation in the near future

Input signal

Dimensionality reduction

slide-33
SLIDE 33

33

DL for Cloud Performance Debugging

 Container

utilization

 Latency  Queue

length

Output signal

Probability that a microservice will initiate a QoS violation in the near future

Input signal

Dimensionality reduction Near-future prediction

slide-34
SLIDE 34

34

DL for Cloud Performance Debugging

 DNN Configuration

 CNN: Fast, but cannot effectively predict future  LSTM: Higher accuracy, but affected by noisy, non-critical microservices  Hybrid network: Highest accuracy, without significantly higher overhead

Output signal Input signal

Probability that a microservice will initiate a QoS violation in the near future

Queue length

slide-35
SLIDE 35

35

 Training once: slow (hours - days)

 Across load levels, load distributions, request types  Annotated queue traces  inject microbenchmarks to force controlled QoS violations  Weight/bias inference with SGD  Incremental retraining & dynamically expanding/shrinking in the background

 Inference: continuously streaming traces  20-server dedicated heterogeneous cluster

 Different server configurations  10s of cores, >100GB RAM per server

 4 end-to-end applications  ~30-40 unique microservices each

 Social Network, Media Service, E-commerce Site, Banking System

Methodology

slide-36
SLIDE 36

36

 Social Network

End-to-end Microservices

slide-37
SLIDE 37

37

 Social Network

End-to-end Microservices

slide-38
SLIDE 38

38

 Social Network

End-to-end Microservices

slide-39
SLIDE 39

39

 Social Network

End-to-end Microservices

slide-40
SLIDE 40

40

Validation

91% accuracy in signaling upcoming QoS violations 88% accuracy in attributing QoS violation to correct microservice

 50GB input training dataset

 Accuracy levels off thereafter

 50ms tracing sampling interval

 No benefit from finer-grain tracing

slide-41
SLIDE 41

41

Validation

91% accuracy in signaling upcoming QoS violations 88% accuracy in attributing QoS violation to correct microservice

 50GB input training dataset

 Accuracy levels off thereafter

 50ms tracing sampling interval

 No benefit from finer-grain tracing

slide-42
SLIDE 42

42

Validation

91% accuracy in signaling upcoming QoS violations 88% accuracy in attributing QoS violation to correct microservice

 50GB input training dataset

 Accuracy levels off thereafter

 50ms tracing sampling interval

 No benefit from finer-grain tracing

slide-43
SLIDE 43

43

 Tracing interval < 500ms 

low accuracy

 Tracing interval > 100ms 

no further improvement

 Large increase in accuracy

until ~50GB training set

 Levels off afterwards  Large increase in training

time after 50GB

Sensitivity Analysis

slide-44
SLIDE 44

44

 Tracing interval < 500ms 

low accuracy

 Tracing interval > 100ms 

no further improvement

 Large increase in accuracy

until ~50GB training set

 Levels off afterwards  Large increase in training

time after 50GB

Sensitivity Analysis

slide-45
SLIDE 45

45

 Tracing interval < 500ms 

low accuracy

 Tracing interval > 100ms 

no further improvement

 Large increase in accuracy

until ~50GB training set

 Levels off afterwards  Large increase in training

time after 50GB

Sensitivity Analysis

slide-46
SLIDE 46

46

 Identify cause of QoS violation

 Private cluster: performance counters & utilization monitors  Public cluster: contentious microbenchmarks

 Adjust resource allocation

 RAPL (fine-grain DVFS) & scale-up for CPU contention  Cache partitioning (CAT) for cache contention  Memory capacity partitioning for memory contention  Network bandwidth partitioning (HTB) for net contention  Storage bandwidth partitioning for I/O contention

 Application level bugs

 Human needs to intervene

Avoiding QoS Violations

slide-47
SLIDE 47

47

slide-48
SLIDE 48

48

Front end Front end

slide-49
SLIDE 49

49

Front end Logic tiers Front end Logic tiers

slide-50
SLIDE 50

50

Front end Logic tiers Back end Front end Logic tiers Back end

slide-51
SLIDE 51

51

Queue

slide-52
SLIDE 52

52

CPU Queue

slide-53
SLIDE 53

53

CPU Queue

slide-54
SLIDE 54

54

CPU Queue QoS met

slide-55
SLIDE 55

55

CPU Queue QoS violated

slide-56
SLIDE 56

56

Demo: http://www.csl.cornell.edu/~delimitrou/2019.asplos.seer.demo.mp4

Demo

slide-57
SLIDE 57

57

Using ML to Design Better Cloud Systems

 Large-scale Social Network deployment (~600 users, ~2 months deployment)  Offload Seer on Google TPU v2  24x-118x improvement in training and inference  Several bugs found (blocking RPCs, livelocks, shared data structs, cyclic dependencies,

insufficient resources, etc.)

 Fewer QoS violations over time

slide-58
SLIDE 58

58

 Microservices become increasingly popular  Traditional performance debugging techniques do not scale and

introduce long recovery times

 Seer leverages DL to anticipate QoS violations & find their root causes  >90% detection accuracy, avoids 86% of QoS violations  Provides insight on how to better design and deploy complex microservices  Practical solutions for systems whose scale make previous empirical

solutions impractical

Conclusions

slide-59
SLIDE 59

59

Questions?

 Microservices become increasingly popular  Traditional performance debugging techniques do not scale and

introduce long recovery times

 Seer leverages DL to anticipate QoS violations & find their root causes  >90% detection accuracy, avoids 86% of QoS violations  Provides insight on how to better design and deploy complex microservices  Practical solutions for systems whose scale make previous empirical

solutions impractical

slide-60
SLIDE 60

60

Questions?

Thank you!

 Microservices become increasingly popular  Traditional performance debugging techniques do not scale and

introduce long recovery times

 Seer leverages DL to anticipate QoS violations & find their root causes  >90% detection accuracy, avoids 86% of QoS violations  Provides insight on how to better design and deploy complex microservices  Practical solutions for systems whose scale make previous empirical

solutions impractical