Detecting Latency Degradation Patterns in Service-based Systems - - PowerPoint PPT Presentation

detecting latency degradation patterns in service based
SMART_READER_LITE
LIVE PREVIEW

Detecting Latency Degradation Patterns in Service-based Systems - - PowerPoint PPT Presentation

Detecting Latency Degradation Patterns in Service-based Systems Vittorio Cortellessa Luca Traini University of LAquila, Italy 11 th ACM/SPEC International Conference on Performance Engineering Challenges in Modern Distributed Systems Move


slide-1
SLIDE 1

Detecting Latency Degradation Patterns in Service-based Systems

Vittorio Cortellessa Luca Traini

University of L’Aquila, Italy

11th ACM/SPEC International Conference on Performance Engineering

slide-2
SLIDE 2

ICPE2020

Challenges in Modern Distributed Systems

Move fast (Rubin and Rinard, 2016) vs Performance assurance Several performance issues come out only with real live user traffic (Veeraraghavan et al., 2016).

Julia Rubin and Martin Rinard. 2016. The challenges of staying together while moving fast: an exploratory study. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). Association for Computing Machinery, New York, NY, USA, 982–993. DOI:https://doi.org/10.1145/2884781.2884871 Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song. 2016. Kraken: leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI’16). USENIX Association, USA, 635–650.

slide-3
SLIDE 3

ICPE2020

Performance debugging in production

Fundamental activity during software evolution Challenge: A request triggers several Remote Procedure Calls (RPC) Availability of workflow-centric solutions

(e.g. Zipkin1, Jaeger2)

[1] https://zipkin.io/ [2] https://www.jaegertracing.io/

slide-4
SLIDE 4

ICPE2020

Triaging requests

slide-5
SLIDE 5

ICPE2020

Time-consuming computation in RPC1 Slow DB query in both RPC2 and RPC3

Triaging requests

slide-6
SLIDE 6

ICPE2020

Latency Degradation Patterns

getProfile execution time is > 30ms getRecommended execution time is > 20ms AND getCart execution time is > 15ms

slide-7
SLIDE 7

ICPE2020

pattern condition

Latency Degradation Patterns

getProfile execution time is > 30ms getRecommended execution time is > 20ms AND getCart execution time is > 15ms

slide-8
SLIDE 8

ICPE2020

Formal Notation

r = (e0, e1, ..., em, L)

<latexit sha1_base64="9wHrJTImEyrdHv0rNDws/E72ps=">ACDXicbVC7SgNBFJ2NrxhfUSuxGQxChLDsSkQbIWhjYRHBPCBZltnJTRwy+2DmrhCW4Cf4FbZa2Ymt32Dhv7gbU2jiqQ7n3Mu953iRFBot69PILSwuLa/kVwtr6xubW8XtnaYOY8WhwUMZqrbHNEgRQAMFSmhHCpjvSWh5w8vMb92D0iIMbnEUgeOzQSD6gjNMJbe4p87L4FoVCq5doaZpZsyv0Osjt1iyTGsCOk/sKSmRKepu8avbC3nsQ4BcMq07thWhkzCFgksYF7qxhojxIRtAJ6UB80E7ySTCmB7GmFI1BUSDoR4fdGwnytR76XTvoM7/Ssl4n/eZ0Y+2dOIoIoRgh4dgiFhMkhzZVIuwHaEwoQWfY5UBFQzhRDBCUo4zwV47SsQtqHPZt+njSPTbtqntxUS7WLaTN5sk8OSJnY5JTUyBWpkwbh5IE8kWfyYjwar8ab8f4zmjOmO7vkD4yPb5qQmFI=</latexit>

c = hj, emin, emaxi

<latexit sha1_base64="UuO7TBdwmNUhIHejCzyuCyHrWwY=">ACGXicbVC7TsNAEDzDOEVoKQ5ESFRoMhGQdAgIWgog0QAKbGi9WUTjpzP1t0agax8AZ/AV9BCRYdoqSj4F2yTAgJTjWZ2tTsTxEpact0PZ2JyanpmtjRXnl9YXFqurKye2ygxApsiUpG5DMCikhqbJEnhZWwQwkDhRTA4zv2LGzRWRvqM7mL0Q+hr2ZMCKJM6lU1x0Fag+wr59TbHThpKPfwmcDvkbVN4nUrVrbkF+F/ijUiVjdDoVD7b3UgkIWoSCqxteW5MfgqGpFA4LcTizGIAfSxlVENIVo/LeIM+WZigSIeo+FS8ULEnxsphNbehUE2GQJd2XEvF/zWgn19v1U6jgh1CI/RDILnh+ywsisJ+RdaZAI8s+RS80FGCBCIzkIkYlJVlw568MbT/+XnO/UvHpt97RePTwaNVNi62yDbTGP7bFDdsIarMkEu2eP7Ik9Ow/Oi/PqvH2PTjijnTX2C87F4XRoDI=</latexit>

A request trace is denoted as : A condition is denoted as: A request trace r satisfies c denoted as if

r C c

<latexit sha1_base64="5KA1XYEgPkwCbJ+pJycMmd3AiI=">ACHicbVC7TgJBFJ31ifjCR2czkZhYkV2D0ZJoY4mJPBLYkLvDBSfMzm5m7pIg4Qf8Clut7Iytf2Hhv7jgFgqe6uSc+zxBrKQl1/10lpZXVtfWcxv5za3tnd3C3n7dRokRWBORikwzAItKaqyRJIXN2CEgcJGMLie+o0hGisjfUejGP0Q+lr2pABKpU7h0LSHYMhI0H2FCnvEuegUim7JnYEvEi8jRZah2il8tbuRSELUJBRY2/LcmPxOlgKhZN8O7EYgxhAH1sp1RCi9cez6yf8JLFAEY/RcKn4TMTfHWMIrR2FQVoZAt3beW8q/ue1Eupd+mOp4RQi+kikgpni6wMo0FeVcaJILp5cil5gIMEKGRHIRIxSTNKZ/m4c1/v0jqZyWvXDq/LRcrV1kyOXbEjtkp89gFq7AbVmU1JtgDe2LP7MV5dF6dN+f9p3TJyXoO2B84H9ixZm7</latexit>

emin ≤ ej < emax

<latexit sha1_base64="nvfrwap5IosOc5/IQicZ3ql2r0=">ACDHicbVC7TsNAEDyHd3gFaJBoTkRIVJGNgqCgQNBQgkQSpCSy1scGjpzP5m6NQJb5BL6CFio6RMs/UPAv2CEFBKYazexqdyaIlbTkuh9OaWx8YnJqeqY8Oze/sFhZWm7aKDECGyJSkTkLwKSGhskSeFZbBDCQGEr6B8WfusGjZWRPqW7GLshXGjZkwIol/zKvpKHXWUXjN0b/a4UAt5lfqbo1dwD+l3hDUmVDHPuVz85JIQNQkF1rY9N6ZuCoakUJiVO4nFGEQfLrCdUw0h2m46SJDxjcQCRTxGw6XiAxF/bqQWnsXBvlkCHRpR71C/M9rJ9Tb7aZSxwmhFsUhkgoHh6wMq8G+bk0SATF58il5gIMEKGRHITIxSTvqpz34Y2m/0uaWzWvXts+qVf3D4bNTLM1ts42mcd2D47YseswQS7Z4/siT07D86L8+q8fY+WnOHOCvsF5/0LgsGbcg=</latexit>

where ei denotes the execution time of the RPC i and L the entire request latency where j refers to the RPC j

r = (..., ej, ...)

<latexit sha1_base64="JrLg3+5bhlnJHdkFZQXFIefz2i0=">ACBXicbVC7SgNBFJ2NrxhfGy1tBoMQISy7EtFGCNpYRjAPSEKYndzEMbOzy8xdJYTUfoWtVnZi63dY+C9u4haeKrDOfdyz1+JIVB1/20MkvLK6tr2fXcxubW9o6d362bMNYcajyUoW76zIAUCmoUEIz0sACX0LDH15O/cY9aCNCdYOjCDoBGyjRF5xhInXtvD4vOo5TotC9K9GEHXtgu4M9BF4qWkQFJUu/ZXuxfyOACFXDJjWp4bYWfMNAouYZJrxwYixodsAK2EKhaA6Yxn0Sf0MDYMQxqBpkLSmQi/N8YsMGYU+MlkwPDWzHtT8T+vFWP/rDMWKoRFJ8eQiFhdshwLZJOgPaEBkQ2TQ5UKMqZoigBWcJ2KclJRL+vDmv18k9WPHKzsn1+VC5SJtJkv2yQEpEo+ckgq5IlVSI5w8kCfyTF6sR+vVerPef0YzVrqzR/7A+vgGuV+VvA=</latexit>

and

slide-9
SLIDE 9

ICPE2020

Formal Notation

A pattern is denoted : A request trace r satisfies a pattern P denoted as where ci

i is a condition and k > 0

P = {c0, c1, ..., ck}

<latexit sha1_base64="5iO6w6eEBc/2Z8xaiKcJH2cuO8=">ACDnicbVC7SgNBFJ2NrxhfUTtBhPBIiy7QdFGCNpYRjAPyIZldnITh8w+mLkrhCXgJ/gVtlrZia2/YOG/uBu30OipDufcy73neJEUGi3rwygsLC4trxRXS2vrG5tb5e2dtg5jxaHFQxmqrsc0SBFACwVK6EYKmO9J6Hjy8zv3IHSIgxucBJB32ejQAwFZ5hKbnmv2jx3Eu5aNcpdu0ZN08zY2JlW3XLFMq0Z6F9i56RCcjTd8qczCHnsQ4BcMq17thVhP2EKBZcwLTmxhojxMRtBL6UB80H3k1mGKT2MNcOQRqCokHQmws+NhPlaT3wvnfQZ3up5LxP/83oxDs/6iQiGCHg2SEUEmaHNFciLQfoQChAZNnQEVAOVMEZSgjPNUjNO2Smkf9nz6v6RdN+1j8+S6Xmlc5M0UyT45IEfEJqekQa5Ik7QIJ/fkTyRZ+PBeDFejbfv0YKR7+ySXzDevwAuYZk/</latexit>

if

slide-10
SLIDE 10

ICPE2020

Formal notation

latency interval considered as degraded denoted as 𝐽

slide-11
SLIDE 11

ICPE2020

Precision and recall

slide-12
SLIDE 12

ICPE2020

Precision and recall

slide-13
SLIDE 13

ICPE2020

F-score

slide-14
SLIDE 14

ICPE2020 𝑡!"# 𝑡!$%

Sub-interval analysis

slide-15
SLIDE 15

ICPE2020

Splitting the interval

We split the latency range with a set of potential split points: Split points are derived using Mean Shift (Comaniciu and Meer, 2002)

Dorin Comaniciu and Peter Meer. 2002. Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24, 5 (May 2002), 603–619. DOI:https://doi.org/10.1109/34.1000236

slide-16
SLIDE 16

ICPE2020

Optimal split

Find subset of split points such that

slide-17
SLIDE 17

ICPE2020

Optimization Problem

Main problem Sub-problem

slide-18
SLIDE 18

ICPE2020

Main Problem: Dynamic Programming approach

(Krushevskaja and Sandler, 2013)

Darja Krushevskaja and Mark Sandler. 2013. Understanding latency variations of black box services. In Proceedings of the 22nd international conference on World Wide Web (WWW ’13). Association for Computing Machinery, New York, NY, USA, 703–714. DOI:https://doi.org/10.1145/2488388.2488450

slide-19
SLIDE 19

ICPE2020

Sub-problem: Genetic Algorithm

Representation Fitness Mutation Crossover Evolution strategy

randomly ad add/re remo move/chang change a condition me merg rge 𝑄

! and 𝑄 " in 𝑄 # = 𝑄 ! ⋃ 𝑄 ", then randomly sp

split 𝑄

# in 𝑄 !′ and 𝑄 "′

𝜈 + 𝜇 evolution strategy1

𝑑! 𝑑" … 𝑑#

𝑘 emin emax

pattern condition

Hans-Georg Beyer and Hans-Paul Schwefel. 2002. Evolution strategies –A comprehensive introduction. Natural Computing: an international journal 1, 1 (May 2002), 3–52. DOI:https://doi.org/10.1023/A:1015059928466

slide-20
SLIDE 20

ICPE2020

Fitness evaluation

RPC1 RPC2 … RPCn L 300 220 … 120 490 330 250 … 125 530 … …. … … … 320 235 … 140 495 350 230 … 130 500

Pe Performa mance critical operation Checking a set of inequalities

slide-21
SLIDE 21

ICPE2020

Optimizing fitness evaluation

Intuition:

Same checks are repeated several times during the evolution process

Our Solution:

Meaningful checks are computed and stored upfront and then reused during the evolution process

slide-22
SLIDE 22

ICPE2020

Precomputation

RPC1 RPC2 RPC3 L 300 220 120 490 330 250 125 530 320 235 140 495 350 230 130 510 340 240 125 515

RPC2 execution time ≥ 235

RPC2 positives 220 235 240 RPC2 negatives 250 230

(𝑡$%&, 𝑡$'() = (500, 600) 011 10

KEYS VALUES <RPC1, 223> … <RPC2, 235> <011, 10> <RPC2, 300> …. … …. False True True True False

slide-23
SLIDE 23

ICPE2020

Fast fitness evaluation using bitwise operators

KEYS VALUES

< 𝑘, 𝑓!"# > < 𝐶!"#

&'( , 𝐶!"# #)* >

< 𝑘, 𝑓!$% > < 𝐶!$%

&'( , 𝐶!$% #)* >

… ….

P = {c0, c1, ..., ck}

<latexit sha1_base64="5iO6w6eEBc/2Z8xaiKcJH2cuO8=">ACDnicbVC7SgNBFJ2NrxhfUTtBhPBIiy7QdFGCNpYRjAPyIZldnITh8w+mLkrhCXgJ/gVtlrZia2/YOG/uBu30OipDufcy73neJEUGi3rwygsLC4trxRXS2vrG5tb5e2dtg5jxaHFQxmqrsc0SBFACwVK6EYKmO9J6Hjy8zv3IHSIgxucBJB32ejQAwFZ5hKbnmv2jx3Eu5aNcpdu0ZN08zY2JlW3XLFMq0Z6F9i56RCcjTd8qczCHnsQ4BcMq17thVhP2EKBZcwLTmxhojxMRtBL6UB80H3k1mGKT2MNcOQRqCokHQmws+NhPlaT3wvnfQZ3up5LxP/83oxDs/6iQiGCHg2SEUEmaHNFciLQfoQChAZNnQEVAOVMEZSgjPNUjNO2Smkf9nz6v6RdN+1j8+S6Xmlc5M0UyT45IEfEJqekQa5Ik7QIJ/fkTyRZ+PBeDFejbfv0YKR7+ySXzDevwAuYZk/</latexit>
slide-24
SLIDE 24

ICPE2020

Search Space Reduction

Problem:

Tremendous precomputing effort

Solution:

Consider only meaningful RPC execution times

  • Mean Shift algorithm (Comaniciu and Meer, 2002) -

Dorin Comaniciu and Peter Meer. 2002. Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24, 5 (May 2002), 603–619. DOI:https://doi.org/10.1109/34.1000236

slide-25
SLIDE 25

ICPE2020

Research Questions

RQ1 Is our approach effective for clustering requests associated to the same latency degradation pattern, as compared to machine learning algorithms? RQ2 Is our approach effective with respect to state-of-the-art approaches for latency profile analysis? RQ3 How robust is our approach to "noise"? RQ4 What is the efficiency of our approach as compared to other ones?

slide-26
SLIDE 26

ICPE2020

Experimental setup

Ca Case of study udy E-commerce application1, composed by 9 microservices (Spring Cloud2).

Zipkin3 is used as distributed tracing solution. Spans are stored on ElasticSearch4. Request under analysis involves 13 RPCs (8 unique) over among 5 microservices.

Da Data genera ratio ion 60 load testing sessions of 5 minutes with 2 randomly injected artificial

  • degradations. 30 using normal artificial degradations and 30 using noised

artificial degradations. Each load test session generate ~1000 requests

Ar Artificial degradation pattern Normal: inject 50ms in subset of RPCs for 10% of traffic

Noised: inject 50ms with some noise in subset of RPCs for 10%

  • f traffic + inject 50ms to portion of async RPCs

[1] https://github.com/SEALABQualityGroup/E-Shopper [2] https://spring.io/projects/spring-cloud [3] https://zipkin.io/ [4] https://www.elastic.co/elasticsearch/

slide-27
SLIDE 27

ICPE2020

Baselines

  • K-me

means (MacQueen, 1967)

  • Hi

Hierarchical (Rokach and Maimon, 2005)

  • Me

Mean shift (Comaniciu and Meer, 2002)

  • Br

Branch and bound (Krushevskaja and Sandler, 2013)

MacQueen, J. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, 281--297, University of California Press, Berkeley, Calif., 1967. Lior Rokach and Oded Maimon. 2005. Clustering Methods. Springer US, Boston, MA, 321–352. https://doi.org/10.1007/0-387-25465-X_15 Dorin Comaniciu and Peter Meer. 2002. Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24, 5 (May 2002), 603–619. DOI:https://doi.org/10.1109/34.1000236 Darja Krushevskaja and Mark Sandler. 2013. Understanding latency variations of black box services. In Proceedings of the 22nd international conference on World Wide Web (WWW ’13). Association for Computing Machinery, New York, NY, USA, 703–714. DOI:https://doi.org/10.1145/2488388.2488450

slide-28
SLIDE 28

ICPE2020

Results: effectiveness

Common clustering approaches

  • utperformed by our approach

in noised experiments.-RQ RQ1- Our approach outperforms the branch and bound approach in both normal and noised experiments (p < 0.05) –RQ RQ2- Both domain specific approaches show resiliency to noise –RQ RQ3-

slide-29
SLIDE 29

ICPE2020

Results: efficiency

Machine learning methods are faster than optimization-based approaches Noise slow down the efficiency in both combinatorial methods Our approach is faster than state-of-the-art approach in both normal and noised experiments –RQ RQ4-

slide-30
SLIDE 30

ICPE2020

Conclusion and future works

Results shows that our approach outperforms existing approaches, especially when RPC execution time is not very regular We plan to deeply investigate the application of our approach to other distributed systems, ever more chaotic. Thoroughly study and improve the scalability of the approach Generalize the approach to other trace attributes other than RPC execution time