Generic and Robust Localization of Multi-Dimensional Root Causes - - PowerPoint PPT Presentation

β–Ά
generic and robust localization of multi dimensional root
SMART_READER_LITE
LIVE PREVIEW

Generic and Robust Localization of Multi-Dimensional Root Causes - - PowerPoint PPT Presentation

Generic and Robust Localization of Multi-Dimensional Root Causes Zeyan Li , Chengyang Luo, Yiwei Zhao, Yongqian Sun, Kaixin Sui, Xiping Wang, Dapeng Liu, Xing Jin, Qi Wang , Dan Pei ISSRE 2019 Outline Background Methodology Experiment


slide-1
SLIDE 1

Generic and Robust Localization

  • f Multi-Dimensional Root Causes

Zeyan Li, Chengyang Luo, Yiwei Zhao, Yongqian Sun, Kaixin Sui, Xiping Wang, Dapeng Liu, Xing Jin, Qi Wang , Dan Pei

ISSRE 2019

slide-2
SLIDE 2

Outline

Background Methodology Experiment Summary

2

slide-3
SLIDE 3

Outline

Background Methodology Experiment Summary

3

slide-4
SLIDE 4

Background

  • KPI: key performance indicator

4

Time #Orders

Anomaly happens, and we need to find the root cause

slide-5
SLIDE 5

Motivation

5

Timestamp Province ISP Device ....... 2019.10.15 13:04 Beijing China Mobile PC ....... Raw log for an order: Total #Orders

Beijing & China Mobile Shanghai & China Mobile Beijing & China Unicom

Province ISP Device

China Unicom Beijing Shanghai Guangdong China Mobile PC Cellphone

Province ISP Device

slide-6
SLIDE 6

Multi-dimensional Data

6

Province ISP Device Cuboid Province

  • Cuboid: a way to slice the multi-dimensional data
  • Attribute combination: elements in a cuboid

Beijing Shanghai Guangdong

slide-7
SLIDE 7

Multi-dimensional Data

7

Province ISP Device Cuboid ISP China Mobile China Unicom China Telegram

  • Cuboid: a way to slice the multi-dimensional data
  • Attribute combination: elements in a cuboid
slide-8
SLIDE 8

Multi-dimensional Data

8

Province ISP Device Cuboid Province & ISP Beijing & China Mobile Beijing & China Unicom Shanghai & China Mobile

  • Cuboid: a way to slice the multi-dimensional data
  • Attribute combination: elements in a cuboid
slide-9
SLIDE 9

Problem Statement

9

Province ISP Device

The KPI of the whole cube is abnormal, but where is the root cause? Root cause is a set of attribute combinations

Potential Root Causes

slide-10
SLIDE 10

Challenge: Huge Search Space

Root Cause: a set of attribute combinations

10

How many potential root cause for a simple 2-d data?

2 +7 +14-1 2 2 +7+14-1

2

slide-11
SLIDE 11

Previous Approaches

11 Algorithm Root Cause Assumption Adtributor (NSDI, 2014) single attribute Recursive Adtributor (Master Thesis, 2018) none iDice (ICSE, 2016)

  • ne or two attribute combinations

Apriori (TON, 2017) none HotSpot (IEEE Access, 2018) all attribute combinations of the root cause in one cuboid Squeeze (ISSRE, 2019) those which cause the same changes are in one cuboid

Adtributor iDice

slide-12
SLIDE 12

Previous Approaches

12 Algorithm Measure Adtributor (NSDI, 2014) fundamental & derived (quotient) Recursive Adtributor (Master Thesis, 2018) fundamental & derived (quotient) iDice (ICSE, 2016) fundamental only Apriori (TON, 2017) fundamental & derived HotSpot (IEEE Access, 2018) fundamental only Squeeze (ISSRE, 2019) fundamental & derived (quotient, product)

China Mobile China Unicom Total Volume China Mobile China Unicom Total

# Orders fundamental, additive % Success Rate derived, not additive

iDice and HotSpot rely on addition, thus cannot handle derived measures

slide-13
SLIDE 13

Previous Approaches

13 Algorithm Change Magnitude Adtributor (NSDI, 2014) significant Recursive Adtributor (Master Thesis, 2018) significant iDice (ICSE, 2016) significant Apriori (TON, 2017) any HotSpot (IEEE Access, 2018) significant Squeeze (ISSRE, 2019) any

Beijing Shanghai Guangdong

Significant Insignificant

slide-14
SLIDE 14

Previous Approaches

14 Algorithm Parameter Fine Tuning Adtributor (NSDI, 2014) no Recursive Adtributor (Master Thesis, 2018) yes iDice (ICSE, 2016) no Apriori (TON, 2017) yes HotSpot (IEEE Access, 2018) no Squeeze (ISSRE, 2019) no

Some approaches perform badly without parameter fine tuning

slide-15
SLIDE 15

Previous Approaches

15 Algorithm Time Cost Adtributor (NSDI, 2014) very short Recursive Adtributor (Master Thesis, 2018) short iDice (ICSE, 2016) very short Apriori (TON, 2017) always too long HotSpot (IEEE Access, 2018) sometimes long Squeeze (ISSRE, 2019) short

Some approaches cost too much time

slide-16
SLIDE 16

Previous Approach

16 Algorithm Root Cause Assumption Measure Change Magnitude Parameter Fine Tuning Time Cost Adtributor (NSDI, 2014) single attribute fundamental & derived (quotient) significant no very short Recursive Adtributor (Master Thesis, 2018) none fundamental & derived (quotient) significant yes short iDice (ICSE, 2016)

  • ne or two attribute combinations

fundamental only significant no very short Apriori (TON, 2017) none fundamental & derived any yes always too long HotSpot (IEEE Access, 2018) all attribute combinations of the root cause in one cuboid fundamental only significant no sometimes long Squeeze (ISSRE, 2019) those which cause the same changes are in one cuboid fundamental & derived (quotient, product) any no short

slide-17
SLIDE 17

Design Goals

17 Root Cause Assumption Measure Change Magnitude Parameter Fine Tuning Time Cost

Squeeze has no impractical assumptions handles both fundamental and derived measures handles anomalies with any change magnitude does not need parameter fine tuning is consistently fast in all cases

slide-18
SLIDE 18

Outline

Background Methodology Experiment Summary

18

slide-19
SLIDE 19

Core Idea: Generalized Ripple Effect (GRE)

19

Beijing Shanghai Guangdong Beijing & China Mobile Beijing & China Unicom

root cause is Beijing causes ripples 10 20 5

10

With idea from HotSpot[IEEE Access 2018], we propose generalized ripple Effect

slide-20
SLIDE 20

Core Idea: GRE & Deviation Score

20

Beijing & China Mobile Beijing & China Unicom Beijing Shanghai Guangdong

real value: v forecast value: f

π‘’π‘“π‘€π‘—π‘π‘’π‘—π‘π‘œ 𝑑𝑑𝑝𝑠𝑓 = 2 𝑔 βˆ’ 𝑀 𝑔 + 𝑀

𝑔 = 30, 𝑀 = 15, 𝑒𝑑 = 2 3 𝑔 = 20, 𝑀 = 10, 𝑒𝑑 = 2 3 𝑔 = 10, 𝑀 = 5, 𝑒𝑑 = 2 3 should in the same bin Deviation Score PDF

slide-21
SLIDE 21

Core Idea: GRE in Real World Cases

21

# successful orders drops down after an update By manually analysis, root cause is ServiceType=020020 Their deviation scores are in the same bin, which supports GRE

slide-22
SLIDE 22

Core Idea: GRE in Real World Cases

22

Case 2

# successful orders drops down 4 root cause attribute combinations

The data shows that deviation scores of the same root cause are in the same bin

slide-23
SLIDE 23

Generalized Ripple Effect

23

Does GRE holds for both fundamental and derived measures?

  • Yes. Please see the details in the paper.
slide-24
SLIDE 24

Core Idea: Generalized Potential Score

24

Evaluate how likely a set of attribute combination is the root cause

slide-25
SLIDE 25

Core Idea: Generalized Potential Score

25

β†’ forecast value and real value should be close β†’ f(S2) – v(S2) ~ 0 β†’ KPI value should be expected by GRE β†’

6 789:9;< = >?@A@BC = 0.5, half fails

β†’ 𝑏 πΆπ‘“π‘—π‘˜π‘—π‘œπ‘•, π·β„Žπ‘—π‘œπ‘ π‘π‘π‘π‘—π‘šπ‘“ = 𝑔 πΆπ‘“π‘—π‘˜π‘—π‘œπ‘•, π·β„Žπ‘—π‘œπ‘ π‘π‘π‘π‘—π‘šπ‘“ βˆ— 0.5 = 5 β†’ 𝑏 πΆπ‘“π‘—π‘˜π‘—π‘œπ‘•, π·β„Žπ‘—π‘œπ‘ π‘‰π‘œπ‘—π‘‘π‘π‘› = 𝑔 πΆπ‘“π‘—π‘˜π‘—π‘œπ‘•, π·β„Žπ‘—π‘œπ‘ π‘‰π‘œπ‘—π‘‘π‘π‘› βˆ— 0.5 = 10 normalization

slide-26
SLIDE 26

Overall Architecture

26

Squeeze

slide-27
SLIDE 27

Squeeze

27

Root Causes Bottom to Top: clustering for leaf attribute combinations Top to Bottom: Search in each cluster

slide-28
SLIDE 28

Clustering

28

slide-29
SLIDE 29

Clustering

29

local maxima: centroids local minima: boundaries

Find attribute combinations affected by the same root cause Find attribute combinations have similar deviation scores

slide-30
SLIDE 30

Localize in Each Cluster

30

slide-31
SLIDE 31

Localize in Cluster

31

Beijing Shanghai CM CU cluster Province ISP Province & ISP Province 2/2 0/2 0/2 0/2 0/2 0/2 0/2 Sorted List: Beijing, Shanghai, ...... Top-K items in this list with highest GPS Beijing, GPS = 1, Root Cause Beijing Shanghai CM CU

slide-32
SLIDE 32

Outline

Background Methodology Experiment Summary

32

slide-33
SLIDE 33

Experiment Setup

We use

  • real KPI datasets from 2 companies;
  • synthetic anomalies => 7 semi-synthetic datasets
  • Moving average as the forecasting algorithm.

33

slide-34
SLIDE 34

Effectiveness

Squeeze achieves relatively good F1-score on both fundamental & derived measures.

Two of Fundamental Measure Datasets Derived Measure Dataset

34

slide-35
SLIDE 35

Efficiency

Squeeze is fast enough consistently in all cases. Squeeze costs only ten to twenty seconds consistently in all cases.

35

slide-36
SLIDE 36

Various Anomaly Change Magnitude

Squeeze performs well regardless of anomaly change magnitudes

36

0.4% and 12% are 25 and 75 percentile of change magnitudes

slide-37
SLIDE 37

Various Forecasting Residual

Squeeze performs well under various residuals, and always outperforms others.

37

Two representative settings by Moving Average

slide-38
SLIDE 38

Outline

Background Methodology Experiment Summary

38

slide-39
SLIDE 39

Summary

  • Bottom-up & Top-down => Squeeze
  • Contributions:

β—‹ Generalized ripple effect β—‹ Squeeze algorithm. β—‹ Experimental study on real world data and semi-synthetic data show Squeeze is both effective and efficient.

  • Future Works

β—‹ focus on numerical attributes β—‹ show GRE for more types of derived measures

39

slide-40
SLIDE 40

References

  • Ahmed, F., Erman, J., Ge, Z., Liu, A., Wang, J., Yan, H. (2017). Detecting and Localizing End-to-End Performance

Degradation for Cellular Data Services Based on TCP Loss Ratio and Round Trip Time IEEE/ACM Transactions on Networking (TON) 25()

  • Bhagwan, R., Kumar, R., Ramjee, R., NSDI, V. (2014). Adtributor: Revenue Debugging in Advertising Systems.
  • Lin, Q., Lou, J., Zhang, H., Zhang, D. (2016). iDice: Problem Identification for Emerging Issues 2016 IEEE/ACM 38th

International Conference on Software Engineering (ICSE) https://dx.doi.org/10.1145/2884781.2884795

  • Rudenius, L., Persson, M.Anomaly Detection and Fault Localization. Master’s thesis, 2018, goteborg : Chalmers University
  • f Technology.
  • Sun, Y., Zhao, Y., Su, Y., Liu, D., Nie, X., Meng, Y., Cheng, S., Pei, D., Zhang, S., Qu, X., Guo, X. (2018). HotSpot: Anomaly

Localization for Additive KPIs With Multi-Dimensional Attributes IEEE Access 6(), 10909-10923. https://dx.doi.org/10.1109/ACCESS.2018.2804764

40

slide-41
SLIDE 41

Thank you. Q&A