Robust and Unsupervised KPI Anomaly Detection Based on Conditional - - PowerPoint PPT Presentation

robust and unsupervised kpi anomaly detection based on
SMART_READER_LITE
LIVE PREVIEW

Robust and Unsupervised KPI Anomaly Detection Based on Conditional - - PowerPoint PPT Presentation

Robust and Unsupervised KPI Anomaly Detection Based on Conditional Variational Autoencoder Zeyan Li , Wenxiao Chen, Dan Pei Department of Computer Science and Technology Tsinghua University November 18, 2018 1/37 Table of Contents 1 Background


slide-1
SLIDE 1

Robust and Unsupervised KPI Anomaly Detection Based on Conditional Variational Autoencoder

Zeyan Li, Wenxiao Chen, Dan Pei

Department of Computer Science and Technology Tsinghua University

November 18, 2018

1/37

slide-2
SLIDE 2

Table of Contents

1 Background

Problem Formulation Previous Work Donut and Its Drawback

2 Architecture

Training Detection

3 Experiments

Evaluation Metric Datasets Performance

4 Analysis

Conditional KDE explanation Dropout for avoiding overfitting on time information

5 Conclusion

1/37

slide-3
SLIDE 3

Problem Formulation (1/4)

KPI: key performance indicator, e.g., pages views, search response time, number of transactions per minute.

Figure: KPI examples.

To ensure undisrupted web-based services, operators need to closely monitor various KPIs, detect anomalies in them, and trigger timely troubleshooting or mitigation. In our work, we focus on business-related KPIs. These KPIs consist of two parts:

2/37

slide-4
SLIDE 4

Problem Formulation (2/4)

1 Seasonal patterns. Business-related KPIs have it because of

the influence from user behavior and schedule

3/37

slide-5
SLIDE 5

Problem Formulation (3/4)

2 Noises. We assume that the noises follow independent,

zero-mean Gaussian distribution.

4/37

slide-6
SLIDE 6

Problem Formulation (4/4)

Anomalies: points that do not follow normal patterns. Abnormal points: missing points and anomalies. Sometimes the KPI values are not collected. These data points are called missing points. Missing points are also some kind of anomalies, but it is easy to distinguish them from normal points. KPI anomaly detection formulation for any time t, given historical KPI observations vt−W +1:t with length W , determine whether anomaly happens at time t (denoted by γt = 1).

5/37

slide-7
SLIDE 7

Previous Works (1/1)

Table: Comparison among anomaly detection methodologies

Suffers from 1 2 3 4 5 Bagel Selecting algorithm Yes No Some No No No Tuning parameters Yes No Some Some Some No Relying on labels No Yes No No No No Poor Capacity Yes No Some No No No Hard to train No No Some Some Some No Time consuming Some Yes Some No No No 1: traditional statistical method, e.g., time series decomposition [1] 2: supervised ensemble method, e.g., Opprentice [2] 3: traditional unsupervised method, e.g., one-class SVM [3] 4: sequential deep generative model, e.g., VRNN [4] 5: non-sequential deep generative model, e.g. VAE [5], Donut [6]

6/37

slide-8
SLIDE 8

Donut

Donut (Xu et.al. WWW 2018) is a state-of-art unsupervised anomaly detection algorithm for KPI. It is based on variational autoencoder (VAE). They also proposed a theoretical interpretation for Donut.

Training

Modified ELBO Missing Data Injection

Detection

MCMC Imputation Model

Data Preparation

Sliding Window Standardization Fill Missing with Zero Training x x Testing x x

Figure: Overall architecture of Donut.

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 0.06 0.19 1.66 1.69 qφ(z|x) . . . x pθ(x|z(1)) pθ(x|z(L))

Eqφ(z|x) [log pθ(x|z)]

log pθ(x|z(1)) log pθ(x|z(L))

Figure: KDE interpretation for Donut.

7/37

slide-9
SLIDE 9

Drawbacks of Donut (1/4)

Donut uses sliding windows, so the time information of a window is totally ignored. It may cause some problems. For example, patterns occurs frequently may not be normal pattern when considering time.

Figure: The KPI value should be around 1 in every night, so the red part is abnormal.

8/37

slide-10
SLIDE 10

Drawbacks of Donut (2/4)

Then we found more problems in real data.

Figure: Anomaly scores of G given by Donut. The blue lines are KPI

  • values. The green lines are the anomaly scores for each point. Donut

gives too high anomaly scores for the normal fragment surrounded by missing points. The small normal pieces surrounded by missing fragments is hard to reconstruct for Donut, because too many points are missing and Donut does not have enough information to reconstruct the normal pattern.

9/37

slide-11
SLIDE 11

Drawbacks of Donut (3/4)

Figure: Donut gives too high anomaly scores at many normal valleys, which are mostly smooth but have many periodic spikes. Since H is very smooth at most points, the x’s standard deviation will be quite small (nearly zero). Small bias may also cause big impact on likelihood since the standard deviation is too small on a mostly smooth KPI.

10/37

slide-12
SLIDE 12

Drawbacks of Donut (4/4)

Summary:

1 The correct normal pattern can not be determined only by a

KPI window.

2 Model may be confused because of the abnormal points or

noises.

3 The biases brought by noises in KPI can be amplified in the

final anomaly detector, likelihood.

11/37

slide-13
SLIDE 13

More robust algorithm is needed

Figure: Donut Figure: Bagel, more healthy

12/37

slide-14
SLIDE 14

Core Idea

1 use additional time information to help reconstruct normal

patterns.

2 encode time information appropriately

Date and time Decompose One-hot encode 2018/7/3 16:25:13 Tuesday 25 , 16 (hour), 2 (day of week)

25 34 16 7 5

minute hour day of week

3 make sure that both window shape and time information work

well. ⇒ use dropout layer to avoid overfitting

13/37

slide-15
SLIDE 15

Effect of the improvements

Donut Bagel Donut Bagel

14/37

slide-16
SLIDE 16

Table of Contents

1 Background

Problem Formulation Previous Work Donut and Its Drawback

2 Architecture

Training Detection

3 Experiments

Evaluation Metric Datasets Performance

4 Analysis

Conditional KDE explanation Dropout for avoiding overfitting on time information

5 Conclusion

15/37

slide-17
SLIDE 17

Overall architecture

Impute Standardize Sliding window MCMC KPI M-ELBO Missing injection Sliding Windows Anomaly Score Preprocess Training Testing

Figure: Overall architecture

16/37

slide-18
SLIDE 18

Training (1/4)

Preprocessing:

1 Imputing missing points. 2 Standardization for points in each KPI. 3 Sliding window with window length W .

Network structure: conditional variational autoencoder [7], as shown in Fig. 10.

17/37

slide-19
SLIDE 19

Training (2/4)

z K

  • fθ(z)

fθ(z)

  • W

SoftPlus+Δ W x W µx σx x W

  • fφ(x)

fφ(x)

  • K

SoftPlus+Δ K z K µz σz y Y

  • Figure: The overall neural network architecture. The double-lines

highlight the major difference with Donut [6] in network architecture.

18/37

slide-20
SLIDE 20

Training (3/4)

Encoding time information (y in Fig. 10):

1 Get the date and time of each window X. 2 Decompose it into useful components. 3 One-hot encode and concatenate.

Date and time Decompose One-hot encode 2018/7/3 16:25:13 Tuesday 25 , 16 (hour), 2 (day of week)

25 34 16 7 5

minute hour day of week

19/37

slide-21
SLIDE 21

Training (4/4)

Training objective (M-ELBO [6]): ˜ L(x, y) = E qφ(z|x,y)[

W

󰁜

i=1

αi · log p(xi|z, y) + β · log p(z|y) − log qφ(z|x, y))] (1) α: a binary vector, denotes the corresponding anomaly labels

  • f a window x.

β: the proportion of normal points in a window x

20/37

slide-22
SLIDE 22

Detection (1/1)

We use negative reconstruction probability as the anomaly detector. − Eqφ(z|x,y) [log pθ(x|z, y)] [6] gives a KDE (kernel density estimation) for it and explain why it is suitable for anomaly detection problem.

21/37

slide-23
SLIDE 23

Table of Contents

1 Background

Problem Formulation Previous Work Donut and Its Drawback

2 Architecture

Training Detection

3 Experiments

Evaluation Metric Datasets Performance

4 Analysis

Conditional KDE explanation Dropout for avoiding overfitting on time information

5 Conclusion

22/37

slide-24
SLIDE 24

Evaluation Metric

1 1 1 1 1 1

truth

0.6 0.4 0.3 0.7 0.6 0.5 0.2 0.3 0.4 0.6

score

1 1 1 1 1

point-wise alert

1 1 1 1 1

adjusted alert

1 0.7 1

maximum allowed delay

We use F1-score based on the adjusted alerts as the evaluation metric.

23/37

slide-25
SLIDE 25

Datasets (1/2)

We obtain several well-maintained KPIs from several large Internet companies. All the anomaly labels are manually confirmed by operators. A, B, C are similar to those in [6], so they can demonstrate Bagel’s performance on those KPIs that Donut claims to handle well. Bagel should have similar performance with Donut on them.

24/37

slide-26
SLIDE 26

Datasets (2/2)

G has many missing points and several long missing fragments (like that shown in item 2, and there are several similar long missing fragments), such that many normal fragments are just small pieces surrounded by missing points. H is quite smooth, but has many periodic spikes every day. Bagel should significantly outperform Donut on them.

25/37

slide-27
SLIDE 27

Overall Performance on A, B, C (1/2)

We compare Bagel’s performance with that of Donut and Opprentice. Donut: a state-of-art unsupervised KPI anomaly detection algorithm based on VAE [6]. Opprentice: a state-of-art supervised ensemble KPI anomaly detection algorithms [2].

26/37

slide-28
SLIDE 28

Overall Performance on A, B, C (2/2)

On datasets A, B, C, Bagel’s performance is similar to that of Donut’s, which means Bagel is also able to handle those KPIs that Donut is able to handle.

27/37

slide-29
SLIDE 29

Overall Performance on G, H

Bagel significantly outperforms Donut, and also outperform Opprentice.

28/37

slide-30
SLIDE 30

Table of Contents

1 Background

Problem Formulation Previous Work Donut and Its Drawback

2 Architecture

Training Detection

3 Experiments

Evaluation Metric Datasets Performance

4 Analysis

Conditional KDE explanation Dropout for avoiding overfitting on time information

5 Conclusion

29/37

slide-31
SLIDE 31

Conditional KDE explanation (1/2)

Two questions:

1 We use negative reconstruction probability

(− Eqφ(z|x,y) [log pθ(x|z, y)]) as the anomaly detector, but why can it be an effective anomaly detector? The answer is almost the same as that of [6]. 1) M-ELBO and the dimension reduction in CVAE makes it able to reconstruct normal patterns from a potential abnormal window. 2) Reconstruction probability can be considered as a KDE (kernel density estimation). log qθ(x|z, y) is kernel, and qφ(z|x, y) is the weight of kernel. Eqφ(z|x,y) [log pθ(x|z, y)] = 󰁜

z(i)

qφ(z(i)|x, y) log pθ(x|z(i), y)

30/37

slide-32
SLIDE 32

Conditional KDE explanation (2/2)

2 Why does time information help?

1) Given a KPI window, its corresponding normal patterns is multimodal. 2) Time information also helps when x is confusing. e.g.: in G, there is a normal fragment surrounded by missing points. As this normal fragment is much shorter than the training windows used to train the model, Donut cannot determine its normal pattern and, therefore, gives wrong anomaly scores.

31/37

slide-33
SLIDE 33

Dropout for avoiding overfitting on time information (1/3)

Modeling the relationship between latent variables (z) and encoded timestamps (y) is easier than that between latent variables (z) and sliding windows (x), because the KPIs are mostly seasonal and the local variation is not so influential compared to the periodicity. Therefore CVAE model may be overfitted on time information easily. Time gradient effect: “z samples drawn from approximated z posterior qφ(z|x, y) with more different y should be far away from each other”. It is important to find a good z posterior according to the analysis in [6].

32/37

slide-34
SLIDE 34

Dropout for avoiding overfitting on time information (2/3)

Time Only Bagel without Dropout Bagel Best F1-score 0.686 0.605 0.074 Latent Space

33/37

slide-35
SLIDE 35

Dropout for avoiding overfitting on time information (3/3)

Since the latent spaces of Bagel have significant time gradient, similar to that in Donut [6], Bagel has similar ability with Donut to reconstruct normal patterns from x, but Bagel has timing information successfully incorporated without overfitting.

34/37

slide-36
SLIDE 36

Table of Contents

1 Background

Problem Formulation Previous Work Donut and Its Drawback

2 Architecture

Training Detection

3 Experiments

Evaluation Metric Datasets Performance

4 Analysis

Conditional KDE explanation Dropout for avoiding overfitting on time information

5 Conclusion

35/37

slide-37
SLIDE 37

Conclusion

For the first time in the literature, we identify the importance of time information for non-sequential deep generative models, such as Donut, in KPI anomaly detection problem. To the best of our knowledge, Bagel is the first to apply conditional variational autoencoder (CVAE) to KPI anomaly detection and use dropout technique to successfully avoid overfitting. Our experiments using real data from Internet companies show that, compared to Donut, Bagel improves the anomaly detection best F1-score by 0.08 to 0.43 for KPIs G and H, greatly improving Donut’s robustness against time information related anomalies.

36/37

slide-38
SLIDE 38
  • Y. Chen, R. Mahajan, B. Sridharan, and Z.-L. Zhang, “A

provider-side view of web search response time,” in ACM SIGCOMM Computer Communication Review, vol. 43, no. 4. ACM, 2013, pp. 243–254.

  • D. Liu, Y. Zhao, H. Xu, Y. Sun, D. Pei, J. Luo, X. Jing, and
  • M. Feng, “Opprentice: Towards practical and automatic

anomaly detection through machine learning,” in Proceedings

  • f the 2015 Internet Measurement Conference.

ACM, 2015,

  • pp. 211–224.
  • M. Amer, M. Goldstein, and S. Abdennadher, “Enhancing
  • ne-class support vector machines for unsupervised anomaly

detection,” in Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description. ACM, 2013, pp. 8–15.

36/37

slide-39
SLIDE 39
  • M. S¨
  • lch, J. Bayer, M. Ludersdorfer, and P. van der Smagt,

“Variational inference for on-line anomaly detection in high-dimensional time series,” stat, vol. 1050, p. 23, 2016.

  • J. An and S. Cho, “Variational autoencoder based anomaly

detection using reconstruction probability,” Special Lecture on IE, vol. 2, pp. 1–18, 2015.

  • H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao,
  • D. Pei, Y. Feng et al., “Unsupervised anomaly detection via

variational auto-encoder for seasonal kpis in web applications,” in Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2018, pp. 187–196.

  • K. Sohn, H. Lee, and X. Yan, “Learning structured output

representation using deep conditional generative models,” in

36/37

slide-40
SLIDE 40

Advances in Neural Information Processing Systems, 2015, pp. 3483–3491.

37/37

slide-41
SLIDE 41

Thank You

37/37