Error Log Processing for Accurate Failure Prediction Felix Salfner - - PowerPoint PPT Presentation

error log processing for accurate failure prediction
SMART_READER_LITE
LIVE PREVIEW

Error Log Processing for Accurate Failure Prediction Felix Salfner - - PowerPoint PPT Presentation

Error Log Processing for Accurate Failure Prediction Felix Salfner Steffen Tschirpke ICSI Berkeley Humboldt-Universitt zu Berlin Introduction Introduction Context of work: Error-based online failure prediction: error events C A C B


slide-1
SLIDE 1

Error Log Processing for Accurate Failure Prediction

Felix Salfner Steffen Tschirpke

ICSI Berkeley Humboldt-Universität zu Berlin

slide-2
SLIDE 2

2 15

Introduction Introduction

■ Context of work: Error-based online failure prediction: ■ Data used:

  • Commercial telecommunication system
  • 200 components, 2000 classes
  • Error- and failure logs

→ In this talk we present the data preprocessing concepts we applied

to obtain accurate failure prediction results

data window

t C A B C

error events failure? present time prediction

slide-3
SLIDE 3

3 15

Contents Contents

■ Key facts on the data ■ Overview of online failure prediction and data preprocessing process ■ Detailed description of major preprocessing concepts

  • Assigning IDs to Error Messages
  • Failure Sequence Clustering
  • Noise Filtering

■ Experiments and Results

slide-4
SLIDE 4

4 15

Key Facts on the Data Key Facts on the Data

■ Experimental setup: ■ 200 days of data from a 273 days period ■ 26,991,314 error log records ■ 1,560 failures of two types ■ Failure Definition:

  • If within a 5 min interval

more than 0.01% of calls experience a response time > 250ms

  • Performance Failures

Call Tracker failure log error logs Telecommunication System response times

t response time

250ms

≤ 0.01% ⇒ ✓ > 0.01%  Failure!

5 min 5 min

calls

slide-5
SLIDE 5

5 15

Online Failure Prediction Online Failure Prediction

■ Approach: Pattern recognition using Hidden Semi-Markov Models ■ Objectives for data preprocessing:

  • Create a data set to train HSMM models exposing key properties of system
  • Identify how to process incoming data during runtime

■ Tasks:

  • Machine-processable data → Error-ID assignment
  • Separate sequences for inherent failure mechanisms → Clustering
  • Distinguishing, noise-free sequences → Noise Filtering

t

Non-Failure Sequence 2 Non-Failure Sequence 1 Failure Sequence 2 Failure Sequence 1 B A F A A B C B B F tl td C C B A HSMM for Failure Ssequences HSMM for Non-Failure Sequences td td td tl

slide-6
SLIDE 6

6 15

Training Data Preprocessing Training Data Preprocessing

Model 0 Failure Sequences Sequences for Failure-Mechanism 1 Sequences for Failure-Mechanism u

... ...

Tupling Clustering Model 1 Model u Noise Filtering 1 Noise Filtering u Error-ID assignment Non-Failure Sequences Sequence Extraction Error Log Timestamp extraction Failure Log

slide-7
SLIDE 7

7 15

Error ID Assignment Error ID Assignment

■ Problem: Error logs contain no message IDs

  • Example message of a log record:

process 1534: end of buffer reached

→ Task: Assign an ID to message to characterize what has happened

■ Approach: Two steps:

  • Remove numbers

process xx: end of buffer reached

  • ID assignment based on Levenshtein's edit distance

with constant threshold

Data No of Messages Reduction Original 1,695,160 Without numbers 12,533 Levenshtein 1,435

slide-8
SLIDE 8

8 15

Failure Sequence Clustering Failure Sequence Clustering

Model 0 Failure Sequences Sequences for Failure-Mechanism 1 Sequences for Failure-Mechanism u

... ...

Tupling Clustering Model 1 Model u Noise Filtering 1 Noise Filtering u Error-ID assignment Non-Failure Sequences Sequence Extraction Error Log Timestamp extraction Failure Log

slide-9
SLIDE 9

9 15

Failure Sequence Clustering (2) Failure Sequence Clustering (2)

■ Goal:

  • Divide set of training failure sequences into subsets
  • Group according to sequence similarity

■ Approach:

  • Train a small HSMM for each sequence
  • Apply each HSMM to all sequences
  • Sequence log-likelihoods express

similarities

  • Make matrix symmetric by
  • Apply standard clustering algorithm

... ... ... ... ... ... ... ... ...

M1

... ... ... ... ... ... ... ... ...

M2

... ... ... ... ... ... ... ... ...

M3

  • 2.1
  • 2.6
  • 7.8

F1

A B A

  • 4.2
  • 1.3
  • 6.9

F2

A B A C

  • 9.7
  • 10.2
  • 1.2

F3

B A B ▾

slide-10
SLIDE 10

10 15

Failure Sequence Clustering (3) Failure Sequence Clustering (3)

slide-11
SLIDE 11

11 15

Noise Filtering Noise Filtering

Model 0 Failure Sequences Sequences for Failure-Mechanism 1 Sequences for Failure-Mechanism u

... ...

Tupling Clustering Model 1 Model u Noise Filtering 1 Noise Filtering u Error-ID assignment Non-Failure Sequences Sequence Extraction Error Log Timestamp extraction Failure Log

slide-12
SLIDE 12

12 15

Noise Filtering (2) Noise Filtering (2)

■ Problem: Clustered failure sequences contain many unrelated errors

→ Main reason: parallelism in the system

■ Assumption: Indicative events occur more frequently prior to a failure

than within other sequences

→ Apply a statistical test to quantify what “more frequently” is

Filtering Group n Filtering Group 1 F1 F2 F4 F3 F5 Clustering F2

A B A C

F3

B A B ▾

F4

A B A A

F5

C B A ▾

F1

A B A

Training Sequences for Failure Mechanism 1

...

A B A Training Sequences for Failure Mechanism n

A B A C

A B A A

✘ ✘

▾ A B A ▾ A B A C ▾ A B A A

✘ ✘

▾ B A B ▾ C B A

time of failure t t t

slide-13
SLIDE 13

13 15

Noise Filtering (3) Noise Filtering (3)

■ Testing variable derived from goodness-of-fit test: ■ Keep events in the sequence if ■ Three ways to estimate priors

from training data set

■ Results

denotes the number of occurrences of error denotes the total number of errors in the time window. denotes the prior probability of occurrence of error

Entire dataset Training sequences Failure training sequences G1 G3 G2 G4

slide-14
SLIDE 14

14 15

Experiments and Results Experiments and Results

■ Objective: Predict upcoming failures as accurate as possible ■ Metric used: F-Measure:

  • Precision: relative number of correct alarms to total number of alarms
  • Recall: relative number of correct alarms to total number of failures
  • F-Measure: harmonic mean of precision and recall

■ Failure prediction is achieved by comparing sequence likelihood of an

incoming sequence computed from failure and non-failure models

■ Classification involves a

customizable decision threshold

→ Maximum F-Measure

Data

  • Max. F-

Measure Relative Quality Optimal Results 0.66

100%

Without grouping 0.5097

77%

Without filtering 0.3601

55%

A t classification Failure prediction

Sequence likelihood Sequence likelihood

B C td HSMM for non-failure sequences HSMM for failure sequences

slide-15
SLIDE 15

15 15

Conclusions Conclusions

■ We have presented the data preprocessing techniques that we have

applied for online failure prediction in a commercial telecommunication system

■ The presented techniques include:

  • Assignment of IDs to error messages using Levenshtein's edit distance
  • Failure sequence clustering
  • Noise filtering based on a statistical test

■ Using error and failure logs of the commercial telecommunication

system, we showed that elaborate data preprocessing is an essential step to achieve accurate failure predictions

slide-16
SLIDE 16

16 15

Backup

slide-17
SLIDE 17

15

Tupling Tupling

■ Goal: Remove multiple reporting of the same issue ■ Approach:

Combine messages of the same type if they occur closer in time to each

  • ther than a threshold ε.

■ Problem:

  • Determine the threshold value ε
  • Solution suggested by Tsao and Siewiorek: Observe the number of tuples

for various values of ε and apply the “elbow rule”

ε

slide-18
SLIDE 18

15

HSMM Model Structure for HSMM Model Structure for Failure Sequence Clustering Failure Sequence Clustering

s1 s3 s2 s5 s4 F

slide-19
SLIDE 19

15

Cluster Distance Metrics Cluster Distance Metrics

Single linkage complete linkage Single linkage Average linkage

slide-20
SLIDE 20

15

Online Failure Prediction Online Failure Prediction

error message sequence Error ID assignment Classification

Failure Prediction

Model 0 Sequence Likelihood 1 Model 1

...

Model u Filtering u Filtering 1

... ...

Sequence Likelihood 0 Sequence Likelihood u Error messages Tupling Sequence Extraction

slide-21
SLIDE 21

21 15

Comparison of Techniques Comparison of Techniques

0.0 0.2 0.4 0.6 periodic DFT Eventset SVD-SVM HSMM precision F-measure false positive rate recall 0.7

slide-22
SLIDE 22

22 15

Hidden Semi-Markov Model Hidden Semi-Markov Model

1 2 N-1 F

...

b1(A) b1(B) b1(C) b2(A) b2(B) b2(C) bN-1(A) bN-1(B) bN-1(C) 1 b3(A) b3(B) b3(C)

...

3

g12(t)

t

g13(t)

t

g23(t)

t

■ Discrete time Markov chain (DTMC)

  • States (1,..., N-1,F)
  • Transition probabilities

■ Hidden Markov Model (HMM)

  • Each state can generate (error) symbols (A,B,C,F)
  • Discrete probability distribution of symbols per state bi(X)

■ Hidden Semi-Markov Model (HSMM)

  • Time-dependent transition probabilities gij(t)
slide-23
SLIDE 23

23 15

Proactive Fault Management Proactive Fault Management

Running System

Prediction Model

Measurements Failure Avoidance Preparation for Failure

Online Failure Prediction