Error Log Processing for Accurate Failure Prediction
Felix Salfner Steffen Tschirpke
ICSI Berkeley Humboldt-Universität zu Berlin
Error Log Processing for Accurate Failure Prediction Felix Salfner - - PowerPoint PPT Presentation
Error Log Processing for Accurate Failure Prediction Felix Salfner Steffen Tschirpke ICSI Berkeley Humboldt-Universitt zu Berlin Introduction Introduction Context of work: Error-based online failure prediction: error events C A C B
ICSI Berkeley Humboldt-Universität zu Berlin
2 15
■ Context of work: Error-based online failure prediction: ■ Data used:
→ In this talk we present the data preprocessing concepts we applied
to obtain accurate failure prediction results
data window
t C A B C
error events failure? present time prediction
3 15
■ Key facts on the data ■ Overview of online failure prediction and data preprocessing process ■ Detailed description of major preprocessing concepts
■ Experiments and Results
4 15
■ Experimental setup: ■ 200 days of data from a 273 days period ■ 26,991,314 error log records ■ 1,560 failures of two types ■ Failure Definition:
more than 0.01% of calls experience a response time > 250ms
Call Tracker failure log error logs Telecommunication System response times
t response time
250ms
≤ 0.01% ⇒ ✓ > 0.01% Failure!
5 min 5 min
calls
5 15
■ Approach: Pattern recognition using Hidden Semi-Markov Models ■ Objectives for data preprocessing:
■ Tasks:
t
Non-Failure Sequence 2 Non-Failure Sequence 1 Failure Sequence 2 Failure Sequence 1 B A F A A B C B B F tl td C C B A HSMM for Failure Ssequences HSMM for Non-Failure Sequences td td td tl
6 15
Model 0 Failure Sequences Sequences for Failure-Mechanism 1 Sequences for Failure-Mechanism u
Tupling Clustering Model 1 Model u Noise Filtering 1 Noise Filtering u Error-ID assignment Non-Failure Sequences Sequence Extraction Error Log Timestamp extraction Failure Log
7 15
■ Problem: Error logs contain no message IDs
process 1534: end of buffer reached
→ Task: Assign an ID to message to characterize what has happened
■ Approach: Two steps:
process xx: end of buffer reached
with constant threshold
Data No of Messages Reduction Original 1,695,160 Without numbers 12,533 Levenshtein 1,435
8 15
Model 0 Failure Sequences Sequences for Failure-Mechanism 1 Sequences for Failure-Mechanism u
Tupling Clustering Model 1 Model u Noise Filtering 1 Noise Filtering u Error-ID assignment Non-Failure Sequences Sequence Extraction Error Log Timestamp extraction Failure Log
9 15
■ Goal:
■ Approach:
similarities
... ... ... ... ... ... ... ... ...
M1
... ... ... ... ... ... ... ... ...
M2
... ... ... ... ... ... ... ... ...
M3
F1
▾
A B A
F2
A B A C
▾
F3
B A B ▾
10 15
11 15
Model 0 Failure Sequences Sequences for Failure-Mechanism 1 Sequences for Failure-Mechanism u
Tupling Clustering Model 1 Model u Noise Filtering 1 Noise Filtering u Error-ID assignment Non-Failure Sequences Sequence Extraction Error Log Timestamp extraction Failure Log
12 15
■ Problem: Clustered failure sequences contain many unrelated errors
→ Main reason: parallelism in the system
■ Assumption: Indicative events occur more frequently prior to a failure
than within other sequences
→ Apply a statistical test to quantify what “more frequently” is
Filtering Group n Filtering Group 1 F1 F2 F4 F3 F5 Clustering F2
A B A C
▾
F3
B A B ▾
F4
A B A A
▾
F5
C B A ▾
F1
▾
A B A
Training Sequences for Failure Mechanism 1
A B A Training Sequences for Failure Mechanism n
A B A C
A B A A
▾ A B A ▾ A B A C ▾ A B A A
✘ ✘
▾ B A B ▾ C B A
time of failure t t t
13 15
■ Testing variable derived from goodness-of-fit test: ■ Keep events in the sequence if ■ Three ways to estimate priors
from training data set
■ Results
denotes the number of occurrences of error denotes the total number of errors in the time window. denotes the prior probability of occurrence of error
Entire dataset Training sequences Failure training sequences G1 G3 G2 G4
14 15
■ Objective: Predict upcoming failures as accurate as possible ■ Metric used: F-Measure:
■ Failure prediction is achieved by comparing sequence likelihood of an
incoming sequence computed from failure and non-failure models
■ Classification involves a
customizable decision threshold
→ Maximum F-Measure
Data
Measure Relative Quality Optimal Results 0.66
100%
Without grouping 0.5097
77%
Without filtering 0.3601
55%
A t classification Failure prediction
Sequence likelihood Sequence likelihood
B C td HSMM for non-failure sequences HSMM for failure sequences
15 15
■ We have presented the data preprocessing techniques that we have
applied for online failure prediction in a commercial telecommunication system
■ The presented techniques include:
■ Using error and failure logs of the commercial telecommunication
system, we showed that elaborate data preprocessing is an essential step to achieve accurate failure predictions
16 15
15
■ Goal: Remove multiple reporting of the same issue ■ Approach:
Combine messages of the same type if they occur closer in time to each
■ Problem:
for various values of ε and apply the “elbow rule”
15
s1 s3 s2 s5 s4 F
15
Single linkage complete linkage Single linkage Average linkage
15
error message sequence Error ID assignment Classification
Failure Prediction
Model 0 Sequence Likelihood 1 Model 1
Model u Filtering u Filtering 1
Sequence Likelihood 0 Sequence Likelihood u Error messages Tupling Sequence Extraction
21 15
0.0 0.2 0.4 0.6 periodic DFT Eventset SVD-SVM HSMM precision F-measure false positive rate recall 0.7
22 15
1 2 N-1 F
b1(A) b1(B) b1(C) b2(A) b2(B) b2(C) bN-1(A) bN-1(B) bN-1(C) 1 b3(A) b3(B) b3(C)
3
g12(t)
t
g13(t)
t
g23(t)
t
■ Discrete time Markov chain (DTMC)
■ Hidden Markov Model (HMM)
■ Hidden Semi-Markov Model (HSMM)
23 15
Measurements Failure Avoidance Preparation for Failure