Error Log Processing for Accurate Failure Prediction Felix Salfner - PowerPoint PPT Presentation

Error Log Processing for Accurate Failure Prediction Felix Salfner Steffen Tschirpke ICSI Berkeley Humboldt-Universität zu Berlin

Introduction Introduction ■ Context of work: Error-based online failure prediction: error events C A C B failure? t data window prediction present time ■ Data used: ● Commercial telecommunication system ● 200 components, 2000 classes ● Error- and failure logs → In this talk we present the data preprocessing concepts we applied to obtain accurate failure prediction results 2 15

Contents Contents ■ Key facts on the data ■ Overview of online failure prediction and data preprocessing process ■ Detailed description of major preprocessing concepts ● Assigning IDs to Error Messages ● Failure Sequence Clustering ● Noise Filtering ■ Experiments and Results 3 15

Key Facts on the Data Key Facts on the Data ■ Experimental setup: Telecommunication System Call Tracker response times error logs failure log ■ 200 days of data from a 273 days period ■ 26,991,314 error log records ■ 1,560 failures of two types ■ Failure Definition: response time calls ● If within a 5 min interval more than 0.01% of calls 250ms experience a response time > 250ms t ● Performance Failures 5 min 5 min ≤ 0.01% ⇒ ✓ > 0.01%  Failure! 4 15

Online Failure Prediction Online Failure Prediction ■ Approach: Pattern recognition using Hidden Semi-Markov Models Failure Failure Non-Failure Non-Failure Sequence 1 Sequence 2 Sequence 1 Sequence 2 B C A F A B C B A B F C B A t  t d  t l  t d  t d  t l  t d HSMM for Failure Ssequences HSMM for Non-Failure Sequences ■ Objectives for data preprocessing: ● Create a data set to train HSMM models exposing key properties of system ● Identify how to process incoming data during runtime ■ Tasks: ● Machine-processable data → Error-ID assignment ● Separate sequences for inherent failure mechanisms → Clustering ● Distinguishing, noise-free sequences → Noise Filtering 5 15

Training Data Preprocessing Training Data Preprocessing Error Log Failure Log Error-ID assignment Timestamp extraction Tupling Sequence Extraction Non-Failure Sequences Failure Sequences Clustering Noise Filtering 1 Noise Filtering u ... Sequences Sequences for Failure-Mechanism 1 for Failure-Mechanism u 6 ... 15 Model 0 Model 1 Model u

Error ID Assignment Error ID Assignment ■ Problem: Error logs contain no message IDs ● Example message of a log record: process 1534: end of buffer reached → Task: Assign an ID to message to characterize what has happened ■ Approach: Two steps: ● Remove numbers process xx: end of buffer reached ● ID assignment based on Levenshtein's edit distance with constant threshold Data No of Messages Reduction Original 1,695,160 Without numbers 12,533 Levenshtein 1,435 7 15

Failure Sequence Clustering Failure Sequence Clustering Error Log Failure Log Error-ID assignment Timestamp extraction Tupling Sequence Extraction Non-Failure Sequences Failure Sequences Clustering Noise Filtering 1 Noise Filtering u ... Sequences Sequences for Failure-Mechanism 1 for Failure-Mechanism u 8 ... 15 Model 0 Model 1 Model u

Failure Sequence Clustering (2) Failure Sequence Clustering (2) ■ Goal: ● Divide set of training failure sequences into subsets ● Group according to sequence similarity F 1 F 2 F 3 ■ Approach: A B ▾ A B A A C B A B ▾ ▾ ● Train a small HSMM for each sequence ● Apply each HSMM to all sequences ... ... ... M 1 ... ... ... -2.1 -4.2 -9.7 ... ● Sequence log-likelihoods express ... ... similarities ... ... ... M 2 ... ... ... -2.6 -1.3 -10.2 ... ... ... ... ... ... M 3 ... ... ... -7.8 -6.9 -1.2 ... ... ... ● Make matrix symmetric by ● Apply standard clustering algorithm 9 15

Failure Sequence Clustering (3) Failure Sequence Clustering (3) 10 15

Noise Filtering Noise Filtering Error Log Failure Log Error-ID assignment Timestamp extraction Tupling Sequence Extraction Non-Failure Sequences Failure Sequences Clustering Noise Filtering 1 Noise Filtering u ... Sequences Sequences for Failure-Mechanism 1 for Failure-Mechanism u 11 ... 15 Model 0 Model 1 Model u

Noise Filtering (2) Noise Filtering (2) ■ Problem: Clustered failure sequences contain many unrelated errors → Main reason: parallelism in the system ■ Assumption: Indicative events occur more frequently prior to a failure than within other sequences → Apply a statistical test to quantify what “more frequently” is F 1 F 2 F 3 F 4 F 5 A B ▾ B A ▾ A B A ▾ A C B A ▾ B A A B A ▾ C Clustering Filtering Group 1 Filtering Group n A B A ▾ ... t B A B A B A F 1 F 3 ▾ ▾ A C B A A C B A F 2 ▾ ✘ ▾ ✘ t C B A A A B A F 5 ▾ F 4 ▾ ✘ A A B A ▾ ✘ t time of 12 Training Sequences for Training Sequences for failure Failure Mechanism 1 Failure Mechanism n 15

Noise Filtering (3) Noise Filtering (3) ■ Testing variable derived from goodness-of-fit test: denotes the number of occurrences of error denotes the total number of errors in the time window. denotes the prior probability of occurrence of error ■ Keep events in the sequence if Training sequences Entire dataset G 2 Failure G 1 G 4 training ■ Three ways to estimate priors G 3 sequences from training data set ■ Results 13 15

Experiments and Results Experiments and Results ■ Objective: Predict upcoming failures as accurate as possible ■ Metric used: F-Measure: ● Precision: relative number of correct alarms to total number of alarms ● Recall: relative number of correct alarms to total number of failures ● F-Measure: harmonic mean of precision and recall ■ Failure prediction is achieved by comparing sequence likelihood of an incoming sequence computed from failure and non-failure models ■ Classification involves a B C A t customizable decision threshold  t d → Maximum F-Measure HSMM for HSMM for failure non-failure Data Max. F- Relative sequences sequences Measure Quality Optimal Results 0.66 Sequence Sequence 100% likelihood likelihood Without grouping 0.5097 77% classification Without filtering 0.3601 55% 14 15 Failure prediction

Conclusions Conclusions ■ We have presented the data preprocessing techniques that we have applied for online failure prediction in a commercial telecommunication system ■ The presented techniques include: ● Assignment of IDs to error messages using Levenshtein's edit distance ● Failure sequence clustering ● Noise filtering based on a statistical test ■ Using error and failure logs of the commercial telecommunication system, we showed that elaborate data preprocessing is an essential step to achieve accurate failure predictions 15 15

Backup 16 15

Tupling Tupling ■ Goal: Remove multiple reporting of the same issue ■ Approach: Combine messages of the same type if they occur closer in time to each other than a threshold ε. ■ Problem: ● Determine the threshold value ε ● Solution suggested by Tsao and Siewiorek: Observe the number of tuples for various values of ε and apply the “elbow rule” ε 15

HSMM Model Structure for HSMM Model Structure for Failure Sequence Clustering Failure Sequence Clustering s 2 s 1 s 3 F s 5 s 4 15

Cluster Distance Metrics Cluster Distance Metrics Single linkage Single linkage complete linkage Average linkage 15

Online Failure Prediction Online Failure Prediction Error messages Error ID assignment error message sequence Tupling Sequence Extraction ... Filtering 1 Filtering u ... Model 0 Model 1 Model u ... Sequence Sequence Sequence Likelihood 0 Likelihood 1 Likelihood u Classification Failure Prediction 15

Comparison of Techniques Comparison of Techniques periodic DFT 0.7 Eventset SVD-SVM 0.6 HSMM 0.4 0.2 0.0 precision recall F-measure false positive rate 21 15

Hidden Semi-Markov Model Hidden Semi-Markov Model g 13 (t) t ... 1 2 3 N-1 F g 23 (t) g 12 (t) b 1 (A) b 2 (A) b 3 (A) b N-1 (A) 0 t t ... b 1 (B) b 2 (B) b 3 (B) b N-1 (B) 0 b 1 (C) b 2 (C) b 3 (C) b N-1 (C) 0 0 0 0 0 1 ■ Discrete time Markov chain (DTMC) ● States (1,..., N-1,F) ● Transition probabilities ■ Hidden Markov Model (HMM) ● Each state can generate (error) symbols (A,B,C,F) ● Discrete probability distribution of symbols per state b i (X) ■ Hidden Semi-Markov Model (HSMM) 22 ● Time-dependent transition probabilities g ij (t) 15

Proactive Fault Management Proactive Fault Management Running System Measurements Failure Preparation Avoidance for Failure Prediction Online Failure Prediction Model 23 15

Error Log Processing for Accurate Failure Prediction Felix Salfner - PowerPoint PPT Presentation

Error Log Processing for Accurate Failure Prediction Felix Salfner Steffen Tschirpke ICSI Berkeley Humboldt-Universitt zu Berlin Introduction Introduction Context of work: Error-based online failure prediction: error events C A C B

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

The Prediction Error Signal 1 Prediction Error Signal Behavior 2 LP Speech Analysis file:s5,

k -Step Ahead Prediction Error Model 1. k -Step Ahead Prediction Error Model 1. ARMAX model is

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Health Failure Telehealth Final Report Sarah Briggs Heart Failure Specialist Nurse Heart Failure

Quantum Information Processing and Quantum Error Correction and Quantum Error Correction with

Natural and Flexible Error Recovery for Generated Parsers Maartje de Jonge Emma Nilsson-Nyman

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

Syslog and Log Rotate Computer Center, CS, NCTU Log files Execution information of each

Distributed ephemeral log service Log entries are replicated,dispersed See Ivy,

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Section 3.7 Derivatives of logarithmic functions 1 Rules of exponentials and logarithms 1.

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Accurate Prediction of Soft Error Vulnerability of Scientific Applications Greg Bronevetsky

Failure is a four-letter word Andreas Zeller Thomas Zimmermann Christian Bird PROMISE

Reporting Copy Forward Slide 1 Employer Guide To RIO Self-Service Reporting - Always contact

hocos SLT Programs hardware-oriented computer science Why This Presentation? Lectures must

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN Giovanni De Micheli Micheli

Program Analysis in Software Development Summary of Papers Program Analysis Application Areas

Metaprogramming,in,SML: , datatype pgm = PostFix of int * cmd list and cmd = Pop | Swap | Nget |

Program Correctness Department of Computer Science University of Maryland, College Park

ietoolkit What we have learned from working with 100+ researchers assistants and fjeld

Conditionals, errors, tests, debugging Steve Bagley somgen223.stanford.edu 1 Conditional

Error Log Processing for Accurate Failure Prediction Felix Salfner - PowerPoint PPT Presentation

Error Log Processing for Accurate Failure Prediction Felix Salfner Steffen Tschirpke ICSI Berkeley Humboldt-Universitt zu Berlin Introduction Introduction Context of work: Error-based online failure prediction: error events C A C B

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

The Prediction Error Signal 1 Prediction Error Signal Behavior 2 LP Speech Analysis file:s5,

k -Step Ahead Prediction Error Model 1. k -Step Ahead Prediction Error Model 1. ARMAX model is

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Health Failure Telehealth Final Report Sarah Briggs Heart Failure Specialist Nurse Heart Failure

Quantum Information Processing and Quantum Error Correction and Quantum Error Correction with

Natural and Flexible Error Recovery for Generated Parsers Maartje de Jonge Emma Nilsson-Nyman

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

Syslog and Log Rotate Computer Center, CS, NCTU Log files Execution information of each

Distributed ephemeral log service Log entries are replicated,dispersed See Ivy,

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Section 3.7 Derivatives of logarithmic functions 1 Rules of exponentials and logarithms 1.

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Accurate Prediction of Soft Error Vulnerability of Scientific Applications Greg Bronevetsky

Failure is a four-letter word Andreas Zeller Thomas Zimmermann Christian Bird PROMISE

Reporting Copy Forward Slide 1 Employer Guide To RIO Self-Service Reporting - Always contact

hocos SLT Programs hardware-oriented computer science Why This Presentation? Lectures must

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN Giovanni De Micheli Micheli

Program Analysis in Software Development Summary of Papers Program Analysis Application Areas

Metaprogramming,in,SML: , datatype pgm = PostFix of int * cmd list and cmd = Pop | Swap | Nget |

Program Correctness Department of Computer Science University of Maryland, College Park

ietoolkit What we have learned from working with 100+ researchers assistants and fjeld

Conditionals, errors, tests, debugging Steve Bagley somgen223.stanford.edu 1 Conditional

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits