Detecting Large-Scale System Problems by Mining Console Logs
Author : Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael Jordan Conference: ICML 2010 SOSP2009, ICDM 2009 Reporter: Zhe Fu
Detecting Large-Scale System Problems by Mining Console Logs - - PowerPoint PPT Presentation
Detecting Large-Scale System Problems by Mining Console Logs Author : Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael Jordan Conference: ICML 2010 SOSP2009, ICDM 2009 Reporter: Zhe Fu Outline SOSP 09 ICML 10 Offline Analysis
Author : Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael Jordan Conference: ICML 2010 SOSP2009, ICDM 2009 Reporter: Zhe Fu
2
SOSP 09 Offline Analysis Online Detection Invited Applications Paper ICML 10 ICDM 09
3
services
unstructured to analyze automatically
relevant sets of events for problem detection
the problem
4
processing
systems.
5
Insight 1: Source code is the “schema” of logs.
small set of log printing statements in source code.
6
Insight 2: Common log structures lead to useful features.
program
could have in program
7
identifiers message types state variables
Insight 2: Common log structures lead to useful features.
8
Insight 3: Message sequences are important in problem detection.
because they are likely to come from logically related execution steps in the program.
For example, if a write operation to a file fails silently (perhaps because the developers do not handle the error correctly), no single error message is likely to indicate the failure.
9
Insight 4: Logs contain strong patterns with lots of noise.
messages or frequent message sequences—are very obvious
fr frequent pattern min inin ing and Princ rincipal Co Component Analysis (P (PCA CA)
gr groupin ing meth thods
10
11
Step 1: Log parsing
structure
Step 2: Feature creation
vector features
Step 3: Machine learning
method
Step 4: Visualization
12
that returns a printable representation of itself for interpolation into a string
subclass rather than the base class of the logger object
13
identifiers message types state variables
regular expression:
starting: xact (. *) is (. *)
Parsing Approach - Source Code
14
Parsing Approach - Source Code
15
Parsing Approach - Logs
be constructed
16
State ratio vector
the time window
17
identifiers message types state variables choose state variables that were reported at least 0. . 2N N times choose a size that allows the variable to appear at least 10 10D times in 80% % of all the time windows
Message count vector
identifier values
the message group
18
identifiers message types state variables
Message count vector
19
State ratio vector
Message count vector
20
Also implement as a Hadoop map-reduce job
Principal Component Analysis (PCA)-based anomaly detection
21
Principal Component Analysis (PCA)-based anomaly detection
22
Intuition behind PCA anomaly detection
23
Principal Component Analysis (PCA)-based anomaly detection
24
Principal Component Analysis (PCA)-based anomaly detection
25
Principal Component Analysis (PCA)-based anomaly detection
26
Principal Component Analysis (PCA)-based anomaly detection
27
Improving PCA detection results
where dfj is total number of message groups that contain the j-th message type
28
29
30
Dataset:
31
32
Parsing accuracy:
Parse fails when cannot find a message template that matches the message and extract message variables.
Scalability:
33
50 nodes, takes less than 3 minutes , less than 10 minutes with 10 node
34
performing random operations
available to Darkstar to 50% during time 1400 to 1800 sec
35
about 1:2
36
37
38
which is reported in 11,197,954 messages (about 50% of all messages) in 29 message types.
39
no single error message indicating the problem
#:Got Exception while serving # to #:#
a few blocks are replicated 10 times instead of 3 times for the majority of blocks.
40
Two Stage Online Detection Systems
has a predictable duration.
timestamps of events in the session.
decision of whether the event is normal or abnormal
41
(coarsely)
the time gap size is a configurable parameter
statistics
42
power-law distribution
43
44
45
46
console logs
logs to parse logs accurately
states and individual operation sequences.
results.
normal events while using PCA detection to detect the anomalies in an online setting .
47
48