DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
Min Du, Feifei Li, Guineng Zheng, Vivek Srikumar University of Utah
from System Logs through Deep Learning Min Du , Feifei Li, Guineng - - PowerPoint PPT Presentation
DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning Min Du , Feifei Li, Guineng Zheng, Vivek Srikumar University of Utah Background 2 Background System Event Log 3 Background System Event Log Available
Min Du, Feifei Li, Guineng Zheng, Vivek Srikumar University of Utah
2
3
Available practically on every computer system!
4
5
Available practically on every computer system!
6
Automatically detected anomaly
System Event Log
7
Started service A on port 80 Executor updated: app-1 is now LOADING ……
System Event Log Structured Data
Message type Log key ……
printf(“Started service %s on port %d”, x, y); LOG PARSING
8
Started service A on port 80 Executor updated: app-1 is now LOADING ……
System Event Log Structured Data
Message type Log key ……
printf(“Started service %s on port %d”, x, y); LOG PARSING
Started service A on port 80 Executor updated: app-1 is now LOADING …… Started service * on port * (log key ID: 1) Executor updated: * is now LOADING (log key ID: 2) ……
9
System Event Log Structured Data Anomaly Detection
Message type Log key ……
printf(“Started service %s on port %d”, x, y); LOG ANALYSIS LOG PARSING
10
System Event Log Structured Data Anomaly Detection
Message type Log key ……
printf(“Started service %s on port %d”, x, y); LOG ANALYSIS Message count vector: Xu’SOSP09, Lou’ATC10, etc. LOG PARSING
11
Structured Data Anomaly Detection
Message type Log key ……
printf(“Started service %s on port %d”, x, y); LOG ANALYSIS Message count vector: Xu’SOSP09, Lou’ATC10, etc. Problem: Offline batched processing LOG PARSING
System Event Log
12
Structured Data Anomaly Detection
Message type Log key ……
printf(“Started service %s on port %d”, x, y); LOG ANALYSIS Message count vector: Xu’SOSP09, Lou’ATC10, etc. Problem: Offline batched processing Build workflow model: Lou’KDD10, Beschastnikh’ICSE14, Yu’ASPLOS16, etc. LOG PARSING
System Event Log
13
Structured Data Anomaly Detection
Message type Log key ……
printf(“Started service %s on port %d”, x, y); LOG ANALYSIS Message count vector: Xu’SOSP09, Lou’ATC10, etc. Problem: Offline batched processing Build workflow model: Lou’KDD10, Beschastnikh’ICSE14, Yu’ASPLOS16, etc. Problem: Only for simple execution path anomalies LOG PARSING
System Event Log
14
Structured Data Anomaly Detection
Message type Log key ……
printf(“Started service %s on port %d”, x, y); LOG ANALYSIS Message count vector: Xu’SOSP09, Lou’ATC10, etc. Problem: Offline batched processing Build workflow model: Lou’KDD10, Beschastnikh’ICSE14, Yu’ASPLOS16, etc. Problem: Only for simple execution path anomalies LOG PARSING Common problem: Only Log keys (Message types) are considered.
System Event Log
15
log message (log key underlined) log key parameter value vector 𝑢1 Deletion of file1 complete 𝑙1 [𝑢1 - 𝑢0, file1] 𝑢2 Took 0.61 seconds to deallocate network … 𝑙2 [𝑢2 - 𝑢1, 0.61] 𝑢3 VM Stopped (Lifecycle Event) 𝑙3 [𝑢3 - 𝑢2] … … …
16
SPELL
A streaming log parser published in ICDM’16
log message (log key underlined) log key parameter value vector 𝑢1 Deletion of file1 complete 𝑙1 [𝑢1 - 𝑢0, file1] 𝑢2 Took 0.61 seconds to deallocate network … 𝑙2 [𝑢2 - 𝑢1, 0.61] 𝑢3 VM Stopped (Lifecycle Event) 𝑙3 [𝑢3 - 𝑢2] … … …
17
SPELL
A streaming log parser published in ICDM’16
log key log message parameters
log message (log key underlined) log key parameter value vector 𝑢1 Deletion of file1 complete 𝑙1 [𝑢1 - 𝑢0, file1] 𝑢2 Took 0.61 seconds to deallocate network … 𝑙2 [𝑢2 - 𝑢1, 0.61] 𝑢3 VM Stopped (Lifecycle Event) 𝑙3 [𝑢3 - 𝑢2] … … …
18
SPELL
A streaming log parser published in ICDM’16 Deletion of file1 complete.
log key log message parameters
log message (log key underlined) log key parameter value vector 𝑢1 Deletion of file1 complete 𝑙1 [𝑢1 - 𝑢0, file1] 𝑢2 Took 0.61 seconds to deallocate network … 𝑙2 [𝑢2 - 𝑢1, 0.61] 𝑢3 VM Stopped (Lifecycle Event) 𝑙3 [𝑢3 - 𝑢2] … … …
19
SPELL
A streaming log parser published in ICDM’16 Deletion of file1 complete.
log key log message parameters
log message (log key underlined) log key parameter value vector 𝑢1 Deletion of file1 complete 𝑙1 [𝑢1 - 𝑢0, file1] 𝑢2 Took 0.61 seconds to deallocate network … 𝑙2 [𝑢2 - 𝑢1, 0.61] 𝑢3 VM Stopped (Lifecycle Event) 𝑙3 [𝑢3 - 𝑢2] … … …
20
Deletion of * complete. [file1]
SPELL
A streaming log parser published in ICDM’16 Deletion of file1 complete.
log key log message
Deletion of file2 complete.
parameters
log message (log key underlined) log key parameter value vector 𝑢1 Deletion of file1 complete 𝑙1 [𝑢1 - 𝑢0, file1] 𝑢2 Took 0.61 seconds to deallocate network … 𝑙2 [𝑢2 - 𝑢1, 0.61] 𝑢3 VM Stopped (Lifecycle Event) 𝑙3 [𝑢3 - 𝑢2] … … …
21
Deletion of * complete. [file1]
SPELL
A streaming log parser published in ICDM’16 Deletion of file1 complete. Deletion of * complete.
log key log message
Deletion of file2 complete. Deletion of * complete.
parameters
[file1] [file2]
log message (log key underlined) log key parameter value vector 𝑢1 Deletion of file1 complete 𝑙1 [𝑢1 - 𝑢0, file1] 𝑢2 Took 0.61 seconds to deallocate network … 𝑙2 [𝑢2 - 𝑢1, 0.61] 𝑢3 VM Stopped (Lifecycle Event) 𝑙3 [𝑢3 - 𝑢2] … … …
22
log message (log key underlined) log key parameter value vector 𝑢1 Deletion of file1 complete 𝑙1 [𝑢1 - 𝑢0, file1] 𝑢2 Took 0.61 seconds to deallocate network … 𝑙2 [𝑢2 - 𝑢1, 0.61] 𝑢3 VM Stopped (Lifecycle Event) 𝑙3 [𝑢3 - 𝑢2] … … …
23
log message (log key underlined) log key parameter value vector 𝑢1 Deletion of file1 complete 𝑙1 [𝑢1 - 𝑢0, file1] 𝑢2 Took 0.61 seconds to deallocate network … 𝑙2 [𝑢2 - 𝑢1, 0.61] 𝑢3 VM Stopped (Lifecycle Event) 𝑙3 [𝑢3 - 𝑢2] … … …
24
log message (log key underlined) log key parameter value vector 𝑢1 Deletion of file1 complete 𝑙1 [𝑢1 - 𝑢0, file1] 𝑢2 Took 0.61 seconds to deallocate network … 𝑙2 [𝑢2 - 𝑢1, 0.61] 𝑢3 VM Stopped (Lifecycle Event) 𝑙3 [𝑢3 - 𝑢2] … … …
25
Anomaly Detection log message (log key underlined) log key parameter value vector 𝑢1 Deletion of file1 complete 𝑙1 [𝑢1 - 𝑢0, file1] 𝑢2 Took 0.61 seconds to deallocate network … 𝑙2 [𝑢2 - 𝑢1, 0.61] 𝑢3 VM Stopped (Lifecycle Event) 𝑙3 [𝑢3 - 𝑢2] … … …
26
Anomaly Detection Diagnosis log message (log key underlined) log key parameter value vector 𝑢1 Deletion of file1 complete 𝑙1 [𝑢1 - 𝑢0, file1] 𝑢2 Took 0.61 seconds to deallocate network … 𝑙2 [𝑢2 - 𝑢1, 0.61] 𝑢3 VM Stopped (Lifecycle Event) 𝑙3 [𝑢3 - 𝑢2] … … …
27
Anomaly Detection Diagnosis log message (log key underlined) log key parameter value vector 𝑢1 Deletion of file1 complete 𝑙1 [𝑢1 - 𝑢0, file1] 𝑢2 Took 0.61 seconds to deallocate network … 𝑙2 [𝑢2 - 𝑢1, 0.61] 𝑢3 VM Stopped (Lifecycle Event) 𝑙3 [𝑢3 - 𝑢2] … … …
28
MODELS
29
MODELS
30
31
32
33
34
35
MODELS
36
MODELS
37
38
39
40
41
42
43
44
45
MODELS
46
47
Example log key sequence: 25 18 54 57 18 56 … 25 18 54 57 56 18 … ➢ a rigorous set of logic and control flows ➢ a (more structured) natural language
48
Example log key sequence: 25 18 54 57 18 56 … 25 18 54 57 56 18 … ➢ a rigorous set of logic and control flows ➢ a (more structured) natural language
natural language modeling multi-class classifier: history sequence => next key to appear
49
Example log key sequence: 25 18 54 57 18 56 … 25 18 54 57 56 18 … ➢ a rigorous set of logic and control flows ➢ a (more structured) natural language
natural language modeling multi-class classifier: history sequence => next key to appear A log key is detected to be abnormal if it does not follow the prediction.
Use long short-term memory (LSTM) architecture
50
Use long short-term memory (LSTM) architecture
51
Use long short-term memory (LSTM) architecture
Training: log key sequence: h=3 25 18 54 57 18 56 … 25 18 54 57 56 18 …
52
Use long short-term memory (LSTM) architecture
Training: log key sequence: h=3 25 18 54 57 18 56 … 25 18 54 57 56 18 …
53
Use long short-term memory (LSTM) architecture
Training: log key sequence: h=3 25 18 54 57 18 56 … 25 18 54 57 56 18 …
54
Use long short-term memory (LSTM) architecture
Training: log key sequence: h=3 25 18 54 57 18 56 … 25 18 54 57 56 18 …
55
Use long short-term memory (LSTM) architecture
56
Detection: In detection stage, DeepLog checks if the actual next log key is among its top g probable predictions.
57
58
59
Input: log key sequence 25 18 54 57 18 56 … 25 18 54 57 56 18 … Output:
60
61
Method 1: Using Log Key Anomaly Detection model
62
Method 1: Using Log Key Anomaly Detection model
An example of concurrency detection:
63
Method 1: Using Log Key Anomaly Detection model
An example of concurrency detection:
64
Method 1: Using Log Key Anomaly Detection model
An example of concurrency detection:
65
Method 1: Using Log Key Anomaly Detection model
An example of concurrency detection:
Method 1: Using Log Key Anomaly Detection model
An example of concurrency detection:
66
Method 2: A density-based clustering approach
67
Co-occurrence matrix of log keys (𝒍𝒋, 𝒍𝒌) within distance 𝒆
68
Method 2: A density-based clustering approach
𝑔
𝑒(𝑙𝑗, 𝑙𝑘) : the frequency of (𝑙𝑗, 𝑙𝑘) appearing together within distance d
𝑔(𝑙𝑗) : the frequency of 𝑙𝑗 in the input sequence 𝑞𝑒(i, 𝑘) : the probability of (𝑙𝑗, 𝑙𝑘) appearing together within distance d
Example: Log messages of a particular log key: 𝒖𝟑: 𝑼𝒑𝒑𝒍 𝟏. 𝟕𝟐 𝒕𝒇𝒅𝒑𝒐𝒆𝒕 𝒖𝒑 𝒆𝒇𝒃𝒎𝒎𝒑𝒅𝒃𝒖𝒇 𝒐𝒇𝒖𝒙𝒑𝒔𝒍 … 𝒖′𝟑: 𝑼𝒑𝒑𝒍 𝟐. 𝟐 𝒕𝒇𝒅𝒑𝒐𝒆𝒕 𝒖𝒑 𝒆𝒇𝒃𝒎𝒎𝒑𝒅𝒃𝒖𝒇 𝒐𝒇𝒖𝒙𝒑𝒔𝒍 … ….
69
Example: Log messages of a particular log key: 𝒖𝟑: 𝑼𝒑𝒑𝒍 𝟏. 𝟕𝟐 𝒕𝒇𝒅𝒑𝒐𝒆𝒕 𝒖𝒑 𝒆𝒇𝒃𝒎𝒎𝒑𝒅𝒃𝒖𝒇 𝒐𝒇𝒖𝒙𝒑𝒔𝒍 … 𝒖′𝟑: 𝑼𝒑𝒑𝒍 𝟐. 𝟐 𝒕𝒇𝒅𝒑𝒐𝒆𝒕 𝒖𝒑 𝒆𝒇𝒃𝒎𝒎𝒑𝒅𝒃𝒖𝒇 𝒐𝒇𝒖𝒙𝒑𝒔𝒍 … …. Parameter value vectors overtime: [𝒖𝟑- 𝒖𝟐, 0.61], [𝒖′𝟑- 𝒖′𝟐, 1.1], ….
70
Example: Log messages of a particular log key: 𝒖𝟑: 𝑼𝒑𝒑𝒍 𝟏. 𝟕𝟐 𝒕𝒇𝒅𝒑𝒐𝒆𝒕 𝒖𝒑 𝒆𝒇𝒃𝒎𝒎𝒑𝒅𝒃𝒖𝒇 𝒐𝒇𝒖𝒙𝒑𝒔𝒍 … 𝒖′𝟑: 𝑼𝒑𝒑𝒍 𝟐. 𝟐 𝒕𝒇𝒅𝒑𝒐𝒆𝒕 𝒖𝒑 𝒆𝒇𝒃𝒎𝒎𝒑𝒅𝒃𝒖𝒇 𝒐𝒇𝒖𝒙𝒑𝒔𝒍 … …. Parameter value vectors overtime: [𝒖𝟑- 𝒖𝟐, 0.61], [𝒖′𝟑- 𝒖′𝟐, 1.1], …. Multi-variate time series data anomaly detection problem!
71
Multi-variate time series data anomaly detection problem ✓ Leverage LSTM-based approach; ✓ A parameter value vector is given as input at each time step; ✓ An anomaly is detected if the mean-square-error (MSE) between prediction and actual data is too big.
72
history time value
73
Multi-variate time series data anomaly detection problem ✓ Leverage LSTM-based approach; ✓ A parameter value vector is given as input at each time step; ✓ An anomaly is detected if the mean-square-error (MSE) between prediction and actual data is too big.
prediction
74
time value history
Multi-variate time series data anomaly detection problem ✓ Leverage LSTM-based approach; ✓ A parameter value vector is given as input at each time step; ✓ An anomaly is detected if the mean-square-error (MSE) between prediction and actual data is too big.
actual
time
75
prediction value history
Multi-variate time series data anomaly detection problem ✓ Leverage LSTM-based approach; ✓ A parameter value vector is given as input at each time step; ✓ An anomaly is detected if the mean-square-error (MSE) between prediction and actual data is too big.
actual
time
76
prediction value history MSE > Threshold ?
Multi-variate time series data anomaly detection problem ✓ Leverage LSTM-based approach; ✓ A parameter value vector is given as input at each time step; ✓ An anomaly is detected if the mean-square-error (MSE) between prediction and actual data is too big.
history time value
77
Multi-variate time series data anomaly detection problem ✓ Leverage LSTM-based approach; ✓ A parameter value vector is given as input at each time step; ✓ An anomaly is detected if the mean-square-error (MSE) between prediction and actual data is too big.
actual prediction
time value
78
history
Multi-variate time series data anomaly detection problem ✓ Leverage LSTM-based approach; ✓ A parameter value vector is given as input at each time step; ✓ An anomaly is detected if the mean-square-error (MSE) between prediction and actual data is too big.
actual prediction
time value
79
history MSE > Threshold ?
Multi-variate time series data anomaly detection problem ✓ Leverage LSTM-based approach; ✓ A parameter value vector is given as input at each time step; ✓ An anomaly is detected if the mean-square-error (MSE) between prediction and actual data is too big.
history
time value
80
Multi-variate time series data anomaly detection problem ✓ Leverage LSTM-based approach; ✓ A parameter value vector is given as input at each time step; ✓ An anomaly is detected if the mean-square-error (MSE) between prediction and actual data is too big.
81
history Log sequence:
82
history
model
Log sequence:
83
history
model
Log sequence: prediction
84
history current
model
Anomaly?
Log sequence: prediction
85
history current
model
Anomaly?
Log sequence: prediction
Yes
86
history current
model
Anomaly?
Log sequence: prediction
Yes False positive?
87
history current
model
Anomaly? Yes update model using this case: “history -> current” False positive? Yes
Log sequence: prediction
88
Evaluation results on HDFS log data [1].
(over a million log entries with labeled anomalies)
[1] PCA (SOSP’09), IM (UsenixATC’10), N-gram (baseline language model)
Up is good
89
Evaluation results on OpenStack cloud log with different confidence intervals (CIs)
MSE: mean square error
90
MSE: mean square error
generated on CloudLab; VM creation/deletion operations; injected performance anomalies.
Evaluation results on OpenStack cloud log with different confidence intervals (CIs)
91
Evaluation results on OpenStack cloud log with different confidence intervals (CIs)
MSE: mean square error
thresholds
92
Evaluation results on OpenStack cloud log with different confidence intervals (CIs)
MSE: mean square error
thresholds ANOMALY
93
Evaluation results on OpenStack cloud log with different confidence intervals (CIs)
MSE: mean square error
thresholds ANOMALY False Positive
94
Evaluation on Blue Gene/L log, with and without online model update. Up is good
95
Evaluation on Blue Gene/L log, with and without online model update. Up is good
HPC log with labeled anomalies; Available at https://www.usenix.org/cfdr-data
96
97
(Mini Challenge 2 – Computer Networking Operations) The dataset contains firewall log, IDS log, etc.
98
(Mini Challenge 2 – Computer Networking Operations) The dataset contains firewall log, IDS log, etc.
Detection results.
99
(Mini Challenge 2 – Computer Networking Operations) The dataset contains firewall log, IDS log, etc.
Detection results.
Could be fixed with prior knowledge
Constructed workflow of VM Creation.
(previously generated OpenStack cloud log)
100
How does it help to diagnose anomalies? Constructed workflow of VM Creation.
(previously generated OpenStack cloud log)
101
Parameter value anomaly
How does it help to diagnose anomalies? Constructed workflow of VM Creation.
(previously generated OpenStack cloud log)
102
Time difference (performance) anomaly Parameter value anomaly
How does it help to diagnose anomalies? Constructed workflow of VM Creation.
(previously generated OpenStack cloud log)
103
How does it help to diagnose anomalies? Constructed workflow of VM Creation.
(previously generated OpenStack cloud log)
104
Identified anomaly:
Instance took too long to build because of the transition from 52 -> 53
How does it help to diagnose anomalies? Identified anomaly:
Instance took too long to build because of the transition from 52 -> 53
Injected anomaly:
During VM creation, network speed from controller to compute node is throttled.
Constructed workflow of VM Creation.
(previously generated OpenStack cloud log)
105
DeepLog ➢ A realtime system log anomaly detection framework. ➢ LSTM is used to model system execution paths and log parameter values. ➢ Workflow models are built to help anomaly diagnosis. ➢ It supports online model update.
Min Du mind@cs.utah.edu Feifei Li lifeifei@cs.utah.edu
106