Improvement of Log Pattern Extracting Algorithm Using Text - - PowerPoint PPT Presentation
Improvement of Log Pattern Extracting Algorithm Using Text - - PowerPoint PPT Presentation
Improvement of Log Pattern Extracting Algorithm Using Text Similarity ZHAO Yining Computer Network Information Center, Chinese Academy of Sciences in HPBDC18, 2018/05/21 Content v CNGrid & LARGE v Why Log Patterns & Extracting
Content
v CNGrid & LARGE v Why Log Patterns & Extracting Algorithm v Algorithm of Identical Word Rate v Text Similarity Based Approach
Ø Improved Extracting Formation & LCS Ø Experiment Result
v Modified Log Comparing Model v Summary & Future Work
CNGrid & LARGE
v China National HPC Environment
2 Operating Centers
( Beijing / Hefei )
19 Sites
( 200PF + 162PB ) Portal with Micro-Service Architecture Application oriented Global Scheduling & Predicting Resource Evaluation Standard & Comprehensive Evaluation Index
CNGrid & LARGE
v Log Analyzing fRamework in Grid Environment
Log Patterns & Extracting Algorithm
v We want to be alerted for logs in certain patterns, but…
Ø too many logs for human to read Ø need to summarize patterns before defining alert rules
v Set of log patterns in our context:
Ø patterns are different from each other Ø covering all logs in original set Ø significantly less than original
v The process of using log patterns
Ø filter and remove frequent normal logs Ø use log pattern extraction algorithms to get the set of patterns Ø manually check the set and pick out abnormal patterns Ø define rules to generate alerts for these patterns
Algorithm of Identical Word Rate
v Algorithm of identical word rate – a straight forward way
Ø identical words
- 2 words that are identical
- and in the same position in 2 original logs
Ø identical word rate
- (number of identical words) / (total words)
- predefined threshold t
- If IWR is greater than t, the two logs are in one pattern
v Process of algorithm of IWR
Ø set threshold t and initial empty pattern set P Ø for each new incoming logs, compute IWR with each pattern in P Ø if pattern matched, skip to next; if none matched, add to P
v Significant Limitation
Ø Logs with different length has IWR of ZERO!
Text Similarity Based Approach (1)
v Using Text Similarity to resolve the problem
Ø S = P x O Ø S: similarity, P: propotion of common words, O: order factor
v Two logs l1 and l2, L1 and L2 are word sets respectively
Ø define P: P(l1, l2) = ( |L1 ∩ L2| × 2) / ( |L1| + |L2| ) Ø define O: O(l1, l2) = SeqSim(l1, l2) / |L1 ∩ L2| Ø hence S: S(l1, l2) = (SeqSim(l1, l2) × 2) / (|L1| + |L2|)
v By this, logs in different lengths can be compared
Text Similarity Based Approach (2)
v Using Longest Common Subsequence to define SeqSim(l1,l2)
Ø S(l1, l2) = ( |LCS(l1, l2)| × 2) / ( |L1| + |L2| ) Ø Same pattern if S(l1, l2) ≥ t, where t is the predefined threshold
v The process of improved log pattern extracting algorithm
Ø set the threshold value t. Set the initial log pattern set P to be an empty set Ø for a new log l appearing from the input log set L, compute Si(l, pi) between l and every pi ∈ P using a LCS algorithm Ø if there is no Si(l, pi) ≥ t, add l to P Ø after all logs in L have been checked, return P
v Increase time cost for single comparison
Ø but reduce total number of comparisons Ø can be offset by choosing a better LCS algorithm
Text Similarity Based Approach (3)
v Experiment result
Ø numbers of extracted patterns
Text Similarity Based Approach (3)
v Experiment result
Ø time costs of candidate algorithms (in milliseconds)
Modified Pattern Comparing Model (1)
v The original model is bad in time cost of searching patterns
Ø has to visit all patterns until the one is met
v Use hashmap to accelerate the matching
Ø divide pattern set into subsets by initial words Ø skip majority of patterns in irrelevant subsets
v Matching process :
1. get initial word of the log 2. hash the word 3. find desired subset in hashmap 4. compare with patterns in the subset
Modified Pattern Comparing Model (2)
v This approach cannot deal with patterns with unfixed initials
Ø build an unfixed pattern set
v In real system, we split pattern set in 4 parts:
Ø fixed alert pattern set Ø unfixed alert pattern set Ø fixed normal pattern set Ø unfixed normal pattern set
v When a new log comes, it is compared in the 4 sets in turn to decide processing methods
Modified Pattern Comparing Model (3)
v Real time cost comparison between original & modified models
200000 400000 600000 800000 1000000 1200000 1400000 1600000 1800000
- riginal model
modified model
cron
millisecond
500000 1000000 1500000 2000000 2500000 3000000
- riginal model
modified model
maillog
millisecond
100000 200000 300000 400000 500000 600000
- riginal model
modified model
secure
millisecond
1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 10000000
- riginal model
modified model
messages
millisecond