improvement of log pattern extracting algorithm using
play

Improvement of Log Pattern Extracting Algorithm Using Text - PowerPoint PPT Presentation

Improvement of Log Pattern Extracting Algorithm Using Text Similarity ZHAO Yining Computer Network Information Center, Chinese Academy of Sciences in HPBDC18, 2018/05/21 Content v CNGrid & LARGE v Why Log Patterns & Extracting


  1. Improvement of Log Pattern Extracting Algorithm Using Text Similarity ZHAO Yining Computer Network Information Center, Chinese Academy of Sciences in HPBDC18, 2018/05/21

  2. Content v CNGrid & LARGE v Why Log Patterns & Extracting Algorithm v Algorithm of Identical Word Rate v Text Similarity Based Approach Ø Improved Extracting Formation & LCS Ø Experiment Result v Modified Log Comparing Model v Summary & Future Work

  3. CNGrid & LARGE v China National HPC Environment 2 Operating Centers ( Beijing / Hefei ) 19 Sites ( 200PF + 162PB ) Portal with Micro-Service Architecture Application oriented Global Scheduling & Predicting Resource Evaluation Standard & Comprehensive Evaluation Index

  4. CNGrid & LARGE v Log Analyzing fRamework in Grid Environment

  5. Log Patterns & Extracting Algorithm v We want to be alerted for logs in certain patterns, but… Ø too many logs for human to read Ø need to summarize patterns before defining alert rules v Set of log patterns in our context: Ø patterns are different from each other Ø covering all logs in original set Ø significantly less than original v The process of using log patterns Ø filter and remove frequent normal logs Ø use log pattern extraction algorithms to get the set of patterns Ø manually check the set and pick out abnormal patterns Ø define rules to generate alerts for these patterns

  6. Algorithm of Identical Word Rate v Algorithm of identical word rate – a straight forward way Ø identical words • 2 words that are identical • and in the same position in 2 original logs Ø identical word rate • (number of identical words) / (total words) • predefined threshold t • If IWR is greater than t, the two logs are in one pattern v Process of algorithm of IWR Ø set threshold t and initial empty pattern set P Ø for each new incoming logs, compute IWR with each pattern in P Ø if pattern matched, skip to next; if none matched, add to P v Significant Limitation Ø Logs with different length has IWR of ZERO!

  7. Text Similarity Based Approach (1) v Using Text Similarity to resolve the problem Ø S = P x O Ø S: similarity, P: propotion of common words, O: order factor v Two logs l 1 and l 2 , L 1 and L 2 are word sets respectively Ø define P: P(l 1 , l 2 ) = ( |L 1 ∩ L 2 | × 2) / ( |L 1 | + |L 2 | ) Ø define O: O(l 1 , l 2 ) = SeqSim(l 1 , l 2 ) / |L 1 ∩ L 2 | Ø hence S: S(l 1 , l 2 ) = (SeqSim(l 1 , l 2 ) × 2) / (|L 1 | + |L 2 |) v By this, logs in different lengths can be compared

  8. Text Similarity Based Approach (2) v Using Longest Common Subsequence to define SeqSim(l 1 ,l 2 ) Ø S(l 1 , l 2 ) = ( |LCS(l 1 , l 2 )| × 2) / ( |L 1 | + |L 2 | ) Ø Same pattern if S(l 1 , l 2 ) ≥ t, where t is the predefined threshold v The process of improved log pattern extracting algorithm Ø set the threshold value t. Set the initial log pattern set P to be an empty set Ø for a new log l appearing from the input log set L, compute S i (l, p i ) between l and every p i ∈ P using a LCS algorithm Ø if there is no S i (l, p i ) ≥ t, add l to P Ø after all logs in L have been checked, return P v Increase time cost for single comparison Ø but reduce total number of comparisons Ø can be offset by choosing a better LCS algorithm

  9. Text Similarity Based Approach (3) v Experiment result Ø numbers of extracted patterns

  10. Text Similarity Based Approach (3) v Experiment result Ø time costs of candidate algorithms (in milliseconds)

  11. Modified Pattern Comparing Model (1) v The original model is bad in time cost of searching patterns Ø has to visit all patterns until the one is met v Use hashmap to accelerate the matching Ø divide pattern set into subsets by initial words Ø skip majority of patterns in irrelevant subsets v Matching process : 1. get initial word of the log 2. hash the word 3. find desired subset in hashmap 4. compare with patterns in the subset

  12. Modified Pattern Comparing Model (2) v This approach cannot deal with patterns with unfixed initials Ø build an unfixed pattern set v In real system, we split pattern set in 4 parts: Ø fixed alert pattern set Ø unfixed alert pattern set Ø fixed normal pattern set Ø unfixed normal pattern set v When a new log comes, it is compared in the 4 sets in turn to decide processing methods

  13. Modified Pattern Comparing Model (3) v Real time cost comparison between original & modified models cron maillog millisecond millisecond 1800000 3000000 1600000 2500000 1400000 2000000 1200000 1000000 1500000 800000 600000 1000000 400000 500000 200000 0 0 original model modified model original model modified model secure messages millisecond millisecond 600000 10000000 9000000 500000 8000000 7000000 400000 6000000 300000 5000000 4000000 200000 3000000 2000000 100000 1000000 0 0 original model modified model original model modified model

  14. Summary & Future Work v Log patterns: used to build log recognition v Algorithm of IWR isn’t capable to match logs in different lengths v Using the idea of text similarity and LCS to improve the algorithm v Modify log comparing model to accelerate the process v Future work: log pattern based analyses in CNGrid Ø log pattern associations Ø log flow feature modeling

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend