Improvement of Log Pattern Extracting Algorithm Using Text - - PowerPoint PPT Presentation

improvement of log pattern extracting algorithm using
SMART_READER_LITE
LIVE PREVIEW

Improvement of Log Pattern Extracting Algorithm Using Text - - PowerPoint PPT Presentation

Improvement of Log Pattern Extracting Algorithm Using Text Similarity ZHAO Yining Computer Network Information Center, Chinese Academy of Sciences in HPBDC18, 2018/05/21 Content v CNGrid & LARGE v Why Log Patterns & Extracting


slide-1
SLIDE 1

Improvement of Log Pattern Extracting Algorithm Using Text Similarity

ZHAO Yining

Computer Network Information Center, Chinese Academy of Sciences in HPBDC18, 2018/05/21

slide-2
SLIDE 2

Content

v CNGrid & LARGE v Why Log Patterns & Extracting Algorithm v Algorithm of Identical Word Rate v Text Similarity Based Approach

Ø Improved Extracting Formation & LCS Ø Experiment Result

v Modified Log Comparing Model v Summary & Future Work

slide-3
SLIDE 3

CNGrid & LARGE

v China National HPC Environment

2 Operating Centers

( Beijing / Hefei )

19 Sites

( 200PF + 162PB ) Portal with Micro-Service Architecture Application oriented Global Scheduling & Predicting Resource Evaluation Standard & Comprehensive Evaluation Index

slide-4
SLIDE 4

CNGrid & LARGE

v Log Analyzing fRamework in Grid Environment

slide-5
SLIDE 5

Log Patterns & Extracting Algorithm

v We want to be alerted for logs in certain patterns, but…

Ø too many logs for human to read Ø need to summarize patterns before defining alert rules

v Set of log patterns in our context:

Ø patterns are different from each other Ø covering all logs in original set Ø significantly less than original

v The process of using log patterns

Ø filter and remove frequent normal logs Ø use log pattern extraction algorithms to get the set of patterns Ø manually check the set and pick out abnormal patterns Ø define rules to generate alerts for these patterns

slide-6
SLIDE 6

Algorithm of Identical Word Rate

v Algorithm of identical word rate – a straight forward way

Ø identical words

  • 2 words that are identical
  • and in the same position in 2 original logs

Ø identical word rate

  • (number of identical words) / (total words)
  • predefined threshold t
  • If IWR is greater than t, the two logs are in one pattern

v Process of algorithm of IWR

Ø set threshold t and initial empty pattern set P Ø for each new incoming logs, compute IWR with each pattern in P Ø if pattern matched, skip to next; if none matched, add to P

v Significant Limitation

Ø Logs with different length has IWR of ZERO!

slide-7
SLIDE 7

Text Similarity Based Approach (1)

v Using Text Similarity to resolve the problem

Ø S = P x O Ø S: similarity, P: propotion of common words, O: order factor

v Two logs l1 and l2, L1 and L2 are word sets respectively

Ø define P: P(l1, l2) = ( |L1 ∩ L2| × 2) / ( |L1| + |L2| ) Ø define O: O(l1, l2) = SeqSim(l1, l2) / |L1 ∩ L2| Ø hence S: S(l1, l2) = (SeqSim(l1, l2) × 2) / (|L1| + |L2|)

v By this, logs in different lengths can be compared

slide-8
SLIDE 8

Text Similarity Based Approach (2)

v Using Longest Common Subsequence to define SeqSim(l1,l2)

Ø S(l1, l2) = ( |LCS(l1, l2)| × 2) / ( |L1| + |L2| ) Ø Same pattern if S(l1, l2) ≥ t, where t is the predefined threshold

v The process of improved log pattern extracting algorithm

Ø set the threshold value t. Set the initial log pattern set P to be an empty set Ø for a new log l appearing from the input log set L, compute Si(l, pi) between l and every pi ∈ P using a LCS algorithm Ø if there is no Si(l, pi) ≥ t, add l to P Ø after all logs in L have been checked, return P

v Increase time cost for single comparison

Ø but reduce total number of comparisons Ø can be offset by choosing a better LCS algorithm

slide-9
SLIDE 9

Text Similarity Based Approach (3)

v Experiment result

Ø numbers of extracted patterns

slide-10
SLIDE 10

Text Similarity Based Approach (3)

v Experiment result

Ø time costs of candidate algorithms (in milliseconds)

slide-11
SLIDE 11

Modified Pattern Comparing Model (1)

v The original model is bad in time cost of searching patterns

Ø has to visit all patterns until the one is met

v Use hashmap to accelerate the matching

Ø divide pattern set into subsets by initial words Ø skip majority of patterns in irrelevant subsets

v Matching process :

1. get initial word of the log 2. hash the word 3. find desired subset in hashmap 4. compare with patterns in the subset

slide-12
SLIDE 12

Modified Pattern Comparing Model (2)

v This approach cannot deal with patterns with unfixed initials

Ø build an unfixed pattern set

v In real system, we split pattern set in 4 parts:

Ø fixed alert pattern set Ø unfixed alert pattern set Ø fixed normal pattern set Ø unfixed normal pattern set

v When a new log comes, it is compared in the 4 sets in turn to decide processing methods

slide-13
SLIDE 13

Modified Pattern Comparing Model (3)

v Real time cost comparison between original & modified models

200000 400000 600000 800000 1000000 1200000 1400000 1600000 1800000

  • riginal model

modified model

cron

millisecond

500000 1000000 1500000 2000000 2500000 3000000

  • riginal model

modified model

maillog

millisecond

100000 200000 300000 400000 500000 600000

  • riginal model

modified model

secure

millisecond

1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 10000000

  • riginal model

modified model

messages

millisecond

slide-14
SLIDE 14

Summary & Future Work

v Log patterns: used to build log recognition v Algorithm of IWR isn’t capable to match logs in different lengths v Using the idea of text similarity and LCS to improve the algorithm v Modify log comparing model to accelerate the process v Future work: log pattern based analyses in CNGrid

Ø log pattern associations Ø log flow feature modeling