A Semantic-aware Representation Framework for Online Log Analysis - - PowerPoint PPT Presentation

a semantic aware representation framework for online log
SMART_READER_LITE
LIVE PREVIEW

A Semantic-aware Representation Framework for Online Log Analysis - - PowerPoint PPT Presentation

A Semantic-aware Representation Framework for Online Log Analysis Wei Weibin Men eng, Yi Ying L Liu, Yu Yuheng Hu Huang, Sh Shenglin Zh Zhang Fe Federico Za Zaiter, Bi Bingji jin Ch Chen en, Dan Pei ei 2020/8/28 1 Weibin Meng 1


slide-1
SLIDE 1

A Semantic-aware Representation Framework for Online Log Analysis

Wei Weibin Men eng, Yi Ying L Liu, Yu Yuheng Hu Huang, Sh Shenglin Zh Zhang Fe Federico Za Zaiter, Bi Bingji jin Ch Chen en, Dan Pei ei

Weibin Meng 1 2020/8/28

slide-2
SLIDE 2

1 Background 2 Design 3 Evaluation 4 Summary

Outli Outline

Weibin Meng 2 2020/8/28

slide-3
SLIDE 3

Background

Weibin Meng 3 2020/8/28

slide-4
SLIDE 4

Internet Services

Weibin Meng 4 2020/8/28

Growing rapidly Stability are important Various types of services

slide-5
SLIDE 5

Logs

Weibin Meng 5 2020/8/28

■Monitoring data: ■logs, traffic, PV. ■Logs are one of the most valuable data for service management

■Every service generates logs ■Logs record a vast range of runtime information (7*24)

General Diverse

slide-6
SLIDE 6

Logs

Weibin Meng 6 2020/8/28

■Logs are unstructured text ■designed by developers ■printed by logging statements (e.g., printf())

  • L1. Interface ae3, changed state to down
  • L2. Interface ae3, changed state to up
  • L3. Interface ae1, changed status to down
  • L4. Interface ae1, changed status to up
  • L5. Vlan-interface vlan22, changed state to down
  • L6. Vlan-interface vlan22, changed state to up

Logs are similar to nature language

slide-7
SLIDE 7

Manual inspection of logs

Weibin Meng 7 2020/8/28

■Manual inspection of logs is impossible ■A large-scale service is often implemented/maintained by hundreds of developers/operators. ■The volume of logs is growing rapidly. ■Traditional way: labor-intensive and time consuming

Automatic log analysis

slide-8
SLIDE 8

Automatic log analysis

Weibin Meng 8 2020/8/28

Monitoring

[INFOCOM’19]

Problem Identifying

[FSE’18]

Failure prediction

[SIGMETRICS’18]

■Automatic log analysis approaches, which are employed for services management, have been widely studied

Anomaly detection

[CCS’17]

slide-9
SLIDE 9

Log representation

Weibin Meng 9 2020/8/28

■Most of automatic log analysis require structured input ■Logs are unstructured text ■Log representation serves as the first step of automatic log analysis ■Template index ■Template count vector

Semantic-aware log representation approach

Lost semantic information

slide-10
SLIDE 10

Challenges

Weibin Meng 10 2020/8/28

Domain-specific semantic information

  • Logs contain logs of domain-specific words

1

Out-of-vocabulary (OOV) words

  • The vocabulary is growing continuously because the service

can be upgraded to add new features and fix bugs

2

slide-11
SLIDE 11

Idea

Weibin Meng 11 2020/8/28

Logs are designed by developers and “printf”-ed by services Original goal of logs: “ logs are for users to read” The intuition and methods in NLP can be applied for log representation

Log2Vec

slide-12
SLIDE 12

Design

Weibin Meng 12 2020/8/28

slide-13
SLIDE 13

Overview of Log2Vec

Weibin Meng 13 2020/8/28

OOV word processor Offline stage Online stage Historical logs Word embedding Syns & Ants Triples Vocabulary Real-time logs Log vectors Word vectors 1

  • 1. Log-specific word

embedding 2

  • 2. Out-of-vocabulary

word processor 3

  • 3. Log vector

generation

Open source toolkit: https://github.com/WeibinMeng/Log2Vec

slide-14
SLIDE 14

Log-specific semantics

Weibin Meng 14 2020/8/28

■When embed words of logs, we should consider many information: ■Antonyms ■Synonyms ■Relation triples ■ Others (future work) ■Traditional word embedding methods (e.g., word2vec) assumes that words with a similar context tend to have a similar meaning

fail to capture the log-specific meaning

slide-15
SLIDE 15

Weibin Meng 15 2020/8/28

Prepare log-specific information

Syns & Ants Triples Historical logs

■Automatically extract ■Antonyms & Synonyms ■Search from WordNet[1], a lexical database for English ■Triples ■Dependency tree[2]

[1]Fellbaum C. WordNet[J]. The encyclopedia of applied linguistics, 2012. [2]Culotta A, Sorensen J. Dependency tree kernels for relation extraction[C]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). 2004: 423-429.

■Manually modify

Relations Word pairs Adding methods Synonyms Interface port Operators Antonyms DOWN UP WordNet powerDown powerUp Operators Relations (interface, changed, state) Dependency tree

slide-16
SLIDE 16

Log-specific word embedding

Weibin Meng 16 2020/8/28

■Log-specific word embedding combines two existing methods: ■Lexical Information word embedding (LWE)[1] -> ants & syns ■Semantic Word embedding (SWE)[2] -> relation triples

[1]Luchen Tan, Haotian Zhang, Charles Clarke, and Mark Smucker. Lexical comparison between wikipedia and twitter corpora by using word embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing , pages 657–661, 2015. [2]/Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and Yu Hu. Learning semantic word embeddings based on ordinal knowledge constraints. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1501– 1511, 2015.

LWE SWE CBOW (a model of word2vec)

Share embedding with word2vec

slide-17
SLIDE 17

OOV processor

Weibin Meng 17 2020/8/28

■We adopt MIMICK[3] to handle OOV words at runtime. ■Learn a function from spelling to distributional embeddings.

[3].Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. Mimicking word embeddings using subword rnns. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 102–112, 2017.

slide-18
SLIDE 18

Log vector generation (Online stage)

Weibin Meng 18 2020/8/28

1. Determine whether each word in logs is in vocabulary 2. Convert existing words to word vectors 3. Assign a new embedding vector to the OOV word 4. Calculate the log vector by averaging of its word vectors.

slide-19
SLIDE 19

Evaluation

Weibin Meng 19 2020/8/28

slide-20
SLIDE 20

Experimental setting

Weibin Meng 20 2020/8/28

■Datasets:

Datasets Description # of logs HPC High performance cluster 433,489 HDFS Hadoop distributed file system 11,175,629 ZooKeeper ZooKeeper service 74,380 Hadoop Hadoop MapReduce job 394,308

■Experimental setup: ■Linux server with Intel Xeon 2.40 GHz CPU

slide-21
SLIDE 21

Measurement of OOV

Weibin Meng 21 2020/8/28

■To highlight the challenge in processing OOV words ■Generate training sets with the percentage of original logs ranging from 10% to 90% and regard the remaining logs as the testing set

Measurements of OOV words Measurement of logs with OOV words

OOV words has a big percentage when trained on a smaller sample Always more than 90% logs contain OOV words in Spark/Windows

It’s important to handle OOV words

slide-22
SLIDE 22

Evaluation of OOV processor

Weibin Meng 22 2020/8/28

■Randomly select a word in each log ■ Changed one of the letters to make the word as an OOV ■Test the similarity between the changed log and the original log

Dataset Spark HDFS Windows Hadoop Similarity 0.964 0.984 0.993 0.996

Distribution of Logs’ Similarity Average similarity when Log2Vec processes logs with OOV words

slide-23
SLIDE 23

Log-based service management task

Weibin Meng 23 2020/8/28

■Online log classification ■Baselines: LogSig, FT-tree, Spell, template2Vec ■Divide: 50% training set and 50% testing set

Comparison of log classification when use 50% training logs

Average Fscore of Log2Vec is 0.944 Average Fscore of baselines 0.745 Log2Vec is stable

slide-24
SLIDE 24

Summary

Weibin Meng 24 2020/8/28

slide-25
SLIDE 25

Summary

Weibin Meng 25 2020/8/28

OOV processor Log2Vec Open-source toolkit Experiments

Semantic-aware representation framework for online log analysis We have open- sourced Log2Vec, A mechanism for generating OOV word embeddings when new types of logs appear The results are excellent

slide-26
SLIDE 26

Thanks

mwb16@mails.tsinghua.edu.cn

Weibin Meng 26 2020/8/28

Open source toolkit: https://github.com/WeibinMeng/Log2Vec