Learn more from your logfiles Learn more from your logfiles using - - PowerPoint PPT Presentation

learn more from your logfiles learn more from your
SMART_READER_LITE
LIVE PREVIEW

Learn more from your logfiles Learn more from your logfiles using - - PowerPoint PPT Presentation

Learn more from your logfiles Learn more from your logfiles using machine learning using machine learning [DEV1156] Adam.Spiers @suse.com Dirk.Mueller @suse.com CC BY-NC 2.0 Thomas Hawk We are SUSE OpenStack Cloud software engineers We are


slide-1
SLIDE 1

Learn more from your logfiles Learn more from your logfiles using machine learning using machine learning

[DEV1156] Adam.Spiers@suse.com Dirk.Mueller@suse.com

CC BY-NC 2.0 Thomas Hawk

slide-2
SLIDE 2

We are SUSE OpenStack Cloud software engineers We are SUSE OpenStack Cloud software engineers

slide-3
SLIDE 3

We love green CI We love green CI

slide-4
SLIDE 4

We care about upstream OpenStack CI too We care about upstream OpenStack CI too

slide-5
SLIDE 5

OpenStack Health OpenStack Health

slide-6
SLIDE 6

?

slide-7
SLIDE 7

Did you find it? Did you find it?

slide-8
SLIDE 8

Manual Process Manual Process

slide-9
SLIDE 9

Idea: Reducing scrolling by pattern matching Idea: Reducing scrolling by pattern matching

warning /(?i)warning/ error /Traceback \(most recent call last\)/ error /(?i)error/ error /(?i)\bfail(ure|ed)?\b/ error /(?i)fatal/ error /$h1!!/

slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13

Dealing with false positives Dealing with false positives

# Successful tempest run

  • k /^ - (Expected Fail|Failed): 0$/
  • k /Warning: Turning on '--gpg-auto-import-keys'/
  • k /Warning: Permanently added .* to the list of known hosts/
  • k /WARNING: Device for PV .* not found or rejected by a filter/
  • k /WARNING: \w+ signature detected on .* offset \d+. Wipe it?/
  • k /grep -v failed\b/

# rpms containing "Error"

  • k /perl-Error[ -]|libsamba-errors|mariadb-errormessages/

# https://bugzilla.suse.com/show_bug.cgi?id=1030822 warning /Cleaning up (vip-admin-\S+) on \S+, removing fail-count-\1/ # https://bugzilla.suse.com/show_bug.cgi?id=971832

  • k /Failed to try-restart vsftpd@.service: Unit name vsftpd@.service is not va
slide-14
SLIDE 14

Vision: Machine Learning Vision: Machine Learning

slide-15
SLIDE 15

Log-Classify Log-Classify

slide-16
SLIDE 16

Today's plan Today's plan

Intro to Machine Learning Log-Classify Implementation Demo

slide-17
SLIDE 17

AI vs ML vs DL AI vs ML vs DL

slide-18
SLIDE 18

Why Machine Learning? Why Machine Learning?

algorithm

time

d a t a

detection

tree

use

  • ne

using

thus

l e a r n i n g

machine

algorithms

model training set

artificial

used

also supervised Classification

systems methods mining inputs examples Main

field article neural models input

rules may

new anomaly

See vector based AI

like study types take rule

either t r u e v a l u e find t e r m l

  • s

s Speech c a r e 2 1 8 show b u i l d False user w

  • r

k

unsupervised

Networks

Theory analysis feature decision example network

dictionary

  • utput

reinforcement Computer known tasks performance features knowledge

statistical mathematical problems techniques Sparse learn represented learned perform association machines Bayesian function programming

  • ften

many method called neurons

language related statistics Optimization genetic bias include inductive logic information different signal biases regression task intelligence Software

  • utputs

labels human given approach two test Typically represent system

without predictions Applications within Relation approaches support class various linear desired process Positive Negative medical previously trained cluster representation rulebased people specific vision performing fields trees program contains whether image

  • bject

values Natural problem Research goal accuracy instances predict complexity addition brain neuron health

I n s t e a d

  • r

d e r m a k e f

  • c

u s e s g a m e density f

  • u

n d H

  • w

e v e r i n c l u d i n g a b i l i t y unknown i m p r

  • v

e way s

  • l

v e i d e n t i f y deep

  • u

t l i e r i t e m s n

  • r

m a l l a y e r R a t e Journal e m a i l c l

  • s

e l y m a k i n g h i s t

  • r

y r e s p e c t r a t h e r b l a c k b u i l d s s p a m a l r e a d y w e l l logical p a t t e r n s c i e n c e

probabilistic computational Semisupervised recognition discovery generalization algorithmic similarity

  • bservations

applied

a p p l i c a t i

  • n

relationships p r

  • c

e s s e s d e f i n i t i

  • n

c a t e g

  • r

i e s c

  • n

t i n u

  • u

s Similar theoretical r e c

  • m

m e n d a t i

  • n

r e p r e s e n t a t i

  • n

s c l a s s i f i e r m a t h r m connection layers p a t t e r n s i n f e r e n c e d e c i s i

  • n

s p r e d i c t i v e p r

  • p

r i e t a r y e x p e r i e n c e c

  • n

c e r n e d computing s e v e r a l c e r t a i n restricted l i m i t e d m e a n i n g s t r u c t u r e c l u s t e r i n g d i s c

  • v

e r d y n a m i c e n v i r

  • n

m e n t p r

  • b

a b i l i t y e s t i m a t i

  • n

s e q u e n c e s r e s e a r c h e r s i n c r e a s i n g leading s e p a r a t e

slide-19
SLIDE 19

CI Logfiles: ML Challenges CI Logfiles: ML Challenges

– Each Instance of a CI Logfile execute the same steps Install, Build, Test ฀ Result is recorded (success, failures) The individual Logfiles are quickly evolving Every check-in changes it 😑 Each run has a lot of completely unique noise 😓 Timestamps, UUIDs, Passwords and

  • rdering due to parallel execution
slide-20
SLIDE 20

Learning model Variations Learning model Variations

Instance-based Generalizing

  • Directly store instances of training

Derives hypotheses directly from training instances Model can be quickly react to new training input Model can be incrementally updated discarding old training input k-Nearest-Neighbor

  • Abstracting a model from training data

Requires much longer training phase Can not "untrain" previously learned data Artifical Neural Networks (DL)

slide-21
SLIDE 21

Overfitting / Underfitting Overfitting / Underfitting

slide-22
SLIDE 22

Machine Learning Variations Machine Learning Variations

Supervised Unsupervised

Classification Naive Bayes NearestNeighbor Support Vector Machines (SVM) Neural Networks ... Regression Decision Trees Linear Regression Neural Networks ... Clustering K-Means Hidden Markov Model Neural Networks ...

slide-23
SLIDE 23

Supervised Learning: Classification Supervised Learning: Classification

Banana Banana

slide-24
SLIDE 24

Using machine learning for CI log files Using machine learning for CI log files

slide-25
SLIDE 25

Machine Learning Workflow Machine Learning Workflow

  • Build: an individual CI log file

Baseline: Collection of log files from good CI runs Target: The failed CI log run logfile to be analyzed

slide-26
SLIDE 26

log-classify: Analogy using pictures log-classify: Analogy using pictures

slide-27
SLIDE 27

Generic Training Workflow Generic Training Workflow

slide-28
SLIDE 28

Generic Testing Workflow Generic Testing Workflow

slide-29
SLIDE 29

Generic Testing Workflow Generic Testing Workflow

slide-30
SLIDE 30

Log Input transformation example Log Input transformation example

Splitting by lines Tokenization Hashing Transformation

Mar 11 02:43:28 localhost sudo[5195]: pam_unix(sudo:session): session opened for user root by (uid=5) DATE localhost sudo pam_unix sudo session session opened for user root uid hash(DATE) hash(localhost) hash(sudo) hash(pam_unix) hash(sudo) hash(session) hash(opened) ... [0, ...., 0, 1, 0, ..., 0, 1, 0, ...]

slide-31
SLIDE 31

Input transformation: Replace irrelevant pieces with fixed strings Input transformation: Replace irrelevant pieces with fixed strings

Token Raw text months/days/date UUIDs IPv4 or IPv6 addresses words that are exactly 32, 64 or 128 chars numbers of at least 3 digits

DATE RNGU RNGI RNGN RNGD

slide-32
SLIDE 32

Example matrix of a CI logfile Example matrix of a CI logfile

slide-33
SLIDE 33

k-Nearest Neighbors (k=1) k-Nearest Neighbors (k=1)

slide-34
SLIDE 34

Example distance calculation in kNeighbors queries Example distance calculation in kNeighbors queries

  • VARIABLE IS NOT DEFINED is not part of the baseline
slide-35
SLIDE 35

Limitations Limitations

  • Nearest Neighbor performs linear search in model

Complexity grows linearly with samples size Unfiltered Noise may distract from important information Logs containing too many features

slide-36
SLIDE 36

Unique vectors over training set instances Unique vectors over training set instances

slide-37
SLIDE 37

Lookup time per sample size Lookup time per sample size

slide-38
SLIDE 38

Introducing log-classify Introducing log-classify

slide-39
SLIDE 39

Log-classify Log-classify

scikit

not yet: ( )

  • Python 3 ฀

Multiple Text Extraction Models Assumes text, line based log-like input http://scikit-learn.org/ https://www.tensorflow.org/ https://github.com/facebookresearch/pysparnn

slide-40
SLIDE 40

scikit-learn scikit-learn

slide-41
SLIDE 41

log-classify: Installation log-classify: Installation

  • penSUSE Leap/Tumbleweed/SLE 15 SUSE Package Hub:

Others install from PyPI: NOTE:

  • log-classify is the new name

Rename from logreduce hasn't been completed yet

$ zypper install python3-logreduce $ pip3 install --user logreduce

slide-42
SLIDE 42

log-classify: Commands log-classify: Commands

# logreduce ... diff Compare directories/files dir Train and run against local files/dirs dir-train Build a model for local files/dirs dir-run Run a model against local files/dirs ... job Train and run against CI logs ... journal Train and run against local journald

slide-43
SLIDE 43

log-classify: Assumptions log-classify: Assumptions

  • Baseline is built from SUCCESS input

Model is run against FAILURE target

slide-44
SLIDE 44

log-classify: DevStack log-classify: DevStack

# logreduce diff logs/good.txt logs/bad.txt ... 0.527 | bad.txt:34245: 2018-10-09 05:56:51.021261 | controller |\ Details: {u'created': u'2018-10-09T05:11:20Z', u'code': 500,\ u'message': u'Exceeded maximum number of retries. Exhausted \ all hosts available for retrying build failures for instance d7046aa3-e885-4ed6-80e7-d7a7eff9f883.'} ... 97.98% reduction (from 35244 lines to 712)

slide-45
SLIDE 45

log-classify: DevStack Model log-classify: DevStack Model

Truncated singular value decomposition (SVD)

slide-46
SLIDE 46

log-classify: Collecting baselines log-classify: Collecting baselines

$ logreduce dir-train model.clf baseline/* Training on 8 logs took 12.090s at 1.426MB/s (20.831kl/s) $ logreduce dir-run model.clf error.txt Testing took 6.375s at 0.454MB/s (6.569kl/s) 99.72% reduction (from 41879 lines to 118)

slide-47
SLIDE 47

log-classify: Influence of Baseline Size log-classify: Influence of Baseline Size

slide-48
SLIDE 48

log-classify: Journald log-classify: Journald

  • Extract novelty in todays logs over yesterday:
  • Build a model using previous month's logs and look for novelties:

# logreduce journal --range day # logreduce journal-train --range month journal.clf

slide-49
SLIDE 49

log-classify: journald (II) log-classify: journald (II)

# logreduce journal-run --range day journal.clf ... 99.76% reduction (from 19677 lines to 48) 0.730 | cron - postdrop: warning: uid=16311: File too large ... 0.000 | smartd Device: /dev/sdb, 1 Offline uncorrectable sectors # killall -SEGV automount # logreduce journal-run --range day journal.clf 99.75% reduction (from 19679 lines to 50) ... 0.317 | systemd - DAEMON - autofs.service: Main process exited, code=dumped, s 0.314 | systemd - DAEMON - autofs.service: Failed with result 'core-dump'.

slide-50
SLIDE 50

log-classify: OpenStack log files log-classify: OpenStack log files

# logreduce dir-train nova.clf /var/log/nova/nova-compute.log-*xz # logreduce dir-run nova.clf /var/log/nova/nova-compute.log ... 0.684 | INFO .. No calling threads waiting for msg_id : d3afd41a53bb4d14a5e42d 0.619 | INFO .. Recovered from being unable to report status ... 93.15% reduction (from 6741 lines to 462)

slide-51
SLIDE 51

Supportconfig Supportconfig

# logreduce diff report-good/ report-bad/ --html report.html Training took 51.364s at 1.543MB/s Testing took 37.432s at 0.446MB/s ... 88.41% reduction (from 261091 lines to 30251)

slide-52
SLIDE 52
slide-53
SLIDE 53
slide-54
SLIDE 54
slide-55
SLIDE 55

log-classify: Other use cases log-classify: Other use cases

  • Open Build Service build result files

Your ideas ?

slide-56
SLIDE 56

OpenStack Zuul OpenStack Zuul

slide-57
SLIDE 57

Zuul Architecture Zuul Architecture

slide-58
SLIDE 58

Zuul Architecture 2 Zuul Architecture 2

slide-59
SLIDE 59

Zuul Architecture Zuul Architecture

slide-60
SLIDE 60

Post-Run Analysis Post-Run Analysis

slide-61
SLIDE 61

Post-Run Playbook Post-Run Playbook

  • job:

name: base post-run:

  • classify-log
  • upload-log
  • tasks:
  • name: Fetch or build the model

command: log-classify job-train ...

  • name: Generate report

command: log-classify job-run ...

  • name: Return report url

zuul_return: report.html

slide-62
SLIDE 62

Standalone Service Standalone Service

slide-63
SLIDE 63
slide-64
SLIDE 64

Conclusions Conclusions

slide-65
SLIDE 65

Software Factory Software Factory

Log-Classify is hosted on softwarefactory-project.io

slide-66
SLIDE 66

How to contribute How to contribute

  • Apache-2.0 Licensed

#log-classify on Freenode

slide-67
SLIDE 67

Future plans Future plans

  • Handle Streaming logs

Adaptive model Incremental model training Curate public domain datasets Fingerprint and detect archived anomalies More services: Jenkins build, Travis CI, ... More reporter: Logstash filter, ... Clustering (DBSCAN, k-means) for outlier detection

slide-68
SLIDE 68

Q&A Q&A

Please fill out the survey! Icons used in these diagrams are licensed under Creative Commons Attribution 3.0:

  • https://pypi.org/project/logreduce/

https://fontawesome.com/license https://zuul-ci.org