Interactive HMM construction based on interesting sequences Szymon - - PowerPoint PPT Presentation

interactive hmm construction based on interesting
SMART_READER_LITE
LIVE PREVIEW

Interactive HMM construction based on interesting sequences Szymon - - PowerPoint PPT Presentation

Interactive HMM construction based on interesting sequences Szymon Jaroszewicz National Institute of Telecommunications Warsaw, Poland LeGo 2008 Szymon Jaroszewicz Interactive HMM construction based on interesting sequences Overview


slide-1
SLIDE 1

Interactive HMM construction based on interesting sequences

Szymon Jaroszewicz

National Institute of Telecommunications Warsaw, Poland

LeGo 2008

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-2
SLIDE 2

Overview

Building models interactively based on interesting patterns Hidden Markov Models Interesting patterns w.r.t. Hidden Markov Models Experimental evaluation: web server log Conclusions and Future research

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-3
SLIDE 3

Typical approach: Automatic model construction

Or:

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-4
SLIDE 4

Here: Interactive model construction

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-5
SLIDE 5

Here: Interactive model construction

+ Understandable models + Learn while building models – Have to do ‘manual’ work :(

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-6
SLIDE 6

Previous related work

Scalable pattern mining with Bayesian networks as background knowledge

  • S. Jaroszewicz, T. Scheffer, D. Simovici

KDD’04, KDD’05, DMKD (to appear) Bayesian networks used as background model Exact and approximate algorithms given Models much closer to real relationships than automatically built models

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-7
SLIDE 7

Hidden Markov Models (HMMs)

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-8
SLIDE 8

Hidden Markov Models (HMMs)

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-9
SLIDE 9

Hidden Markov Models (HMMs)

User gives the structure of the HMM: internal states which transitions are possible (not probabilities) which emission symbols are possible for each state (not probabilities)

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-10
SLIDE 10

Interestingness of sequences w.r.t. an HMM

Inter(seq) =

  • ProbHMM{seq} − ProbData{seq}
  • Szymon Jaroszewicz

Interactive HMM construction based on interesting sequences

slide-11
SLIDE 11

Algorithm for finding all ε-interesting sequences

1 Train HMM parameters based on Data (Baum-Welch) 2 Find all seq such that ProbData{seq} > ε 3 Find all seq such that ProbHMM{seq} > ε 4 Compute ProbData for seq frequent in HMM but not in Data 5 Compute ProbHMM for seq frequent in Data but not in HMM 6 Compute Inter(seq) for all sequences 7 Output ε-interesting sequences Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-12
SLIDE 12

Inference in Hidden Markov Models

Probability that sequence seq (starting at t = 0) is emitted and HMM ends in state si α(seq, si) Efficient recursive updating: α(seq + on+1, si) =

  • j

α(seq, sj)PjiEion+1 ProbHMM{seq} =

i α(seq, si)

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-13
SLIDE 13

Finding frequent sequences in Hidden Markov Models

Monotonicity property holds ProbHMM{seq + o} ≤ ProbHMM{seq} Standard depth-first frequent pattern mining works alpha probabilities used instead of support counting Very efficient: probability updating is fast

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-14
SLIDE 14

Weblog of the National Institute of Telecommunications

Web log format:

195.205.118.10 [01/Jan/2007:00:04:33 +0100] "GET /journal/paper 1.pdf" 200 8833 "http://www.google.pl/" 65.55.208.68 [01/Jan/2007:00:04:45] "GET /robots.txt" 200 51 "-" "msnbot/1.0"

Preprocessing: keep only top level directory sessionizing Result: sessions:

journal/, journal/, END robots.txt, index.html, journal/, ..., END exchweb/, exchange/, exchange/, ..., END ...

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-15
SLIDE 15

Initial HMM

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-16
SLIDE 16

The Sophos antivirus

Top sequences: sophos/,sophos/ ProbHMM = 1.17% ProbData = 11.48% sophos/,sophos/,sophos/,sophos/ ProbHMM = 0.013% ProbData = 9.29% Update of the Sophos antivirus Always accessed 2, 4 or more times

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-17
SLIDE 17

The Sophos antivirus: update to the model

The new model is: Each soph state only emits the sophos/ symbol sophos/ symbol removed from all state

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-18
SLIDE 18

Journal PDF files + icon

Sequence: journals/, journals/, favicon.ico ProbHMM ≈ 0 ProbData ≈ 2% favicon.ico small icon next to web address Default location: main directory At the Institute: img/ directory HTML header contains the other location; PDF can’t Browser tries the default location and fails Fixed: icon appears now

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-19
SLIDE 19

Journal PDF files + icon

Added the following segment to the model: The same PDF file often accessed twice; unable to explain:

accelerators? browser errors? server errors?

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-20
SLIDE 20

Other patterns

Exchange mail web reader robots: Google / MSN / Yahoo RSS readers ...

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-21
SLIDE 21

Final model

Quickly built a model of high level user behavior Accuracy: probability of all sequences modeled with error < 0.01 Every sequence is either:

uninteresting (modeled well) infrequent

Understandability: the model is easily understandable Learnt a lot about the data while modeling

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-22
SLIDE 22

Final model

0.116693 sophos2a 0.174907 czasopisma_1 0.0579479 confer_1 0.0720362 proxy_wpad_1 0.0917065 robot_enter 0.0667198 main 0.0228602 coop 0.0217969 mail 0.0680489 structure 0.0619351

  • gloszenia

0.0550239 RSS_1 0.190324 _all_ sophos2b 1 czasopisma_2 0.507599 quit 0.455927 favico 0.0364742 confer_2 0.224771 0.775229 proxy_wpad_2 0.434023 0.398524 _all_sink 0.167453 0.486957 robot_all_ 0.513043 main_img 0.105143 0.0756972 0.196602 main_css 0.622558 0.532609 0.467391 0.983928 0.0160721 0.0824373 0.917563 0.431707 0.568293 0.68599 RSS_2 0.31401 0.386435 0.523659 _all_image 0.0899054 0.790598 0.209402 sophos_more 0.985953 0.0128473 0.00119974 0.277904 sophos4a 0.722096 czasopisma_4 0.958403 0.0415973 czasopisma_3 0.344828 0.655172 0.365269 0.416168 0.218563 0.94003 0.0489046 0.0102458 main_js 0.000819665 0.82173 0.0953639 0.0829057 0.00730688 0.992693 0.157534 0.506849 0.335616 sophos4b 1 0.258675 0.741325 0.0415271 0.0348292 0.923644 0.179063 0.820937 0.00520123 0.0280358 0.0572136 0.909549 0.367851 0.17918 0.45297 0.44697 0.55303

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-23
SLIDE 23

Comparison with automatically learned models

20 hidden states + Baum Welch algorithm

  • nly transitions with prob. > 0.01

all transitions with prob. > 0.001

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-24
SLIDE 24

Only transitions with prob. > 0.01

js:0.97 img:0.98 0.65772 __END__:0.78 index.html:0.17 img:0.03 0.138782 js:0.95 publ:0.02 0.0146485 wpad.dat:0.94 index.html:0.03 proxy.pac:0.02 0.135006 sophos:0.96 __END__:0.04 0.0316445 0.944852 0.0269857 0.0152721 0.0101079 css:0.93 img:0.03 publ:0.01 0.933681

  • gloszenia:0.13

konf:0.11 struk:0.11 czasopisma:0.10 robots.txt:0.10

  • ferta:0.06

publ:0.05 RSS:0.05 icton:0.04 favicon.ico:0.03 index.html:0.03 wpad.dat:0.03 wspolpraca:0.03 en:0.03 sieci:0.02 p12:0.02 0.0289194 exchweb:0.97 exchange:0.03 0.0121043 exchange:0.99 0.0156184 exchweb:1.00 0.733695 exchweb:0.98 exchange:0.02 0.234342 0.0169881 0.576072 exchweb:0.96 exchange:0.02 0.285567 0.0994718 0.955822 0.0395215 0.161047 0.448195 0.383328 0.860515 0.106825 sophos:1.00 0.0572462 0.0186898 0.904394 struk:0.37 p12:0.21 index.html:0.15 konf:0.15 kier:0.04 prace:0.03 rada_n:0.01 0.976036

  • ferta:0.46

wspolpraca:0.43 __END__:0.03 img:0.03 favicon.ico:0.02 0.0168552 0.948961 czasopisma:0.56 publ:0.15 RSS:0.12 __END__:0.08 favicon.ico:0.04 wydarzenia:0.03 0.0105685 0.105043 0.852759 0.0203733 exchange:0.97 public:0.01 cruise:0.01 0.988373 0.0122857 0.971186 0.690681 0.0420723 0.261081 0.979299 0.0207014 0.180548 0.803352 0.0540888 0.0141684 0.421178 0.0130245 0.497353 exchange:0.85 exchweb:0.14 0.0279171 0.0148548 0.948481

  • gloszenia:0.94

probniki:0.05 0.985168 0.037825 0.0109124 0.0148372 0.0441093 0.112007 0.624973 0.11556 0.0102352

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-25
SLIDE 25

All transitions with prob. > 0.001

js:0.97 img:0.98 0.65772 __END__:0.78 index.html:0.17 img:0.03 0.138782 exchange:0.99 0.00265426 js:0.95 publ:0.02 0.0146485 css:0.93 img:0.03 publ:0.01 0.00331982 struk:0.37 p12:0.21 index.html:0.15 konf:0.15 kier:0.04 prace:0.03 rada_n:0.01 0.00182227 wpad.dat:0.94 index.html:0.03 proxy.pac:0.02 0.135006 exchange:0.97 public:0.01 cruise:0.01 0.00304185
  • gloszenia:0.13
konf:0.11 struk:0.11 czasopisma:0.10 robots.txt:0.10
  • ferta:0.06
publ:0.05 RSS:0.05 icton:0.04 favicon.ico:0.03 index.html:0.03 wpad.dat:0.03 wspolpraca:0.03 en:0.03 sieci:0.02 p12:0.02 0.00462256 sophos:0.96 __END__:0.04 0.0316445 exchange:0.85 exchweb:0.14 0.00144196
  • gloszenia:0.94
probniki:0.05 0.00529607 0.944852 0.0269857 0.00334134
  • ferta:0.46
wspolpraca:0.43 __END__:0.03 img:0.03 favicon.ico:0.02 0.00331717 0.00647091 czasopisma:0.56 publ:0.15 RSS:0.12 __END__:0.08 favicon.ico:0.04 wydarzenia:0.03 0.00231816 0.00489707 0.00142946 0.00235566 0.00218958 0.0152721 0.0101079 0.933681 0.00739823 0.0023792 0.00224235 0.0289194 exchweb:0.97 exchange:0.03 0.0121043 0.0156184 exchweb:1.00 0.733695 exchweb:0.96 exchange:0.02 0.00342551 exchweb:0.98 exchange:0.02 0.234342 0.00169411 0.0169881 0.576072 0.00309726 0.004354 0.00435886 0.285567 0.0994718 0.00787089 0.955822 0.0395215 0.00112686 0.00157867 0.161047 0.00650926 0.448195 0.383328 0.00977241 0.00910083 0.860515 0.106825 0.00141845 0.00892502 0.002578 sophos:1.00 0.0572462 0.00291555 0.00920957 0.0186898 0.00153455 0.00200813 0.904394 0.00311377 0.00501501 0.976036 0.00527255 0.00709322 0.0030991 0.00124555 0.00632418 0.00412222 0.00184189 0.0168552 0.948961 0.0105685 0.00905684 0.105043 0.00118491 0.00484898 0.00181956 0.852759 0.00169182 0.00405784 0.0203733 0.00323877 0.00212925 0.00200621 0.00147511 0.988373 0.00219078 0.00728186 0.0122857 0.00400413 0.00639132 0.00369664 0.971186 0.690681 0.0420723 0.00340373 0.00137903 0.261081 0.00138262 0.979299 0.0207014 0.00508614 0.180548 0.00407838 0.803352 0.00149997 0.00400273 0.0540888 0.0141684 0.421178 0.0130245 0.497353 0.0279171 0.00119087 0.00241907 0.0148548 0.00272064 0.00215849 0.948481 0.00209431 0.00257349 0.00248052 0.00442828 0.0027652 0.985168 0.037825 0.00134896 0.00770541 0.00621729 0.0109124 0.0148372 0.0441093 0.112007 0.624973 0.11556 0.0064398 0.0102352 0.00678749

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences

slide-26
SLIDE 26

Conclusions and Future work

Conclusions: Interactive model construction based on interesting patterns = Understandability + Accuracy + Learning about the data Future work: Patterns starting at arbitrary time More general models: Dynamic Bayesian Networks, models of biological systems Automatic model updating (?)

Szymon Jaroszewicz Interactive HMM construction based on interesting sequences