sequence data mining techniques and applications
play

Sequence Data Mining: Techniques and Applications Sunita Sarawagi - PDF document

Sequence Data Mining: Techniques and Applications Sunita Sarawagi IIT Bombay http://www.it.iitb.ac.in/~sunita What is a sequence? Ordered set of elements: s = a ,a ,..a


  1. ✎ ✍ Sequence Data Mining: Techniques and Applications Sunita Sarawagi IIT Bombay http://www.it.iitb.ac.in/~sunita �✂✁☎✄ ✁✝✆✞✁✠✟☛✡ What is a sequence? • Ordered set of elements: s = a ☞ ,a ✌ ,..a • Each a could be – Categorical: domain a finite set of symbols Σ , | Σ |=m – Numerical – Multiple attributes • The length n of a sequence is not fixed • Order determined by time or position and could be regular or irregular �✂✁☎✄ ✁✝✆✞✁✠✟☛✡ 1

  2. Motivation • Several real-life mining applications on sequence data • Classical applications – Speech, language, handwritten are all complex sequences • Newer applications – Bio-informatics: DNA and proteins – Telecommunication: Network alarms, network packet data – Retail data mining: Customer behavior ✏✂✑☎✒ ✑✝✓✞✑✠✔☛✕ Outline • Three case studies – Intrusion detection – Information Extraction – Bio-informatics: protein classification • Sequence mining operators • Approaches to sequence mining • Conclusions and future work ✏✂✑☎✒ ✑✝✓✞✑✠✔☛✕ 2

  3. ➯ ➵ ➶ ➵ ➳ ➹ ➲ ➸ ➭ ➚ ➼ ➲ ➸ ➭ ➵ ➩ ➚ ➶ ➶ ➁ ❸ ❂ ■ ✬ Ð ✬ Case study: intrusion detection • Intrusions could be detected at ✜✣✢✥✤✧✦✩★✫✪✭✬ ✮✧✯✰✮✱✬✳✲✂✴✶✵✫✵✫✴✧✷✧✸✺✹✼✻✱✽✿✾❁❀❃❂ ✯✧❂ ✮✧❄✧✮✧❅❆✾✧❀❇✻✧❄❈❀❇✴❈❉❊✹✿✬ ✸✺✮❋✬ ✾❁❀❍●✩✹✰✮❈✽✰❅❈❉✿✴❈❂ ❏✣❑✥✮✶✵☎▲▼✻❈❀❍✸✺◆✭✬ ✮✶✯✰✮✱✬✺✲✂❅✧✮❖✽❁❂ ✴✱✬ ◆❃✻✶P❇◆❍✹✩✮❈❀◗✯✧❂ ✷✩✮❘✴✶✵✫✵✫✴❁✷✶✸❙✹✧●✶✾✰✻✱❀✫✵❚◆✂✹✩✷✰✴❈✽✰✹❁●✩✮✶✵◗✷❯■ ❱❋❲❨❳❬❩✧❭☛❪❯❫✰❴✳❵❜❛ • Method ❏❞❝❡❂ ❄❈✽❯✴✧✵✂❢✧❀❇✮✧◆✭❣✰✴✧✹✰✮✧❅❤✲✐❉❊✴✧✵✫✷✧❥✼✹✧❂ ❄✱✽✰✴✶✵❃❢✧❀❚✮❦✻✶P❧✾✧❀❇✮♠✯❁❂ ✻❈❢✰✹✼✴✶✵✫✵❇✴✧✷✶✸❙✹✩■ ❱♦♥q♣❯r✰r✺s❙t♠❪❜✉❙t◗✉✩♥✈t✧r✺✉✞✇②① r✺t❚③❇❫❙❛✺① s♠r✺❛ ❏⑤④⑥✽✰✻❈❉✿✴❈✬ ⑦✩◆⑧❣♠✴✧✹✩✮✧❅❤✲⑨❉❊✻✧❅✧✮✱✬❜✽✰✻❈❀✐❉❊✴❈✬✩❢✰✹✩✴✧❄❁✮✿✴✱✽✰❅❘❅✧✮✶✵❚✮✧✷❜✵❁❅❁✮✶✯❁❂ ✴✧✵❃❂ ✻❈✽❯■ • Automatic Vs Manual: ❏❶⑩❷✴❈✽✧❢♠✴❈✬ ❱❺❹⑥① ❻❯❼✺t✶❴▼① ❛❽❛⑥❵❜♣✺t☎t◗✉♠③❇r✺❛✩❾✺❴⑥♣❜❿⑥r❜s✺t❯✉✺➀✺s❯➁ ➀q✉▼♣✩❛⑥r✺s❯③❇❴⑥♣❯➁q❫✺❛✞♣✩❻✩✉✼❵✺♣❙t☎t✫✉❯③❚r ❛❙➁ sq✇➂➁ ❿➃❪♠③❇① ➄☎t◗❛✩➅ ❏⑤④⑥❢✩✵❚✻❈❉❊✴✶✵❚✮✧❅✱❸ ❱➇➆❖❛q✉➈❼✰① ❛✈t✫s❯③❚① ♥✞♣❯➁❽♣✰❫❙❪♠① t✩t❚③✫♣❯① ❛➂♣❯r❙❪✳♣➈➁ ✉✩♣♠③❇r❜① r❜❻➉♣♠➁ ❻✩s♠③❇① t✫❼✩❴ ❱❺❹➃♣✩❿➉r❙s❙t✧❵✩③✫s❜➀❙① ❪✩✉⑥➄❚❫✩➁ ➁➊♥qs❙➀✺✉❯③✫♣✰❻❜✉ ✖✂✗☎✘ ✗✝✙✞✗✠✚☛✛ Host-level attacks on privileged programs • Attacks exploit a loophole in the program ➭✰➯❨➲❯➳ ➵✶➸ ➲✶➲❨➺ to do illegal actions ➵✶➸✰➻✶➼✶➻ ➑✣➒❈➓❯➔✱→❤➣✧↔ ↕✱➙✼↕♠➓❁➣✧↔ ➛❈➜ ➝❖➞❁➟✩➠❚➠❚↕✱➡❨➛♠➢✰↕❈➡❇➤☛➠❍↔ ➛✰➥▼➦➈➝❚➛➧➡❍➟✧➨❦➟✰➦✰↕❈➡❚➤ ➽✶➽ ➛❁➫✧↕ ➲❯➾❁➲❁➚❜➪❨➲ • What to monitor of an executing ➻❨➵ ➭❁➚ ➻❨➵ privileged program to detect attacks? ➭❁➚ ➲❯➾❁➲❁➚❜➪❨➲ • Sequence of system calls ➑✣➘ ➴➬➷✩➮❘➱✰✃✶❐❁❒✶❮Ï❰❈Ð Ð✩Ñ✰❒✧➱✰➱✶Ò Ó❁Ð ✃❦➱❜Ô❜➱✩❐❚✃✱Õ×Ö✰❰❈Ð ➱❷ØÚÙ✩Û✧Û ➳❈➺ • Mining problem: given traces of previous normal execution, monitor a new execution and flag attack or normal • Challenge: is it possible to do this given widely varying normal conditions? ➋✂➌☎➍ ➌✝➎✞➌✠➏☛➐ 3

  4. Bio-informatics • Many recent advances in sequence analysis due to bio-informatics • Two main kinds of sequences: – Genes: â♦ãåä✧æ❖ç✰ä✱è❯é✩ä❦ê✶ë✧ì♦í✰ê❁î✩î✶ï ð❁ñ ä❋è❁ç✰é✧ñ ä❁ê✶ò❍ï ó❨ä✧î✧ô❨õ ö➬õ ÷✧ì ø❤ù➬ù➃ú➂ûåü➉ù➂úýú➂ûÚü✼ü✼ü✼úýúýú❬ù✥ù➂ûÚúýú – proteins: ä ✁�✄✂ ❤ï è❯ê ✆☎✝� ❁é✶ï ø♦ãåä✧æ❖ç✰ä✱è❯é✩ä❦ê✶ë❁þÏÿ❤í❯ê✧î✩î✧ï ð❨ñ ó✧î✧ô❘õ ö➬õ ÷❁þ❨ÿ ø ✟✞ ✩ä✱è ✡✠ ✶ò ☞☛ ❷ê✶ë❨î✰ä✧æ❈ç✰ä❈è❯é✩ä❆è ✍✌✡�✏✎ ✐ï ä❁î✿ð✰ä✶ò ✒✑ ▼ä✧ä❖è ✔✓ ✩ÿ❁ÿ✶î➈ò❚ê ✕✓ ✩ÿ❈ô ÿ❨ÿ✧ÿ • Sequence analysis in bio-informatics: rich and varied, we will concentrate on one problem – Protein family classification Ü✂Ý☎Þ Ý✝ß✞Ý✠à☛á Protein family classification • Protein families characterized by common occurrence of a few scattered amino acids in a background of other unrelated symbol • Example: three aligned sequences of a family ✖✘✗✒✙ ✗✛✚✜✗✣✢✥✤ 4

  5. ❡ ❲ Ó Ð ✃ ❒ ❮ ❡ ❪ ❘ ❭ ❩ Õ ❒ ◗ Ï ▲ Ñ × Information extraction Sequence: text string with elements as words ✬✮✭✏✯✱✰✳✲✔✴✆✵ ✶✳✷✹✸✻✺✼✺✳✽✾✶✆✿✡✿❀✶✆✿✼❁❃❂✆❄ ❂✁✽✾✶❆❅❀❇✏✽❈✺✆✿ ❉❋❊❍●❏■☞❑ ●✜▼✻◆✹❑P❖ ➬✣Ø ●❏❘✒❙✒❚❯❘ ❊❯❳❨❚ ▲❯❱ ❖✝❑❬❳ ❫✡❴✱❵❜❛❞❝ ❢❏❝ ❣✐❤❦❥♠❧☞♥✆♦❀❤q♣r❥❍❤❃s❃❝ ❥t❣✼✉✈❝ ✇✱❤②①✹③✡④⑥⑤✆❝⑧⑦⑩⑨❃❶✁❷❸⑤✆❝❺❹✼❻❃❻❃❻✱❼❃❵ ➬✾Ð ➱❍➷✜✃❈Ò ❰✆Ï ➴➫➷❍➬↕➮✹➱❍✃ ❐❋❒❬❮ ➱❯Ð ➷✜Ô ❮PÖ❯❒ ❽✼❾✥❽✼❾✒❿➁➀r➂✱➃✄➄✒➅❸➀r➆➈➇✏➉➊❾❈❽✼❾✏➋➌➆☞➀✱➍✆➎✹➀r➆➈➇❋➏⑩❾✛➐⑥❾❞➑✱➒✜➓P➔r→✒→ ➇❋➏➣❾↕↔➊❾✏➙➊→ ➀r➆❨➅❀➇✳➛❸❾✾↔➊❾ ➏➊➜✏➆☞➝✄➄ ➎✡➅⑧➞❯➟❏➠r➠✳➡✼➢t❽❃➆☞➜❃➓✈➔r➄✣➂✁➀r➂✐➝➤↔✳➜✳→ ➥✆➔r➂❀➓➦➑❆➂✱➃✳➄✣➂✱➔❃➔r➆P➄✣➂✱➃➧➜❃➨➊↔➫➩❆➭❀➓♠➄✣→✒➄ ➒✡➄✒➂✱➃ ➯✄❽❃➲❋➳❆➄✒➂✕➲❞➔❃➀r➆❨→ ➍✁➐✻➂❆➵❀➍✆➝✳➆➈➜r➩✐➒➺➸t➆➈➃✼➀r➂✼➄ ➎➼➻⑩➔❃➝✄➄ ➀✁➛❸❾ ➐➾➽⑧➔r➆❬❾r➙➊➵✐➔✏➽✕❾✏↔✳➜✆➎✐❾ ➟r➟✱➚✆➇➣➟✡➪r➪❆➡➶➟✱➹❯➟✱➪r➪❆➡r➘✳❾ Mining problem: Given a set of tags (labels) e.g. address fields, classify parts of the sequence to different labels ✦✘✧✒★ ✧✛✩✜✧✣✪✥✫ Outline Three case studies • • Sequence mining operators – Whole sequence classification – Partial sequence classification (Tagging) – Predicting next symbol of a sequence – Clustering sequences – Finding repeated patterns in a sequence • Approaches to sequence mining • Conclusion and future work Ù✘Ú✒Û Ú✛Ü✜Ú✣Ý✥Þ 5

  6. Classification of whole sequences Given: – a set of classes C and – a number of example of instances in each class c, train a model so that for an unseen sequence we can say to which class it belongs Example: – Given a set of protein families, find family of new protein – Given a sequence of packets, predict session as intrusion or not – Given several utterances of a set of words, classify a new utterance to the right word ß✘à✒á à✛â✜à✣ã✥ä Existing methods of classification • Generative classifiers • Discriminatory classifiers • Distance based classifiers: (Nearest neighbor) • Kernel-based classifiers ß✘à✒á à✛â✜à✣ã✥ä 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend