CS6220: DATA MINING TECHNIQUES
Instructor: Yizhou Sun
yzsun@ccs.neu.edu November 14, 2013
CS6220: DATA MINING TECHNIQUES Sequence Data Instructor: Yizhou Sun - - PowerPoint PPT Presentation
CS6220: DATA MINING TECHNIQUES Sequence Data Instructor: Yizhou Sun yzsun@ccs.neu.edu November 14, 2013 Reminder Homework 1 3 students need to talk to Moon during the break Rakesh Viswanathan, Xin Huang, and Laxmi Rambhatla
yzsun@ccs.neu.edu November 14, 2013
2
3
4
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
5
6
7
A sequence database
A sequence : < (ef) (ab) (df) c b > An element may contain a set of items. Items within an element are unordered and we list them alphabetically.
<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
8
π1, β¦ , ππ β π ππ
9
<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>
10
11
A sequence database Given support threshold min_sup =2, <(ab)c> is a sequential pattern
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
databases
12
13
Agrawal @ EDBTβ96)
al.@ICDEβ01)
Shim@VLDBβ99; Pei, Han, Wang @ CIKMβ02)
November 14, 2013 Data Mining: Concepts and Techniques
14
<a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence
Given support threshold min_sup =2
November 14, 2013 Data Mining: Concepts and Techniques
15
sequence
frequent sequences using Apriori
November 14, 2013 Data Mining: Concepts and Techniques
16
candidates
<a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence
min_sup =2
Cand Sup <a> 3 <b> 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1
November 14, 2013 Data Mining: Concepts and Techniques
17
<a> <b> <c> <d> <e> <f> <a> <aa> <ab> <ac> <ad> <ae> <af> <b> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff> <a> <b> <c> <d> <e> <f> <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <b> <(bc)> <(bd)> <(be)> <(bf)> <c> <(cd)> <(ce)> <(cf)> <d> <(de)> <(df)> <e> <(ef)> <f>
Without Apriori property, 8*8+8*7/2=92 candidates
18
November 14, 2013 Data Mining: Concepts and Techniques
19
<a> <b> <c> <d> <e> <f> <g> <h> <aa> <ab> β¦ <af> <ba> <bb> β¦ <ff> <(ab)> β¦ <(ef)> <abb> <aab> <aba> <baa> <bab> β¦ <abba> <(bd)bc> β¦ <(bd)cba> 1st scan: 8 cand. 6 length-1 seq. pat. 2nd scan: 51 cand. 19 length-2 seq.
3rd scan: 46 cand. 20 length-3 seq.
4th scan: 8 cand. 7 length-4 seq. pat. 5th scan: 1 cand. 1 length-5 seq. pat.
<a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence
min_sup =2
November 14, 2013 Data Mining: Concepts and Techniques
20
November 14, 2013 Data Mining: Concepts and Techniques
21
developed by Zaki 2001
November 14, 2013 Data Mining: Concepts and Techniques
22
Join two tables
November 14, 2013 Data Mining: Concepts and Techniques
23
candidates!
candidate sequences! 500 , 499 , 1 2 999 1000 1000 1000 ο½ ο΄ ο« ο΄
30 100 100 1
10 1 2 100 ο» ο ο½ ο· ο· οΈ οΆ ο§ ο§ ο¨ ο¦
ο½ i
i
November 14, 2013 Data Mining: Concepts and Techniques
24
Prefix Suffix (Prefix-Based Projection)
<a> <(abc)(ac)d(cf)> <aa> <(_bc)(ac)d(cf)> <a(ab)> <(_c)(ac)d(cf)>
Assume a pre-specified order on items, e.g., alphabetical order
November 14, 2013 Data Mining: Concepts and Techniques
25
partitioned into 6 subsets:
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
November 14, 2013 Data Mining: Concepts and Techniques
26
<(ab)>, <ac>, <ad>, <af>
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
27
November 14, 2013 Data Mining: Concepts and Techniques
28
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
SDB
Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc>
Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>
Having prefix <a>
Having prefix <aa> <aa>-proj. db
β¦ <af>-proj. db
Having prefix <af>
<b>-projected database
β¦
Having prefix <b> Having prefix <c>, β¦, <f>
β¦ β¦
November 14, 2013 Data Mining: Concepts and Techniques
29
November 14, 2013 Data Mining: Concepts and Techniques
30
memory, use pointers to form projections
s=<a(abc)(ac)d(cf)> <(abc)(ac)d(cf)> <(_c)(ac)d(cf)> <a> <ab> s|<a>: ( , 2) s|<ab>: ( , 4)
November 14, 2013 Data Mining: Concepts and Techniques
31
memory
November 14, 2013 Data Mining: Concepts and Techniques
32
November 14, 2013 Data Mining: Concepts and Techniques
33
November 14, 2013 Data Mining: Concepts and Techniques
34
35
11/14/2013 Data Mining: Principles and Algorithms
36
future states are independent of the past states
from the current state to another state, or remain in the same state, according to a certain probability distribution
the probabilities associated with various state- changes are called transition probabilities
ο½ ο½
ο
1 ) | Pr(
1
g x x
i i
11/14/2013 Data Mining: Principles and Algorithms
37
Each event of a sequence here is Considered containing only one state
11/14/2013 Data Mining: Principles and Algorithms
38
the sequence is given our model
probability as
each xi depends only on the value of xi-1
) Pr( )... ,..., | Pr( ) ,..., | Pr( ) ,..., , Pr( ) Pr(
1 1 2 1 1 1 1 1
x x x x x x x x x x x
L L L L L L ο ο ο ο
ο½ ο½
ο½ ο ο ο ο
ο½ ο½
L i i i L L L L
x x x x x x x x x x x
2 1 1 1 1 2 2 1 1
) | Pr( ) Pr( ) Pr( ) | Pr( )... | Pr( ) | Pr( ) Pr(
11/14/2013 Data Mining: Principles and Algorithms
39
Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)
11/14/2013 Data Mining: Principles and Algorithms
40
expected given the marginal probabilities of C and G
dinucleotides than elsewhere β CpG islands
11/14/2013 Data Mining: Principles and Algorithms
41
sequence regions
regions, we can construct
11/14/2013 Data Mining: Principles and Algorithms
42
classes, we just need to compare Pr(x|CpG) and Pr(x|null)
11/14/2013 Data Mining: Principles and Algorithms
43
depends only on the probability of the previous state
higher order Markov model
1 1 2 1 n i i i i i i
ο ο ο ο
11/14/2013 Data Mining: Principles and Algorithms
44
equivalent to a first order Markov chain over the alphabet of n-tuples: An
a 1st order Markov model over alphabet AA, AC, AG, AT CA, CC, CG, CT GA, GC, GG, GT TA, TC, TG, TT
11/14/2013 Data Mining: Principles and Algorithms
45
Pr(gctaca)=Pr(gctac)Pr(a|gctac)
11/14/2013 Data Mining: Principles and Algorithms
46
Given observed sequence AGGCT, which state emits every item?
Data Mining: Principles and Algorithms
47
the system being modeled is assumed to be a Markov process with unknown parameters
the sequence of tokens generated by an HMM gives some information about the sequence of states.
used to perform further analysis
network
11/14/2013 Data Mining: Principles and Algorithms
48
high probability (goal is to find a model that generalizes well to sequences we havenβt seen before)
sequence
subsequence
11/14/2013 Data Mining: Principles and Algorithms
49
1
i i kl
ο
i i k
11/14/2013 Data Mining: Principles and Algorithms
50
11/14/2013 Data Mining: Principles and Algorithms
51
11/14/2013 Data Mining: Principles and Algorithms
52
ο½
ο«
L i i N L
i i i
1 1
1 1
ο° ο° ο° ο°
6 . 3 . 8 . 4 . 2 . 4 . 5 . ) ( ) ( ) ( ) , Pr(
35 3 13 1 11 1 01
ο΄ ο΄ ο΄ ο΄ ο΄ ο΄ ο½ ο΄ ο΄ ο΄ ο΄ ο΄ ο΄ ο½ a C e a A e a A e a AAC ο°
11/14/2013 Data Mining: Principles and Algorithms
53
the sequence...
11/14/2013 Data Mining: Principles and Algorithms
54
k kl k l l
ο½ ο½
k kN k L
a L f x x x ) ( ) ... Pr( ) Pr(
1
N: ending state; denoted as 0 in textbook
11/14/2013 Data Mining: Principles and Algorithms
55
Given the sequence x=TAGA
11/14/2013 Data Mining: Principles and Algorithms
56
11/14/2013 Data Mining: Principles and Algorithms
57
58
11/14/2013 Data Mining: Principles and Algorithms
59
60
11/14/2013 Data Mining: Principles and Algorithms
61
11/14/2013 Data Mining: Principles and Algorithms
62
sequence in our training set
parameter is used across the training set
11/14/2013 Data Mining: Principles and Algorithms
63
training set, consider all possible paths for the sequence
expected number of times each parameter is used across the training set
11/14/2013 Data Mining: Principles and Algorithms
64
11/14/2013 Data Mining: Principles and Algorithms
65
66
complexity of the Forward, Backward and Viterbi algorithms is
2L
2L
67
68