1
Stream Sequential Pattern Mining with Precise Error Bounds
Luiz F. Mendes1,2 Bolin Ding1 Jiawei Han1
1 University of Illinois at Urbana-Champaign 2 Google Inc.
Stream Sequential Pattern Mining with Precise Error Bounds Luiz F. - - PowerPoint PPT Presentation
Stream Sequential Pattern Mining with Precise Error Bounds Luiz F. Mendes 1,2 Bolin Ding 1 Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 Google Inc. lmendes@google.com {bding3,hanj}@illinois.edu 1 Outline Introduction
1
1 University of Illinois at Urbana-Champaign 2 Google Inc.
2
Introduction Problem Definition The SS-BE Method The SS-MB Method Experimental Results Discussion Conclusions
Sequential pattern mining is an important
In recent years, we have seen a new kind of data,
Additional constraints for mining data streams:
Memory usage is limited (cannot store everything) Can only look at each stream component once
3
Two effective methods for mining sequential
SS-BE (Stream Sequence miner using Bounded
Guarantees there are no false negatives. Ensures the support count of the false positives is
SS-MB (Stream Sequence miner using Memory
Maximum memory usage after processing any
4 Source: www.belgravium.com
5
Introduction Problem Definition The SS-BE Method The SS-MB Method Experimental Results Discussion Conclusions
Let I = {i1, i2, …, ij} be a set of j items. A sequence is an ordered list of items from I denoted
A sequence <a1, a2, …, ap> is a subsequence of
A data stream of sequences is an arbitrarily large list
A sequence s contains another sequence s’ if s’ is a
6
7
The count of a sequence s, denoted by count(s), is
The support of a sequence s, also called supp(s), is
If supp(s) >= σ, where σ is a user-supplied minimum
Goal is to find all the frequent sequential patterns in
8
Example:
Given data stream D: S1 = <a,b,c>, S2 = <a,c>, and
σ = 0.5. The set of sequential patterns and their
<a>: 2 <b>:2 <c>:3 <a,c>:2 <b,c>:2
9
Introduction Problem Definition The SS-BE Method The SS-MB Method Experimental Results Discussion Conclusions
10
A data stream D = S1, S2, S3,… Minimum support threshold σ Significance threshold , 0 <= < σ Batch support threshold α, 0 <= α <= Batch length L Pruning period δ
a: 2 TID# batchCount b:1 c:3
11
Algorithm Overview:
Break the stream into batches of length L. For each arriving batch Bk, apply PrefixSpan with
Insert each frequent sequence si (say it has count
If a path corresponding to this sequence does
When the number of batches seen is a multiple of the
[count + ( αL – 1) * B’ ] <= * (BL)
When we find that a node can be pruned, the entire sub-
12
Finally, suppose the user requests the set of
Simply traverse the tree outputting all sequences
There are no false negatives. The false positives are guaranteed to have real
13
Suppose L = 4, σ = 0.75,
Data stream D:
<a,b,c> <a,c> <a,b> <b,c> <a,b,c,d> <c,a,b> <d,a,b> <a,e,b>
14
Batch B1 Batch B2
Apply PrefixSpan to B1 with minimum support 0.4.
<a>:3, <b>:3, <c>:3, <a,b>:2, <a,c>:2, and <b,c>:2
The algorithm then moves on to B2. The frequent
<a>:4, <b>:4, <c>:2, <d>:2, and <a,b>:4
15
Because the pruning period is 2, we must now
For each node, B is the number of batches
We prune all nodes satisfying:
count + B’ ( α L – 1) <= B L
16
When the user requests the set of sequential
The output sequences and counts are:
<a>: 7 <b>: 7 <c>: 5 <a, b>:6
There are no false negatives and only one false
17
18
Introduction Problem Definition The SS-BE Method The SS-MB Method Experimental Results Discussion Conclusions
Input:
A data stream D = S1, S2, S3,… Minimum support threshold σ Significance threshold , 0 <= < σ Batch length L Maximum number of nodes in the tree m
Use a tree T0 to store subsequences seen in the stream Use variable min, initially set to 0
19 a: 2
b:1 c:3
Algorithm Overview:
Break the stream into batches of length L. For each arriving batch Bk, apply PrefixSpan with
Insert each frequent sequence si (say it has count
If a path corresponding to this sequence does
20
After processing each batch, we check whether the
While this is true, we remove from the tree the node of
21
Finally, suppose the user requests the set of
Simply traverse the tree outputting all sequences
Nodes having (count – over-estimation) >= σN are
If min <= (σ-)N, then the algorithm guarantees there
22
Suppose L = 4, σ = 0.75,
Data stream D:
<a,b,c> <a,c> <a,b> <b,c> <a,b,c,d> <c,a,b> <d,a,b> <a,e,b>
23
Batch B1 Batch B2
Apply PrefixSpan to B1 with minimum support 0.5.
<a>:3, <b>:3, <c>:3, <a,b>:2, <a,c>:2, and <b,c>:2
24
The algorithm then moves on to B2. The frequent
<a>:4, <b>:4, <c>:2, <d>:2, and <a,b>:4
Because there are now 8 nodes in the tree and
sequence <b,c> is removed min is set to this sequence’s count, 2.
25
When the user requests the set of sequential patterns, the
The output sequences and counts are:
<a>: 7 <b>: 7 <c>: 5 <a, b>:6
Because min = 2 <= (σ-)N = 2, the algorithm guarantees
26
27
Introduction Problem Definition The SS-BE Method The SS-MB Method Experimental Results Discussion Conclusions
Varying the number of sequences
that all true sequential patterns were output (on average, the ratio of m to the number of true sequential patterns was 1.115)
28
Varying the average sequence length
that all true sequential patterns were output (on average, the ratio of m to the number of true sequential patterns was 1.054)
stream, inserting each one into a tree like T0, that is also pruned periodically.
29
30
Introduction Problem Definition The SS-BE Method The SS-MB Method Experimental Results Discussion Conclusions
Main advantage of SS-BE is that it always
However, no precise relationship between the
may pick a value for too large or too small
By exploiting all of the available memory in the
31
32
Introduction Problem Definition The SS-BE Method The SS-MB Method Experimental Results Discussion Conclusions
SS-BE always ensures there are no false negatives, while
SS-MB is only guaranteed to have no false negatives if at
Our proposed methods are effective solutions to the
The running time of each algorithm scales linearly as the number
The maximum memory usage is restricted in both cases through
the pruning strategies adopted.
Our experiments show that both methods produce a very small
number of false positives.
33
34