1
Mining Frequent Itemsets in a Stream
Toon Calders, TU/e (joint work with Bart Goethals and Nele Dexters, UAntwerpen)
Outline
Motivation Max-Frequency Algorithm for one itemset mining all Frequent Itemsets Experiments Conclusion
Mining Frequent Itemsets in a Stream Toon Calders, TU/e (joint - - PDF document
Mining Frequent Itemsets in a Stream Toon Calders, TU/e (joint work with Bart Goethals and Nele Dexters, UAntwerpen) Outline Motivation Max-Frequency Algorithm for one itemset mining all Frequent Itemsets Experiments
1
Motivation Max-Frequency Algorithm for one itemset mining all Frequent Itemsets Experiments Conclusion
2
Model: Every timestamp an itemset arrives Goal: Find sets of items that frequently
Take into account history, Yet, recognize sudden bursts quickly
Most definitions of frequency rely
Sliding window length Decay factor … Correct parameter setting is hard Can be different for different items
3
Motivation Max-Frequency Algorithm for one itemset mining all Frequent Itemsets Experiments Conclusion
4
mfreq(I, S S) : = max(freq(I, last(k, S S)))
k= 1 ..| S S|
5
6
Motivation Max-Frequency Algorithm for one itemset mining all Frequent Itemsets Experiments Conclusion
1.
2.
3.
7
BUT: not every point needs to be checked ↓ Only some special points = the borders a a a a b b b a b b a b a b a b a b b b b| a a b a b b
1 3 8 27 21 1
timestamp # targets
Target set a Is the marked position a border?
8
Target set a Is the marked position a border?
Target set a Is the marked position a border?
9
Target set a Is the marked position a border?
Target set a Is the marked position a border?
10
a1 l1 l2 a2 p If a1/ l1 ≥ a2/ l2, position p is never the border again! Very pow erful pruning criterion!
This is true in general:
Summary only keeps counts for the
11
Summary only keeps counts for the
Frequencies always increasing Thus: max-frequency in last cell Block with largest frequency before
When a new itemset arrives, the summary is
borders need to be checked again
12
When a new itemset arrives, the summary is
borders need to be checked again no new « before » - blocks
maximal block before: always previous border
When a new itemset arrives, the summary is
borders need to be checked again no new « before » - blocks
maximal block before: always previous border
13
The new position is a border if and
5
Only keep entries for borders Get Max-frequency = access last cell only Update summary: if target: add new entry if non-target: check borders
14
Only interested in itemsets that are
We can throw away any border with a
We only need to maintain the summaries for
the frequent itemsets
Can still be a lot, though … every subset of the most recent transaction
…
minimal window length reduces this problem FUTURE WORK: reduce this number; rely,
e.g., on approximate counts
15
Motivation Max-Frequency Algorithm for one itemset mining all Frequent Itemsets Experiments Conclusion
Size of the summaries number of borders for random data average, maximal number of borders
Theoretical worst case
16
17
Motivation Max-Frequency Algorithm for one itemset mining all Frequent Itemsets Experiments Conclusion
New frequency measure Summary for one itemset small easy to maintain
Mining all frequent itemsets
18