SLIDE 1 August 21, 2002 VLDB 2002 1
Gurmeet Singh Manku
Frequency Counts Frequency Counts
Data Streams Data Streams
Stanford University, USA
SLIDE 2
The Problem ... The Problem ...
I dent if y all element s whose current f requency exceeds support t hreshold s = 0.1%.
Stream
SLIDE 3 A Related Problem ... A Related Problem ...
Stream
I dent if y all subset s of it ems whose current f requency exceeds s = 0.1%.
Frequent Itemsets / Association Rules
SLIDE 4
Applications Applications
Flow Identification at IP Router [ EV01] Iceberg Queries [ FSGM+ 98] Iceberg Datacubes [ BR99 HPDW01] Association Rules & Frequent Itemsets [ AS94 SON95 Toi96 Hid99 HPY00 … ]
SLIDE 5 Presentation Outline ... Presentation Outline ...
- 1. Lossy Counting 2. Sticky Sampling
- 3. Algorithm for Frequent Itemsets
SLIDE 6
Algorithm 1: Algorithm 1: Lossy Lossy Counting Counting
Step 1: Divide the stream into ‘windows’
Is window size a function of support s? Will fix later…
Window 1 Window 2 Window 3
SLIDE 7
Lossy Lossy Counting in Action ... Counting in Action ...
Empty
Frequency Counts
At window boundary, decrement all counters by 1
+
First Window
SLIDE 8
Lossy Lossy Counting continued ... Counting continued ...
Frequency Counts
At window boundary, decrement all counters by 1
Next Window
+
SLIDE 9 Error Analysis Error Analysis
I f
current size of st ream = N
and
window-size = 1/ e
t hen
# windows = eN
Rule of t humb: Set e = 10% of support s Example: Given support f requency s = 1%, set error f requency e = 0.1%
f requency error ≤ ≤
How much do we undercount ?
SLIDE 10
How many count ers do we need? Worst case: 1/ e log (e N) count ers [See paper f or proof ]
Out put :
Element s wit h count er values exceeding sN – eN
Approximat ion guarant ees
Frequencies underest imat ed by at most eN No f alse negat ives False posit ives have t rue f requency at least sN – eN
SLIDE 11
Enhancements ... Enhancements ...
Frequency Errors
For count er (X, c), t rue f requency in [c, c+eN] Trick: Remember window-id’s For count er (X, c, w), t rue f requency in [c, c+w-1]
Bat ch Processing
Decrement s af t er k windows If (w = 1), no error!
SLIDE 12 Algorithm 2: Sticky Sampling Algorithm 2: Sticky Sampling
Stream
Creat e count ers by sampling Maint ain exact count s t hereaf t er
What rate should we sample?
34 15 30 28 31 41 23 35 19
SLIDE 13 Sticky Sampling contd... Sticky Sampling contd...
For f init e st ream of lengt h N
Sampling rat e = 2/ Ne log 1/ (sδ)
Same Rule of t humb:
Set e = 10% of support s
Example:
Given support t hreshold s = 1%, set error t hreshold e = 0.1% set f ailure probabilit y δ = 0.01%
Out put :
Element s wit h count er values exceeding sN – eN Same error guarant ees as Lossy Count ing but probabilist ic
Approximat ion guarant ees (probabilist ic)
Frequencies underest imat ed by at most eN No f alse negat ives False posit ives have t rue f requency at least sN – eN
δ = probabilit y of f ailure
SLIDE 14 Sampling rate? Sampling rate?
Finit e st ream of lengt h N Sampling rat e: 2/ Ne log 1/ (sδ)
I ndependent of N!
I nf init e st ream wit h unknown N Gradually adj ust sampling rat e (see paper f or det ails) I n eit her case, Expect ed number of count ers = 2/ ε log 1/ sδ
SLIDE 15 No of counters
Support s = 1% Error e = 0.1%
N (stream length)
No of counters
Sticky Sampling Expected: 2/ ε log 1/ sδ Lossy Counting Worst Case: 1/ ε log εN
Log10 of N (stream length)
SLIDE 16
From elements to sets of elements …
SLIDE 17 Frequent Frequent Itemsets Itemsets Problem ... Problem ...
Stream
I dent if y all subset s of it ems whose current f requency exceeds s = 0.1%.
Frequent Itemsets = > Association Rules
SLIDE 18
Three Modules Three Modules
BUFFER TRIE SUBSET-GEN
SLIDE 19
Module 1: Module 1: TRIE TRIE
Compact represent at ion of f requent it emset s in lexicographic order. 50 40 30 31 29 32 45 42 50 40 30 31 29 45 32 42 Sets with frequency counts
SLIDE 20 Module 2: Module 2: BUFFER BUFFER
Compact represent at ion as sequence of int s Transact ions sort ed by it em-id Bit map f or t ransact ion boundaries
Window 1 Window 2 Window 3 Window 4 Window 5 Window 6
In Main Memory
SLIDE 21 Module 3: Module 3: SUBSET SUBSET-
GEN
BUFFER
3 3 3 4 2 2 1 2 1 3 1 1 Frequency count s
in lexicographic order
SLIDE 22
Overall Algorithm ... Overall Algorithm ...
BUFFER
3 3 3 4 2 2 1 2 1 3 1 1 SUBSET-GEN
TRIE new TRIE Problem: Number of subsets is exponential!
SLIDE 23 SUBSET SUBSET-
GEN Pruning Rules
A-priori Pruning Rule
I f set S is inf requent , every superset of S is inf requent . See paper for details ...
Lossy Count ing Pruning Rule
At each ‘window boundary’ decrement TRI E count ers by 1. Act ually, ‘Bat ch Delet ion’: At each ‘main memory buf f er’ boundary, decrement all TRI E count ers by b.
SLIDE 24
Bottlenecks ... Bottlenecks ...
BUFFER
3 3 3 4 2 2 1 2 1 3 1 1 SUBSET-GEN
TRIE new TRIE
Consumes main memory Consumes CPU time
SLIDE 25 Design Decisions for Performance Design Decisions for Performance
TRIE
Main memory bottleneck Compact linear array (element, counter, level) in preorder traversal No pointers! Tries are on disk All of main memory devoted to BUFFER Pair of tries
mmap() and madvise()
SUBSET-GEN CPU bottleneck Very fast implementation See paper for details
SLIDE 26
Experiments ... Experiments ...
IBM synthetic dataset T10.I4.1000K
N = 1Million Avg Tran Size = 10 Input Size = 49MB
IBM synthetic dataset T15.I6.1000K
N = 1Million Avg Tran Size = 15 Input Size = 69MB
Frequent word pairs in 100K web documents
N = 100K Avg Tran Size = 134 Input Size = 54MB
Frequent word pairs in 806K Reuters newsreports
N = 806K Avg Tran Size = 61 Input Size = 210MB
SLIDE 27 What do we study? What do we study?
For each dat aset
Support t hreshold s Lengt h of st ream N BUFFER size B Time t aken t
Set e = 10% of support s
Three independent variables Fix one and vary two Measure time taken
SLIDE 28 Varying support s and BUFFER B Varying support s and BUFFER B
IBM 1M transactions Reuters 806K docs
BUFFER size in MB BUFFER size in MB Time in seconds Time in seconds
Fixed: Stream length N Varying: BUFFER size B
Support threshold s
S = 0.001 S = 0.002 S = 0.004 S = 0.008 S = 0.004 S = 0.008 S = 0.012 S = 0.016 S = 0.020
SLIDE 29 Varying length N and support s Varying length N and support s
IBM 1M transactions Reuters 806K docs
Time in seconds Time in seconds Length of stream in Thousands Length of stream in Thousands
Fixed: BUFFER size B Varying: Stream length N Support threshold s
S = 0.001 S = 0.002 S = 0.004 S = 0.001 S = 0.002 S = 0.004
SLIDE 30 Varying BUFFER B and support s Varying BUFFER B and support s
Time in seconds Time in seconds
IBM 1M transactions Reuters 806K docs
Support threshold s Support threshold s
Fixed: Stream length
N
Varying: BUFFER size B
Support threshold s
B = 4 MB B = 16 MB B = 28 MB B = 40 MB B = 4 MB B = 16 MB B = 28 MB B = 40 MB
SLIDE 31 Comparison with fast A Comparison with fast A-
priori
45 MB 4 s 5 MB 26 s 48 MB 14 s 0.010 45 MB 4 s 5 MB 34 s 48 MB 13 s 0.008 45 MB 6 s 6 MB 46 s 48 MB 13 s 0.006 45 MB 8 s 7MB 65 s 48 MB 14 s 0.004 45 MB 15 s 10 MB 94 s 53 MB 25 s 0.002 45 MB 27 s 12 MB 111 s 82 MB 99 s 0.001 Memory Time Memory Time Memory Time Support
Our Algorit hm wit h 44MB Buf f er Our Algorit hm wit h 4MB Buf f er APriori
Dataset: IBM T10.I4.1000K with 1M transactions, average size 10. A-priori by Christian Borgelt
http: / / fuzzy.cs.uni-magdeburg.de/ ~ borgelt/ software.html
SLIDE 32
Comparison with Iceberg Queries Comparison with Iceberg Queries
Query: Identify all word pairs in 100K web documents which co-occur in at least 0.5% of the documents.
[ FSGM+ 98] multiple pass algorithm: 7000 seconds with 30 MB memory Our single-pass algorithm: 4500 seconds with 26 MB memory
Our algorithm would be much faster if allowed multiple passes!
SLIDE 33
Lessons Learnt ... Lessons Learnt ...
Optimizing for # passes is wrong! Small support s ⇒ Too many frequent itemsets! Time to redefine the problem itself? Interesting combination of Theory and Systems.
SLIDE 34
Work in Progress ... Work in Progress ...
Frequency Counts over Sliding Windows Multiple pass Algorithm for Frequent Itemsets Iceberg Datacubes
SLIDE 35
Summary Summary
Lossy Counting: A Practical algorithm for online frequency counting. First ever single pass algorithm for Association Rules with user specified error guarantees. Basic algorithm applicable to several problems.