Frequency Counts Frequency Counts over over Data Streams Data - - PowerPoint PPT Presentation

frequency counts frequency counts
SMART_READER_LITE
LIVE PREVIEW

Frequency Counts Frequency Counts over over Data Streams Data - - PowerPoint PPT Presentation

Frequency Counts Frequency Counts over over Data Streams Data Streams Gurmeet Singh Manku Stanford University, USA August 21, 2002 VLDB 2002 1 The Problem ... The Problem ... Stream I dent if y all element s whose current f


slide-1
SLIDE 1

August 21, 2002 VLDB 2002 1

Gurmeet Singh Manku

Frequency Counts Frequency Counts

  • ver
  • ver

Data Streams Data Streams

Stanford University, USA

slide-2
SLIDE 2

The Problem ... The Problem ...

I dent if y all element s whose current f requency exceeds support t hreshold s = 0.1%.

Stream

slide-3
SLIDE 3

A Related Problem ... A Related Problem ...

Stream

I dent if y all subset s of it ems whose current f requency exceeds s = 0.1%.

Frequent Itemsets / Association Rules

slide-4
SLIDE 4

Applications Applications

Flow Identification at IP Router [ EV01] Iceberg Queries [ FSGM+ 98] Iceberg Datacubes [ BR99 HPDW01] Association Rules & Frequent Itemsets [ AS94 SON95 Toi96 Hid99 HPY00 … ]

slide-5
SLIDE 5

Presentation Outline ... Presentation Outline ...

  • 1. Lossy Counting 2. Sticky Sampling
  • 3. Algorithm for Frequent Itemsets
slide-6
SLIDE 6

Algorithm 1: Algorithm 1: Lossy Lossy Counting Counting

Step 1: Divide the stream into ‘windows’

Is window size a function of support s? Will fix later…

Window 1 Window 2 Window 3

slide-7
SLIDE 7

Lossy Lossy Counting in Action ... Counting in Action ...

Empty

Frequency Counts

At window boundary, decrement all counters by 1

+

First Window

slide-8
SLIDE 8

Lossy Lossy Counting continued ... Counting continued ...

Frequency Counts

At window boundary, decrement all counters by 1

Next Window

+

slide-9
SLIDE 9

Error Analysis Error Analysis

I f

current size of st ream = N

and

window-size = 1/ e

t hen

# windows = eN

Rule of t humb: Set e = 10% of support s Example: Given support f requency s = 1%, set error f requency e = 0.1%

f requency error ≤ ≤

How much do we undercount ?

slide-10
SLIDE 10

How many count ers do we need? Worst case: 1/ e log (e N) count ers [See paper f or proof ]

Out put :

Element s wit h count er values exceeding sN – eN

Approximat ion guarant ees

Frequencies underest imat ed by at most eN No f alse negat ives False posit ives have t rue f requency at least sN – eN

slide-11
SLIDE 11

Enhancements ... Enhancements ...

Frequency Errors

For count er (X, c), t rue f requency in [c, c+eN] Trick: Remember window-id’s For count er (X, c, w), t rue f requency in [c, c+w-1]

Bat ch Processing

Decrement s af t er k windows If (w = 1), no error!

slide-12
SLIDE 12

Algorithm 2: Sticky Sampling Algorithm 2: Sticky Sampling

Stream

Creat e count ers by sampling Maint ain exact count s t hereaf t er

What rate should we sample?

34 15 30 28 31 41 23 35 19

slide-13
SLIDE 13

Sticky Sampling contd... Sticky Sampling contd...

For f init e st ream of lengt h N

Sampling rat e = 2/ Ne log 1/ (sδ)

Same Rule of t humb:

Set e = 10% of support s

Example:

Given support t hreshold s = 1%, set error t hreshold e = 0.1% set f ailure probabilit y δ = 0.01%

Out put :

Element s wit h count er values exceeding sN – eN Same error guarant ees as Lossy Count ing but probabilist ic

Approximat ion guarant ees (probabilist ic)

Frequencies underest imat ed by at most eN No f alse negat ives False posit ives have t rue f requency at least sN – eN

δ = probabilit y of f ailure

slide-14
SLIDE 14

Sampling rate? Sampling rate?

Finit e st ream of lengt h N Sampling rat e: 2/ Ne log 1/ (sδ)

I ndependent of N!

I nf init e st ream wit h unknown N Gradually adj ust sampling rat e (see paper f or det ails) I n eit her case, Expect ed number of count ers = 2/ ε log 1/ sδ

slide-15
SLIDE 15

No of counters

Support s = 1% Error e = 0.1%

N (stream length)

No of counters

Sticky Sampling Expected: 2/ ε log 1/ sδ Lossy Counting Worst Case: 1/ ε log εN

Log10 of N (stream length)

slide-16
SLIDE 16

From elements to sets of elements …

slide-17
SLIDE 17

Frequent Frequent Itemsets Itemsets Problem ... Problem ...

Stream

I dent if y all subset s of it ems whose current f requency exceeds s = 0.1%.

Frequent Itemsets = > Association Rules

slide-18
SLIDE 18

Three Modules Three Modules

BUFFER TRIE SUBSET-GEN

slide-19
SLIDE 19

Module 1: Module 1: TRIE TRIE

Compact represent at ion of f requent it emset s in lexicographic order. 50 40 30 31 29 32 45 42 50 40 30 31 29 45 32 42 Sets with frequency counts

slide-20
SLIDE 20

Module 2: Module 2: BUFFER BUFFER

Compact represent at ion as sequence of int s Transact ions sort ed by it em-id Bit map f or t ransact ion boundaries

Window 1 Window 2 Window 3 Window 4 Window 5 Window 6

In Main Memory

slide-21
SLIDE 21

Module 3: Module 3: SUBSET SUBSET-

  • GEN

GEN

BUFFER

3 3 3 4 2 2 1 2 1 3 1 1 Frequency count s

  • f subset s

in lexicographic order

slide-22
SLIDE 22

Overall Algorithm ... Overall Algorithm ...

BUFFER

3 3 3 4 2 2 1 2 1 3 1 1 SUBSET-GEN

TRIE new TRIE Problem: Number of subsets is exponential!

slide-23
SLIDE 23

SUBSET SUBSET-

  • GEN Pruning Rules

GEN Pruning Rules

A-priori Pruning Rule

I f set S is inf requent , every superset of S is inf requent . See paper for details ...

Lossy Count ing Pruning Rule

At each ‘window boundary’ decrement TRI E count ers by 1. Act ually, ‘Bat ch Delet ion’: At each ‘main memory buf f er’ boundary, decrement all TRI E count ers by b.

slide-24
SLIDE 24

Bottlenecks ... Bottlenecks ...

BUFFER

3 3 3 4 2 2 1 2 1 3 1 1 SUBSET-GEN

TRIE new TRIE

Consumes main memory Consumes CPU time

slide-25
SLIDE 25

Design Decisions for Performance Design Decisions for Performance

TRIE

Main memory bottleneck Compact linear array (element, counter, level) in preorder traversal No pointers! Tries are on disk All of main memory devoted to BUFFER Pair of tries

  • ld and new (in chunks)

mmap() and madvise()

SUBSET-GEN CPU bottleneck Very fast implementation See paper for details

slide-26
SLIDE 26

Experiments ... Experiments ...

IBM synthetic dataset T10.I4.1000K

N = 1Million Avg Tran Size = 10 Input Size = 49MB

IBM synthetic dataset T15.I6.1000K

N = 1Million Avg Tran Size = 15 Input Size = 69MB

Frequent word pairs in 100K web documents

N = 100K Avg Tran Size = 134 Input Size = 54MB

Frequent word pairs in 806K Reuters newsreports

N = 806K Avg Tran Size = 61 Input Size = 210MB

slide-27
SLIDE 27

What do we study? What do we study?

For each dat aset

Support t hreshold s Lengt h of st ream N BUFFER size B Time t aken t

Set e = 10% of support s

Three independent variables Fix one and vary two Measure time taken

slide-28
SLIDE 28

Varying support s and BUFFER B Varying support s and BUFFER B

IBM 1M transactions Reuters 806K docs

BUFFER size in MB BUFFER size in MB Time in seconds Time in seconds

Fixed: Stream length N Varying: BUFFER size B

Support threshold s

S = 0.001 S = 0.002 S = 0.004 S = 0.008 S = 0.004 S = 0.008 S = 0.012 S = 0.016 S = 0.020

slide-29
SLIDE 29

Varying length N and support s Varying length N and support s

IBM 1M transactions Reuters 806K docs

Time in seconds Time in seconds Length of stream in Thousands Length of stream in Thousands

Fixed: BUFFER size B Varying: Stream length N Support threshold s

S = 0.001 S = 0.002 S = 0.004 S = 0.001 S = 0.002 S = 0.004

slide-30
SLIDE 30

Varying BUFFER B and support s Varying BUFFER B and support s

Time in seconds Time in seconds

IBM 1M transactions Reuters 806K docs

Support threshold s Support threshold s

Fixed: Stream length

N

Varying: BUFFER size B

Support threshold s

B = 4 MB B = 16 MB B = 28 MB B = 40 MB B = 4 MB B = 16 MB B = 28 MB B = 40 MB

slide-31
SLIDE 31

Comparison with fast A Comparison with fast A-

  • priori

priori

45 MB 4 s 5 MB 26 s 48 MB 14 s 0.010 45 MB 4 s 5 MB 34 s 48 MB 13 s 0.008 45 MB 6 s 6 MB 46 s 48 MB 13 s 0.006 45 MB 8 s 7MB 65 s 48 MB 14 s 0.004 45 MB 15 s 10 MB 94 s 53 MB 25 s 0.002 45 MB 27 s 12 MB 111 s 82 MB 99 s 0.001 Memory Time Memory Time Memory Time Support

Our Algorit hm wit h 44MB Buf f er Our Algorit hm wit h 4MB Buf f er APriori

Dataset: IBM T10.I4.1000K with 1M transactions, average size 10. A-priori by Christian Borgelt

http: / / fuzzy.cs.uni-magdeburg.de/ ~ borgelt/ software.html

slide-32
SLIDE 32

Comparison with Iceberg Queries Comparison with Iceberg Queries

Query: Identify all word pairs in 100K web documents which co-occur in at least 0.5% of the documents.

[ FSGM+ 98] multiple pass algorithm: 7000 seconds with 30 MB memory Our single-pass algorithm: 4500 seconds with 26 MB memory

Our algorithm would be much faster if allowed multiple passes!

slide-33
SLIDE 33

Lessons Learnt ... Lessons Learnt ...

Optimizing for # passes is wrong! Small support s ⇒ Too many frequent itemsets! Time to redefine the problem itself? Interesting combination of Theory and Systems.

slide-34
SLIDE 34

Work in Progress ... Work in Progress ...

Frequency Counts over Sliding Windows Multiple pass Algorithm for Frequent Itemsets Iceberg Datacubes

slide-35
SLIDE 35

Summary Summary

Lossy Counting: A Practical algorithm for online frequency counting. First ever single pass algorithm for Association Rules with user specified error guarantees. Basic algorithm applicable to several problems.