Frequency Counts Frequency Counts over over Data Streams Data - PowerPoint PPT Presentation

Frequency Counts Frequency Counts over over Data Streams Data Streams Gurmeet Singh Manku Stanford University, USA August 21, 2002 VLDB 2002 1

The Problem ... The Problem ... Stream � I dent if y all element s whose current f requency exceeds support t hreshold s = 0.1%.

A Related Problem ... A Related Problem ... Stream � I dent if y all subset s of it ems whose current f requency exceeds s = 0.1%. Frequent Itemsets / Association Rules

Applications Applications Flow Identification at IP Router [ EV01] Iceberg Queries [ FSGM+ 98] Iceberg Datacubes [ BR99 HPDW01] Association Rules & Frequent Itemsets [ AS94 SON95 Toi96 Hid99 HPY00 … ]

Presentation Outline ... Presentation Outline ... 1. Lossy Counting 2. Sticky Sampling 3. Algorithm for Frequent Itemsets

Algorithm 1: Lossy Lossy Counting Counting Algorithm 1: Step 1: Divide the stream into ‘windows’ Window 1 Window 2 Window 3 Is window size a function of support s? Will fix later…

Lossy Counting in Action ... Counting in Action ... Lossy Frequency Counts + Empty First Window At window boundary, decrement all counters by 1

Lossy Counting continued ... Counting continued ... Lossy Frequency Counts + Next Window At window boundary, decrement all counters by 1

Error Analysis Error Analysis How much do we undercount ? current size of st ream = N I f window-size = 1/ e and ≤ ≤ f requency error # windows = eN t hen Rule of t humb: Set e = 10% of support s Example: Given support f requency s = 1%, set error f requency e = 0.1%

Out put : Element s wit h count er values exceeding sN – eN Approximat ion guarant ees Frequencies underest imat ed by at most eN No f alse negat ives False posit ives have t rue f requency at least sN – eN How many count ers do we need? Worst case: 1/ e log (e N) count ers [See paper f or proof ]

Enhancements ... Enhancements ... Frequency Errors For count er (X, c), t rue f requency in [c, c+ e N ] Trick: Remember window-id’s For count er (X, c, w), t rue f requency in [c, c+w-1] If (w = 1), no error! Bat ch Processing Decrement s af t er k windows

Algorithm 2: Sticky Sampling Algorithm 2: Sticky Sampling Stream 28 34 31 15 � Creat e count ers by sampling 41 30 23 � Maint ain exact count s t hereaf t er 35 19 What rate should we sample?

Sticky Sampling contd... Sticky Sampling contd... For f init e st ream of lengt h N log 1/ (s δ ) δ = probabilit y of f ailure Sampling rat e = 2/ Ne Out put : Element s wit h count er values exceeding sN – eN Approximat ion guarant ees (probabilist ic) Frequencies underest imat ed by at most eN No f alse negat ives False posit ives have t rue f requency at least sN – eN Same Rule of t humb: Same error guarant ees Set e = 10% of support s as Lossy Count ing Example: but probabilist ic Given support t hreshold s = 1%, set error t hreshold e = 0.1% set f ailure probabilit y δ = 0.01%

Sampling rate? Sampling rate? Finit e st ream of lengt h N log 1/ (s δ ) Sampling rat e: 2/ Ne I nf init e st ream wit h unknown N Gradually adj ust sampling rat e (see paper f or det ails) I n eit her case, Expect ed number of count ers = 2/ ε log 1/ s δ I ndependent of N!

Sticky Sampling Expected: 2/ ε log 1/ s δ Lossy Counting Worst Case: 1/ ε log ε N Support s = 1% Error e = 0.1% No of counters Log10 of N (stream length) No of counters N (stream length)

From elements to sets of elements …

Frequent Itemsets Itemsets Problem ... Problem ... Frequent Stream � I dent if y all subset s of it ems whose current f requency exceeds s = 0.1%. Frequent Itemsets = > Association Rules

Three Modules Three Modules TRIE SUBSET-GEN BUFFER

Module 1: TRIE TRIE Module 1: Compact represent at ion of f requent it emset s in lexicographic order. 45 50 40 31 29 32 42 30 50 40 30 31 29 45 32 42 Sets with frequency counts

Module 2: BUFFER BUFFER Module 2: Window 1 Window 2 Window 3 Window 4 Window 5 Window 6 In Main Memory Compact represent at ion as sequence of int s Transact ions sort ed by it em-id Bit map f or t ransact ion boundaries

Module 3: SUBSET SUBSET- - GEN GEN Module 3: 3 3 3 4 2 2 1 2 1 3 Frequency count s 1 of subset s 1 in lexicographic order BUFFER

Overall Algorithm ... Overall Algorithm ... 3 3 3 4 2 2 1 2 1 3 1 1 SUBSET-GEN BUFFER TRIE new TRIE Problem: Number of subsets is exponential!

SUBSET- - GEN Pruning Rules GEN Pruning Rules SUBSET A-priori Pruning Rule I f set S is inf requent , every superset of S is inf requent . Lossy Count ing Pruning Rule At each ‘window boundary’ decrement TRI E count ers by 1. Act ually, ‘Bat ch Delet ion’: At each ‘main memory buf f er’ boundary, decrement all TRI E count ers by b. See paper for details ...

Bottlenecks ... Bottlenecks ... 3 3 3 4 2 2 1 2 1 3 1 1 SUBSET-GEN BUFFER TRIE new TRIE Consumes main memory Consumes CPU time

Design Decisions for Performance Design Decisions for Performance TRIE Main memory bottleneck Compact linear array � (element, counter, level) in preorder traversal � No pointers! Tries are on disk � All of main memory devoted to BUFFER Pair of tries � old and new (in chunks) mmap() and madvise() SUBSET-GEN CPU bottleneck Very fast implementation � See paper for details

Experiments ... Experiments ... IBM synthetic dataset T10.I4.1000K N = 1Million Avg Tran Size = 10 Input Size = 49MB IBM synthetic dataset T15.I6.1000K N = 1Million Avg Tran Size = 15 Input Size = 69MB Frequent word pairs in 100K web documents N = 100K Avg Tran Size = 134 Input Size = 54MB Frequent word pairs in 806K Reuters newsreports N = 806K Avg Tran Size = 61 Input Size = 210MB

What do we study? What do we study? For each dat aset Support t hreshold s Three independent variables Lengt h of st ream N Fix one and vary two BUFFER size B Time t aken t Measure time taken Set e = 10% of support s

Varying support s and BUFFER B Varying support s and BUFFER B Time in seconds Time in seconds S = 0.004 S = 0.008 S = 0.001 S = 0.012 S = 0.002 S = 0.016 S = 0.004 S = 0.020 S = 0.008 BUFFER size in MB BUFFER size in MB IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s

Varying length N and support s Varying length N and support s Time in seconds S = 0.001 Time in seconds S = 0.002 S = 0.001 S = 0.004 S = 0.002 S = 0.004 Length of stream in Thousands Length of stream in Thousands IBM 1M transactions Reuters 806K docs Fixed: BUFFER size B Varying: Stream length N Support threshold s

Varying BUFFER B and support s Varying BUFFER B and support s Time in seconds Time in seconds B = 4 MB B = 4 MB B = 16 MB B = 16 MB B = 28 MB B = 28 MB B = 40 MB B = 40 MB Support threshold s Support threshold s IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s

Comparison with fast A- - priori priori Comparison with fast A APriori Our Algorit hm Our Algorit hm wit h 4MB Buf f er wit h 44MB Buf f er Support Time Memory Time Memory Time Memory 0.001 99 s 82 MB 111 s 12 MB 27 s 45 MB 0.002 25 s 53 MB 94 s 10 MB 15 s 45 MB 0.004 14 s 48 MB 65 s 7MB 8 s 45 MB 0.006 13 s 48 MB 46 s 6 MB 6 s 45 MB 0.008 13 s 48 MB 34 s 5 MB 4 s 45 MB 0.010 14 s 48 MB 26 s 5 MB 4 s 45 MB Dataset: IBM T10.I4.1000K with 1M transactions, average size 10. A-priori by Christian Borgelt http: / / fuzzy.cs.uni-magdeburg.de/ ~ borgelt/ software.html

Comparison with Iceberg Queries Comparison with Iceberg Queries Query: Identify all word pairs in 100K web documents which co-occur in at least 0.5% of the documents. [ FSGM+ 98] multiple pass algorithm: 7000 seconds with 30 MB memory Our single-pass algorithm: 4500 seconds with 26 MB memory Our algorithm would be much faster if allowed multiple passes!

Lessons Learnt ... Lessons Learnt ... Optimizing for # passes is wrong! Small support s ⇒ Too many frequent itemsets! Time to redefine the problem itself? Interesting combination of Theory and Systems.

Work in Progress ... Work in Progress ... Frequency Counts over Sliding Windows Multiple pass Algorithm for Frequent Itemsets Iceberg Datacubes

Summary Summary Lossy Counting: A Practical algorithm for online frequency counting. First ever single pass algorithm for Association Rules with user specified error guarantees. Basic algorithm applicable to several problems.

Frequency Counts Frequency Counts over over Data Streams Data - PowerPoint PPT Presentation

Frequency Counts Frequency Counts over over Data Streams Data Streams Gurmeet Singh Manku Stanford University, USA August 21, 2002 VLDB 2002 1 The Problem ... The Problem ... Stream I dent if y all element s whose current f

Frequency Decomposition The base frequency or the fundamental frequency is the lowest frequency.

Time-Frequency Analysis Time Frequency Analysis in Visual Signal Yetmen Wang AnCAD, Inc.

2020 U.S. Census: Every Alaskan Counts Every Alaskan Counts Why the Census $ 3.2 matters for

Eric Calderon, Senior Political Science Major, UCR Counts Committee Marlenee Blas, Director of UCR

Florida Counts Flcounts.com The Florida Counts Census 2020 is a partnership between

Pre-K K Counts unts Prese esenta tati tion on Septe tember mber 2019 19 Pre-K Counts

It s the Grade that Counts! s the Grade that Counts! It Bob Duffin Bob Duffin Executive

When an alarm is false, every penny counts When its real, every minute counts. Facts and

KILIFI County KILIFI County Plan Plan Every Child Counts Every Child Counts

Launch Your Info Product Empire On Clickbank Presented By: Paul Counts @paulcounts *Clickbank

Not everything that counts can be counted, and not everything that can be counted counts.

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray

Not everything that counts can be counted, and not everything that can be counted counts.

Not everything that counts can be counted, and not everything that can be counted counts.

Not everything that counts can be counted, and not everything that can be counted counts.

Preparing for the 2020 Census to Go Door-to-Door (NRFU) Hosted by: The Census Counts Campaign and

Introduction & Motivation Bart Baesens Professor Data Science at KU Leuven DataCamp Fraud

Infrared Spectroscopy Sample IR Spectrum: ! General Theory of IR Spectroscopy ! Overview of the IR

Continuous Profiling: (It's 10:43; Do You Know Where Your Cycles Are?) Jennifer Anderson Lance

SI485i : NLP Set 4 Smoothing Language Models Fall 2013 : Chambers Review: evaluating n-gram

Today Digital filters and signal processing Filter examples and properties FIR filters

A Distributed Dynamic Frequency Allocation Algorithm Behtash Babadi and Vahid Tarokh School of

New Communications Repeater Connector ON/OFF Power Antenna Rhotheta RT-600 and

Full Duplex Wireless: From Fundamental Physics and Integrated Circuits to Complex Systems and

Frequency Counts Frequency Counts over over Data Streams Data - PowerPoint PPT Presentation

Frequency Counts Frequency Counts over over Data Streams Data Streams Gurmeet Singh Manku Stanford University, USA August 21, 2002 VLDB 2002 1 The Problem ... The Problem ... Stream I dent if y all element s whose current f

Frequency Decomposition The base frequency or the fundamental frequency is the lowest frequency.

Time-Frequency Analysis Time Frequency Analysis in Visual Signal Yetmen Wang AnCAD, Inc.

2020 U.S. Census: Every Alaskan Counts Every Alaskan Counts Why the Census $ 3.2 matters for

Eric Calderon, Senior Political Science Major, UCR Counts Committee Marlenee Blas, Director of UCR

Florida Counts Flcounts.com The Florida Counts Census 2020 is a partnership between

Pre-K K Counts unts Prese esenta tati tion on Septe tember mber 2019 19 Pre-K Counts

It s the Grade that Counts! s the Grade that Counts! It Bob Duffin Bob Duffin Executive

When an alarm is false, every penny counts When its real, every minute counts. Facts and

KILIFI County KILIFI County Plan Plan Every Child Counts Every Child Counts

Launch Your Info Product Empire On Clickbank Presented By: Paul Counts @paulcounts *Clickbank

Not everything that counts can be counted, and not everything that can be counted counts.

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray

Not everything that counts can be counted, and not everything that can be counted counts.

Not everything that counts can be counted, and not everything that can be counted counts.

Not everything that counts can be counted, and not everything that can be counted counts.

Preparing for the 2020 Census to Go Door-to-Door (NRFU) Hosted by: The Census Counts Campaign and

Introduction &amp; Motivation Bart Baesens Professor Data Science at KU Leuven DataCamp Fraud

Infrared Spectroscopy Sample IR Spectrum: ! General Theory of IR Spectroscopy ! Overview of the IR

Continuous Profiling: (It's 10:43; Do You Know Where Your Cycles Are?) Jennifer Anderson Lance

SI485i : NLP Set 4 Smoothing Language Models Fall 2013 : Chambers Review: evaluating n-gram

Today Digital filters and signal processing Filter examples and properties FIR filters

A Distributed Dynamic Frequency Allocation Algorithm Behtash Babadi and Vahid Tarokh School of

New Communications Repeater Connector ON/OFF Power Antenna Rhotheta RT-600 and

Full Duplex Wireless: From Fundamental Physics and Integrated Circuits to Complex Systems and

Introduction & Motivation Bart Baesens Professor Data Science at KU Leuven DataCamp Fraud