Data Streams Many large sources of data are generated as streams of - PowerPoint PPT Presentation

Summarizing and Mining Skewed Data Streams Graham Cormode cormode@bell-labs.com Flip Korn, S. Muthukrishnan, Divesh Srivastava

Data Streams Many large sources of data are generated as streams of updates: – IP Network traffic data – Text: email/ IM/ SMS/ weblogs – Scientific/ monitoring data Must analyze this data which is high speed (tens of thousands to millions of updates/ second) and massive (gigabytes to terabytes per day)

Data Stream Analysis Analysis of data streams consists of two parts: � Summarization – Fast memory is much smaller than data size, so need a (guaranteed) concise synopsis – Data is distributed, so need to combine synopses � Mining – Extract information about streams from synopsis – Examples: Heavy hitters/ frequent items, quantiles, changes/ difference, clustering/ trending, etc.

Skew In Data Data is rarely uniform in practice, typically skewed frequency A few items are frequent, then a long tail of infrequent items items sorted by frequency Such skew is prevalent in network data, word frequency, log frequency paper citations, city sizes, etc. One concept, many names: Zipf distribution, Pareto distribution, Power-laws, multifractals, etc. log rank

Outline � Better bounds for summarization/ mining tasks by incorporating skewness into analysis – Count-Min sketch and Zipf distribution � New mining tasks motivated by skewness in data – Biased Quantiles

Zipf Distribution (Pareto) Items drawn from a universe of size U Draw N items, frequency of i’th most frequent is f i ≈ Ni -z Proportionality constant depends on U, z, not N z indicates skewness: – z = 0: Uniform distribution – z < 0.5: light skew/ no skew – 0.5 " z < 1: moderate skew } most real data in this range – 1 " z: (highly) skewed

Typical Skews Data Source Zipf skewness, z Web page 0.7 — 0.8 popularity FTP Transmission 0.9 — 1.1 size Word use in 1.1 — 1.3 English text Depth of website 1.4 — 1.6 exploration

Our contributions A simple synopsis used to approximately answer: � Point queries (PQ) — given item i, return how many times i occurred in the stream, f i � Second Frequency moment (F 2 ) — compute sum of squares of frequencies of all items The basis of many mining tasks: histograms, anomaly detection, quantiles, heavy hitters Asymptotic improvement over prior methods: for error bound ε , space is o(1/ ε ) for z> 1 previously, cost was O(1/ ε 2 ) for F 2 , O(1/ ε ) for PQ

Point Estimation Use the Count-Min Sketch structure, introduced in [ CM04] to answer point queries with error < ε N with probability at least 1- δ Tighter analysis here for skewed data, plus new analysis for F 2 . Ingredients: –Universal hash fns h 1 ..h log 1/ δ { items} � { 1..w} –Array of counters CM[ 1..w, 1..log 1/ δ ]

i,count Update Algorithm h log 1/ δ (i) h 1 (i) Count-Min Sketch + 1 + 1 w + 1 + 1 log 1/ δ

Analysis for Point Queries Split error into: – Collisions with w/ 3 largest items – Collisions with the remaining items With constant probability (2/ 3), no large items collide with the queried point. Expected error Applying Zipf tail bounds and setting w = 3 ε -1/ z . Markov Inequality: Pr[ error > ε N] < 1/ 3. Take Min of estimates: Pr[ error > ε N] < 3 -log 1/ δ < δ

Application to top-k items Can find f i with (1 ± ε ) relative error for i< k (ie, the top-k most frequent items). Applying similar analysis and tail bounds gives: and so w = O(k/ ε ) for any z> 1. Improves the O(k/ ε 2 ) bound due to [ CCFC02] We only require z> 1, do not need value of z.

Second Frequency Moment 2 Second Frequency Moment, F 2 = ∑ i f i Two techniques to make estimate from CM sketch: � CM + : min j ∑ k= 1 w CM[ j,k] 2 — min of F 2 of rows in sketch w/ 2 (CM[ j,2k] – CM[ j,2k-1] ) 2 � CM - : median j ∑ k= 1 — median of F 2 of differences of adjacent entries in the sketch We compare bounds for both methods.

CM + Analysis With constant probability, the largest w 1/ 2 items all fall in different buckets. For z> 1:

CM + Analysis Simplifying, we set the expected error = ½ ε F 2 . This gives w = O( ε -2/ (1+ z) ). Applying Markov inequality shows error is at most ε F 2 with constant probability. Taking the minimum of the log 1/ δ repetitions reduces failure probability to δ. Total space cost = O( ε -2/ (1+ z) log 1/ δ ), provided z> 1

CM - Analysis For z> 1/ 2, again constant probability that the largest w 1/ 2 items all fall in different buckets. We show that: – Expectation of each CM - estimate is F 2 2 w -(1-2z)/ 2 – Variance " 8F 2 Setting Var = ε 2 F 2 2 and applying Chebyshev bound gives constant probability of < ε F 2 error. Taking the median amplifies this to δ probability Total cost space = O( ε -4/ (1+ 2z) log 1/ δ ), if z> ½

F 2 Estimation Summary 2 1.5 ε ε Power of 1/ ε ε 1 0.5 0 0 0.5 1 1.5 2 2.5 Zipf skewness z Skewness Space Cost Method (1/ ε ) 2 z " ½ CM - (1/ ε ) 4/ (1+ 2z) ½ < z " 1 CM - (1/ ε ) 2/ 1+ z CM + 1 < z

Experiments: Point Queries Maximum Error on Zipf data with 27KB space Max Error on Point Queries from Zipf(1.6) CM 1.6% CM 1.E+00 CCFC 1.4% Observed error CCFC 1.E-01 1.2% x^-1.6 Max Error 1.0% 1.E-02 0.8% 1.E-03 0.6% 0.4% 1.E-04 0.2% 0.0% 1.E-05 1 10 100 1000 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Zipf parameter Size / KB � On synthetic data, significantly outperforms worst error from comparable method [ CCFC02] � Error decays as space increases, as predicted

Experiments: F 2 Estimation F2 Estimation on Shakespeare F2 Estimation on IP Request Data 1.0E+00 1.E+00 CM+ CM+ CM- 1.E-01 Observed Error 1.0E-01 Observed Error CM- 1.0E-02 1.E-02 1.0E-03 1.E-03 1.0E-04 1.E-04 1.0E-05 1.E-05 1 10 100 1000 1 10 100 1000 Space / KB Size / KB � Experiments on complete works of Shakespeare (5MB, z ≈ 1.2) and IP traffic data (20MB, z ≈ 1.3) � CM - seems to do better in practice on real data.

Experiments: Timing Easily process 2-3million new items / second on standard desktop PC. Queries are also fast – point queries ≈ 1 µ s – F 2 queries ≈ 100 µ s Alternative methods are at least 40-50% slower.

Outline � Better bounds for summarization/ mining tasks by incorporating skewness into analysis – Count-Min sketch and Zipf distribution � New mining tasks motivated by skewness in data – Biased Quantiles

Quantiles Quantiles summarize data distribution concisely. Given N items, the φ –quantile is the item with rank φ N in the sorted order. Eg. The median is the 0.5-quantile, the minimum is the 0-quantile. Equidepth histograms put bucket boundaries on regular quantile values, eg 0.1, 0.2… 0.9 Quantiles are a robust and rich summary: median is less affected by outliers than mean

Quantiles over Data Streams Data stream consists of N items in arbitrary order. Models many data sources eg network traffic, each packet is one item. Requires linear space to compute quantiles exactly in one pass, Ω (N 1/ p ) in p passes. ε -approximate computation in sub-linear space – Φ -quantile: item with rank between ( Φ - ε )N and ( Φ + ε )N – [ GK01] : insertions only, space O(1/ ε log( ε N)) – [ CM04] : insertions and deletions, space O(1/ ε log 1/ δ )

Biased Quantiles IP network traffic is very skewed – Long tails of great interest – Eg: 0.9, 0.95, 0.99-quantiles of TCP round trip times Issue: uniform error guarantees – ε = 0.05: okay for median, but not 0.99-quantile – ε = 0.001: okay for both, but needs too much space Goal: support relative error guarantees in small space – Low-biased quantiles: φ φ φ φ -quantiles in ranks φ(1 φ(1 ± ε φ(1 φ(1 ε ε ε )N – High-biased quantiles: (1- φ φ φ φ )-quantiles in ranks (1-(1 ± ε ) φ φ φ φ )N

Prior Work Sampling approach given by Gupta and Zane [ GZ03] in context of a different problem: – Keep O(1/ ε ) samplers at different sample rates, each keeping a sample of O(1/ ε 2 ) items – Total space: O(1/ ε 3 ), probabilistic algorithm Uses too much space in practice. Is it possible to do better? Without randomization?

Intuition Example shows intuition behind our approach. Low-biased quantiles: give error εφ on φ -quantiles – Set ε = 10% . Suppose we know approximate median of n items is M — so absolute error is ε n/ 2 M ε n/ 2 – Then there are n inserts, all above M – M is now the first quartile, so we need error ε N/ 4

Intuition How can error bounds be maintained? M ε n/ 2 – Total number of items is now N= 2n, so required absolute error bound is for M is still ε n/ 2 Error bound never shrinks too fast, so we can hope to guarantee relative errors. Challenge is to guarantee accuracy in small space

Space for Biased Quantiles Any solution to the Biased Quantiles problem must use space at least Ω (1/ ε log( ε N)) Shown by a counting argument, there are Ω (1/ ε log( ε N)) possible different answers based on choice of φ For uniform quantiles, corresponding lower bound is Ω (1/ ε ) — biased quantiles problem is strictly harder in terms of space needed.

Data Streams Many large sources of data are generated as streams of - PowerPoint PPT Presentation

Summarizing and Mining Skewed Data Streams Graham Cormode cormode@bell-labs.com Flip Korn, S. Muthukrishnan, Divesh Srivastava Data Streams Many large sources of data are generated as streams of updates: IP Network traffic data Text:

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S.

A P A P A Proposal for Publishing Data A Proposal for Publishing Data l f l f P bli hi P bli

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! Research and ANU

Frequency Counts Frequency Counts over over Data Streams Data Streams Gurmeet Singh Manku

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

Software Streams Big Data Challenges in Dynamic Program Analysis Irene Finocchi Dept. Computer

CS 10: Problem solving via Object Oriented Programming Streams Agenda 1. Streaming data 2.

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

Input/Output Cmd Line Input Formatted I/O Formatted Output Formatted Input Volker Sorge

ChoiceRank Identifying Preferences from Node Tra ff ic in Networks Lucas Maystre, Matthias

Performance and cost effectiveness of caching in mobile access networks Jim Roberts (IRT-SystemX)

Bitcoin and Beyond The World of CryptoCurrencies Math 2018 to date Lecturer, NTU,

Content: 1. Principal task in EGS development 2. A methodology: HEX-S code 3. Example: Coso

COPING WITH THE CHALLENGE OF SORTING LARGE PRODUCT CATALOGS ONLINE - SHOP WINDOW ARRANGEMENT

Clustering in Popularity Adjusted Stochastic Block Model Majid Noroozi and Marianna Pensky

The Lifecyle of a Youtube Video: Phases, Content and Popularity Honglin Yu, Lexing Xie, Scott

Relevance of Time Spent on Web Pages WEBKDD August 20, 2006, Philadelphia, USA Peter I. Hofgesang

Data Streams Many large sources of data are generated as streams of - PowerPoint PPT Presentation

Summarizing and Mining Skewed Data Streams Graham Cormode cormode@bell-labs.com Flip Korn, S. Muthukrishnan, Divesh Srivastava Data Streams Many large sources of data are generated as streams of updates: IP Network traffic data Text:

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S.

A P A P A Proposal for Publishing Data A Proposal for Publishing Data l f l f P bli hi P bli

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! Research and ANU

Frequency Counts Frequency Counts over over Data Streams Data Streams Gurmeet Singh Manku

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

Software Streams Big Data Challenges in Dynamic Program Analysis Irene Finocchi Dept. Computer

CS 10: Problem solving via Object Oriented Programming Streams Agenda 1. Streaming data 2.

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

Input/Output Cmd Line Input Formatted I/O Formatted Output Formatted Input Volker Sorge

ChoiceRank Identifying Preferences from Node Tra ff ic in Networks Lucas Maystre, Matthias

Performance and cost effectiveness of caching in mobile access networks Jim Roberts (IRT-SystemX)

Bitcoin and Beyond The World of CryptoCurrencies Math 2018 to date Lecturer, NTU,

Content: 1. Principal task in EGS development 2. A methodology: HEX-S code 3. Example: Coso

COPING WITH THE CHALLENGE OF SORTING LARGE PRODUCT CATALOGS ONLINE - SHOP WINDOW ARRANGEMENT

Clustering in Popularity Adjusted Stochastic Block Model Majid Noroozi and Marianna Pensky

The Lifecyle of a Youtube Video: Phases, Content and Popularity Honglin Yu, Lexing Xie, Scott

Relevance of Time Spent on Web Pages WEBKDD August 20, 2006, Philadelphia, USA Peter I. Hofgesang

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams