Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur - PowerPoint PPT Presentation

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S. Muthukrishnan graham@cormode.org

Data Streams Data streams occur everywhere: • Network streams - IP packet flow records, phone call records • Environmental observations - Weather readings, other sensor values • Other streams of values - Web clickstreams, stock values… 2

Streams from IP Networks Many network flows between (source, dest) pairs Want a snapshot at time t of the flows This defines a (massive) vector, and we ask: • Summarise the current state • How does state at time t compare with at t’? • Which past situation does this most resemble, etc.? 3

Processing Constraints Network devices have small memory, limited processing power Want solutions which have fast per-item processing, minimal memory requirements Backtracking on the input is impossible without explicitly storing it Informally the “datastream” model of computation 4

How to measure streams? The state at any time defines a massive vector • Hamming norm: Σ (x i ≠ 0) Number of non-zero entries of the vector • Union Size: Σ (x i + y i ≠ 0) • Hamming difference: Σ ((x i - y i ) ≠ 0) = Σ (x i ≠ y i ) This is the number of places where the vectors differ - a fundamental concept. 5

Hamming Norm for Counting Distinct Values Application 1: Maintaining number of distinct values in a relation with inserts and deletes Important to know number of values for query optimization, approximate query answering, join size estimation etc. Fully dynamic case, with inserts and deletes: sampling has been shown to be inaccurate. The Hamming Norm of the stream of updates gives the number of distinct values. 6

Application to Networks Application 2: Many questions possible about network streams: • How many packet flows between distinct pairs of (source, destination)? • How many flows are losing packets (where packets in one side of network not equal to packets out)? • Denial of service attacks signalled by large numbers of requests (from spoofed IPs) — so many distinct sources. All these can be solved by computing Hamming norms. 7

Our approach An exact answer is not possible in small space, so we find an approximate answer with probability guarantees. We will use statistical distributions with provable properties. Assume an general form of a data stream: • Pairs (i, j) arrive (meaning “add j to location i”) • The total of values x i is bounded | x i | < U for some U. We will create a small summarizing “sketch” for the stream that allows Hamming Norm, Difference and Union to be approximated. 8

Hamming Norm of a Stream Vectors are assumed to be massive, too large to store explicitly. Entries are updated dynamically: (5,+ 3), (2, -1), (3, + 2), (7, + 9), (5, -2), (6, -1), (6, -3), (2, + 1), (4, + 2), (3, -2), (7, -5), (5, + 2), (6, -2), (4, -3), (5, -1) 1 2 3 4 5 6 7 8 0 0 0 -1 2 -3 4 0 Hamming norm of the stream is 4 (4 non-zero entries) 9

Zeroing in on the Hamming Norm We can approximate the Hamming norm by finding the Lp norm to the power p for small enough p Hamming norm of vector a is | a | H = Σ | a i | 0 where 0 0 defined = 0 Lp norm of a vector is ( Σ | a i | p ) 1/p | a | H = Σ | a i | 0 ≤ Σ | a i | p ≤ Σ U p | a i | 0 ≤ U p Σ | a | H Setting U p = (1+ ε ) means | a | H ≤ Σ | a i | p ≤ (1+ ε ) | a | H This fixes p = ε / log U, allowing us to approximate the Hamming Norm 10

Finding Lp norm Relies on results from Indyk ‘00 on Stable Distributions: We can use Stable distributions to approximate the Lp norm: Fact: if X i ~ Stable(p, 0) then Σ i a i X i ~ ( Σ | a i p | ) 1/p Stable(p,0) Create vector x where each entry is drawn from Stable(p,0) Compute | â H | = Σ a i x i — this quantity has the correct expectation Can be computed on the stream: with each update (i, j), then update | â H | ← | â H | + j x i 11

Guaranteed Accuracy One estimate is not accurate (variance is high), so repeat several times independently: keep k copies based on independent drawings of the vector x . Store the values of â H in a short L 0 sketch , sk[1…k]. Find median i (| sk[i]| ), and scale by median(| Stable(p,0)| ) = m. Fix k = O(1/ ε 2 log 1/ δ ). Then (1- ε ) | a | H ≤ median(sk)/m ≤ (1+ ε ) 2 | a | H with probability 1- δ 12

Implementation Details Don’t store x explicitly — it would take too much space. Instead, compute each x i as a pseudo-random function of i (so use a pseudo-random number generator, initialized by i), and known methods to generate values from Stable Distributions from uniform distributions. Also need to compute | median(Stable(p,0))| in advance — can do this empirically or numerically. 13

Properties Space usage is small: the L 0 sketch consists of O(1/ ε 2 log 1/ δ ) counters Time per item is to update each counter, O(1/ ε 2 log 1/ δ ) Difference and union of streams is easy to compute: sk( a + b ) = sk( a ) + sk( b ) sk( a - b ) = sk( a ) - sk( b ) by linearity of dot product, so can approximate | a - b | H and | a + b | H with the same accuracy. 14

Complete Algorithm i ni t i al i ze sk[ 1… k] = 0. 0 f or al l t upl es ( i , j ) do f or al l do f or al l f or al l do do i ni t i al i ze r andom wi t h i f or s = 1 t o f or t o k do do f or f or t o t o do do r 1 = r andom ( ) ; r 2 = r andom ( ) sk[ s] = sk[ s] +j * st abl e( r 1, r 2, p) f or s = 1 t o f or t o k do do f or f or t o t o do do sk[ s] = absol ut e( sk[ s] ) p r et ur n m edi an( sk) * scal ef act or ( p) Simple to implement, can run quickly with small space 15

Experimental Evaluation Data Sets • Generated synthetic data from Zipf distributions with a range of parameters • Took real Netflow data from one of AT&T’s networks • Each data stream was around 20Mb, working space was around a few Kb. Parameters We fixed p = 0.02 (as small as possible), this sets the scale factor, median(| Stable(0.02,0)| ) = 1.425 16

Existing Techniques Compared against the “probabilistic counting” algorithm of Flajolet and Martin + Uses a similar amount of space + Operates in the data stream model + Fast per-item processing – Can’t cope with all situations (eg negative values) – Can’t find the difference between two streams 17

Hamming Norm Tests • Performance of our algorithm is better than FM85 • Improves with more workspace • Slightly slower in practice 18

• Shows that FM85 can’t cope when values are allowed to be negative, but L 0 sketches retain their accuracy. 19

• Good performance (~ 7% error), small memory cost • Performance of finding union of streams (not shown) also good. 20

Conclusions We give a new technique for data stream analysis Can approximate the Hamming norm, Number of Distinct Items, Hamming difference with only a few kb of space Suitable for indexing streams The “L 0 sketch” can be used as a surrogate for the stream in other computations: clustering, searching, querying, all based only on the sketches 21

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur - PowerPoint PPT Presentation

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S. Muthukrishnan graham@cormode.org Data Streams Data streams occur everywhere: Network streams - IP packet flow records, phone call records

Chapter 7 Norms and Distance Measures Chapter 7 Vector Norms Norms are functions which measure

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Enhancing compliance Why do people comply? Adriaan Denkers FIOD-ECD Why do people obey the law?

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Joshua Brody and Amit Chakrabarti Dartmouth College 24 th CCC, 2009, Paris Joshua Brody 1

Cyclic Sieving of Dual Hamming Codes Alex Mason 1 Shruthi Sridhar 2 1 Washington University, St.

Error Detection and Correction: Hamming Code; Reed-Muller Code Greg Plaxton Theory in

A P A P A Proposal for Publishing Data A Proposal for Publishing Data l f l f P bli hi P bli

Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer

Mining Data Streams (Part 1) CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode,

GDPR Leyla Hannbeck MRPharmS, MBA, MSc, MA NPA Chief Pharmacist and Director of Pharmacy Why do

? 2 M. Tiemens Hit Creation Cluster 3 M. Tiemens Topology of the Data Stream t

Models and Issues in Data Stream Systems Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev

over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura Iowa

Gigascope: A Stream Database for Network Applications Authors: Cranor, Johnson, Spataschek

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur - PowerPoint PPT Presentation

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S. Muthukrishnan graham@cormode.org Data Streams Data streams occur everywhere: Network streams - IP packet flow records, phone call records

Chapter 7 Norms and Distance Measures Chapter 7 Vector Norms Norms are functions which measure

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Enhancing compliance Why do people comply? Adriaan Denkers FIOD-ECD Why do people obey the law?

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Joshua Brody and Amit Chakrabarti Dartmouth College 24 th CCC, 2009, Paris Joshua Brody 1

Cyclic Sieving of Dual Hamming Codes Alex Mason 1 Shruthi Sridhar 2 1 Washington University, St.

Error Detection and Correction: Hamming Code; Reed-Muller Code Greg Plaxton Theory in

A P A P A Proposal for Publishing Data A Proposal for Publishing Data l f l f P bli hi P bli

Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer

Mining Data Streams (Part 1) CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode,

GDPR Leyla Hannbeck MRPharmS, MBA, MSc, MA NPA Chief Pharmacist and Director of Pharmacy Why do

? 2 M. Tiemens Hit Creation Cluster 3 M. Tiemens Topology of the Data Stream t

Models and Issues in Data Stream Systems Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev

over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura Iowa

Gigascope: A Stream Database for Network Applications Authors: Cranor, Johnson, Spataschek

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams