Crash Course on Data Stream Algorithms
Part I: Basic Definitions and Numerical Streams Andrew McGregor
University of Massachusetts Amherst
1/24
Crash Course on Data Stream Algorithms Part I: Basic Definitions and - - PowerPoint PPT Presentation
Crash Course on Data Stream Algorithms Part I: Basic Definitions and Numerical Streams Andrew McGregor University of Massachusetts Amherst 1/24 Goals of the Crash Course Goal: Give a flavor for the theoretical results and techniques from
1/24
◮ Goal: Give a flavor for the theoretical results and techniques from
2/24
◮ Goal: Give a flavor for the theoretical results and techniques from
2/24
◮ Goal: Give a flavor for the theoretical results and techniques from
◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t
2/24
◮ Goal: Give a flavor for the theoretical results and techniques from
◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t
◮ Request:
2/24
◮ Goal: Give a flavor for the theoretical results and techniques from
◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t
◮ Request:
◮ If you get bored, ask questions. . . 2/24
◮ Goal: Give a flavor for the theoretical results and techniques from
◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t
◮ Request:
◮ If you get bored, ask questions. . . ◮ If you get lost, ask questions. . . 2/24
◮ Goal: Give a flavor for the theoretical results and techniques from
◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t
◮ Request:
◮ If you get bored, ask questions. . . ◮ If you get lost, ask questions. . . ◮ If you’d like to ask questions, ask questions. . . 2/24
3/24
4/24
◮ Stream: m elements from universe of size n, e.g.,
5/24
◮ Stream: m elements from universe of size n, e.g.,
◮ Goal: Compute a function of stream, e.g., median, number of
5/24
◮ Stream: m elements from universe of size n, e.g.,
◮ Goal: Compute a function of stream, e.g., median, number of
◮ Catch:
5/24
◮ Stream: m elements from universe of size n, e.g.,
◮ Goal: Compute a function of stream, e.g., median, number of
◮ Catch:
5/24
◮ Stream: m elements from universe of size n, e.g.,
◮ Goal: Compute a function of stream, e.g., median, number of
◮ Catch:
5/24
◮ Stream: m elements from universe of size n, e.g.,
◮ Goal: Compute a function of stream, e.g., median, number of
◮ Catch:
◮ Origins in 70s but has become popular in last ten years because of
5/24
◮ Practical Appeal:
◮ Faster networks, cheaper data storage, ubiquitous data-logging
◮ Applications to network monitoring, query planning, I/O efficiency
6/24
◮ Practical Appeal:
◮ Faster networks, cheaper data storage, ubiquitous data-logging
◮ Applications to network monitoring, query planning, I/O efficiency
◮ Theoretical Appeal:
◮ Easy to state problems but hard to solve. ◮ Links to communication complexity, compressed sensing,
6/24
7/24
◮ Sampling is a general technique for tackling massive amounts of data
8/24
◮ Sampling is a general technique for tackling massive amounts of data ◮ Example: To compute the median packet size of some IP packets,
8/24
◮ Sampling is a general technique for tackling massive amounts of data ◮ Example: To compute the median packet size of some IP packets,
◮ Challenge: But how do you take a sample from a stream of unknown
8/24
◮ Problem: Find uniform sample s from a stream of unknown length
9/24
◮ Problem: Find uniform sample s from a stream of unknown length ◮ Algorithm:
◮ Initially s = x1 ◮ On seeing the t-th element, s ← xt with probability 1/t 9/24
◮ Problem: Find uniform sample s from a stream of unknown length ◮ Algorithm:
◮ Initially s = x1 ◮ On seeing the t-th element, s ← xt with probability 1/t
◮ Analysis:
◮ What’s the probability that s = xi at some time t ≥ i? 9/24
◮ Problem: Find uniform sample s from a stream of unknown length ◮ Algorithm:
◮ Initially s = x1 ◮ On seeing the t-th element, s ← xt with probability 1/t
◮ Analysis:
◮ What’s the probability that s = xi at some time t ≥ i?
9/24
◮ Problem: Find uniform sample s from a stream of unknown length ◮ Algorithm:
◮ Initially s = x1 ◮ On seeing the t-th element, s ← xt with probability 1/t
◮ Analysis:
◮ What’s the probability that s = xi at some time t ≥ i?
◮ To get k samples we use O(k log n) bits of space. 9/24
◮ Problem: Maintain a uniform sample from the last w items
10/24
◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm:
10/24
◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm:
10/24
◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm:
10/24
◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm:
◮ Analysis:
10/24
◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm:
◮ Analysis:
◮ The probability that j-th oldest element is in S is 1/j so the
10/24
◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm:
◮ Analysis:
◮ The probability that j-th oldest element is in S is 1/j so the
◮ Hence, algorithm only uses O(log w log n) bits of memory. 10/24
◮ Universe sampling: For a random i ∈R [n], compute
11/24
◮ Universe sampling: For a random i ∈R [n], compute
◮ Minwise hashing: Sample i ∈R {i : there exists j such that xj = i}
11/24
◮ Universe sampling: For a random i ∈R [n], compute
◮ Minwise hashing: Sample i ∈R {i : there exists j such that xj = i} ◮ AMS sampling: Sample xj for j ∈R [m] and compute
11/24
◮ Universe sampling: For a random i ∈R [n], compute
◮ Minwise hashing: Sample i ∈R {i : there exists j such that xj = i} ◮ AMS sampling: Sample xj for j ∈R [m] and compute
i g(fi) because
11/24
12/24
◮ Sketching is another general technique for processing streams
13/24
◮ Sketching is another general technique for processing streams ◮ Basic idea: Apply a linear projection “on the fly” that takes
13/24
◮ Input: Stream from two sources x1, x2, . . . , xm ∈ ([n] ∪ [n])m
14/24
◮ Input: Stream from two sources x1, x2, . . . , xm ∈ ([n] ∪ [n])m ◮ Goal: Estimate difference between distribution of red values and blue
14/24
◮ Defn: A p-stable distribution µ has the following property:
15/24
◮ Defn: A p-stable distribution µ has the following property:
◮ Algorithm:
◮ Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(ǫ−2). 15/24
◮ Defn: A p-stable distribution µ has the following property:
◮ Algorithm:
◮ Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(ǫ−2). ◮ Compute sketches Af and Ag incrementally 15/24
◮ Defn: A p-stable distribution µ has the following property:
◮ Algorithm:
◮ Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(ǫ−2). ◮ Compute sketches Af and Ag incrementally ◮ Return median(|t1|, . . . , |tk|) where t = Af − Ag 15/24
◮ Defn: A p-stable distribution µ has the following property:
◮ Algorithm:
◮ Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(ǫ−2). ◮ Compute sketches Af and Ag incrementally ◮ Return median(|t1|, . . . , |tk|) where t = Af − Ag
◮ Analysis:
◮ By the 1-stability property for Zi ∼ Cauchy
j
j
15/24
◮ Defn: A p-stable distribution µ has the following property:
◮ Algorithm:
◮ Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(ǫ−2). ◮ Compute sketches Af and Ag incrementally ◮ Return median(|t1|, . . . , |tk|) where t = Af − Ag
◮ Analysis:
◮ By the 1-stability property for Zi ∼ Cauchy
j
j
◮ For k = O(ǫ−2), since median(|Zi|) = 1, with high probability,
j
j
15/24
16/24
◮ Heavy Hitters: Find all i such that fi ≥ φm
16/24
◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance
16/24
◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance ◮ Find k-Quantiles: Find values q0, . . . , qk such that
16/24
◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance ◮ Find k-Quantiles: Find values q0, . . . , qk such that
◮ Algorithm: Count-Min Sketch
◮ Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w] 16/24
◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance ◮ Find k-Quantiles: Find values q0, . . . , qk such that
◮ Algorithm: Count-Min Sketch
◮ Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w] ◮ Construct d random hash functions h1, h2, . . . hd : [n] → [w] 16/24
◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance ◮ Find k-Quantiles: Find values q0, . . . , qk such that
◮ Algorithm: Count-Min Sketch
◮ Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w] ◮ Construct d random hash functions h1, h2, . . . hd : [n] → [w] ◮ Update counters: On seeing value v, increment ci,hi (v) for i ∈ [d] 16/24
◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance ◮ Find k-Quantiles: Find values q0, . . . , qk such that
◮ Algorithm: Count-Min Sketch
◮ Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w] ◮ Construct d random hash functions h1, h2, . . . hd : [n] → [w] ◮ Update counters: On seeing value v, increment ci,hi (v) for i ∈ [d] ◮ To get an estimate of fk, return
i
16/24
◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance ◮ Find k-Quantiles: Find values q0, . . . , qk such that
◮ Algorithm: Count-Min Sketch
◮ Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w] ◮ Construct d random hash functions h1, h2, . . . hd : [n] → [w] ◮ Update counters: On seeing value v, increment ci,hi (v) for i ∈ [d] ◮ To get an estimate of fk, return
i
◮ Analysis: For d = O(log 1/δ) and w = O(1/ǫ2)
16/24
17/24
◮ Input: Stream x1, x2, . . . , xm ∈ [n]m ◮ Goal: Estimate the number of distinct values in the stream up to a
18/24
◮ Algorithm:
19/24
◮ Algorithm:
19/24
◮ Algorithm:
19/24
◮ Algorithm:
◮ Analysis:
19/24
◮ Algorithm:
◮ Analysis:
19/24
20/24
t r(1+ǫ)] and X = Xi
20/24
t r(1+ǫ)] and X = Xi
20/24
t r(1+ǫ)] and X = Xi
20/24
21/24
◮ Input: (x1, y1), (x2, y2), . . . , (xm, ym) ◮ Goal: Estimate strength of correlation between x and y via the
◮ Result: (1 + ǫ) approx in ˜
22/24
◮ Input: (x1, y1), (x2, y2), . . . , (xm, ym) ◮ Goal: Estimate strength of correlation between x and y via the
◮ Result: (1 + ǫ) approx in ˜
◮ Input: Stream defines a matrix A ∈ Rn×d and b ∈ Rd×1 ◮ Goal: Find x such that Ax − b2 is minimized. ◮ Result: (1 + ǫ) estimation in ˜
22/24
◮ Input: x1, x2, . . . , xm ∈ [n]m ◮ Goal: Determine B bucket histogram H : [m] → R minimizing
◮ Result: (1 + ǫ) estimation in ˜
23/24
◮ Input: x1, x2, . . . , xm ∈ [n]m ◮ Goal: Determine B bucket histogram H : [m] → R minimizing
◮ Result: (1 + ǫ) estimation in ˜
◮ Input: x1, x2, . . . , xm ∈ [n]m ◮ Goal: Estimate number of transpositions |{i < j : xi > xj}| ◮ Goal: Estimate length of longest increasing subsequence ◮ Results: (1 + ǫ) approx in ˜
23/24
◮ Blog: http://polylogblog.wordpress.com ◮ Lectures: Piotr Indyk, MIT
◮ Books:
24/24