Crash Course on Data Stream Algorithms Part I: Basic Definitions and - - PowerPoint PPT Presentation

crash course on data stream algorithms
SMART_READER_LITE
LIVE PREVIEW

Crash Course on Data Stream Algorithms Part I: Basic Definitions and - - PowerPoint PPT Presentation

Crash Course on Data Stream Algorithms Part I: Basic Definitions and Numerical Streams Andrew McGregor University of Massachusetts Amherst 1/24 Goals of the Crash Course Goal: Give a flavor for the theoretical results and techniques from


slide-1
SLIDE 1

Crash Course on Data Stream Algorithms

Part I: Basic Definitions and Numerical Streams Andrew McGregor

University of Massachusetts Amherst

1/24

slide-2
SLIDE 2

Goals of the Crash Course

◮ Goal: Give a flavor for the theoretical results and techniques from

the 100’s of papers on the design and analysis of stream algorithms.

2/24

slide-3
SLIDE 3

Goals of the Crash Course

◮ Goal: Give a flavor for the theoretical results and techniques from

the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?”

2/24

slide-4
SLIDE 4

Goals of the Crash Course

◮ Goal: Give a flavor for the theoretical results and techniques from

the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?”

◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t

require much in the way of prerequisites.

2/24

slide-5
SLIDE 5

Goals of the Crash Course

◮ Goal: Give a flavor for the theoretical results and techniques from

the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?”

◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t

require much in the way of prerequisites.

◮ Request:

2/24

slide-6
SLIDE 6

Goals of the Crash Course

◮ Goal: Give a flavor for the theoretical results and techniques from

the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?”

◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t

require much in the way of prerequisites.

◮ Request:

◮ If you get bored, ask questions. . . 2/24

slide-7
SLIDE 7

Goals of the Crash Course

◮ Goal: Give a flavor for the theoretical results and techniques from

the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?”

◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t

require much in the way of prerequisites.

◮ Request:

◮ If you get bored, ask questions. . . ◮ If you get lost, ask questions. . . 2/24

slide-8
SLIDE 8

Goals of the Crash Course

◮ Goal: Give a flavor for the theoretical results and techniques from

the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?”

◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t

require much in the way of prerequisites.

◮ Request:

◮ If you get bored, ask questions. . . ◮ If you get lost, ask questions. . . ◮ If you’d like to ask questions, ask questions. . . 2/24

slide-9
SLIDE 9

Outline

Basic Definitions Sampling Sketching Counting Distinct Items Summary of Some Other Results

3/24

slide-10
SLIDE 10

Outline

Basic Definitions Sampling Sketching Counting Distinct Items Summary of Some Other Results

4/24

slide-11
SLIDE 11

Data Stream Model

◮ Stream: m elements from universe of size n, e.g.,

x1, x2, . . . , xm = 3, 5, 3, 7, 5, 4, . . .

5/24

slide-12
SLIDE 12

Data Stream Model

◮ Stream: m elements from universe of size n, e.g.,

x1, x2, . . . , xm = 3, 5, 3, 7, 5, 4, . . .

◮ Goal: Compute a function of stream, e.g., median, number of

distinct elements, longest increasing sequence.

5/24

slide-13
SLIDE 13

Data Stream Model

◮ Stream: m elements from universe of size n, e.g.,

x1, x2, . . . , xm = 3, 5, 3, 7, 5, 4, . . .

◮ Goal: Compute a function of stream, e.g., median, number of

distinct elements, longest increasing sequence.

◮ Catch:

  • 1. Limited working memory, sublinear in n and m

5/24

slide-14
SLIDE 14

Data Stream Model

◮ Stream: m elements from universe of size n, e.g.,

x1, x2, . . . , xm = 3, 5, 3, 7, 5, 4, . . .

◮ Goal: Compute a function of stream, e.g., median, number of

distinct elements, longest increasing sequence.

◮ Catch:

  • 1. Limited working memory, sublinear in n and m
  • 2. Access data sequentially

5/24

slide-15
SLIDE 15

Data Stream Model

◮ Stream: m elements from universe of size n, e.g.,

x1, x2, . . . , xm = 3, 5, 3, 7, 5, 4, . . .

◮ Goal: Compute a function of stream, e.g., median, number of

distinct elements, longest increasing sequence.

◮ Catch:

  • 1. Limited working memory, sublinear in n and m
  • 2. Access data sequentially
  • 3. Process each element quickly

5/24

slide-16
SLIDE 16

Data Stream Model

◮ Stream: m elements from universe of size n, e.g.,

x1, x2, . . . , xm = 3, 5, 3, 7, 5, 4, . . .

◮ Goal: Compute a function of stream, e.g., median, number of

distinct elements, longest increasing sequence.

◮ Catch:

  • 1. Limited working memory, sublinear in n and m
  • 2. Access data sequentially
  • 3. Process each element quickly

◮ Origins in 70s but has become popular in last ten years because of

growing theory and very applicable.

5/24

slide-17
SLIDE 17

Why’s it become popular?

◮ Practical Appeal:

◮ Faster networks, cheaper data storage, ubiquitous data-logging

results in massive amount of data to be processed.

◮ Applications to network monitoring, query planning, I/O efficiency

for massive data, sensor networks aggregation. . .

6/24

slide-18
SLIDE 18

Why’s it become popular?

◮ Practical Appeal:

◮ Faster networks, cheaper data storage, ubiquitous data-logging

results in massive amount of data to be processed.

◮ Applications to network monitoring, query planning, I/O efficiency

for massive data, sensor networks aggregation. . .

◮ Theoretical Appeal:

◮ Easy to state problems but hard to solve. ◮ Links to communication complexity, compressed sensing,

embeddings, pseudo-random generators, approximation. . .

6/24

slide-19
SLIDE 19

Outline

Basic Definitions Sampling Sketching Counting Distinct Items Summary of Some Other Results

7/24

slide-20
SLIDE 20

Sampling and Statistics

◮ Sampling is a general technique for tackling massive amounts of data

8/24

slide-21
SLIDE 21

Sampling and Statistics

◮ Sampling is a general technique for tackling massive amounts of data ◮ Example: To compute the median packet size of some IP packets,

we could just sample some and use the median of the sample as an estimate for the true median. Statistical arguments relate the size of the sample to the accuracy of the estimate.

8/24

slide-22
SLIDE 22

Sampling and Statistics

◮ Sampling is a general technique for tackling massive amounts of data ◮ Example: To compute the median packet size of some IP packets,

we could just sample some and use the median of the sample as an estimate for the true median. Statistical arguments relate the size of the sample to the accuracy of the estimate.

◮ Challenge: But how do you take a sample from a stream of unknown

length or from a “sliding window”?

8/24

slide-23
SLIDE 23

Reservoir Sampling

◮ Problem: Find uniform sample s from a stream of unknown length

9/24

slide-24
SLIDE 24

Reservoir Sampling

◮ Problem: Find uniform sample s from a stream of unknown length ◮ Algorithm:

◮ Initially s = x1 ◮ On seeing the t-th element, s ← xt with probability 1/t 9/24

slide-25
SLIDE 25

Reservoir Sampling

◮ Problem: Find uniform sample s from a stream of unknown length ◮ Algorithm:

◮ Initially s = x1 ◮ On seeing the t-th element, s ← xt with probability 1/t

◮ Analysis:

◮ What’s the probability that s = xi at some time t ≥ i? 9/24

slide-26
SLIDE 26

Reservoir Sampling

◮ Problem: Find uniform sample s from a stream of unknown length ◮ Algorithm:

◮ Initially s = x1 ◮ On seeing the t-th element, s ← xt with probability 1/t

◮ Analysis:

◮ What’s the probability that s = xi at some time t ≥ i?

P [s = xi] = 1 i × „ 1 − 1 i + 1 « × . . . × „ 1 − 1 t « = 1 t

9/24

slide-27
SLIDE 27

Reservoir Sampling

◮ Problem: Find uniform sample s from a stream of unknown length ◮ Algorithm:

◮ Initially s = x1 ◮ On seeing the t-th element, s ← xt with probability 1/t

◮ Analysis:

◮ What’s the probability that s = xi at some time t ≥ i?

P [s = xi] = 1 i × „ 1 − 1 i + 1 « × . . . × „ 1 − 1 t « = 1 t

◮ To get k samples we use O(k log n) bits of space. 9/24

slide-28
SLIDE 28

Priority Sampling for Sliding Windows

◮ Problem: Maintain a uniform sample from the last w items

10/24

slide-29
SLIDE 29

Priority Sampling for Sliding Windows

◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm:

  • 1. For each xi we pick a random value vi ∈ (0, 1)

10/24

slide-30
SLIDE 30

Priority Sampling for Sliding Windows

◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm:

  • 1. For each xi we pick a random value vi ∈ (0, 1)
  • 2. In a window xj−w+1, . . . , xj return value xi with smallest vi

10/24

slide-31
SLIDE 31

Priority Sampling for Sliding Windows

◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm:

  • 1. For each xi we pick a random value vi ∈ (0, 1)
  • 2. In a window xj−w+1, . . . , xj return value xi with smallest vi
  • 3. To do this, maintain set of all elements in sliding window whose v

value is minimal among subsequent values

10/24

slide-32
SLIDE 32

Priority Sampling for Sliding Windows

◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm:

  • 1. For each xi we pick a random value vi ∈ (0, 1)
  • 2. In a window xj−w+1, . . . , xj return value xi with smallest vi
  • 3. To do this, maintain set of all elements in sliding window whose v

value is minimal among subsequent values

◮ Analysis:

10/24

slide-33
SLIDE 33

Priority Sampling for Sliding Windows

◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm:

  • 1. For each xi we pick a random value vi ∈ (0, 1)
  • 2. In a window xj−w+1, . . . , xj return value xi with smallest vi
  • 3. To do this, maintain set of all elements in sliding window whose v

value is minimal among subsequent values

◮ Analysis:

◮ The probability that j-th oldest element is in S is 1/j so the

expected number of items in S is 1/w + 1/(w − 1) + . . . + 1/1 = O(log w)

10/24

slide-34
SLIDE 34

Priority Sampling for Sliding Windows

◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm:

  • 1. For each xi we pick a random value vi ∈ (0, 1)
  • 2. In a window xj−w+1, . . . , xj return value xi with smallest vi
  • 3. To do this, maintain set of all elements in sliding window whose v

value is minimal among subsequent values

◮ Analysis:

◮ The probability that j-th oldest element is in S is 1/j so the

expected number of items in S is 1/w + 1/(w − 1) + . . . + 1/1 = O(log w)

◮ Hence, algorithm only uses O(log w log n) bits of memory. 10/24

slide-35
SLIDE 35

Other Types of Sampling

◮ Universe sampling: For a random i ∈R [n], compute

fi = |{j : xj = i}|

11/24

slide-36
SLIDE 36

Other Types of Sampling

◮ Universe sampling: For a random i ∈R [n], compute

fi = |{j : xj = i}|

◮ Minwise hashing: Sample i ∈R {i : there exists j such that xj = i}

11/24

slide-37
SLIDE 37

Other Types of Sampling

◮ Universe sampling: For a random i ∈R [n], compute

fi = |{j : xj = i}|

◮ Minwise hashing: Sample i ∈R {i : there exists j such that xj = i} ◮ AMS sampling: Sample xj for j ∈R [m] and compute

r = |{j′ ≥ j : xj′ = xj}|

11/24

slide-38
SLIDE 38

Other Types of Sampling

◮ Universe sampling: For a random i ∈R [n], compute

fi = |{j : xj = i}|

◮ Minwise hashing: Sample i ∈R {i : there exists j such that xj = i} ◮ AMS sampling: Sample xj for j ∈R [m] and compute

r = |{j′ ≥ j : xj′ = xj}| Handy when estimating quantities like

i g(fi) because

E [m(g(r) − g(r − 1))] =

  • i

g(fi)

11/24

slide-39
SLIDE 39

Outline

Basic Definitions Sampling Sketching Counting Distinct Items Summary of Some Other Results

12/24

slide-40
SLIDE 40

Sketching

◮ Sketching is another general technique for processing streams

13/24

slide-41
SLIDE 41

Sketching

◮ Sketching is another general technique for processing streams ◮ Basic idea: Apply a linear projection “on the fly” that takes

high-dimensional data to a smaller dimensional space. Post-process lower dimensional image to estimate the quantities of interest.

13/24

slide-42
SLIDE 42

Estimating the difference between two streams

◮ Input: Stream from two sources x1, x2, . . . , xm ∈ ([n] ∪ [n])m

14/24

slide-43
SLIDE 43

Estimating the difference between two streams

◮ Input: Stream from two sources x1, x2, . . . , xm ∈ ([n] ∪ [n])m ◮ Goal: Estimate difference between distribution of red values and blue

values, e.g.,

  • i∈[n]

|fi − gi| where fi = |{k : xk = i}| and gi = |{k : xk = i}|

14/24

slide-44
SLIDE 44

p-Stable Distributions and Algorithm

◮ Defn: A p-stable distribution µ has the following property:

for X, Y , Z ∼ µ and a, b ∈ R : aX + bY ∼ (|a|p + |b|p)1/pZ e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable

15/24

slide-45
SLIDE 45

p-Stable Distributions and Algorithm

◮ Defn: A p-stable distribution µ has the following property:

for X, Y , Z ∼ µ and a, b ∈ R : aX + bY ∼ (|a|p + |b|p)1/pZ e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable

◮ Algorithm:

◮ Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(ǫ−2). 15/24

slide-46
SLIDE 46

p-Stable Distributions and Algorithm

◮ Defn: A p-stable distribution µ has the following property:

for X, Y , Z ∼ µ and a, b ∈ R : aX + bY ∼ (|a|p + |b|p)1/pZ e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable

◮ Algorithm:

◮ Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(ǫ−2). ◮ Compute sketches Af and Ag incrementally 15/24

slide-47
SLIDE 47

p-Stable Distributions and Algorithm

◮ Defn: A p-stable distribution µ has the following property:

for X, Y , Z ∼ µ and a, b ∈ R : aX + bY ∼ (|a|p + |b|p)1/pZ e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable

◮ Algorithm:

◮ Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(ǫ−2). ◮ Compute sketches Af and Ag incrementally ◮ Return median(|t1|, . . . , |tk|) where t = Af − Ag 15/24

slide-48
SLIDE 48

p-Stable Distributions and Algorithm

◮ Defn: A p-stable distribution µ has the following property:

for X, Y , Z ∼ µ and a, b ∈ R : aX + bY ∼ (|a|p + |b|p)1/pZ e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable

◮ Algorithm:

◮ Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(ǫ−2). ◮ Compute sketches Af and Ag incrementally ◮ Return median(|t1|, . . . , |tk|) where t = Af − Ag

◮ Analysis:

◮ By the 1-stability property for Zi ∼ Cauchy

|ti| = | X

j

Ai,j(fj − gj)| ∼ |Zi| X

j

|fj − gj|

15/24

slide-49
SLIDE 49

p-Stable Distributions and Algorithm

◮ Defn: A p-stable distribution µ has the following property:

for X, Y , Z ∼ µ and a, b ∈ R : aX + bY ∼ (|a|p + |b|p)1/pZ e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable

◮ Algorithm:

◮ Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(ǫ−2). ◮ Compute sketches Af and Ag incrementally ◮ Return median(|t1|, . . . , |tk|) where t = Af − Ag

◮ Analysis:

◮ By the 1-stability property for Zi ∼ Cauchy

|ti| = | X

j

Ai,j(fj − gj)| ∼ |Zi| X

j

|fj − gj|

◮ For k = O(ǫ−2), since median(|Zi|) = 1, with high probability,

(1 − ǫ) X

j

|fj − gj| ≤ median(|t1|, . . . , |tk|) ≤ (1 + ǫ) X

j

|fj − gj|

15/24

slide-50
SLIDE 50

A Useful Multi-Purpose Sketch: Count-Min Sketch

16/24

slide-51
SLIDE 51

A Useful Multi-Purpose Sketch: Count-Min Sketch

◮ Heavy Hitters: Find all i such that fi ≥ φm

16/24

slide-52
SLIDE 52

A Useful Multi-Purpose Sketch: Count-Min Sketch

◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance

16/24

slide-53
SLIDE 53

A Useful Multi-Purpose Sketch: Count-Min Sketch

◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance ◮ Find k-Quantiles: Find values q0, . . . , qk such that

q0 = 0, qk = n, and

  • i≤qj−1

fi < jm k ≤

  • i≤qj

fi

16/24

slide-54
SLIDE 54

A Useful Multi-Purpose Sketch: Count-Min Sketch

◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance ◮ Find k-Quantiles: Find values q0, . . . , qk such that

q0 = 0, qk = n, and

  • i≤qj−1

fi < jm k ≤

  • i≤qj

fi

◮ Algorithm: Count-Min Sketch

◮ Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w] 16/24

slide-55
SLIDE 55

A Useful Multi-Purpose Sketch: Count-Min Sketch

◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance ◮ Find k-Quantiles: Find values q0, . . . , qk such that

q0 = 0, qk = n, and

  • i≤qj−1

fi < jm k ≤

  • i≤qj

fi

◮ Algorithm: Count-Min Sketch

◮ Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w] ◮ Construct d random hash functions h1, h2, . . . hd : [n] → [w] 16/24

slide-56
SLIDE 56

A Useful Multi-Purpose Sketch: Count-Min Sketch

◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance ◮ Find k-Quantiles: Find values q0, . . . , qk such that

q0 = 0, qk = n, and

  • i≤qj−1

fi < jm k ≤

  • i≤qj

fi

◮ Algorithm: Count-Min Sketch

◮ Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w] ◮ Construct d random hash functions h1, h2, . . . hd : [n] → [w] ◮ Update counters: On seeing value v, increment ci,hi (v) for i ∈ [d] 16/24

slide-57
SLIDE 57

A Useful Multi-Purpose Sketch: Count-Min Sketch

◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance ◮ Find k-Quantiles: Find values q0, . . . , qk such that

q0 = 0, qk = n, and

  • i≤qj−1

fi < jm k ≤

  • i≤qj

fi

◮ Algorithm: Count-Min Sketch

◮ Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w] ◮ Construct d random hash functions h1, h2, . . . hd : [n] → [w] ◮ Update counters: On seeing value v, increment ci,hi (v) for i ∈ [d] ◮ To get an estimate of fk, return

˜ fk = min

i

ci,hi (k)

16/24

slide-58
SLIDE 58

A Useful Multi-Purpose Sketch: Count-Min Sketch

◮ Heavy Hitters: Find all i such that fi ≥ φm ◮ Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance ◮ Find k-Quantiles: Find values q0, . . . , qk such that

q0 = 0, qk = n, and

  • i≤qj−1

fi < jm k ≤

  • i≤qj

fi

◮ Algorithm: Count-Min Sketch

◮ Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w] ◮ Construct d random hash functions h1, h2, . . . hd : [n] → [w] ◮ Update counters: On seeing value v, increment ci,hi (v) for i ∈ [d] ◮ To get an estimate of fk, return

˜ fk = min

i

ci,hi (k)

◮ Analysis: For d = O(log 1/δ) and w = O(1/ǫ2)

P

  • fk − ǫm ≤ ˜

fk ≤ fk

  • ≥ 1 − δ

16/24

slide-59
SLIDE 59

Outline

Basic Definitions Sampling Sketching Counting Distinct Items Summary of Some Other Results

17/24

slide-60
SLIDE 60

Counting Distinct Elements

◮ Input: Stream x1, x2, . . . , xm ∈ [n]m ◮ Goal: Estimate the number of distinct values in the stream up to a

multiplicative factor (1 + ǫ) with high probability.

18/24

slide-61
SLIDE 61

Algorithm

◮ Algorithm:

  • 1. Apply random hash function h : [n] → [0, 1] to each element

19/24

slide-62
SLIDE 62

Algorithm

◮ Algorithm:

  • 1. Apply random hash function h : [n] → [0, 1] to each element
  • 2. Compute φ, the t-th smallest value of the hash seen where t = 21/ǫ2

19/24

slide-63
SLIDE 63

Algorithm

◮ Algorithm:

  • 1. Apply random hash function h : [n] → [0, 1] to each element
  • 2. Compute φ, the t-th smallest value of the hash seen where t = 21/ǫ2
  • 3. Return ˜

r = t/φ as estimate for r, the number of distinct items.

19/24

slide-64
SLIDE 64

Algorithm

◮ Algorithm:

  • 1. Apply random hash function h : [n] → [0, 1] to each element
  • 2. Compute φ, the t-th smallest value of the hash seen where t = 21/ǫ2
  • 3. Return ˜

r = t/φ as estimate for r, the number of distinct items.

◮ Analysis:

  • 1. Algorithm uses O(ǫ−2 log n) bits of space.

19/24

slide-65
SLIDE 65

Algorithm

◮ Algorithm:

  • 1. Apply random hash function h : [n] → [0, 1] to each element
  • 2. Compute φ, the t-th smallest value of the hash seen where t = 21/ǫ2
  • 3. Return ˜

r = t/φ as estimate for r, the number of distinct items.

◮ Analysis:

  • 1. Algorithm uses O(ǫ−2 log n) bits of space.
  • 2. We’ll show estimate has good accuracy with reasonable probability

P [|˜ r − r| ≤ ǫr] ≤ 9/10

19/24

slide-66
SLIDE 66

Accuracy Analysis

  • 1. Suppose the distinct items are a1, . . . , ar

20/24

slide-67
SLIDE 67

Accuracy Analysis

  • 1. Suppose the distinct items are a1, . . . , ar
  • 2. Over Estimation:

P [˜ r ≥ (1 + ǫ)r] = P [t/φ ≥ (1 + ǫ)r] = P

  • φ ≤

t r(1 + ǫ)

  • 20/24
slide-68
SLIDE 68

Accuracy Analysis

  • 1. Suppose the distinct items are a1, . . . , ar
  • 2. Over Estimation:

P [˜ r ≥ (1 + ǫ)r] = P [t/φ ≥ (1 + ǫ)r] = P

  • φ ≤

t r(1 + ǫ)

  • 3. Let Xi = 1[h(ai) ≤

t r(1+ǫ)] and X = Xi

P

  • φ ≤

t r(1 + ǫ)

  • = P [X > t] = P [X > (1 + ǫ)E [X]]

20/24

slide-69
SLIDE 69

Accuracy Analysis

  • 1. Suppose the distinct items are a1, . . . , ar
  • 2. Over Estimation:

P [˜ r ≥ (1 + ǫ)r] = P [t/φ ≥ (1 + ǫ)r] = P

  • φ ≤

t r(1 + ǫ)

  • 3. Let Xi = 1[h(ai) ≤

t r(1+ǫ)] and X = Xi

P

  • φ ≤

t r(1 + ǫ)

  • = P [X > t] = P [X > (1 + ǫ)E [X]]
  • 4. By a Chebyshev analysis,

P [X > (1 + ǫ)E [X]] ≤ 1 ǫ2E [X] ≤ 1/20

20/24

slide-70
SLIDE 70

Accuracy Analysis

  • 1. Suppose the distinct items are a1, . . . , ar
  • 2. Over Estimation:

P [˜ r ≥ (1 + ǫ)r] = P [t/φ ≥ (1 + ǫ)r] = P

  • φ ≤

t r(1 + ǫ)

  • 3. Let Xi = 1[h(ai) ≤

t r(1+ǫ)] and X = Xi

P

  • φ ≤

t r(1 + ǫ)

  • = P [X > t] = P [X > (1 + ǫ)E [X]]
  • 4. By a Chebyshev analysis,

P [X > (1 + ǫ)E [X]] ≤ 1 ǫ2E [X] ≤ 1/20

  • 5. Under Estimation: A similar analysis shows P [˜

r ≤ (1 − ǫ)r] ≤ 1/20

20/24

slide-71
SLIDE 71

Outline

Basic Definitions Sampling Sketching Counting Distinct Items Summary of Some Other Results

21/24

slide-72
SLIDE 72

Some Other Results

Correlations:

◮ Input: (x1, y1), (x2, y2), . . . , (xm, ym) ◮ Goal: Estimate strength of correlation between x and y via the

distance between joint distribution and product of the marginals.

◮ Result: (1 + ǫ) approx in ˜

O(ǫ−O(1)) space.

22/24

slide-73
SLIDE 73

Some Other Results

Correlations:

◮ Input: (x1, y1), (x2, y2), . . . , (xm, ym) ◮ Goal: Estimate strength of correlation between x and y via the

distance between joint distribution and product of the marginals.

◮ Result: (1 + ǫ) approx in ˜

O(ǫ−O(1)) space. Linear Regression:

◮ Input: Stream defines a matrix A ∈ Rn×d and b ∈ Rd×1 ◮ Goal: Find x such that Ax − b2 is minimized. ◮ Result: (1 + ǫ) estimation in ˜

O(d2ǫ−1) space.

22/24

slide-74
SLIDE 74

Some More Other Results

Histograms:

◮ Input: x1, x2, . . . , xm ∈ [n]m ◮ Goal: Determine B bucket histogram H : [m] → R minimizing

  • i∈[m]

(xi − H(i))2

◮ Result: (1 + ǫ) estimation in ˜

O(B2ǫ−1) space

23/24

slide-75
SLIDE 75

Some More Other Results

Histograms:

◮ Input: x1, x2, . . . , xm ∈ [n]m ◮ Goal: Determine B bucket histogram H : [m] → R minimizing

  • i∈[m]

(xi − H(i))2

◮ Result: (1 + ǫ) estimation in ˜

O(B2ǫ−1) space Transpositions and Increasing Subsequences:

◮ Input: x1, x2, . . . , xm ∈ [n]m ◮ Goal: Estimate number of transpositions |{i < j : xi > xj}| ◮ Goal: Estimate length of longest increasing subsequence ◮ Results: (1 + ǫ) approx in ˜

O(ǫ−1) and ˜ O(ǫ−1√n) space respectively

23/24

slide-76
SLIDE 76

Thanks!

◮ Blog: http://polylogblog.wordpress.com ◮ Lectures: Piotr Indyk, MIT

http://stellar.mit.edu/S/course/6/fa07/6.895/

◮ Books:

“Data Streams: Algorithms and Applications”

  • S. Muthukrishnan (2005)

“Algorithms and Complexity of Stream Processing”

  • A. McGregor, S. Muthukrishnan (forthcoming)

24/24