Data Streams Tutorial Andrew McGregor University of Massachusetts, - - PowerPoint PPT Presentation

data streams tutorial
SMART_READER_LITE
LIVE PREVIEW

Data Streams Tutorial Andrew McGregor University of Massachusetts, - - PowerPoint PPT Presentation

Data Streams Tutorial Andrew McGregor University of Massachusetts, Amherst Data Stream Model [Morris 78] [Munro, Paterson 78] [Flajolet, Martin 85] [Alon, Matias, Szegedy 96] [Henzinger, Raghavan, Rajagopalan 98] Data Stream


slide-1
SLIDE 1

Data Streams Tutorial

Andrew McGregor

University of Massachusetts, Amherst

slide-2
SLIDE 2

Data Stream Model

[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]

slide-3
SLIDE 3

Data Stream Model

[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]

  • Stream: m elements from some universe of size n

e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...

slide-4
SLIDE 4

Data Stream Model

[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]

  • Stream: m elements from some universe of size n

e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...

  • Goal: Estimate properties of the stream, e.g., median,

number of distinct elements, longest increasing sequence.

slide-5
SLIDE 5

Data Stream Model

[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]

  • Stream: m elements from some universe of size n

e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...

  • Goal: Estimate properties of the stream, e.g., median,

number of distinct elements, longest increasing sequence.

  • The Catch:
  • i) Limited working memory, e.g., polylog(n,m)
  • ii) Access data sequentially
  • iii) Process each element quickly
slide-6
SLIDE 6

Data Stream Model

[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]

  • Stream: m elements from some universe of size n

e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...

  • Goal: Estimate properties of the stream, e.g., median,

number of distinct elements, longest increasing sequence.

  • The Catch:
  • i) Limited working memory, e.g., polylog(n,m)
  • ii) Access data sequentially
  • iii) Process each element quickly
  • Origins in ’70s but has become popular in last ten years...
slide-7
SLIDE 7

Why’s it become popular?

slide-8
SLIDE 8

Why’s it become popular?

  • Practical Appeal:

Faster networks, cheaper data storage, ubiquitous data- logging results in massive amount of data to be processed. Applications to: Network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation...

slide-9
SLIDE 9

Why’s it become popular?

  • Practical Appeal:

Faster networks, cheaper data storage, ubiquitous data- logging results in massive amount of data to be processed. Applications to: Network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation...

  • Theoretical Appeal:

Easy to state problems but hard to solve. Links to: Communication complexity, compressed sensing, embeddings, pseudo-random generators, approximation...

slide-10
SLIDE 10
  • I. What’s Done?
  • II. Some Tools
  • III. What’s Next?
  • I. What’s Done?
  • II. Some Tools
  • III. What’s Next?
slide-11
SLIDE 11
  • I. What’s Done?
  • I. What’s Done?
slide-12
SLIDE 12
  • I. What’s Done?
  • I. What’s Done?

Basic Problems Problem Variants Model Variants

slide-13
SLIDE 13

Families of Problems

slide-14
SLIDE 14

Families of Problems

  • Numbers:
  • Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

slide-15
SLIDE 15

Families of Problems

  • Numbers:
  • Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

slide-16
SLIDE 16

Families of Problems

  • Numbers:
  • Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

  • Graphs:
  • Connectivity, MST, bipartiteness, matchings, distances,

graph cuts, independent sets, number of triangles, ...

slide-17
SLIDE 17

Families of Problems

  • Numbers:
  • Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

  • Graphs:
  • Connectivity, MST, bipartiteness, matchings, distances,

graph cuts, independent sets, number of triangles, ...

slide-18
SLIDE 18

Families of Problems

  • Numbers:
  • Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

  • Graphs:
  • Connectivity, MST, bipartiteness, matchings, distances,

graph cuts, independent sets, number of triangles, ...

  • Points:
  • Clustering, diameter, convex hulls, minimum enclosing

balls, MST, facility location, earth mover distance, ...

slide-19
SLIDE 19

Families of Problems

  • Numbers:
  • Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

  • Graphs:
  • Connectivity, MST, bipartiteness, matchings, distances,

graph cuts, independent sets, number of triangles, ...

  • Points:
  • Clustering, diameter, convex hulls, minimum enclosing

balls, MST, facility location, earth mover distance, ...

slide-20
SLIDE 20

Families of Problems

  • Numbers:
  • Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

  • Graphs:
  • Connectivity, MST, bipartiteness, matchings, distances,

graph cuts, independent sets, number of triangles, ...

  • Points:
  • Clustering, diameter, convex hulls, minimum enclosing

balls, MST, facility location, earth mover distance, ...

  • Sequences:
  • Longest increasing subsequence, pattern matching,

periodicity, time-series histograms, DYCK languages, ...

slide-21
SLIDE 21

Problem Variants

slide-22
SLIDE 22

Problem Variants

  • Sliding

Window:

  • Suppose you only want to solve the problem using the last

w elements and have O(polylog w) space.

  • Elements could have time stamps and you want to solve the

problem for elements with stamps in the last hour.

slide-23
SLIDE 23

Problem Variants

  • Sliding

Window:

  • Suppose you only want to solve the problem using the last

w elements and have O(polylog w) space.

  • Elements could have time stamps and you want to solve the

problem for elements with stamps in the last hour.

  • Uncertain Data:
  • ith element of the stream is a distribution μi that defines

random variable Xi. Consider random variable g(X1, ... , Xm).

  • Problems: What’s the expected value of the max value?

What’s the probability the graph is connected?

slide-24
SLIDE 24

Model Variants

slide-25
SLIDE 25

Model Variants

  • Multiple Passes: Suppose you can take p>1 passes over

the stream. Can we trade-off passes with space?

slide-26
SLIDE 26

Model Variants

  • Multiple Passes: Suppose you can take p>1 passes over

the stream. Can we trade-off passes with space?

  • Example: Most common item in space.

˜ Θ(n/p)

slide-27
SLIDE 27

Model Variants

  • Multiple Passes: Suppose you can take p>1 passes over

the stream. Can we trade-off passes with space?

  • Example: Most common item in space.
  • Example: Can find elements of length k increasing

subsequence in space. ˜ Θ(k1+

1 2p−1 )

˜ Θ(n/p)

slide-28
SLIDE 28

Model Variants

  • Multiple Passes: Suppose you can take p>1 passes over

the stream. Can we trade-off passes with space?

  • Example: Most common item in space.
  • Example: Can find elements of length k increasing

subsequence in space.

  • Random Order: We normally assume that stream is
  • rdered adversarially. What if it’s ordered randomly?

˜ Θ(k1+

1 2p−1 )

˜ Θ(n/p)

slide-29
SLIDE 29

Model Variants

  • Multiple Passes: Suppose you can take p>1 passes over

the stream. Can we trade-off passes with space?

  • Example: Most common item in space.
  • Example: Can find elements of length k increasing

subsequence in space.

  • Random Order: We normally assume that stream is
  • rdered adversarially. What if it’s ordered randomly?
  • Example: Can find median of a random-order stream

in O(n1/2) space. If adversarial, it takes Ω(n) space. ˜ Θ(k1+

1 2p−1 )

˜ Θ(n/p)

slide-30
SLIDE 30

Model Variants

  • Multiple Passes: Suppose you can take p>1 passes over

the stream. Can we trade-off passes with space?

  • Example: Most common item in space.
  • Example: Can find elements of length k increasing

subsequence in space.

  • Random Order: We normally assume that stream is
  • rdered adversarially. What if it’s ordered randomly?
  • Example: Can find median of a random-order stream

in O(n1/2) space. If adversarial, it takes Ω(n) space.

  • Example: Estimating Fk takes roughly the same space

in random and adversarial settings. ˜ Θ(k1+

1 2p−1 )

˜ Θ(n/p)

slide-31
SLIDE 31
  • I. What’s Done?
  • II. Some Tools
  • III. What’s Next?
  • I. What’s Done?
  • II. Some Tools
  • III. What’s Next?
slide-32
SLIDE 32
  • II. Some Tools
  • II. Some Tools
slide-33
SLIDE 33
  • II. Some Tools
  • II. Some Tools

Sketching Sampling Lower Bounds

slide-34
SLIDE 34

First Idea: Sketches

slide-35
SLIDE 35

First Idea: Sketches

         f1 f2 . . . fn         

slide-36
SLIDE 36

First Idea: Sketches

  • Algorithm uses a (random) projection matrix Z such that the

relevant properties of f can be estimated from the sketch Zf.          f1 f2 . . . fn         

slide-37
SLIDE 37

    Z    

First Idea: Sketches

  • Algorithm uses a (random) projection matrix Z such that the

relevant properties of f can be estimated from the sketch Zf.          f1 f2 . . . fn          =     t1 t2 tk    

slide-38
SLIDE 38

    Z    

First Idea: Sketches

  • Algorithm uses a (random) projection matrix Z such that the

relevant properties of f can be estimated from the sketch Zf.

  • Easy to Update: On seeing “i”, add ith column of Z to sketch

         f1 f2 . . . fn          =     t1 t2 tk    

slide-39
SLIDE 39

    Z    

First Idea: Sketches

  • Algorithm uses a (random) projection matrix Z such that the

relevant properties of f can be estimated from the sketch Zf.

  • Easy to Update: On seeing “i”, add ith column of Z to sketch
  • Store Matrix Implicitly: Need to be able to efficiently generate

any entry of Z from a “small” random seed.          f1 f2 . . . fn          =     t1 t2 tk    

slide-40
SLIDE 40

    Z    

First Idea: Sketches

  • Algorithm uses a (random) projection matrix Z such that the

relevant properties of f can be estimated from the sketch Zf.

  • Easy to Update: On seeing “i”, add ith column of Z to sketch
  • Store Matrix Implicitly: Need to be able to efficiently generate

any entry of Z from a “small” random seed.

  • Gives Õ(k) space algorithm with seed & precision assumptions.

         f1 f2 . . . fn          =     t1 t2 tk    

slide-41
SLIDE 41

    Z              f1 f2 . . . fn          =     t1 t2 tk    

Algorithm for Estimating F2

slide-42
SLIDE 42

    Z              f1 f2 . . . fn          =     t1 t2 tk    

Algorithm for Estimating F2

Consider a row z of the projection matrix.

slide-43
SLIDE 43

    Z              f1 f2 . . . fn          =     t1 t2 tk    

Algorithm for Estimating F2

Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f.

slide-44
SLIDE 44

    Z              f1 f2 . . . fn          =     t1 t2 tk    

Algorithm for Estimating F2

Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f.

slide-45
SLIDE 45

    Z              f1 f2 . . . fn          =     t1 t2 tk    

Algorithm for Estimating F2

Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f.

Square of entry is concentrated around F2.

slide-46
SLIDE 46

    Z              f1 f2 . . . fn          =     t1 t2 tk    

Algorithm for Estimating F2

Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f. Expectation: E(t2) = ∑i,j E(zizj)fifj= F2

Square of entry is concentrated around F2.

slide-47
SLIDE 47

    Z              f1 f2 . . . fn          =     t1 t2 tk    

Algorithm for Estimating F2

Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f. Expectation: E(t2) = ∑i,j E(zizj)fifj= F2 Variance: Var(t2) ≤ ∑i,j,k,l E(zizjzkzl)fifjfkfl < 6F22

Square of entry is concentrated around F2.

slide-48
SLIDE 48

    Z              f1 f2 . . . fn          =     t1 t2 tk    

Algorithm for Estimating F2

Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f. Expectation: E(t2) = ∑i,j E(zizj)fifj= F2 Variance: Var(t2) ≤ ∑i,j,k,l E(zizjzkzl)fifjfkfl < 6F22 By Chebyshev, setting k=O(ε-2 log δ-1) ensures with

  • prob. 1-δ, average of squared entries is (1±ε) F2.

Square of entry is concentrated around F2.

slide-49
SLIDE 49

Second Idea: Sampling

slide-50
SLIDE 50

Second Idea: Sampling

  • Let’s sample from S=[a1, a2, a3, ... , am] where each ai ∈R [n]
slide-51
SLIDE 51

Second Idea: Sampling

  • Let’s sample from S=[a1, a2, a3, ... , am] where each ai ∈R [n]
  • Distribution Sampling: Return i with probability fi/m
slide-52
SLIDE 52

Second Idea: Sampling

  • Let’s sample from S=[a1, a2, a3, ... , am] where each ai ∈R [n]
  • Distribution Sampling: Return i with probability fi/m
  • Universe Sampling: Return (i,fi) where i ∈R [n]
slide-53
SLIDE 53

Second Idea: Sampling

  • Let’s sample from S=[a1, a2, a3, ... , am] where each ai ∈R [n]
  • Distribution Sampling: Return i with probability fi/m
  • Universe Sampling: Return (i,fi) where i ∈R [n]
  • AMS Sampling: Return (i,r) with i chosen w/p fi/m and r ∈R [fi]
slide-54
SLIDE 54

Second Idea: Sampling

  • Let’s sample from S=[a1, a2, a3, ... , am] where each ai ∈R [n]
  • Distribution Sampling: Return i with probability fi/m
  • Universe Sampling: Return (i,fi) where i ∈R [n]
  • AMS Sampling: Return (i,r) with i chosen w/p fi/m and r ∈R [fi]
  • Sample aj for j ∈R [m], let i= aj and compute r=|{j′≥j : aj′=aj}|
slide-55
SLIDE 55

Second Idea: Sampling

  • Let’s sample from S=[a1, a2, a3, ... , am] where each ai ∈R [n]
  • Distribution Sampling: Return i with probability fi/m
  • Universe Sampling: Return (i,fi) where i ∈R [n]
  • AMS Sampling: Return (i,r) with i chosen w/p fi/m and r ∈R [fi]
  • Sample aj for j ∈R [m], let i= aj and compute r=|{j′≥j : aj′=aj}|
  • Useful for estimating ∑i g(fi) because E[m(g(r)-g(r-1))] = ∑i g(fi)
slide-56
SLIDE 56

Second Idea: Sampling

  • Let’s sample from S=[a1, a2, a3, ... , am] where each ai ∈R [n]
  • Distribution Sampling: Return i with probability fi/m
  • Universe Sampling: Return (i,fi) where i ∈R [n]
  • AMS Sampling: Return (i,r) with i chosen w/p fi/m and r ∈R [fi]
  • Sample aj for j ∈R [m], let i= aj and compute r=|{j′≥j : aj′=aj}|
  • Useful for estimating ∑i g(fi) because E[m(g(r)-g(r-1))] = ∑i g(fi)
  • Lp Sampling: Return i with probability fik/Fk
slide-57
SLIDE 57

L0 Sampling

slide-58
SLIDE 58

L0 Sampling

Suppose we know F0. Pick hash function h:[n]→[F0]

slide-59
SLIDE 59

L0 Sampling

Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0.

slide-60
SLIDE 60

L0 Sampling

Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c←c+1, id←id+j

slide-61
SLIDE 61

L0 Sampling

Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c←c+1, id←id+j Return id/c if all elts hashing to 1 were same

slide-62
SLIDE 62

L0 Sampling

Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c←c+1, id←id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability.

slide-63
SLIDE 63

L0 Sampling

Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c←c+1, id←id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability. Claim: Need to check elts hashing to 1 were same.

slide-64
SLIDE 64

L0 Sampling

Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c←c+1, id←id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability. Claim: Need to check elts hashing to 1 were same. Run O(log n) copies guessing F0=2i. At least one instantiation works with constant probability.

slide-65
SLIDE 65

L0 Sampling

Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c←c+1, id←id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability. Claim: Need to check elts hashing to 1 were same. Run O(log n) copies guessing F0=2i. At least one instantiation works with constant probability. Algorithm is a sketch and works with deletions!

slide-66
SLIDE 66

Third Idea: Lower Bounds

slide-67
SLIDE 67

Third Idea: Lower Bounds

x∈{0,1}n y∈{0,1}n

  • Many space lower bounds in data stream model use

reductions from communication complexity.

slide-68
SLIDE 68

Third Idea: Lower Bounds

x∈{0,1}n y∈{0,1}n

  • Many space lower bounds in data stream model use

reductions from communication complexity.

slide-69
SLIDE 69

Third Idea: Lower Bounds

x∈{0,1}n y∈{0,1}n

  • Many space lower bounds in data stream model use

reductions from communication complexity.

  • Example: Alice and Bob have x,y∈{0,1}n and Bob wants to

check DISJOINTNESS i.e., is there an i with xi=yi=1?

slide-70
SLIDE 70

Third Idea: Lower Bounds

x∈{0,1}n y∈{0,1}n

  • Many space lower bounds in data stream model use

reductions from communication complexity.

  • Example: Alice and Bob have x,y∈{0,1}n and Bob wants to

check DISJOINTNESS i.e., is there an i with xi=yi=1?

  • Thm: Any 1/3-error protocol for DISJOINTNESS requires

Ω(n) bits of communication.

slide-71
SLIDE 71

Third Idea: Lower Bounds

x∈{0,1}n y∈{0,1}n

  • Many space lower bounds in data stream model use

reductions from communication complexity.

  • Example: Alice and Bob have x,y∈{0,1}n and Bob wants to

check DISJOINTNESS i.e., is there an i with xi=yi=1?

  • Thm: Any 1/3-error protocol for DISJOINTNESS requires

Ω(n) bits of communication.

  • Corollary: Any 1/3-error stream algorithm that checks if a

graph is triangle-free needs Ω(n2) bits of memory.

slide-72
SLIDE 72

Lower Bound for Triangle Detection

slide-73
SLIDE 73

Lower Bound for Triangle Detection

Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication.

slide-74
SLIDE 74

Lower Bound for Triangle Detection

Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles.

slide-75
SLIDE 75

Lower Bound for Triangle Detection

Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n

slide-76
SLIDE 76

Lower Bound for Triangle Detection

Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n

slide-77
SLIDE 77

Lower Bound for Triangle Detection

Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E1={uiwi: 1≤i≤n} and E2={uivj: Xij=1}

slide-78
SLIDE 78

Lower Bound for Triangle Detection

Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E1={uiwi: 1≤i≤n} and E2={uivj: Xij=1}

slide-79
SLIDE 79

Lower Bound for Triangle Detection

Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E1={uiwi: 1≤i≤n} and E2={uivj: Xij=1}

slide-80
SLIDE 80

Lower Bound for Triangle Detection

Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E1={uiwi: 1≤i≤n} and E2={uivj: Xij=1} Sends memory to Bob who runs A on E3={vjwi:Yij=1}

slide-81
SLIDE 81

Lower Bound for Triangle Detection

Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E1={uiwi: 1≤i≤n} and E2={uivj: Xij=1} Sends memory to Bob who runs A on E3={vjwi:Yij=1}

slide-82
SLIDE 82

Lower Bound for Triangle Detection

Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E1={uiwi: 1≤i≤n} and E2={uivj: Xij=1} Sends memory to Bob who runs A on E3={vjwi:Yij=1} Output of A resolves matrix question so s=Ω(n2).

slide-83
SLIDE 83

Useful Communication Results

slide-84
SLIDE 84

Useful Communication Results

  • Indexing:
  • Alice has x∈{0,1}n, Bob has i∈[n]. Bob want’s to learn xi.
  • One-way communication requires Ω(n) bits even if Bob

also knows first i-1 bits of x.

slide-85
SLIDE 85

Useful Communication Results

  • Indexing:
  • Alice has x∈{0,1}n, Bob has i∈[n]. Bob want’s to learn xi.
  • One-way communication requires Ω(n) bits even if Bob

also knows first i-1 bits of x.

  • Gap-Hamming:
  • Alice and Bob have x,y∈{0,1}n. Distinguish Δ(x,y)<n/2-√n

from Δ(x,y)>n/2+√n.

  • Requires Ω(n) communication.
slide-86
SLIDE 86

Useful Communication Results

  • Indexing:
  • Alice has x∈{0,1}n, Bob has i∈[n]. Bob want’s to learn xi.
  • One-way communication requires Ω(n) bits even if Bob

also knows first i-1 bits of x.

  • Gap-Hamming:
  • Alice and Bob have x,y∈{0,1}n. Distinguish Δ(x,y)<n/2-√n

from Δ(x,y)>n/2+√n.

  • Requires Ω(n) communication.
  • Multi-Party Disjointness:
  • t players have x1,x2, ... , xt ∈{0,1}n. Need to distinguish

x1i=x2i= ... =xti =1 for some i from all vectors orthogonal.

  • Requires Ω(n/t) communication.
slide-87
SLIDE 87

Bonus! The Fourth Idea

slide-88
SLIDE 88

Bonus! The Fourth Idea

  • Algorithmic tools will only get you so far, sometimes you

need to come up with neat ad hoc solutions.

slide-89
SLIDE 89

Bonus! The Fourth Idea

  • Algorithmic tools will only get you so far, sometimes you

need to come up with neat ad hoc solutions.

  • Graph Distances: Given a stream of edges, approximate the

shortest path distance between any two nodes.

slide-90
SLIDE 90

Bonus! The Fourth Idea

  • Algorithmic tools will only get you so far, sometimes you

need to come up with neat ad hoc solutions.

  • Graph Distances: Given a stream of edges, approximate the

shortest path distance between any two nodes.

  • k-Center: Given a stream of points, find a set of centers that

minimizes max distance from a point to nearest center.

slide-91
SLIDE 91

Approximate Distances

slide-92
SLIDE 92

Approximate Distances

Edges define shortest path graph metric dG.

slide-93
SLIDE 93

Approximate Distances

Edges define shortest path graph metric dG. An α-spanner of G = (V ,E) is a subgraph H = (V ,E’) such that ∀u,v: dG(u,v) ≤ dH(u,v) ≤ αdG(u,v)

slide-94
SLIDE 94

Approximate Distances

Edges define shortest path graph metric dG. An α-spanner of G = (V ,E) is a subgraph H = (V ,E’) such that ∀u,v: dG(u,v) ≤ dH(u,v) ≤ αdG(u,v) Algorithm: Let E′ be initially empty Add (u,v) to E′ if dH(u,v) > 2t-1

slide-95
SLIDE 95

Approximate Distances

Edges define shortest path graph metric dG. An α-spanner of G = (V ,E) is a subgraph H = (V ,E’) such that ∀u,v: dG(u,v) ≤ dH(u,v) ≤ αdG(u,v) Algorithm: Let E′ be initially empty Add (u,v) to E′ if dH(u,v) > 2t-1 Analysis: Each distance increase by at most factor 2t-1 |E′| = O(n1+1/t) because all cycles of length > 2t

slide-96
SLIDE 96

k-Center Clustering

slide-97
SLIDE 97

k-Center Clustering

2 approx in O(k) space if you already know OPT.

slide-98
SLIDE 98

k-Center Clustering

2 approx in O(k) space if you already know OPT. (2+ε) approx in O(k ε-1 log Δ) space if 1≤OPT≤Δ

slide-99
SLIDE 99

k-Center Clustering

2 approx in O(k) space if you already know OPT. (2+ε) approx in O(k ε-1 log Δ) space if 1≤OPT≤Δ Better Algorithm O(k ε-1 log ε-1): Instantiate basic algorithm with guesses 1, (1+ε), (1+ε)2, ... , 2ε−1

slide-100
SLIDE 100

k-Center Clustering

2 approx in O(k) space if you already know OPT. (2+ε) approx in O(k ε-1 log Δ) space if 1≤OPT≤Δ Better Algorithm O(k ε-1 log ε-1): Instantiate basic algorithm with guesses 1, (1+ε), (1+ε)2, ... , 2ε−1 If guess r stops working at (j+1)th point: Let q1,...,qk be centers chosen so far. Then p1,...,pj are all at most 2r from some qi. OPT for {q1,...,qk,pj+1,...,pn} is at most OPT+2r.

slide-101
SLIDE 101

k-Center Clustering

2 approx in O(k) space if you already know OPT. (2+ε) approx in O(k ε-1 log Δ) space if 1≤OPT≤Δ Better Algorithm O(k ε-1 log ε-1): Instantiate basic algorithm with guesses 1, (1+ε), (1+ε)2, ... , 2ε−1 If guess r stops working at (j+1)th point: Let q1,...,qk be centers chosen so far. Then p1,...,pj are all at most 2r from some qi. OPT for {q1,...,qk,pj+1,...,pn} is at most OPT+2r. Hence, an instantiation with guess 2r/ε can use {q1,...,qk,pj+1,...,pn} rather than {p1,...,pn}.

slide-102
SLIDE 102
  • I. What’s Done?
  • II. Some Tools
  • III. What’s Next?
  • I. What’s Done?
  • II. Some Tools
  • III. What’s Next?
slide-103
SLIDE 103
  • III. What’s Next?
  • III. What’s Next?
slide-104
SLIDE 104
  • III. What’s Next?
  • III. What’s Next?

Open Problems Annotations Space-Efficient Sampling

slide-105
SLIDE 105

Open Problems

slide-106
SLIDE 106

Open Problems

  • Lists of Open Problems:
  • Original Kanpur List (2007)
  • www.cse.iitk.ac.in/users/sganguly/data-stream-probs.pdf
  • Bertinoro & Second Kanpur List (2011)
  • www.cs.umass.edu/~mcgregor/papers/11-openproblems.pdf
slide-107
SLIDE 107

Open Problems

  • Lists of Open Problems:
  • Original Kanpur List (2007)
  • www.cse.iitk.ac.in/users/sganguly/data-stream-probs.pdf
  • Bertinoro & Second Kanpur List (2011)
  • www.cs.umass.edu/~mcgregor/papers/11-openproblems.pdf

?

Longest Increasing Subsequence: Given a stream of n values, approximate the length of the LIS.

slide-108
SLIDE 108

Open Problems

  • Lists of Open Problems:
  • Original Kanpur List (2007)
  • www.cse.iitk.ac.in/users/sganguly/data-stream-probs.pdf
  • Bertinoro & Second Kanpur List (2011)
  • www.cs.umass.edu/~mcgregor/papers/11-openproblems.pdf

?

Longest Increasing Subsequence: Given a stream of n values, approximate the length of the LIS.

?

Earth Mover Distance: Given a stream of n red points and n blue points in [n]2, approximate min-cost matching.

slide-109
SLIDE 109

Annotated Streams

STREAM

slide-110
SLIDE 110

Annotated Streams

STREAM ADVICE S T R E A M

slide-111
SLIDE 111

Annotated Streams

  • Helper: Helper has unbounded memory but can’t tell the
  • future. Sends h bits of of advice including the final answer.

STREAM ADVICE S T R E A M

slide-112
SLIDE 112

Annotated Streams

  • Helper: Helper has unbounded memory but can’t tell the
  • future. Sends h bits of of advice including the final answer.
  • Verifier: Has v bits of memory to process the stream and
  • advice. Need to verify that provided answer is correct.

STREAM ADVICE S T R E A M

slide-113
SLIDE 113

Annotated Streams

  • Helper: Helper has unbounded memory but can’t tell the
  • future. Sends h bits of of advice including the final answer.
  • Verifier: Has v bits of memory to process the stream and
  • advice. Need to verify that provided answer is correct.
  • Example: There exists helper/verifier protocol for median

where h=v=O(√m).

STREAM ADVICE S T R E A M

slide-114
SLIDE 114

Annotated Streams

  • Helper: Helper has unbounded memory but can’t tell the
  • future. Sends h bits of of advice including the final answer.
  • Verifier: Has v bits of memory to process the stream and
  • advice. Need to verify that provided answer is correct.
  • Example: There exists helper/verifier protocol for median

where h=v=O(√m).

?

Open Problem: Testing if a graph is triangle-free.

STREAM ADVICE S T R E A M

slide-115
SLIDE 115

Space-Efficient Sampling

STREAM

slide-116
SLIDE 116

Space-Efficient Sampling

THE SOURCE STREAM

slide-117
SLIDE 117

Space-Efficient Sampling

  • Stochastically Generated Stream: E.g., stream is generated by

taking independent samples from unknown distribution μ.

THE SOURCE STREAM

slide-118
SLIDE 118

Space-Efficient Sampling

  • Stochastically Generated Stream: E.g., stream is generated by

taking independent samples from unknown distribution μ.

  • Sample Complexity: How many samples to estimate f(μ)?

THE SOURCE STREAM

slide-119
SLIDE 119

Space-Efficient Sampling

  • Stochastically Generated Stream: E.g., stream is generated by

taking independent samples from unknown distribution μ.

  • Sample Complexity: How many samples to estimate f(μ)?
  • Space Complexity: How much memory to process samples?

THE SOURCE STREAM

slide-120
SLIDE 120

Space-Efficient Sampling

  • Stochastically Generated Stream: E.g., stream is generated by

taking independent samples from unknown distribution μ.

  • Sample Complexity: How many samples to estimate f(μ)?
  • Space Complexity: How much memory to process samples?
  • By increasing sample complexity, we can perhaps decrease

space complexity. This is the case for frequency moments.

THE SOURCE STREAM

slide-121
SLIDE 121

Resources

slide-122
SLIDE 122

Resources

  • Blog:
  • http://polylogblog.wordpress.com
slide-123
SLIDE 123

Resources

  • Blog:
  • http://polylogblog.wordpress.com
  • Lectures: Piotr Indyk (MIT)
  • http://stellar.mit.edu/S/course/6/fa07/6.895/
slide-124
SLIDE 124

Resources

  • Blog:
  • http://polylogblog.wordpress.com
  • Lectures: Piotr Indyk (MIT)
  • http://stellar.mit.edu/S/course/6/fa07/6.895/
  • Books:
  • “Data Streams: Algorithms and Applications”
  • S. Muthukrishnan (2005)
  • “Algorithms and Complexity of Stream Processing”
  • A. McGregor, S. Muthukrishnan (forthcoming)
slide-125
SLIDE 125
slide-126
SLIDE 126

Median with Annotations

slide-127
SLIDE 127

Median with Annotations

Define “cumulative frequency” vector: gi=f1+f2+...+fi

slide-128
SLIDE 128

Median with Annotations

Define “cumulative frequency” vector: gi=f1+f2+...+fi

1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25

slide-129
SLIDE 129

Median with Annotations

Define “cumulative frequency” vector: gi=f1+f2+...+fi Easy to see i is median iff gi-1<m/2 and gi≥m/2

1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25

slide-130
SLIDE 130

Median with Annotations

Define “cumulative frequency” vector: gi=f1+f2+...+fi Easy to see i is median iff gi-1<m/2 and gi≥m/2 Partition g into v=m1/2 segments of length h=m1/2

1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25

slide-131
SLIDE 131

Median with Annotations

Define “cumulative frequency” vector: gi=f1+f2+...+fi Easy to see i is median iff gi-1<m/2 and gi≥m/2 Partition g into v=m1/2 segments of length h=m1/2 Verifier: a) Construct fingerprint of each segment

1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25

slide-132
SLIDE 132

Median with Annotations

Define “cumulative frequency” vector: gi=f1+f2+...+fi Easy to see i is median iff gi-1<m/2 and gi≥m/2 Partition g into v=m1/2 segments of length h=m1/2 Verifier: a) Construct fingerprint of each segment

  • b) Compute last entry in each segment

1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25

slide-133
SLIDE 133

Median with Annotations

Define “cumulative frequency” vector: gi=f1+f2+...+fi Easy to see i is median iff gi-1<m/2 and gi≥m/2 Partition g into v=m1/2 segments of length h=m1/2 Verifier: a) Construct fingerprint of each segment

  • b) Compute last entry in each segment
  • c) Identify “interesting” segment

1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25

slide-134
SLIDE 134

Median with Annotations

Define “cumulative frequency” vector: gi=f1+f2+...+fi Easy to see i is median iff gi-1<m/2 and gi≥m/2 Partition g into v=m1/2 segments of length h=m1/2 Verifier: a) Construct fingerprint of each segment

  • b) Compute last entry in each segment
  • c) Identify “interesting” segment

Helper: Presents entirety of interesting segment

1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25

slide-135
SLIDE 135