Data Streams Tutorial
Andrew McGregor
University of Massachusetts, Amherst
Data Streams Tutorial Andrew McGregor University of Massachusetts, - - PowerPoint PPT Presentation
Data Streams Tutorial Andrew McGregor University of Massachusetts, Amherst Data Stream Model [Morris 78] [Munro, Paterson 78] [Flajolet, Martin 85] [Alon, Matias, Szegedy 96] [Henzinger, Raghavan, Rajagopalan 98] Data Stream
Andrew McGregor
University of Massachusetts, Amherst
[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]
[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]
e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...
[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]
e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...
number of distinct elements, longest increasing sequence.
[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]
e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...
number of distinct elements, longest increasing sequence.
[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]
e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...
number of distinct elements, longest increasing sequence.
Faster networks, cheaper data storage, ubiquitous data- logging results in massive amount of data to be processed. Applications to: Network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation...
Faster networks, cheaper data storage, ubiquitous data- logging results in massive amount of data to be processed. Applications to: Network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation...
Easy to state problems but hard to solve. Links to: Communication complexity, compressed sensing, embeddings, pseudo-random generators, approximation...
frequency moments, matrix problems, ...
frequency moments, matrix problems, ...
frequency moments, matrix problems, ...
graph cuts, independent sets, number of triangles, ...
frequency moments, matrix problems, ...
graph cuts, independent sets, number of triangles, ...
frequency moments, matrix problems, ...
graph cuts, independent sets, number of triangles, ...
balls, MST, facility location, earth mover distance, ...
frequency moments, matrix problems, ...
graph cuts, independent sets, number of triangles, ...
balls, MST, facility location, earth mover distance, ...
frequency moments, matrix problems, ...
graph cuts, independent sets, number of triangles, ...
balls, MST, facility location, earth mover distance, ...
periodicity, time-series histograms, DYCK languages, ...
Window:
w elements and have O(polylog w) space.
problem for elements with stamps in the last hour.
Window:
w elements and have O(polylog w) space.
problem for elements with stamps in the last hour.
random variable Xi. Consider random variable g(X1, ... , Xm).
What’s the probability the graph is connected?
the stream. Can we trade-off passes with space?
the stream. Can we trade-off passes with space?
˜ Θ(n/p)
the stream. Can we trade-off passes with space?
subsequence in space. ˜ Θ(k1+
1 2p−1 )
˜ Θ(n/p)
the stream. Can we trade-off passes with space?
subsequence in space.
˜ Θ(k1+
1 2p−1 )
˜ Θ(n/p)
the stream. Can we trade-off passes with space?
subsequence in space.
in O(n1/2) space. If adversarial, it takes Ω(n) space. ˜ Θ(k1+
1 2p−1 )
˜ Θ(n/p)
the stream. Can we trade-off passes with space?
subsequence in space.
in O(n1/2) space. If adversarial, it takes Ω(n) space.
in random and adversarial settings. ˜ Θ(k1+
1 2p−1 )
˜ Θ(n/p)
f1 f2 . . . fn
relevant properties of f can be estimated from the sketch Zf. f1 f2 . . . fn
Z
relevant properties of f can be estimated from the sketch Zf. f1 f2 . . . fn = t1 t2 tk
Z
relevant properties of f can be estimated from the sketch Zf.
f1 f2 . . . fn = t1 t2 tk
Z
relevant properties of f can be estimated from the sketch Zf.
any entry of Z from a “small” random seed. f1 f2 . . . fn = t1 t2 tk
Z
relevant properties of f can be estimated from the sketch Zf.
any entry of Z from a “small” random seed.
f1 f2 . . . fn = t1 t2 tk
Z f1 f2 . . . fn = t1 t2 tk
Z f1 f2 . . . fn = t1 t2 tk
Consider a row z of the projection matrix.
Z f1 f2 . . . fn = t1 t2 tk
Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f.
Z f1 f2 . . . fn = t1 t2 tk
Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f.
Z f1 f2 . . . fn = t1 t2 tk
Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f.
Square of entry is concentrated around F2.
Z f1 f2 . . . fn = t1 t2 tk
Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f. Expectation: E(t2) = ∑i,j E(zizj)fifj= F2
Square of entry is concentrated around F2.
Z f1 f2 . . . fn = t1 t2 tk
Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f. Expectation: E(t2) = ∑i,j E(zizj)fifj= F2 Variance: Var(t2) ≤ ∑i,j,k,l E(zizjzkzl)fifjfkfl < 6F22
Square of entry is concentrated around F2.
Z f1 f2 . . . fn = t1 t2 tk
Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f. Expectation: E(t2) = ∑i,j E(zizj)fifj= F2 Variance: Var(t2) ≤ ∑i,j,k,l E(zizjzkzl)fifjfkfl < 6F22 By Chebyshev, setting k=O(ε-2 log δ-1) ensures with
Square of entry is concentrated around F2.
Suppose we know F0. Pick hash function h:[n]→[F0]
Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0.
Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c←c+1, id←id+j
Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c←c+1, id←id+j Return id/c if all elts hashing to 1 were same
Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c←c+1, id←id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability.
Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c←c+1, id←id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability. Claim: Need to check elts hashing to 1 were same.
Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c←c+1, id←id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability. Claim: Need to check elts hashing to 1 were same. Run O(log n) copies guessing F0=2i. At least one instantiation works with constant probability.
Suppose we know F0. Pick hash function h:[n]→[F0] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c←c+1, id←id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability. Claim: Need to check elts hashing to 1 were same. Run O(log n) copies guessing F0=2i. At least one instantiation works with constant probability. Algorithm is a sketch and works with deletions!
x∈{0,1}n y∈{0,1}n
reductions from communication complexity.
x∈{0,1}n y∈{0,1}n
reductions from communication complexity.
x∈{0,1}n y∈{0,1}n
reductions from communication complexity.
check DISJOINTNESS i.e., is there an i with xi=yi=1?
x∈{0,1}n y∈{0,1}n
reductions from communication complexity.
check DISJOINTNESS i.e., is there an i with xi=yi=1?
Ω(n) bits of communication.
x∈{0,1}n y∈{0,1}n
reductions from communication complexity.
check DISJOINTNESS i.e., is there an i with xi=yi=1?
Ω(n) bits of communication.
graph is triangle-free needs Ω(n2) bits of memory.
Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication.
Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles.
Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n
Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n
Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E1={uiwi: 1≤i≤n} and E2={uivj: Xij=1}
Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E1={uiwi: 1≤i≤n} and E2={uivj: Xij=1}
Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E1={uiwi: 1≤i≤n} and E2={uivj: Xij=1}
Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E1={uiwi: 1≤i≤n} and E2={uivj: Xij=1} Sends memory to Bob who runs A on E3={vjwi:Yij=1}
Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E1={uiwi: 1≤i≤n} and E2={uivj: Xij=1} Sends memory to Bob who runs A on E3={vjwi:Yij=1}
Alice and Bob have X,Y∈{0,1}nxn. For Bob to check if Xij=Yij=1 for some i,j needs Ω(n2) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E1={uiwi: 1≤i≤n} and E2={uivj: Xij=1} Sends memory to Bob who runs A on E3={vjwi:Yij=1} Output of A resolves matrix question so s=Ω(n2).
also knows first i-1 bits of x.
also knows first i-1 bits of x.
from Δ(x,y)>n/2+√n.
also knows first i-1 bits of x.
from Δ(x,y)>n/2+√n.
x1i=x2i= ... =xti =1 for some i from all vectors orthogonal.
need to come up with neat ad hoc solutions.
need to come up with neat ad hoc solutions.
shortest path distance between any two nodes.
need to come up with neat ad hoc solutions.
shortest path distance between any two nodes.
minimizes max distance from a point to nearest center.
Edges define shortest path graph metric dG.
Edges define shortest path graph metric dG. An α-spanner of G = (V ,E) is a subgraph H = (V ,E’) such that ∀u,v: dG(u,v) ≤ dH(u,v) ≤ αdG(u,v)
Edges define shortest path graph metric dG. An α-spanner of G = (V ,E) is a subgraph H = (V ,E’) such that ∀u,v: dG(u,v) ≤ dH(u,v) ≤ αdG(u,v) Algorithm: Let E′ be initially empty Add (u,v) to E′ if dH(u,v) > 2t-1
Edges define shortest path graph metric dG. An α-spanner of G = (V ,E) is a subgraph H = (V ,E’) such that ∀u,v: dG(u,v) ≤ dH(u,v) ≤ αdG(u,v) Algorithm: Let E′ be initially empty Add (u,v) to E′ if dH(u,v) > 2t-1 Analysis: Each distance increase by at most factor 2t-1 |E′| = O(n1+1/t) because all cycles of length > 2t
2 approx in O(k) space if you already know OPT.
2 approx in O(k) space if you already know OPT. (2+ε) approx in O(k ε-1 log Δ) space if 1≤OPT≤Δ
2 approx in O(k) space if you already know OPT. (2+ε) approx in O(k ε-1 log Δ) space if 1≤OPT≤Δ Better Algorithm O(k ε-1 log ε-1): Instantiate basic algorithm with guesses 1, (1+ε), (1+ε)2, ... , 2ε−1
2 approx in O(k) space if you already know OPT. (2+ε) approx in O(k ε-1 log Δ) space if 1≤OPT≤Δ Better Algorithm O(k ε-1 log ε-1): Instantiate basic algorithm with guesses 1, (1+ε), (1+ε)2, ... , 2ε−1 If guess r stops working at (j+1)th point: Let q1,...,qk be centers chosen so far. Then p1,...,pj are all at most 2r from some qi. OPT for {q1,...,qk,pj+1,...,pn} is at most OPT+2r.
2 approx in O(k) space if you already know OPT. (2+ε) approx in O(k ε-1 log Δ) space if 1≤OPT≤Δ Better Algorithm O(k ε-1 log ε-1): Instantiate basic algorithm with guesses 1, (1+ε), (1+ε)2, ... , 2ε−1 If guess r stops working at (j+1)th point: Let q1,...,qk be centers chosen so far. Then p1,...,pj are all at most 2r from some qi. OPT for {q1,...,qk,pj+1,...,pn} is at most OPT+2r. Hence, an instantiation with guess 2r/ε can use {q1,...,qk,pj+1,...,pn} rather than {p1,...,pn}.
Longest Increasing Subsequence: Given a stream of n values, approximate the length of the LIS.
Longest Increasing Subsequence: Given a stream of n values, approximate the length of the LIS.
Earth Mover Distance: Given a stream of n red points and n blue points in [n]2, approximate min-cost matching.
STREAM
STREAM ADVICE S T R E A M
STREAM ADVICE S T R E A M
STREAM ADVICE S T R E A M
where h=v=O(√m).
STREAM ADVICE S T R E A M
where h=v=O(√m).
Open Problem: Testing if a graph is triangle-free.
STREAM ADVICE S T R E A M
STREAM
THE SOURCE STREAM
taking independent samples from unknown distribution μ.
THE SOURCE STREAM
taking independent samples from unknown distribution μ.
THE SOURCE STREAM
taking independent samples from unknown distribution μ.
THE SOURCE STREAM
taking independent samples from unknown distribution μ.
space complexity. This is the case for frequency moments.
THE SOURCE STREAM
Define “cumulative frequency” vector: gi=f1+f2+...+fi
Define “cumulative frequency” vector: gi=f1+f2+...+fi
1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25
Define “cumulative frequency” vector: gi=f1+f2+...+fi Easy to see i is median iff gi-1<m/2 and gi≥m/2
1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25
Define “cumulative frequency” vector: gi=f1+f2+...+fi Easy to see i is median iff gi-1<m/2 and gi≥m/2 Partition g into v=m1/2 segments of length h=m1/2
1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25
Define “cumulative frequency” vector: gi=f1+f2+...+fi Easy to see i is median iff gi-1<m/2 and gi≥m/2 Partition g into v=m1/2 segments of length h=m1/2 Verifier: a) Construct fingerprint of each segment
1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25
Define “cumulative frequency” vector: gi=f1+f2+...+fi Easy to see i is median iff gi-1<m/2 and gi≥m/2 Partition g into v=m1/2 segments of length h=m1/2 Verifier: a) Construct fingerprint of each segment
1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25
Define “cumulative frequency” vector: gi=f1+f2+...+fi Easy to see i is median iff gi-1<m/2 and gi≥m/2 Partition g into v=m1/2 segments of length h=m1/2 Verifier: a) Construct fingerprint of each segment
1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25
Define “cumulative frequency” vector: gi=f1+f2+...+fi Easy to see i is median iff gi-1<m/2 and gi≥m/2 Partition g into v=m1/2 segments of length h=m1/2 Verifier: a) Construct fingerprint of each segment
Helper: Presents entirety of interesting segment
1 1 3 5 6 7 8 8 8 8 10 10 11 12 12 15 18 18 19 20 22 22 23 25 25