[PPT] - Data Streams Tutorial Andrew McGregor University of Massachusetts, PowerPoint Presentation

SLIDE 1

Data Streams Tutorial

Andrew McGregor

University of Massachusetts, Amherst

SLIDE 2

Data Stream Model

[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]

SLIDE 3

Data Stream Model

[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]

Stream: m elements from some universe of size n

e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...

SLIDE 4

Data Stream Model

[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]

Stream: m elements from some universe of size n

e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...

Goal: Estimate properties of the stream, e.g., median,

number of distinct elements, longest increasing sequence.

SLIDE 5

Data Stream Model

[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]

Stream: m elements from some universe of size n

e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...

Goal: Estimate properties of the stream, e.g., median,

number of distinct elements, longest increasing sequence.

The Catch:
i) Limited working memory, e.g., polylog(n,m)
ii) Access data sequentially
iii) Process each element quickly

SLIDE 6

Data Stream Model

[Morris ’78] [Munro, Paterson ’78] [Flajolet, Martin ’85] [Alon, Matias, Szegedy ’96] [Henzinger, Raghavan, Rajagopalan ’98]

Stream: m elements from some universe of size n

e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...

Goal: Estimate properties of the stream, e.g., median,

number of distinct elements, longest increasing sequence.

The Catch:
i) Limited working memory, e.g., polylog(n,m)
ii) Access data sequentially
iii) Process each element quickly
Origins in ’70s but has become popular in last ten years...

SLIDE 7

Why’s it become popular?

SLIDE 8

Why’s it become popular?

Practical Appeal:

Faster networks, cheaper data storage, ubiquitous data- logging results in massive amount of data to be processed. Applications to: Network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation...

SLIDE 9

Why’s it become popular?

Practical Appeal:

Faster networks, cheaper data storage, ubiquitous data- logging results in massive amount of data to be processed. Applications to: Network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation...

Theoretical Appeal:

Easy to state problems but hard to solve. Links to: Communication complexity, compressed sensing, embeddings, pseudo-random generators, approximation...

SLIDE 10

I. What’s Done?
II. Some Tools
III. What’s Next?
I. What’s Done?
II. Some Tools
III. What’s Next?

SLIDE 11

I. What’s Done?
I. What’s Done?

SLIDE 12

I. What’s Done?
I. What’s Done?

Basic Problems Problem Variants Model Variants

SLIDE 13

Families of Problems

SLIDE 14

Families of Problems

Numbers:
Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

SLIDE 15

Families of Problems

Numbers:
Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

SLIDE 16

Families of Problems

Numbers:
Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

Graphs:
Connectivity, MST, bipartiteness, matchings, distances,

graph cuts, independent sets, number of triangles, ...

SLIDE 17

Families of Problems

Numbers:
Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

Graphs:
Connectivity, MST, bipartiteness, matchings, distances,

graph cuts, independent sets, number of triangles, ...

SLIDE 18

Families of Problems

Numbers:
Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

Graphs:
Connectivity, MST, bipartiteness, matchings, distances,

graph cuts, independent sets, number of triangles, ...

Points:
Clustering, diameter, convex hulls, minimum enclosing

balls, MST, facility location, earth mover distance, ...

SLIDE 19

Families of Problems

Numbers:
Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

Graphs:
Connectivity, MST, bipartiteness, matchings, distances,

graph cuts, independent sets, number of triangles, ...

Points:
Clustering, diameter, convex hulls, minimum enclosing

balls, MST, facility location, earth mover distance, ...

SLIDE 20

Families of Problems

Numbers:
Distinct elements, heavy hitters, range queries, quantiles,

frequency moments, matrix problems, ...

Graphs:
Connectivity, MST, bipartiteness, matchings, distances,

graph cuts, independent sets, number of triangles, ...

Points:
Clustering, diameter, convex hulls, minimum enclosing

balls, MST, facility location, earth mover distance, ...

Sequences:
Longest increasing subsequence, pattern matching,

periodicity, time-series histograms, DYCK languages, ...

SLIDE 21

Problem Variants

SLIDE 22

Problem Variants

Sliding

Window:

Suppose you only want to solve the problem using the last

w elements and have O(polylog w) space.

Elements could have time stamps and you want to solve the

problem for elements with stamps in the last hour.

SLIDE 23

Problem Variants

Sliding

Window:

Suppose you only want to solve the problem using the last

w elements and have O(polylog w) space.

Elements could have time stamps and you want to solve the

problem for elements with stamps in the last hour.

Uncertain Data:
ith element of the stream is a distribution μi that defines

random variable Xi. Consider random variable g(X1, ... , Xm).

Problems: What’s the expected value of the max value?

What’s the probability the graph is connected?

SLIDE 24

Model Variants

SLIDE 25

Model Variants

Multiple Passes: Suppose you can take p>1 passes over

the stream. Can we trade-off passes with space?

SLIDE 26

Model Variants

Multiple Passes: Suppose you can take p>1 passes over

the stream. Can we trade-off passes with space?

Example: Most common item in space.

˜ Θ(n/p)

SLIDE 27

Model Variants

Multiple Passes: Suppose you can take p>1 passes over

the stream. Can we trade-off passes with space?

Example: Most common item in space.
Example: Can find elements of length k increasing

subsequence in space. ˜ Θ(k1+

1 2p−1 )

˜ Θ(n/p)

SLIDE 28

Model Variants

Multiple Passes: Suppose you can take p>1 passes over

the stream. Can we trade-off passes with space?

Example: Most common item in space.
Example: Can find elements of length k increasing

subsequence in space.

Random Order: We normally assume that stream is
rdered adversarially. What if it’s ordered randomly?

˜ Θ(k1+

1 2p−1 )

˜ Θ(n/p)

SLIDE 29

Model Variants

Multiple Passes: Suppose you can take p>1 passes over

the stream. Can we trade-off passes with space?

Example: Most common item in space.
Example: Can find elements of length k increasing

subsequence in space.

Random Order: We normally assume that stream is
rdered adversarially. What if it’s ordered randomly?
Example: Can find median of a random-order stream

in O(n1/2) space. If adversarial, it takes Ω(n) space. ˜ Θ(k1+

1 2p−1 )

˜ Θ(n/p)

SLIDE 30

Model Variants

Multiple Passes: Suppose you can take p>1 passes over

the stream. Can we trade-off passes with space?

Example: Most common item in space.
Example: Can find elements of length k increasing

subsequence in space.

Random Order: We normally assume that stream is
rdered adversarially. What if it’s ordered randomly?
Example: Can find median of a random-order stream

in O(n1/2) space. If adversarial, it takes Ω(n) space.

Example: Estimating Fk takes roughly the same space

in random and adversarial settings. ˜ Θ(k1+

1 2p−1 )

˜ Θ(n/p)

SLIDE 31

I. What’s Done?
II. Some Tools
III. What’s Next?
I. What’s Done?
II. Some Tools
III. What’s Next?

SLIDE 32

II. Some Tools
II. Some Tools

SLIDE 33

II. Some Tools
II. Some Tools

Sketching Sampling Lower Bounds

SLIDE 34

First Idea: Sketches

SLIDE 35

First Idea: Sketches

         f1 f2 . . . fn         

SLIDE 36

First Idea: Sketches

Algorithm uses a (random) projection matrix Z such that the

relevant properties of f can be estimated from the sketch Zf.          f1 f2 . . . fn         

SLIDE 37

    Z    

First Idea: Sketches

Algorithm uses a (random) projection matrix Z such that the

relevant properties of f can be estimated from the sketch Zf.          f1 f2 . . . fn          =     t1 t2 tk    

SLIDE 38

    Z    

First Idea: Sketches

Algorithm uses a (random) projection matrix Z such that the

relevant properties of f can be estimated from the sketch Zf.

Easy to Update: On seeing “i”, add ith column of Z to sketch

         f1 f2 . . . fn          =     t1 t2 tk    

SLIDE 39

    Z    

First Idea: Sketches

Algorithm uses a (random) projection matrix Z such that the

relevant properties of f can be estimated from the sketch Zf.

Easy to Update: On seeing “i”, add ith column of Z to sketch
Store Matrix Implicitly: Need to be able to efficiently generate

any entry of Z from a “small” random seed.          f1 f2 . . . fn          =     t1 t2 tk    

SLIDE 40

    Z    

First Idea: Sketches

Algorithm uses a (random) projection matrix Z such that the

relevant properties of f can be estimated from the sketch Zf.

Easy to Update: On seeing “i”, add ith column of Z to sketch
Store Matrix Implicitly: Need to be able to efficiently generate

any entry of Z from a “small” random seed.

Gives Õ(k) space algorithm with seed & precision assumptions.

         f1 f2 . . . fn          =     t1 t2 tk    

SLIDE 41