B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: - - PowerPoint PPT Presentation

b669 sublinear algorithms for big data
SMART_READER_LITE
LIVE PREVIEW

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: - - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet


slide-1
SLIDE 1

1-1

B669 Sublinear Algorithms for Big Data

Qin Zhang

slide-2
SLIDE 2

2-1

Part 1: Sublinear in Space

slide-3
SLIDE 3

3-1

RAM The data stream model (Alon, Matias and Szegedy 1996) CPU

The model and challenge

Why hard? Cannot store everything. Applications: Internet router, stock data, ad auction, flight logs on tape, etc.

(next 4 slides, in courtesy of Jeff Phillps)

a1 a2 an

slide-4
SLIDE 4

4-1

Internet Router

  • data per day: at least 1 Terabyte
  • packet takes 8 nanoseconds to pass through router
  • few million packets per second

What statistics can we keep on data? For example, want to detect anomalies for security.

Network routers

Router

Packets

limited space

slide-5
SLIDE 5

5-1

Cell phones connect through switches

  • each message 1000 Bytes
  • 500 million calls / day
  • 1 Terabyte per month second

Search for characteristics for dropped calls?

Telephone Switch

Switch

txt, msg

limited space

slide-6
SLIDE 6

6-1

Serving Ads on web Google, Yahoo!, Microsoft

  • Yahoo.com viewed 77 trillion times
  • 2 million / hour
  • Each page serves ads; which ones?

How to update ad delivery model?

Ad Auction

Server

page view ad click keyword search ad served delivery model limited space

slide-7
SLIDE 7

7-1

All airplane logs over Washington, DC

  • About 500 - 1000 flights per day.
  • 50 years, total about 9 million flights
  • Each flight has trajectory, passenger count, control

dialog. Stored on Tape. Can only make 1 (or O(1)) pass! What statistics can be gathered?

Flight Logs on Tape

CPU

statistics

slide-8
SLIDE 8

8-1

(Last lecture) Maintain a sample for Sliding Windows

Tasks: Find a uniform sample from the last w items.

slide-9
SLIDE 9

8-2

(Last lecture) Maintain a sample for Sliding Windows

Tasks: Find a uniform sample from the last w items. Algorithm: – For each xi, we pick a random value vi ∈ (0, 1). – In a window < xj−w+1, . . . , xj >, return value xi with smallest vi. – To do this, maintain the set of all xi in sliding window whose vi value is minimal among subsequent values.

slide-10
SLIDE 10

8-3

(Last lecture) Maintain a sample for Sliding Windows

Tasks: Find a uniform sample from the last w items. Algorithm: – For each xi, we pick a random value vi ∈ (0, 1). – In a window < xj−w+1, . . . , xj >, return value xi with smallest vi. – To do this, maintain the set of all xi in sliding window whose vi value is minimal among subsequent values. Space (expected): 1/w + 1/(w − 1) + . . . + 1/1 = log w. Correctness: Obvious.

slide-11
SLIDE 11

9-1

§1.0 An overview of problems

slide-12
SLIDE 12

10-1

Statistics

Denote the stream by A = a1, a2, . . . , am, where m is the length of the stream, which is unknown at the beginning. Let n be the item

  • universe. Let fi be the frequency of item i in the steam. On seen

ai = (i, ∆), update fi ← fi + ∆ (special case: ∆ = {1, −1}, corresponding to ins/del).

slide-13
SLIDE 13

10-2

Statistics

Denote the stream by A = a1, a2, . . . , am, where m is the length of the stream, which is unknown at the beginning. Let n be the item

  • universe. Let fi be the frequency of item i in the steam. On seen

ai = (i, ∆), update fi ← fi + ∆ (special case: ∆ = {1, −1}, corresponding to ins/del).

Entropy: emprical entropy of the data set : H(A) =

i∈[n] fi m log m fi ,

App: Very useful in “change” (e.g., anomalous events) detection.

slide-14
SLIDE 14

10-3

Statistics

Denote the stream by A = a1, a2, . . . , am, where m is the length of the stream, which is unknown at the beginning. Let n be the item

  • universe. Let fi be the frequency of item i in the steam. On seen

ai = (i, ∆), update fi ← fi + ∆ (special case: ∆ = {1, −1}, corresponding to ins/del).

Entropy: emprical entropy of the data set : H(A) =

i∈[n] fi m log m fi ,

App: Very useful in “change” (e.g., anomalous events) detection.

Frequent moments: Fp =

i f p i

  • F0: number of distinct items.
  • F1: total number of items.
  • F2: size of self-join.

General FP (p > 1), good measurements of the skewness of the data.

slide-15
SLIDE 15

11-1

Statistics (cont.)

Heavy-hitter: a set of items whose frequency ≥ a threshold.

App: popular IP destinations, . . .

1 2 3 4 5 6 7 8 Included 0.01m |A| = m

slide-16
SLIDE 16

11-2

Statistics (cont.)

Heavy-hitter: a set of items whose frequency ≥ a threshold.

App: popular IP destinations, . . .

1 2 3 4 5 6 7 8 Included 0.01m |A| = m

Quantile:

The φ-quantile of A is some x such that there are at most φm items of A that are smaller than x and at most (1 − φ)m items of A that are greater than x. All-quantile: a data structure from which all φ-quantiles for any 0 ≤ φ ≤ 1 can be extracted.

App: distribution of package sizes . . .

slide-17
SLIDE 17

12-1

Statistics (cont.)

Lp sampling: Let x ∈ Rn be a non-zero vector. For p > 0 we call the Lp distribution corresponding to x the distribution on [n] that takes i with probability |xi|p xip

p

, with xp = (

i∈[n] |xi|p)1/p. In particular, for p = 0, the

L0 sampling is to select an element uniform at random from the non-zero coordinates of x.

App: an extremely useful tool for constructing graph sketches, finding duplications, etc.

slide-18
SLIDE 18

13-1

Graphs

Denote the stream by A = a1, a2, . . . , am, where ai = ((ui, vi), insert/delete), where (ui, vi) is an edge.

slide-19
SLIDE 19

13-2

Graphs

Denote the stream by A = a1, a2, . . . , am, where ai = ((ui, vi), insert/delete), where (ui, vi) is an edge.

Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes).

slide-20
SLIDE 20

13-3

Graphs

Denote the stream by A = a1, a2, . . . , am, where ai = ((ui, vi), insert/delete), where (ui, vi) is an edge.

Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes). Triangle counting: Compute # triangles of a graph.

App: Useful for finding communities in a social network. (fraction of v’s neighbors who are neighbors themselves)

slide-21
SLIDE 21

14-1

Graphs (cont.)

Spanner: Given a graph G = (V , E), we say that a subgraph

H = (V , E ′) is an α-spanner for G if ∀u, v, ∈ V , dG(u, v) ≤ dH(u, v) ≤ α · dG(u, v) A subgraph (approximately) maintains pair-wise distances.

slide-22
SLIDE 22

14-2

Graphs (cont.)

Spanner: Given a graph G = (V , E), we say that a subgraph

H = (V , E ′) is an α-spanner for G if ∀u, v, ∈ V , dG(u, v) ≤ dH(u, v) ≤ α · dG(u, v) A subgraph (approximately) maintains pair-wise distances.

Graph sparcification: Given a graph G = (V , E), denote the

minimum cut of G by λ(G), and λA(G) the capacity of the cut (A, V \A). We say that a weighted subgraph H = (V , E ′, w) is an ǫ-sparsification for G if ∀A ⊂ V , (1 − ǫ)λA(G) ≤ λA(H) ≤ (1 + ǫ)λA(G).

App: Synopses for massive graphs. A graph synopse is a subgraph

  • f much smaller size that keeps properties of the original graph.
slide-23
SLIDE 23

15-1

Geometry

Denote the stream by A = a1, a2, . . . , am, where ai = (location, ins/del).

slide-24
SLIDE 24

15-2

Geometry

Denote the stream by A = a1, a2, . . . , am, where ai = (location, ins/del).

Earth-mover distance: Given two multisets A, B in the grid

[∆]2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B. EMD(A, B) = min

π:A→B a bijection

  • a∈A

a − π(a) .

App: a good measurement of the similarity of two images

slide-25
SLIDE 25

15-3

Geometry

Denote the stream by A = a1, a2, . . . , am, where ai = (location, ins/del).

Clustering: (k-Center) Cluster a set of points

X = (x1, x2, . . . , xm) to clusters c1, c2, . . . , ck with representatives r1 ∈ c1, r2 ∈ c2, . . . , rk ∈ ck to minimize max

i

min

j

d(xi, rj) .

App: (see wiki page)

Earth-mover distance: Given two multisets A, B in the grid

[∆]2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B. EMD(A, B) = min

π:A→B a bijection

  • a∈A

a − π(a) .

App: a good measurement of the similarity of two images

slide-26
SLIDE 26

16-1

Strings

Denote the stream by A = a1, a2, . . . , am, where ai = (i, ins/del).

slide-27
SLIDE 27

16-2

Strings

Denote the stream by A = a1, a2, . . . , am, where ai = (i, ins/del).

Distance to the sortedness:

LIS(A)= length of longest increasing subsequence of sequence A. DistSort(A)= minimum number of elements needed to be deleted from A to get a sorted sequence = |A| − LIS(A).

App: a good measurement of network latency.

slide-28
SLIDE 28

16-3

Strings

Denote the stream by A = a1, a2, . . . , am, where ai = (i, ins/del).

Distance to the sortedness:

LIS(A)= length of longest increasing subsequence of sequence A. DistSort(A)= minimum number of elements needed to be deleted from A to get a sorted sequence = |A| − LIS(A).

App: a good measurement of network latency.

Edit distance: Given two strings A and B, the number of

insertion/deletion/substitution that is needed to convert A to B.

App: a standard measurement of the similarity of two strings/documents

slide-29
SLIDE 29

17-1

Numerical linear algebra

Denote the stream by A = a1, a2, . . . , an, where ak = (i, j, ∆) denotes the update M[i, j] ← M[i, j] + ∆, where M[i, j] is the cell in the i-th row, j-th column of matrix M.

slide-30
SLIDE 30

17-2

Numerical linear algebra

Denote the stream by A = a1, a2, . . . , an, where ak = (i, j, ∆) denotes the update M[i, j] ← M[i, j] + ∆, where M[i, j] is the cell in the i-th row, j-th column of matrix M.

Regression: Given an n × d matrix M and an n × 1 vector b,

and one seeks x∗ = argminxMx − bp, for a p ∈ [1, ∞).

slide-31
SLIDE 31

17-3

Numerical linear algebra

Denote the stream by A = a1, a2, . . . , an, where ak = (i, j, ∆) denotes the update M[i, j] ← M[i, j] + ∆, where M[i, j] is the cell in the i-th row, j-th column of matrix M.

Regression: Given an n × d matrix M and an n × 1 vector b,

and one seeks x∗ = argminxMx − bp, for a p ∈ [1, ∞).

Low-rank approximation: Given an n × m matrix M, find

  • rthonormal n × k matrices L, W , and a diagonal

k × k (k < min{n, m}) matrix D with

  • M − LDW T
  • F minimized,

where ·F is the Frobenius norm

App: Fundamental problem in many areas, including machine learning, recommendation system, natural language processing, etc.

slide-32
SLIDE 32

18-1

Sliding windows

Sometimes we are only interested in recent items in the stream.

RAM CPU

w most recent time steps

Or, CPU

w most recent items

Time-based sliding window Sequence-based sliding window

RAM RAM

slide-33
SLIDE 33

19-1

Lower bounds

What is the impossible? Or, what is the limit of the space usage to solve a problem?

Usually by reductions from communication complexity. (leave to the future)

slide-34
SLIDE 34

20-1

Thank you!