B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview - - PowerPoint PPT Presentation

b669 sublinear algorithms for big data
SMART_READER_LITE
LIVE PREVIEW

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview - - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview of problems 2-1 Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe.


slide-1
SLIDE 1

1-1

B669 Sublinear Algorithms for Big Data

Qin Zhang

slide-2
SLIDE 2

2-1

An overview of problems

slide-3
SLIDE 3

3-1

Statistics

Denote the stream by A = a1, a2, . . . , am, where m is the length of the stream, which is unknown at the beginning. Let n be the item

  • universe. Let fi be the frequency of item i in the steam. On seen

ai = (i, ∆), update fi ← fi + ∆ (special case: ∆ = {1, −1}, corresponding to ins/del).

slide-4
SLIDE 4

3-2

Statistics

Denote the stream by A = a1, a2, . . . , am, where m is the length of the stream, which is unknown at the beginning. Let n be the item

  • universe. Let fi be the frequency of item i in the steam. On seen

ai = (i, ∆), update fi ← fi + ∆ (special case: ∆ = {1, −1}, corresponding to ins/del).

Entropy: emprical entropy of the data set : H(A) =

i∈[n] fi m log m fi ,

App: Very useful in “change” (e.g., anomalous events) detection.

slide-5
SLIDE 5

3-3

Statistics

Denote the stream by A = a1, a2, . . . , am, where m is the length of the stream, which is unknown at the beginning. Let n be the item

  • universe. Let fi be the frequency of item i in the steam. On seen

ai = (i, ∆), update fi ← fi + ∆ (special case: ∆ = {1, −1}, corresponding to ins/del).

Entropy: emprical entropy of the data set : H(A) =

i∈[n] fi m log m fi ,

App: Very useful in “change” (e.g., anomalous events) detection.

Frequent moments: Fp =

i f p i

  • F0: number of distinct items.
  • F1: total number of items.
  • F2: size of self-join.

General FP (p > 1), good measurements of the skewness of the data.

slide-6
SLIDE 6

4-1

Statistics (cont.)

Heavy-hitter: a set of items whose frequency ≥ a threshold.

App: popular IP destinations, . . .

1 2 3 4 5 6 7 8 Included 0.01m |A| = m

slide-7
SLIDE 7

4-2

Statistics (cont.)

Heavy-hitter: a set of items whose frequency ≥ a threshold.

App: popular IP destinations, . . .

1 2 3 4 5 6 7 8 Included 0.01m |A| = m

Quantile:

The φ-quantile of A is some x such that there are at most φm items of A that are smaller than x and at most (1 − φ)m items of A that are greater than x. All-quantile: a data structure from which all φ-quantiles for any 0 ≤ φ ≤ 1 can be extracted.

App: distribution of package sizes . . .

slide-8
SLIDE 8

5-1

Statistics (cont.)

Lp sampling: Let x ∈ Rn be a non-zero vector. For p > 0 we call the Lp distribution corresponding to x the distribution on [n] that takes i with probability |xi|p xip

p

, with xp = (

i∈[n] |xi|p)1/p. In particular, for p = 0, the

L0 sampling is to select an element uniform at random from the non-zero coordinates of x.

App: an extremely useful tool for constructing graph sketches, finding duplications, etc.

slide-9
SLIDE 9

6-1

Graphs

Denote the stream by A = a1, a2, . . . , am, where ai = ((ui, vi), insert/delete), where (ui, vi) is an edge.

slide-10
SLIDE 10

6-2

Graphs

Denote the stream by A = a1, a2, . . . , am, where ai = ((ui, vi), insert/delete), where (ui, vi) is an edge.

Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes).

slide-11
SLIDE 11

6-3

Graphs

Denote the stream by A = a1, a2, . . . , am, where ai = ((ui, vi), insert/delete), where (ui, vi) is an edge.

Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes). Triangle counting: Compute # triangles of a graph.

App: Useful for finding communities in a social network. (fraction of v’s neighbors who are neighbors themselves)

slide-12
SLIDE 12

7-1

Graphs (cont.)

Spanner: Given a graph G = (V , E), we say that a subgraph

H = (V , E ′) is an α-spanner for G if ∀u, v, ∈ V , dG(u, v) ≤ dH(u, v) ≤ α · dG(u, v) A subgraph (approximately) maintains pair-wise distances.

slide-13
SLIDE 13

7-2

Graphs (cont.)

Spanner: Given a graph G = (V , E), we say that a subgraph

H = (V , E ′) is an α-spanner for G if ∀u, v, ∈ V , dG(u, v) ≤ dH(u, v) ≤ α · dG(u, v) A subgraph (approximately) maintains pair-wise distances.

Graph sparcification: Given a graph G = (V , E), denote the

minimum cut of G by λ(G), and λA(G) the capacity of the cut (A, V \A). We say that a weighted subgraph H = (V , E ′, w) is an ǫ-sparsification for G if ∀A ⊂ V , (1 − ǫ)λA(G) ≤ λA(H) ≤ (1 + ǫ)λA(G).

App: Synopses for massive graphs. A graph synopse is a subgraph

  • f much smaller size that keeps properties of the original graph.
slide-14
SLIDE 14

8-1

Geometry

Denote the stream by A = a1, a2, . . . , am, where ai = (location, ins/del).

slide-15
SLIDE 15

8-2

Geometry

Denote the stream by A = a1, a2, . . . , am, where ai = (location, ins/del).

Earth-mover distance: Given two multisets A, B in the grid

[∆]2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B. EMD(A, B) = min

π:A→B a bijection

  • a∈A

a − π(a) .

App: a good measurement of the similarity of two images

slide-16
SLIDE 16

8-3

Geometry

Denote the stream by A = a1, a2, . . . , am, where ai = (location, ins/del).

Clustering: (k-Center) Cluster a set of points

X = (x1, x2, . . . , xm) to clusters c1, c2, . . . , ck with representatives r1 ∈ c1, r2 ∈ c2, . . . , rk ∈ ck to minimize max

i

min

j

d(xi, rj) .

App: (see wiki page)

Earth-mover distance: Given two multisets A, B in the grid

[∆]2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B. EMD(A, B) = min

π:A→B a bijection

  • a∈A

a − π(a) .

App: a good measurement of the similarity of two images

slide-17
SLIDE 17

9-1

Strings

Denote the stream by A = a1, a2, . . . , am, where ai = (i, ins/del).

slide-18
SLIDE 18

9-2

Strings

Denote the stream by A = a1, a2, . . . , am, where ai = (i, ins/del).

Distance to the sortedness:

LIS(A)= length of longest increasing subsequence of sequence A. DistSort(A)= minimum number of elements needed to be deleted from A to get a sorted sequence = |A| − LIS(A).

App: a good measurement of network latency.

slide-19
SLIDE 19

9-3

Strings

Denote the stream by A = a1, a2, . . . , am, where ai = (i, ins/del).

Distance to the sortedness:

LIS(A)= length of longest increasing subsequence of sequence A. DistSort(A)= minimum number of elements needed to be deleted from A to get a sorted sequence = |A| − LIS(A).

App: a good measurement of network latency.

Edit distance: Given two strings A and B, the number of

insertion/deletion/substitution that is needed to convert A to B.

App: a standard measurement of the similarity of two strings/documents

slide-20
SLIDE 20

10-1

Numerical linear algebra

Denote the stream by A = a1, a2, . . . , an, where ak = (i, j, ∆) denotes the update M[i, j] ← M[i, j] + ∆, where M[i, j] is the cell in the i-th row, j-th column of matrix M.

slide-21
SLIDE 21

10-2

Numerical linear algebra

Denote the stream by A = a1, a2, . . . , an, where ak = (i, j, ∆) denotes the update M[i, j] ← M[i, j] + ∆, where M[i, j] is the cell in the i-th row, j-th column of matrix M.

Regression: Given an n × d matrix M and an n × 1 vector b,

and one seeks x∗ = argminxMx − bp, for a p ∈ [1, ∞).

slide-22
SLIDE 22

10-3

Numerical linear algebra

Denote the stream by A = a1, a2, . . . , an, where ak = (i, j, ∆) denotes the update M[i, j] ← M[i, j] + ∆, where M[i, j] is the cell in the i-th row, j-th column of matrix M.

Regression: Given an n × d matrix M and an n × 1 vector b,

and one seeks x∗ = argminxMx − bp, for a p ∈ [1, ∞).

Low-rank approximation: Given an n × m matrix M, find

  • rthonormal n × k matrices L, W , and a diagonal

k × k (k < min{n, m}) matrix D with

  • M − LDW T
  • F minimized,

where ·F is the Frobenius norm

App: Fundamental problem in many areas, including machine learning, recommendation system, natural language processing, etc.

slide-23
SLIDE 23

11-1

Sliding windows

Sometimes we are only interested in recent items in the stream.

RAM CPU

w most recent time steps

Or, CPU

w most recent items

Time-based sliding window Sequence-based sliding window

RAM RAM

slide-24
SLIDE 24

12-1

Lower bounds

What is the impossible? Or, what is the limit of the space usage to solve a problem?

Usually by reductions from communication complexity. (not for this course)