1-1
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview - - PowerPoint PPT Presentation
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview - - PowerPoint PPT Presentation
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview of problems 2-1 Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe.
2-1
An overview of problems
3-1
Statistics
Denote the stream by A = a1, a2, . . . , am, where m is the length of the stream, which is unknown at the beginning. Let n be the item
- universe. Let fi be the frequency of item i in the steam. On seen
ai = (i, ∆), update fi ← fi + ∆ (special case: ∆ = {1, −1}, corresponding to ins/del).
3-2
Statistics
Denote the stream by A = a1, a2, . . . , am, where m is the length of the stream, which is unknown at the beginning. Let n be the item
- universe. Let fi be the frequency of item i in the steam. On seen
ai = (i, ∆), update fi ← fi + ∆ (special case: ∆ = {1, −1}, corresponding to ins/del).
Entropy: emprical entropy of the data set : H(A) =
i∈[n] fi m log m fi ,
App: Very useful in “change” (e.g., anomalous events) detection.
3-3
Statistics
Denote the stream by A = a1, a2, . . . , am, where m is the length of the stream, which is unknown at the beginning. Let n be the item
- universe. Let fi be the frequency of item i in the steam. On seen
ai = (i, ∆), update fi ← fi + ∆ (special case: ∆ = {1, −1}, corresponding to ins/del).
Entropy: emprical entropy of the data set : H(A) =
i∈[n] fi m log m fi ,
App: Very useful in “change” (e.g., anomalous events) detection.
Frequent moments: Fp =
i f p i
- F0: number of distinct items.
- F1: total number of items.
- F2: size of self-join.
General FP (p > 1), good measurements of the skewness of the data.
4-1
Statistics (cont.)
Heavy-hitter: a set of items whose frequency ≥ a threshold.
App: popular IP destinations, . . .
1 2 3 4 5 6 7 8 Included 0.01m |A| = m
4-2
Statistics (cont.)
Heavy-hitter: a set of items whose frequency ≥ a threshold.
App: popular IP destinations, . . .
1 2 3 4 5 6 7 8 Included 0.01m |A| = m
Quantile:
The φ-quantile of A is some x such that there are at most φm items of A that are smaller than x and at most (1 − φ)m items of A that are greater than x. All-quantile: a data structure from which all φ-quantiles for any 0 ≤ φ ≤ 1 can be extracted.
App: distribution of package sizes . . .
5-1
Statistics (cont.)
Lp sampling: Let x ∈ Rn be a non-zero vector. For p > 0 we call the Lp distribution corresponding to x the distribution on [n] that takes i with probability |xi|p xip
p
, with xp = (
i∈[n] |xi|p)1/p. In particular, for p = 0, the
L0 sampling is to select an element uniform at random from the non-zero coordinates of x.
App: an extremely useful tool for constructing graph sketches, finding duplications, etc.
6-1
Graphs
Denote the stream by A = a1, a2, . . . , am, where ai = ((ui, vi), insert/delete), where (ui, vi) is an edge.
6-2
Graphs
Denote the stream by A = a1, a2, . . . , am, where ai = ((ui, vi), insert/delete), where (ui, vi) is an edge.
Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes).
6-3
Graphs
Denote the stream by A = a1, a2, . . . , am, where ai = ((ui, vi), insert/delete), where (ui, vi) is an edge.
Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes). Triangle counting: Compute # triangles of a graph.
App: Useful for finding communities in a social network. (fraction of v’s neighbors who are neighbors themselves)
7-1
Graphs (cont.)
Spanner: Given a graph G = (V , E), we say that a subgraph
H = (V , E ′) is an α-spanner for G if ∀u, v, ∈ V , dG(u, v) ≤ dH(u, v) ≤ α · dG(u, v) A subgraph (approximately) maintains pair-wise distances.
7-2
Graphs (cont.)
Spanner: Given a graph G = (V , E), we say that a subgraph
H = (V , E ′) is an α-spanner for G if ∀u, v, ∈ V , dG(u, v) ≤ dH(u, v) ≤ α · dG(u, v) A subgraph (approximately) maintains pair-wise distances.
Graph sparcification: Given a graph G = (V , E), denote the
minimum cut of G by λ(G), and λA(G) the capacity of the cut (A, V \A). We say that a weighted subgraph H = (V , E ′, w) is an ǫ-sparsification for G if ∀A ⊂ V , (1 − ǫ)λA(G) ≤ λA(H) ≤ (1 + ǫ)λA(G).
App: Synopses for massive graphs. A graph synopse is a subgraph
- f much smaller size that keeps properties of the original graph.
8-1
Geometry
Denote the stream by A = a1, a2, . . . , am, where ai = (location, ins/del).
8-2
Geometry
Denote the stream by A = a1, a2, . . . , am, where ai = (location, ins/del).
Earth-mover distance: Given two multisets A, B in the grid
[∆]2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B. EMD(A, B) = min
π:A→B a bijection
- a∈A
a − π(a) .
App: a good measurement of the similarity of two images
8-3
Geometry
Denote the stream by A = a1, a2, . . . , am, where ai = (location, ins/del).
Clustering: (k-Center) Cluster a set of points
X = (x1, x2, . . . , xm) to clusters c1, c2, . . . , ck with representatives r1 ∈ c1, r2 ∈ c2, . . . , rk ∈ ck to minimize max
i
min
j
d(xi, rj) .
App: (see wiki page)
Earth-mover distance: Given two multisets A, B in the grid
[∆]2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B. EMD(A, B) = min
π:A→B a bijection
- a∈A
a − π(a) .
App: a good measurement of the similarity of two images
9-1
Strings
Denote the stream by A = a1, a2, . . . , am, where ai = (i, ins/del).
9-2
Strings
Denote the stream by A = a1, a2, . . . , am, where ai = (i, ins/del).
Distance to the sortedness:
LIS(A)= length of longest increasing subsequence of sequence A. DistSort(A)= minimum number of elements needed to be deleted from A to get a sorted sequence = |A| − LIS(A).
App: a good measurement of network latency.
9-3
Strings
Denote the stream by A = a1, a2, . . . , am, where ai = (i, ins/del).
Distance to the sortedness:
LIS(A)= length of longest increasing subsequence of sequence A. DistSort(A)= minimum number of elements needed to be deleted from A to get a sorted sequence = |A| − LIS(A).
App: a good measurement of network latency.
Edit distance: Given two strings A and B, the number of
insertion/deletion/substitution that is needed to convert A to B.
App: a standard measurement of the similarity of two strings/documents
10-1
Numerical linear algebra
Denote the stream by A = a1, a2, . . . , an, where ak = (i, j, ∆) denotes the update M[i, j] ← M[i, j] + ∆, where M[i, j] is the cell in the i-th row, j-th column of matrix M.
10-2
Numerical linear algebra
Denote the stream by A = a1, a2, . . . , an, where ak = (i, j, ∆) denotes the update M[i, j] ← M[i, j] + ∆, where M[i, j] is the cell in the i-th row, j-th column of matrix M.
Regression: Given an n × d matrix M and an n × 1 vector b,
and one seeks x∗ = argminxMx − bp, for a p ∈ [1, ∞).
10-3
Numerical linear algebra
Denote the stream by A = a1, a2, . . . , an, where ak = (i, j, ∆) denotes the update M[i, j] ← M[i, j] + ∆, where M[i, j] is the cell in the i-th row, j-th column of matrix M.
Regression: Given an n × d matrix M and an n × 1 vector b,
and one seeks x∗ = argminxMx − bp, for a p ∈ [1, ∞).
Low-rank approximation: Given an n × m matrix M, find
- rthonormal n × k matrices L, W , and a diagonal
k × k (k < min{n, m}) matrix D with
- M − LDW T
- F minimized,
where ·F is the Frobenius norm
App: Fundamental problem in many areas, including machine learning, recommendation system, natural language processing, etc.
11-1
Sliding windows
Sometimes we are only interested in recent items in the stream.
RAM CPU
w most recent time steps
Or, CPU
w most recent items
Time-based sliding window Sequence-based sliding window
RAM RAM
12-1