1-1
Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear - - PowerPoint PPT Presentation
Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear - - PowerPoint PPT Presentation
Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear in Time 2-1 Sublinear in time Given a social network graph, if we have no time to ask everyone, can we still compute something non-trivial? For example, the average # of
2-1
Part 3: Sublinear in Time
3-1
Sublinear in time
Given a social network graph, if we have no time to ask everyone, can we still compute something non-trivial? For example, the average # of individual’s friends?
4-1
Average degree of a graph
Problem definition: Given a simple (no parallel edges, self-loops) graph G = (V , E), its average degree ¯ d =
- v∈V d(v)
|V | .
4-2
Average degree of a graph
Problem definition: Given a simple (no parallel edges, self-loops) graph G = (V , E), its average degree ¯ d =
- v∈V d(v)
|V | . Representation of G: degree + adjacency list. Our algorithms only make the following operations (queries)
- Degree queries: on v return d(v).
- Neighbor queries: for (v, j) return j-th neighbor of v.
5-1
Naive approach fails
Naive sampling: pick a set of S consisting s random nodes, output
- i∈S d(vi)
s
. How large s should be, in order to get an O(1) multiplicative approx? Ω(n)!
5-2
Naive approach fails
Naive sampling: pick a set of S consisting s random nodes, output
- i∈S d(vi)
s
. How large s should be, in order to get an O(1) multiplicative approx? Ω(n)! In general, if given n numbers and we want to estimate their average, Ω(n) queries are needed.
5-3
Naive approach fails
Naive sampling: pick a set of S consisting s random nodes, output
- i∈S d(vi)
s
. How large s should be, in order to get an O(1) multiplicative approx? Ω(n)! In general, if given n numbers and we want to estimate their average, Ω(n) queries are needed. But, maybe the degree sequences are special, and we can make use of that?
- (n − 1, 0, . . . , 0) is NOT possible.
- (n − 1, 1, . . . , 1) is possible.
6-1
Some lower bounds for approximation
An extreme case: graph with 0 edge VS graph with 1 edge.
Require Ω(n) queries to distinguish (i.e., get any multiplicative approx).
6-2
Some lower bounds for approximation
An extreme case: graph with 0 edge VS graph with 1 edge.
Require Ω(n) queries to distinguish (i.e., get any multiplicative approx).
Another example:
- n-cycle.
- (n − c√n)-cycle + c√n-clique
Require Ω(√n) queries to find a clique node.
6-3
Some lower bounds for approximation
An extreme case: graph with 0 edge VS graph with 1 edge.
Require Ω(n) queries to distinguish (i.e., get any multiplicative approx).
Another example:
- n-cycle.
- (n − c√n)-cycle + c√n-clique
Require Ω(√n) queries to find a clique node.
We will assume the graph has Ω(n) edges from now on.
7-1
(2 + ǫ)-approximation
8-1
The algorithm
Algorithm
- 1. Take subsets S1, S2, . . . , S8/ǫ independently at random
from V , each of size Θ(√n/ǫO(1))
- 2. Output the smallest number in {dS1, dS2, . . . , dS8/ǫ},
where dSi is the average degree of nodes in set Si. Theorem
This algorithm runs in time O(√n/ǫO(1)), and with probability 2/3,
- utputs a (2 + ǫ)-approximation
Analysis on board.
9-1
(1 + ǫ)-approximation
10-1
The idea
Idea: group nodes of similar degrees, estimate average within each group. Buckets: set β = ǫ/c (c is a const), t = O(log n/ǫ) (# buckets)
For i ∈ {0, . . . , t − 1}, set Bi = {v | (1 + β)i−1 < d(v) ≤ (1 + β)i}. The total degree of nodes in Bi (let d(X) =
x∈X d(x)),
d(Bi) ∈ ((1 + β)i−1 |Bi| , (1 + β)i |Bi|]. The total degree of nodes in V , d(V ) ∈ (
i(1 + β)i−1 |Bi| , i(1 + β)i |Bi|].
11-1
The first try
Algorithm
- 1. Take a sample S of size s = 10000
- n/ǫ · t.
- 2. Let Si := S ∩ Bi (samples that fall into the i-th bucket).
- 3. Estimate average degree of Bi using Si, that is,
ρi = |Si| /s.
- 4. Output
i ρi(1 + β)i−1.
Note: ∀i, E[ρi] = E[|Si| /s] = |Bi| /n.
Does this work? What is for a level i, |Si| is small (that is, |Bi| is small)? For those i’s, ρi will not be very accurate...
12-1
The second try
Algorithm
- 1. Take a sample S of size s = 10000
- n/ǫ · t. Set η = 10000
c
- 2. Let Si := S ∩ Bi (samples that fall into the i-th bucket).
- 3. For each i, set ρi = 0 if |Si| ≤ η; ρi = |Si| /s otherwise.
- 4. Output
i ρi(1 + β)i−1.
Idea: set 0 for small buckets. Note that we don’t have ∀i, E[ρi] = E[|Si| /s] = |Bi| /n now. But we can still show some good bounds (on board). Theorem
This algorithm runs in time O(√n/ǫO(1)), and with probability 2/3,
- utputs a (2 + ǫ)-approximation.
13-1
An improved algorithm (if allow neighbor queries)
Algorithm
- 1. Take a sample S of size s = 1000
- n/ǫ · t. Set η = 10000
c
- 2. For all i,
(a) If |Si| ≥ η, set ρi = |Si| /s; otherwise set ρi = 0. (b) For all v ∈ Si, pick a random neighbor u of v, set χ(v) = 1 if u′ is in a small bucket Bj. (c) Set αi = |{v∈Si|χ(v)=1}|
|Si|
- 3. Output
i ρi(1 + αi)(1 + β)i−1.
Idea: try to estimate degrees contributed by large-small (see the analysis on board) more precisely. Theorem
This algorithm runs in time O(√n/ǫO(1)), and with probability 2/3,
- utputs a (1 + ǫ)-approximation.
14-1
Minimum Spanning Tree
15-1
Assume a connected graph G(V , E) has max degree D, and the weights of all edges are in {1, 2, . . . , W }. Let G i = (V , E i) denote the subgraph containing edges of weights at most i, and let ci be the # connected components in G i. We have MST(G) = n − W +
W −1
- i=1
ci.
The connection between # CC and MST
15-2
Assume a connected graph G(V , E) has max degree D, and the weights of all edges are in {1, 2, . . . , W }. Let G i = (V , E i) denote the subgraph containing edges of weights at most i, and let ci be the # connected components in G i. We have MST(G) = n − W +
W −1
- i=1
ci.
The connection between # CC and MST
We thus only need to approximate ci for each i = 1, . . . , W − 1.
16-1
A sublinear algorithm for # CC
- 1. Sample a random set of r = c0/ǫ2 vertices u1, . . . , ur
- 2. For each sampled vertex ui, we grow a BFS tree Tui rooted ui as
follows. (a) Choose X according to Pr[X ≥ k] = 1/k. (b) Run BFS starting at ui until either
- i. the whole connected component containing ui has been
fully explored, or
- ii. X vertices have been explored.
(c) If BFS stopped in the first case, then set αi = 1, otherwise set αi = 0.
- 3. Output n
r
r
i=1 αi.
Theorem
This algorithm runs in time O(D log n/(ǫ2ρ)) (D is the maximum degree of nodes in G), and with probability 1 − ρ, outputs an answer with additive error ǫn.
17-1
An improved algorithm for # CC
- 1. Sample a random set of r = c0/ǫ2 vertices u1, . . . , ur
- 2. For each sampled vertex ui, we grow a BFS tree Tui rooted ui as
- follows. Set αi = 0 and f = 0.
* Flip a coin. Set f = f + 1. If (head) ∧ (|Tui | < W = 4/ǫ) ∧ (no visitied vertex has degree > d∗ = O(¯ d/ǫ)), then Let B = |Tui |. We continue to grow Tui by B steps.
- i. If during any of the B steps the component of G containing
ui has been fully explored, then set αi = 2 if B′ = 0 and dui 2f /B′ otherwise, where B′ ∈ [B, 2B] is the # edges visited in the BFS so far.
- ii. Else, we repeat step ∗.
- 3. Output
n 2r
r
i=1 αi.
Theorem
This algorithm runs in time O(¯ d log
¯ d ǫ /(ǫ2ρ)), and with probability
1 − ρ, outputs an answer with additive error ǫn.
18-1
Back to MST
Set ǫ = φ/(2W ), and ρ = 1/(4W ) when approx. all ci. The total running time will be O(D · W 3 · log n/φ2) (to approximate the MST up to a factor of 1 + φ).
18-2
Back to MST
Set ǫ = φ/(2W ), and ρ = 1/(4W ) when approx. all ci. The total running time will be O(D · W 3 · log n/φ2) (to approximate the MST up to a factor of 1 + φ). Can be improved to ˜ O(DW /φ2).
19-1