1-1
Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear - - PowerPoint PPT Presentation
Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear - - PowerPoint PPT Presentation
Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear in Communication 2-1 Sublinear in communication x 2 = 111011 x 1 = 010011 The model x 3 = 111111 x k = 100011 They want to jointly compute f ( x 1 , x 2 , . . . , x k ) Goal:
2-1
Part 2: Sublinear in Communication
3-1
Sublinear in communication
Applicaitons etc.
x1 = 010011 x2 = 111011 x3 = 111111 xk = 100011
They want to jointly compute f (x1, x2, . . . , xk) Goal: minimize total bits of communication
The model
4-1
A natrual approach
x1 = 010011 x2 = 111011 x3 = 111111 xk = 100011
They want to jointly compute f (x1, x2, . . . , xk) Goal: minimize total bits of communication
The model
· · ·
S1 S2 S3 Sk C
=
Coordinator The natural approach
Each Si computes a skech of its input sk(Si) and send it to C, and then C computes f (x1, . . . , xk) based on sk(S1), . . . , sk(Sk)
The slides from next page are borrowed from Andrew McGregor
- III. Min-Cut
- II. k-Connectivity
- I. Connectivity
Theorem: Testing Connectivity a) Dynamic Graph Stream: O(n polylog n) space. b) Simultaneous Messages: O(polylog n) length.
- III. Min-Cut
- II. k-Connectivity
- I. Connectivity
Ingredient 1: Basic Algorithm
Algorithm (Spanning Forest):
Ingredient 1: Basic Algorithm
Algorithm (Spanning Forest):
- 1. For each node: pick incident edge
Ingredient 1: Basic Algorithm
Algorithm (Spanning Forest):
- 1. For each node: pick incident edge
Ingredient 1: Basic Algorithm
Algorithm (Spanning Forest):
- 1. For each node: pick incident edge
Ingredient 1: Basic Algorithm
Algorithm (Spanning Forest):
- 1. For each node: pick incident edge
Ingredient 1: Basic Algorithm
Algorithm (Spanning Forest):
- 1. For each node: pick incident edge
2.For each connected comp: pick incident edge
Ingredient 1: Basic Algorithm
Algorithm (Spanning Forest):
- 1. For each node: pick incident edge
2.For each connected comp: pick incident edge
Ingredient 1: Basic Algorithm
Algorithm (Spanning Forest):
- 1. For each node: pick incident edge
2.For each connected comp: pick incident edge
Ingredient 1: Basic Algorithm
Algorithm (Spanning Forest):
- 1. For each node: pick incident edge
2.For each connected comp: pick incident edge 3.Repeat until no edges between connected comp.
Ingredient 1: Basic Algorithm
Algorithm (Spanning Forest):
- 1. For each node: pick incident edge
2.For each connected comp: pick incident edge 3.Repeat until no edges between connected comp. Lemma: After O(log n) rounds selected edges include spanning forest.
Ingredient 1: Basic Algorithm
Ingredient 2: Sketching Neighborhoods
For node i, let ai be vector indexed by node pairs. Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i.
Ingredient 2: Sketching Neighborhoods
1 2 3 5 4
{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}
a1 = 1 1 a2 = −1 1
For node i, let ai be vector indexed by node pairs. Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i.
Ingredient 2: Sketching Neighborhoods
1 2 3 5 4
{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}
a1 = 1 1 a2 = −1 1
For node i, let ai be vector indexed by node pairs. Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i.
Ingredient 2: Sketching Neighborhoods
1 2 3 5 4
{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}
a1 = 1 1 a2 = −1 1 a1 + a2 = 1 1
For node i, let ai be vector indexed by node pairs. Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i. Lemma: For any subset of nodes S⊂V ,
Ingredient 2: Sketching Neighborhoods
1 2 3 5 4
{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}
a1 = 1 1 a2 = −1 1 support (
- i∈S
ai ) = E(S, V \ S) a1 + a2 = 1 1
For node i, let ai be vector indexed by node pairs. Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i. Lemma: For any subset of nodes S⊂V , Lemma: ∃ random M: N→k with k=O(polylog N) such that for any a∈N, with high probability
Ingredient 2: Sketching Neighborhoods
1 2 3 5 4
{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}
a1 = 1 1 a2 = −1 1 Ma − → e ∈ support(a) support (
- i∈S
ai ) = E(S, V \ S) a1 + a2 = 1 1
Recipe: Sketch & Compute on Sketches
Sketch: Each player sends Maj
Recipe: Sketch & Compute on Sketches
Sketch: Each player sends Maj Central Player Runs Algorithm in Sketch Space:
Recipe: Sketch & Compute on Sketches
Sketch: Each player sends Maj Central Player Runs Algorithm in Sketch Space: Use Maj to get incident edge on each node j
Recipe: Sketch & Compute on Sketches
Sketch: Each player sends Maj Central Player Runs Algorithm in Sketch Space: Use Maj to get incident edge on each node j For i=2 to log n: To get incident edge on component S⊂V use:
Recipe: Sketch & Compute on Sketches
Sketch: Each player sends Maj Central Player Runs Algorithm in Sketch Space: Use Maj to get incident edge on each node j For i=2 to log n: To get incident edge on component S⊂V use:
Recipe: Sketch & Compute on Sketches
- j∈S
Maj = M(
- j∈S
aj)
Sketch: Each player sends Maj Central Player Runs Algorithm in Sketch Space: Use Maj to get incident edge on each node j For i=2 to log n: To get incident edge on component S⊂V use:
Recipe: Sketch & Compute on Sketches
− → e ∈ support(
- j∈S
aj) = E(S, V \ S)
- j∈S
Maj = M(
- j∈S
aj)
Sketch: Each player sends Maj Central Player Runs Algorithm in Sketch Space: Use Maj to get incident edge on each node j For i=2 to log n: To get incident edge on component S⊂V use:
Recipe: Sketch & Compute on Sketches
− → e ∈ support(
- j∈S
aj) = E(S, V \ S)
- j∈S
Maj = M(
- j∈S
aj)
Detail: Actually each player sends log n indept sketches M1aj, M2aj, ... and central player uses Miaj when emulating ith iteration of the algorithm.
- III. Min-Cut
- II. k-Connectivity
- I. Connectivity
Theorem: Checking every cut has size ≥ k a) Dynamic Graph Stream: O(n k polylog n) space. b) Simultaneous Messages: O(k polylog n) length.
- III. Min-Cut
- I. Connectivity
- II. k-Connectivity
Ingredient 1: Basic Algorithm
Algorithm (k-Connectivity):
Ingredient 1: Basic Algorithm
Algorithm (k-Connectivity):
- 1. Let F1 be spanning forest of G(V
,E)
Ingredient 1: Basic Algorithm
Algorithm (k-Connectivity):
- 1. Let F1 be spanning forest of G(V
,E) 2.For i=2 to k: 2.1. Let Fi be spanning forest of G(V ,E-F1-...-Fi-1)
Ingredient 1: Basic Algorithm
Algorithm (k-Connectivity):
- 1. Let F1 be spanning forest of G(V
,E) 2.For i=2 to k: 2.1. Let Fi be spanning forest of G(V ,E-F1-...-Fi-1) Lemma: G(V ,F1+...+Fk) is k-connected iff G(V ,E) is.
Ingredient 1: Basic Algorithm
Ingredient 2: Connectivity Sketches
Ingredient 2: Connectivity Sketches
Sketch: Simultaneously construct k independent connectivity sketches {M1G, M2G, ... MkG}.
Ingredient 2: Connectivity Sketches
Sketch: Simultaneously construct k independent connectivity sketches {M1G, M2G, ... MkG}. Run Algorithm in Sketch Space: Use M1G to find a spanning forest F1 of G
Ingredient 2: Connectivity Sketches
Sketch: Simultaneously construct k independent connectivity sketches {M1G, M2G, ... MkG}. Run Algorithm in Sketch Space: Use M1G to find a spanning forest F1 of G Use M2G-M2F1=M2(G-F1) to find F2
Ingredient 2: Connectivity Sketches
Sketch: Simultaneously construct k independent connectivity sketches {M1G, M2G, ... MkG}. Run Algorithm in Sketch Space: Use M1G to find a spanning forest F1 of G Use M2G-M2F1=M2(G-F1) to find F2 Use M3G-M3F1-M3F2=M3(G-F1-F2) to find F3
Ingredient 2: Connectivity Sketches
Sketch: Simultaneously construct k independent connectivity sketches {M1G, M2G, ... MkG}. Run Algorithm in Sketch Space: Use M1G to find a spanning forest F1 of G Use M2G-M2F1=M2(G-F1) to find F2 Use M3G-M3F1-M3F2=M3(G-F1-F2) to find F3 etc.
- III. Min-Cut
- II. k-Connectivity
- I. Connectivity
- II. k-Connectivity
- I. Connectivity
- III. Min-Cut
Theorem: (1+%)-approximating minimum cut a) Dynamic Graph Stream: O(%-2 n polylog n) space. b) Simultaneous Messages: O(%-2 polylog n) length.
Ingredient 1: Subsampling
Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then
Ingredient 1: Subsampling
Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗
Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then where
Ingredient 1: Subsampling
p∗ = 6ǫ−2 log n/Min-Cut(G) Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗
Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then where
Ingredient 1: Subsampling
G=G0 p∗ = 6ǫ−2 log n/Min-Cut(G) Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗
Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then where
Ingredient 1: Subsampling
G=G0 G1 p∗ = 6ǫ−2 log n/Min-Cut(G) Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗
Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then where
Ingredient 1: Subsampling
G=G0 G1 G2 p∗ = 6ǫ−2 log n/Min-Cut(G) Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗
Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then where
Ingredient 1: Subsampling
G=G0 G1 G2 G3 p∗ = 6ǫ−2 log n/Min-Cut(G) Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗
Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then where Suffices to find Min-Cut(Gi) for some i<-log p*.
Ingredient 1: Subsampling
G=G0 G1 G2 G3 p∗ = 6ǫ−2 log n/Min-Cut(G) Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗
Ingredient 2: k-Connectivity
Ingredient 2: k-Connectivity
k-Connectivity: Given Gi returns subgraph Hi with
Min-Cut(Gi) = Min-Cut(Hi) if Min-Cut(Gi) < k
Ingredient 2: k-Connectivity
k-Connectivity: Given Gi returns subgraph Hi with Lemma: For k = 12%-2 log n, with high probability
Min-Cut(Gi) = Min-Cut(Hi) if Min-Cut(Gi) < k Min-Cut(Gi) < k for i = − log p∗
Ingredient 2: k-Connectivity
k-Connectivity: Given Gi returns subgraph Hi with Lemma: For k = 12%-2 log n, with high probability since expectation of Min-Cut(Gi) is < 6%-2 log n.
Min-Cut(Gi) = Min-Cut(Hi) if Min-Cut(Gi) < k Min-Cut(Gi) < k for i = − log p∗
Ingredient 2: k-Connectivity
k-Connectivity: Given Gi returns subgraph Hi with Lemma: For k = 12%-2 log n, with high probability since expectation of Min-Cut(Gi) is < 6%-2 log n. Putting it together: Construct Hi for all i. Return 2i Min-Cut(Hi) for smallest i with Min-Cut(Hi) < k.
Min-Cut(Gi) = Min-Cut(Hi) if Min-Cut(Gi) < k Min-Cut(Gi) < k for i = − log p∗
1-1
Algorithm for Min-Cut
- 1. For i = {1, . . . , 2 log n}, let hi → {0, 1} be a uniform hash function.
- 2. For i = {1, . . . , 2 log n},
(a) Let Gi be the subgraph of G containing edges e such that Πj≤ihj(e) = 1. (b) Let Hi ← k-Connected(Gi) for k = O(ǫ−2 log n).
- 3. Return 2j · Min-Cut(Hj), where j = min{i : Min-Cut(Hi) < k}
Example: Checking Bipartiteness
Idea: Given G, define G’ where a node v becomes v1 and v2 and edge (u,v) becomes (u1,v2) and (u2,v1).
Example: Checking Bipartiteness
Idea: Given G, define G’ where a node v becomes v1 and v2 and edge (u,v) becomes (u1,v2) and (u2,v1).
Example: Checking Bipartiteness
Idea: Given G, define G’ where a node v becomes v1 and v2 and edge (u,v) becomes (u1,v2) and (u2,v1). Lemma: G is bipartite iff number of connected components doubles. Can sketch G’ implicitly.
Example: Checking Bipartiteness
Idea: Given G, define G’ where a node v becomes v1 and v2 and edge (u,v) becomes (u1,v2) and (u2,v1). Lemma: G is bipartite iff number of connected components doubles. Can sketch G’ implicitly.
Example: Checking Bipartiteness
Idea: Given G, define G’ where a node v becomes v1 and v2 and edge (u,v) becomes (u1,v2) and (u2,v1). Lemma: G is bipartite iff number of connected components doubles. Can sketch G’ implicitly. Thm: Õ(n)-dimensional sketch for bipartiteness.
Example: Checking Bipartiteness
Example: Minimum Spanning Tree
Example: Minimum Spanning Tree
Idea: Let ni be number of connected components if we ignore edges with weight ≥(1+ε)i, then: w(MST) ≤
- i
ǫ(1 + ǫ)ini ≤ (1 + ǫ)w(MST)
Example: Minimum Spanning Tree
Idea: Let ni be number of connected components if we ignore edges with weight ≥(1+ε)i, then: Thm: Can (1+) approximate MST in one-pass dynamic semi-streaming model. w(MST) ≤
- i
ǫ(1 + ǫ)ini ≤ (1 + ǫ)w(MST)
2-1
Algorithm for Sparsification
- 1. For i = {1, . . . , 2 log n}, let hi → {0, 1} be a uniform hash function.
- 2. For i = {1, . . . , 2 log n},
(a) Let Gi be the subgraph of G containing edges e such that Πj≤ihj(e) = 1. (b) Let Hi ← k-Connected(Gi) for k = O(ǫ−2 log2 n).
- 3. For each edge e = (u, v), find j = min{i : λe(Hi) < k}. If e ∈ Hj,
add e to the sparsifier with weight 2j. λe(G): size of the minimum cut for each edge e = (u, v) in G Azuma’s inequality A sequence of random variables X1, X2, . . . is called a martingale is for all i ≥ 1, E[Xi+1|Xi] = Xi. If |Xi+1 − Xi| ≤ ci almost surely for all i, then Pr[|Xn − X1| ≥ t] < 2e
−
t2 2 n−1 i=1 c2 i .