Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear - - PowerPoint PPT Presentation

sublinear algorithms for big data
SMART_READER_LITE
LIVE PREVIEW

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear - - PowerPoint PPT Presentation

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear in Communication 2-1 Sublinear in communication x 2 = 111011 x 1 = 010011 The model x 3 = 111111 x k = 100011 They want to jointly compute f ( x 1 , x 2 , . . . , x k ) Goal:


slide-1
SLIDE 1

1-1

Qin Zhang

Sublinear Algorithms for Big Data

slide-2
SLIDE 2

2-1

Part 2: Sublinear in Communication

slide-3
SLIDE 3

3-1

Sublinear in communication

Applicaitons etc.

x1 = 010011 x2 = 111011 x3 = 111111 xk = 100011

They want to jointly compute f (x1, x2, . . . , xk) Goal: minimize total bits of communication

The model

slide-4
SLIDE 4

4-1

A natrual approach

x1 = 010011 x2 = 111011 x3 = 111111 xk = 100011

They want to jointly compute f (x1, x2, . . . , xk) Goal: minimize total bits of communication

The model

· · ·

S1 S2 S3 Sk C

=

Coordinator The natural approach

Each Si computes a skech of its input sk(Si) and send it to C, and then C computes f (x1, . . . , xk) based on sk(S1), . . . , sk(Sk)

The slides from next page are borrowed from Andrew McGregor

slide-5
SLIDE 5
  • III. Min-Cut
  • II. k-Connectivity
  • I. Connectivity
slide-6
SLIDE 6

Theorem: Testing Connectivity a) Dynamic Graph Stream: O(n polylog n) space. b) Simultaneous Messages: O(polylog n) length.

  • III. Min-Cut
  • II. k-Connectivity
  • I. Connectivity
slide-7
SLIDE 7

Ingredient 1: Basic Algorithm

slide-8
SLIDE 8

Algorithm (Spanning Forest):

Ingredient 1: Basic Algorithm

slide-9
SLIDE 9

Algorithm (Spanning Forest):

  • 1. For each node: pick incident edge

Ingredient 1: Basic Algorithm

slide-10
SLIDE 10

Algorithm (Spanning Forest):

  • 1. For each node: pick incident edge

Ingredient 1: Basic Algorithm

slide-11
SLIDE 11

Algorithm (Spanning Forest):

  • 1. For each node: pick incident edge

Ingredient 1: Basic Algorithm

slide-12
SLIDE 12

Algorithm (Spanning Forest):

  • 1. For each node: pick incident edge

Ingredient 1: Basic Algorithm

slide-13
SLIDE 13

Algorithm (Spanning Forest):

  • 1. For each node: pick incident edge

2.For each connected comp: pick incident edge

Ingredient 1: Basic Algorithm

slide-14
SLIDE 14

Algorithm (Spanning Forest):

  • 1. For each node: pick incident edge

2.For each connected comp: pick incident edge

Ingredient 1: Basic Algorithm

slide-15
SLIDE 15

Algorithm (Spanning Forest):

  • 1. For each node: pick incident edge

2.For each connected comp: pick incident edge

Ingredient 1: Basic Algorithm

slide-16
SLIDE 16

Algorithm (Spanning Forest):

  • 1. For each node: pick incident edge

2.For each connected comp: pick incident edge 3.Repeat until no edges between connected comp.

Ingredient 1: Basic Algorithm

slide-17
SLIDE 17

Algorithm (Spanning Forest):

  • 1. For each node: pick incident edge

2.For each connected comp: pick incident edge 3.Repeat until no edges between connected comp. Lemma: After O(log n) rounds selected edges include spanning forest.

Ingredient 1: Basic Algorithm

slide-18
SLIDE 18

Ingredient 2: Sketching Neighborhoods

slide-19
SLIDE 19

For node i, let ai be vector indexed by node pairs. Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i.

Ingredient 2: Sketching Neighborhoods

1 2 3 5 4

{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}

a1 = 1 1 a2 = −1 1

slide-20
SLIDE 20

For node i, let ai be vector indexed by node pairs. Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i.

Ingredient 2: Sketching Neighborhoods

1 2 3 5 4

{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}

a1 = 1 1 a2 = −1 1

slide-21
SLIDE 21

For node i, let ai be vector indexed by node pairs. Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i.

Ingredient 2: Sketching Neighborhoods

1 2 3 5 4

{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}

a1 = 1 1 a2 = −1 1 a1 + a2 = 1 1

slide-22
SLIDE 22

For node i, let ai be vector indexed by node pairs. Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i. Lemma: For any subset of nodes S⊂V ,

Ingredient 2: Sketching Neighborhoods

1 2 3 5 4

{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}

a1 = 1 1 a2 = −1 1 support (

  • i∈S

ai ) = E(S, V \ S) a1 + a2 = 1 1

slide-23
SLIDE 23

For node i, let ai be vector indexed by node pairs. Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i. Lemma: For any subset of nodes S⊂V , Lemma: ∃ random M: N→k with k=O(polylog N) such that for any a∈N, with high probability

Ingredient 2: Sketching Neighborhoods

1 2 3 5 4

{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}

a1 = 1 1 a2 = −1 1 Ma − → e ∈ support(a) support (

  • i∈S

ai ) = E(S, V \ S) a1 + a2 = 1 1

slide-24
SLIDE 24

Recipe: Sketch & Compute on Sketches

slide-25
SLIDE 25

Sketch: Each player sends Maj

Recipe: Sketch & Compute on Sketches

slide-26
SLIDE 26

Sketch: Each player sends Maj Central Player Runs Algorithm in Sketch Space:

Recipe: Sketch & Compute on Sketches

slide-27
SLIDE 27

Sketch: Each player sends Maj Central Player Runs Algorithm in Sketch Space: Use Maj to get incident edge on each node j

Recipe: Sketch & Compute on Sketches

slide-28
SLIDE 28

Sketch: Each player sends Maj Central Player Runs Algorithm in Sketch Space: Use Maj to get incident edge on each node j For i=2 to log n: To get incident edge on component S⊂V use:

Recipe: Sketch & Compute on Sketches

slide-29
SLIDE 29

Sketch: Each player sends Maj Central Player Runs Algorithm in Sketch Space: Use Maj to get incident edge on each node j For i=2 to log n: To get incident edge on component S⊂V use:

Recipe: Sketch & Compute on Sketches

  • j∈S

Maj = M(

  • j∈S

aj)

slide-30
SLIDE 30

Sketch: Each player sends Maj Central Player Runs Algorithm in Sketch Space: Use Maj to get incident edge on each node j For i=2 to log n: To get incident edge on component S⊂V use:

Recipe: Sketch & Compute on Sketches

− → e ∈ support(

  • j∈S

aj) = E(S, V \ S)

  • j∈S

Maj = M(

  • j∈S

aj)

slide-31
SLIDE 31

Sketch: Each player sends Maj Central Player Runs Algorithm in Sketch Space: Use Maj to get incident edge on each node j For i=2 to log n: To get incident edge on component S⊂V use:

Recipe: Sketch & Compute on Sketches

− → e ∈ support(

  • j∈S

aj) = E(S, V \ S)

  • j∈S

Maj = M(

  • j∈S

aj)

Detail: Actually each player sends log n indept sketches M1aj, M2aj, ... and central player uses Miaj when emulating ith iteration of the algorithm.

slide-32
SLIDE 32
  • III. Min-Cut
  • II. k-Connectivity
  • I. Connectivity
slide-33
SLIDE 33

Theorem: Checking every cut has size ≥ k a) Dynamic Graph Stream: O(n k polylog n) space. b) Simultaneous Messages: O(k polylog n) length.

  • III. Min-Cut
  • I. Connectivity
  • II. k-Connectivity
slide-34
SLIDE 34

Ingredient 1: Basic Algorithm

slide-35
SLIDE 35

Algorithm (k-Connectivity):

Ingredient 1: Basic Algorithm

slide-36
SLIDE 36

Algorithm (k-Connectivity):

  • 1. Let F1 be spanning forest of G(V

,E)

Ingredient 1: Basic Algorithm

slide-37
SLIDE 37

Algorithm (k-Connectivity):

  • 1. Let F1 be spanning forest of G(V

,E) 2.For i=2 to k: 2.1. Let Fi be spanning forest of G(V ,E-F1-...-Fi-1)

Ingredient 1: Basic Algorithm

slide-38
SLIDE 38

Algorithm (k-Connectivity):

  • 1. Let F1 be spanning forest of G(V

,E) 2.For i=2 to k: 2.1. Let Fi be spanning forest of G(V ,E-F1-...-Fi-1) Lemma: G(V ,F1+...+Fk) is k-connected iff G(V ,E) is.

Ingredient 1: Basic Algorithm

slide-39
SLIDE 39

Ingredient 2: Connectivity Sketches

slide-40
SLIDE 40

Ingredient 2: Connectivity Sketches

Sketch: Simultaneously construct k independent connectivity sketches {M1G, M2G, ... MkG}.

slide-41
SLIDE 41

Ingredient 2: Connectivity Sketches

Sketch: Simultaneously construct k independent connectivity sketches {M1G, M2G, ... MkG}. Run Algorithm in Sketch Space: Use M1G to find a spanning forest F1 of G

slide-42
SLIDE 42

Ingredient 2: Connectivity Sketches

Sketch: Simultaneously construct k independent connectivity sketches {M1G, M2G, ... MkG}. Run Algorithm in Sketch Space: Use M1G to find a spanning forest F1 of G Use M2G-M2F1=M2(G-F1) to find F2

slide-43
SLIDE 43

Ingredient 2: Connectivity Sketches

Sketch: Simultaneously construct k independent connectivity sketches {M1G, M2G, ... MkG}. Run Algorithm in Sketch Space: Use M1G to find a spanning forest F1 of G Use M2G-M2F1=M2(G-F1) to find F2 Use M3G-M3F1-M3F2=M3(G-F1-F2) to find F3

slide-44
SLIDE 44

Ingredient 2: Connectivity Sketches

Sketch: Simultaneously construct k independent connectivity sketches {M1G, M2G, ... MkG}. Run Algorithm in Sketch Space: Use M1G to find a spanning forest F1 of G Use M2G-M2F1=M2(G-F1) to find F2 Use M3G-M3F1-M3F2=M3(G-F1-F2) to find F3 etc.

slide-45
SLIDE 45
  • III. Min-Cut
  • II. k-Connectivity
  • I. Connectivity
slide-46
SLIDE 46
  • II. k-Connectivity
  • I. Connectivity
  • III. Min-Cut

Theorem: (1+%)-approximating minimum cut a) Dynamic Graph Stream: O(%-2 n polylog n) space. b) Simultaneous Messages: O(%-2 polylog n) length.

slide-47
SLIDE 47

Ingredient 1: Subsampling

slide-48
SLIDE 48

Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then

Ingredient 1: Subsampling

Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗

slide-49
SLIDE 49

Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then where

Ingredient 1: Subsampling

p∗ = 6ǫ−2 log n/Min-Cut(G) Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗

slide-50
SLIDE 50

Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then where

Ingredient 1: Subsampling

G=G0 p∗ = 6ǫ−2 log n/Min-Cut(G) Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗

slide-51
SLIDE 51

Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then where

Ingredient 1: Subsampling

G=G0 G1 p∗ = 6ǫ−2 log n/Min-Cut(G) Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗

slide-52
SLIDE 52

Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then where

Ingredient 1: Subsampling

G=G0 G1 G2 p∗ = 6ǫ−2 log n/Min-Cut(G) Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗

slide-53
SLIDE 53

Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then where

Ingredient 1: Subsampling

G=G0 G1 G2 G3 p∗ = 6ǫ−2 log n/Min-Cut(G) Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗

slide-54
SLIDE 54

Lemma (Karger): Define subgraph Gi by sampling edges w/p 2-i. Then where Suffices to find Min-Cut(Gi) for some i<-log p*.

Ingredient 1: Subsampling

G=G0 G1 G2 G3 p∗ = 6ǫ−2 log n/Min-Cut(G) Min-Cut(G) = (1 ± ǫ) · 2i · Min-Cut(Gi) if i < − log p∗

slide-55
SLIDE 55

Ingredient 2: k-Connectivity

slide-56
SLIDE 56

Ingredient 2: k-Connectivity

k-Connectivity: Given Gi returns subgraph Hi with

Min-Cut(Gi) = Min-Cut(Hi) if Min-Cut(Gi) < k

slide-57
SLIDE 57

Ingredient 2: k-Connectivity

k-Connectivity: Given Gi returns subgraph Hi with Lemma: For k = 12%-2 log n, with high probability

Min-Cut(Gi) = Min-Cut(Hi) if Min-Cut(Gi) < k Min-Cut(Gi) < k for i = − log p∗

slide-58
SLIDE 58

Ingredient 2: k-Connectivity

k-Connectivity: Given Gi returns subgraph Hi with Lemma: For k = 12%-2 log n, with high probability since expectation of Min-Cut(Gi) is < 6%-2 log n.

Min-Cut(Gi) = Min-Cut(Hi) if Min-Cut(Gi) < k Min-Cut(Gi) < k for i = − log p∗

slide-59
SLIDE 59

Ingredient 2: k-Connectivity

k-Connectivity: Given Gi returns subgraph Hi with Lemma: For k = 12%-2 log n, with high probability since expectation of Min-Cut(Gi) is < 6%-2 log n. Putting it together: Construct Hi for all i. Return 2i Min-Cut(Hi) for smallest i with Min-Cut(Hi) < k.

Min-Cut(Gi) = Min-Cut(Hi) if Min-Cut(Gi) < k Min-Cut(Gi) < k for i = − log p∗

slide-60
SLIDE 60

1-1

Algorithm for Min-Cut

  • 1. For i = {1, . . . , 2 log n}, let hi → {0, 1} be a uniform hash function.
  • 2. For i = {1, . . . , 2 log n},

(a) Let Gi be the subgraph of G containing edges e such that Πj≤ihj(e) = 1. (b) Let Hi ← k-Connected(Gi) for k = O(ǫ−2 log n).

  • 3. Return 2j · Min-Cut(Hj), where j = min{i : Min-Cut(Hi) < k}
slide-61
SLIDE 61

Example: Checking Bipartiteness

slide-62
SLIDE 62

Idea: Given G, define G’ where a node v becomes v1 and v2 and edge (u,v) becomes (u1,v2) and (u2,v1).

Example: Checking Bipartiteness

slide-63
SLIDE 63

Idea: Given G, define G’ where a node v becomes v1 and v2 and edge (u,v) becomes (u1,v2) and (u2,v1).

Example: Checking Bipartiteness

slide-64
SLIDE 64

Idea: Given G, define G’ where a node v becomes v1 and v2 and edge (u,v) becomes (u1,v2) and (u2,v1). Lemma: G is bipartite iff number of connected components doubles. Can sketch G’ implicitly.

Example: Checking Bipartiteness

slide-65
SLIDE 65

Idea: Given G, define G’ where a node v becomes v1 and v2 and edge (u,v) becomes (u1,v2) and (u2,v1). Lemma: G is bipartite iff number of connected components doubles. Can sketch G’ implicitly.

Example: Checking Bipartiteness

slide-66
SLIDE 66

Idea: Given G, define G’ where a node v becomes v1 and v2 and edge (u,v) becomes (u1,v2) and (u2,v1). Lemma: G is bipartite iff number of connected components doubles. Can sketch G’ implicitly. Thm: Õ(n)-dimensional sketch for bipartiteness.

Example: Checking Bipartiteness

slide-67
SLIDE 67

Example: Minimum Spanning Tree

slide-68
SLIDE 68

Example: Minimum Spanning Tree

Idea: Let ni be number of connected components if we ignore edges with weight ≥(1+ε)i, then: w(MST) ≤

  • i

ǫ(1 + ǫ)ini ≤ (1 + ǫ)w(MST)

slide-69
SLIDE 69

Example: Minimum Spanning Tree

Idea: Let ni be number of connected components if we ignore edges with weight ≥(1+ε)i, then: Thm: Can (1+) approximate MST in one-pass dynamic semi-streaming model. w(MST) ≤

  • i

ǫ(1 + ǫ)ini ≤ (1 + ǫ)w(MST)

slide-70
SLIDE 70

2-1

Algorithm for Sparsification

  • 1. For i = {1, . . . , 2 log n}, let hi → {0, 1} be a uniform hash function.
  • 2. For i = {1, . . . , 2 log n},

(a) Let Gi be the subgraph of G containing edges e such that Πj≤ihj(e) = 1. (b) Let Hi ← k-Connected(Gi) for k = O(ǫ−2 log2 n).

  • 3. For each edge e = (u, v), find j = min{i : λe(Hi) < k}. If e ∈ Hj,

add e to the sparsifier with weight 2j. λe(G): size of the minimum cut for each edge e = (u, v) in G Azuma’s inequality A sequence of random variables X1, X2, . . . is called a martingale is for all i ≥ 1, E[Xi+1|Xi] = Xi. If |Xi+1 − Xi| ≤ ci almost surely for all i, then Pr[|Xn − X1| ≥ t] < 2e

t2 2 n−1 i=1 c2 i .