Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear - - PowerPoint PPT Presentation

sublinear algorithms for big data
SMART_READER_LITE
LIVE PREVIEW

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear - - PowerPoint PPT Presentation

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear in Time 2-1 Sublinear in time Given a social network graph, if we have no time to ask everyone, can we still compute something non-trivial? For example, the average # of


slide-1
SLIDE 1

1-1

Qin Zhang

Sublinear Algorithms for Big Data

slide-2
SLIDE 2

2-1

Part 3: Sublinear in Time

slide-3
SLIDE 3

3-1

Sublinear in time

Given a social network graph, if we have no time to ask everyone, can we still compute something non-trivial? For example, the average # of individual’s friends?

slide-4
SLIDE 4

4-1

Average degree of a graph

Problem definition: Given a simple (no parallel edges, self-loops) graph G = (V , E), its average degree ¯ d =

  • v∈V d(v)

|V | .

slide-5
SLIDE 5

4-2

Average degree of a graph

Problem definition: Given a simple (no parallel edges, self-loops) graph G = (V , E), its average degree ¯ d =

  • v∈V d(v)

|V | . Representation of G: degree + adjacency list. Our algorithms only make the following operations (queries)

  • Degree queries: on v return d(v).
  • Neighbor queries: for (v, j) return j-th neighbor of v.
slide-6
SLIDE 6

5-1

Naive approach fails

Naive sampling: pick a set of S consisting s random nodes, output

  • i∈S d(vi)

s

. How large s should be, in order to get an O(1) multiplicative approx? Ω(n)!

slide-7
SLIDE 7

5-2

Naive approach fails

Naive sampling: pick a set of S consisting s random nodes, output

  • i∈S d(vi)

s

. How large s should be, in order to get an O(1) multiplicative approx? Ω(n)! In general, if given n numbers and we want to estimate their average, Ω(n) queries are needed.

slide-8
SLIDE 8

5-3

Naive approach fails

Naive sampling: pick a set of S consisting s random nodes, output

  • i∈S d(vi)

s

. How large s should be, in order to get an O(1) multiplicative approx? Ω(n)! In general, if given n numbers and we want to estimate their average, Ω(n) queries are needed. But, maybe the degree sequences are special, and we can make use of that?

  • (n − 1, 0, . . . , 0) is NOT possible.
  • (n − 1, 1, . . . , 1) is possible.
slide-9
SLIDE 9

6-1

Some lower bounds for approximation

An extreme case: graph with 0 edge VS graph with 1 edge.

Require Ω(n) queries to distinguish (i.e., get any multiplicative approx).

slide-10
SLIDE 10

6-2

Some lower bounds for approximation

An extreme case: graph with 0 edge VS graph with 1 edge.

Require Ω(n) queries to distinguish (i.e., get any multiplicative approx).

Another example:

  • n-cycle.
  • (n − c√n)-cycle + c√n-clique

Require Ω(√n) queries to find a clique node.

slide-11
SLIDE 11

6-3

Some lower bounds for approximation

An extreme case: graph with 0 edge VS graph with 1 edge.

Require Ω(n) queries to distinguish (i.e., get any multiplicative approx).

Another example:

  • n-cycle.
  • (n − c√n)-cycle + c√n-clique

Require Ω(√n) queries to find a clique node.

We will assume the graph has Ω(n) edges from now on.

slide-12
SLIDE 12

7-1

(2 + ǫ)-approximation

slide-13
SLIDE 13

8-1

The algorithm

Algorithm

  • 1. Take subsets S1, S2, . . . , S8/ǫ independently at random

from V , each of size Θ(√n/ǫO(1))

  • 2. Output the smallest number in {dS1, dS2, . . . , dS8/ǫ},

where dSi is the average degree of nodes in set Si. Theorem

This algorithm runs in time O(√n/ǫO(1)), and with probability 2/3,

  • utputs a (2 + ǫ)-approximation

Analysis on board.

slide-14
SLIDE 14

9-1

(1 + ǫ)-approximation

slide-15
SLIDE 15

10-1

The idea

Idea: group nodes of similar degrees, estimate average within each group. Buckets: set β = ǫ/c (c is a const), t = O(log n/ǫ) (# buckets)

For i ∈ {0, . . . , t − 1}, set Bi = {v | (1 + β)i−1 < d(v) ≤ (1 + β)i}. The total degree of nodes in Bi (let d(X) =

x∈X d(x)),

d(Bi) ∈ ((1 + β)i−1 |Bi| , (1 + β)i |Bi|]. The total degree of nodes in V , d(V ) ∈ (

i(1 + β)i−1 |Bi| , i(1 + β)i |Bi|].

slide-16
SLIDE 16

11-1

The first try

Algorithm

  • 1. Take a sample S of size s = 10000
  • n/ǫ · t.
  • 2. Let Si := S ∩ Bi (samples that fall into the i-th bucket).
  • 3. Estimate average degree of Bi using Si, that is,

ρi = |Si| /s.

  • 4. Output

i ρi(1 + β)i−1.

Note: ∀i, E[ρi] = E[|Si| /s] = |Bi| /n.

Does this work? What is for a level i, |Si| is small (that is, |Bi| is small)? For those i’s, ρi will not be very accurate...

slide-17
SLIDE 17

12-1

The second try

Algorithm

  • 1. Take a sample S of size s = 10000
  • n/ǫ · t. Set η = 10000

c

  • 2. Let Si := S ∩ Bi (samples that fall into the i-th bucket).
  • 3. For each i, set ρi = 0 if |Si| ≤ η; ρi = |Si| /s otherwise.
  • 4. Output

i ρi(1 + β)i−1.

Idea: set 0 for small buckets. Note that we don’t have ∀i, E[ρi] = E[|Si| /s] = |Bi| /n now. But we can still show some good bounds (on board). Theorem

This algorithm runs in time O(√n/ǫO(1)), and with probability 2/3,

  • utputs a (2 + ǫ)-approximation.
slide-18
SLIDE 18

13-1

An improved algorithm (if allow neighbor queries)

Algorithm

  • 1. Take a sample S of size s = 1000
  • n/ǫ · t. Set η = 10000

c

  • 2. For all i,

(a) If |Si| ≥ η, set ρi = |Si| /s; otherwise set ρi = 0. (b) For all v ∈ Si, pick a random neighbor u of v, set χ(v) = 1 if u′ is in a small bucket Bj. (c) Set αi = |{v∈Si|χ(v)=1}|

|Si|

  • 3. Output

i ρi(1 + αi)(1 + β)i−1.

Idea: try to estimate degrees contributed by large-small (see the analysis on board) more precisely. Theorem

This algorithm runs in time O(√n/ǫO(1)), and with probability 2/3,

  • utputs a (1 + ǫ)-approximation.
slide-19
SLIDE 19

14-1

Minimum Spanning Tree

slide-20
SLIDE 20

15-1

Assume a connected graph G(V , E) has max degree D, and the weights of all edges are in {1, 2, . . . , W }. Let G i = (V , E i) denote the subgraph containing edges of weights at most i, and let ci be the # connected components in G i. We have MST(G) = n − W +

W −1

  • i=1

ci.

The connection between # CC and MST

slide-21
SLIDE 21

15-2

Assume a connected graph G(V , E) has max degree D, and the weights of all edges are in {1, 2, . . . , W }. Let G i = (V , E i) denote the subgraph containing edges of weights at most i, and let ci be the # connected components in G i. We have MST(G) = n − W +

W −1

  • i=1

ci.

The connection between # CC and MST

We thus only need to approximate ci for each i = 1, . . . , W − 1.

slide-22
SLIDE 22

16-1

A sublinear algorithm for # CC

  • 1. Sample a random set of r = c0/ǫ2 vertices u1, . . . , ur
  • 2. For each sampled vertex ui, we grow a BFS tree Tui rooted ui as

follows. (a) Choose X according to Pr[X ≥ k] = 1/k. (b) Run BFS starting at ui until either

  • i. the whole connected component containing ui has been

fully explored, or

  • ii. X vertices have been explored.

(c) If BFS stopped in the first case, then set αi = 1, otherwise set αi = 0.

  • 3. Output n

r

r

i=1 αi.

Theorem

This algorithm runs in time O(D log n/(ǫ2ρ)) (D is the maximum degree of nodes in G), and with probability 1 − ρ, outputs an answer with additive error ǫn.

slide-23
SLIDE 23

17-1

An improved algorithm for # CC

  • 1. Sample a random set of r = c0/ǫ2 vertices u1, . . . , ur
  • 2. For each sampled vertex ui, we grow a BFS tree Tui rooted ui as
  • follows. Set αi = 0 and f = 0.

* Flip a coin. Set f = f + 1. If (head) ∧ (|Tui | < W = 4/ǫ) ∧ (no visitied vertex has degree > d∗ = O(¯ d/ǫ)), then Let B = |Tui |. We continue to grow Tui by B steps.

  • i. If during any of the B steps the component of G containing

ui has been fully explored, then set αi = 2 if B′ = 0 and dui 2f /B′ otherwise, where B′ ∈ [B, 2B] is the # edges visited in the BFS so far.

  • ii. Else, we repeat step ∗.
  • 3. Output

n 2r

r

i=1 αi.

Theorem

This algorithm runs in time O(¯ d log

¯ d ǫ /(ǫ2ρ)), and with probability

1 − ρ, outputs an answer with additive error ǫn.

slide-24
SLIDE 24

18-1

Back to MST

Set ǫ = φ/(2W ), and ρ = 1/(4W ) when approx. all ci. The total running time will be O(D · W 3 · log n/φ2) (to approximate the MST up to a factor of 1 + φ).

slide-25
SLIDE 25

18-2

Back to MST

Set ǫ = φ/(2W ), and ρ = 1/(4W ) when approx. all ci. The total running time will be O(D · W 3 · log n/φ2) (to approximate the MST up to a factor of 1 + φ). Can be improved to ˜ O(DW /φ2).

slide-26
SLIDE 26

19-1

Some slides are based on Ronitt Rubinfeld’s course http://stellar.mit.edu/S/course/6/sp13/6.893