sublinear algorithms for big data
play

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear - PowerPoint PPT Presentation

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear in Time 2-1 Sublinear in time Given a social network graph, if we have no time to ask everyone, can we still compute something non-trivial? For example, the average # of


  1. Sublinear Algorithms for Big Data Qin Zhang 1-1

  2. Part 3: Sublinear in Time 2-1

  3. Sublinear in time Given a social network graph, if we have no time to ask everyone, can we still compute something non-trivial? For example, the average # of individual’s friends? 3-1

  4. Average degree of a graph Problem definition : Given a simple (no parallel edges, self-loops) graph G = ( V , E ), its average degree � v ∈ V d ( v ) ¯ d = . | V | 4-1

  5. Average degree of a graph Problem definition : Given a simple (no parallel edges, self-loops) graph G = ( V , E ), its average degree � v ∈ V d ( v ) ¯ d = . | V | Representation of G : degree + adjacency list. Our algorithms only make the following operations (queries) • Degree queries: on v return d ( v ). • Neighbor queries: for ( v , j ) return j -th neighbor of v . 4-2

  6. Naive approach fails Naive sampling: pick a set of S consisting s random � i ∈ S d ( v i ) nodes, output . s How large s should be, in order to get an O (1) multiplicative approx? Ω( n )! 5-1

  7. Naive approach fails Naive sampling: pick a set of S consisting s random � i ∈ S d ( v i ) nodes, output . s How large s should be, in order to get an O (1) multiplicative approx? Ω( n )! In general, if given n numbers and we want to estimate their average, Ω( n ) queries are needed. 5-2

  8. Naive approach fails Naive sampling: pick a set of S consisting s random � i ∈ S d ( v i ) nodes, output . s How large s should be, in order to get an O (1) multiplicative approx? Ω( n )! In general, if given n numbers and we want to estimate their average, Ω( n ) queries are needed. But, maybe the degree sequences are special, and we can make use of that? • ( n − 1 , 0 , . . . , 0) is NOT possible. • ( n − 1 , 1 , . . . , 1) is possible. 5-3

  9. Some lower bounds for approximation An extreme case: graph with 0 edge VS graph with 1 edge. Require Ω( n ) queries to distinguish (i.e., get any multiplicative approx). 6-1

  10. Some lower bounds for approximation An extreme case: graph with 0 edge VS graph with 1 edge. Require Ω( n ) queries to distinguish (i.e., get any multiplicative approx). Another example: • n -cycle. • ( n − c √ n )-cycle + c √ n -clique Require Ω( √ n ) queries to find a clique node. 6-2

  11. Some lower bounds for approximation An extreme case: graph with 0 edge VS graph with 1 edge. Require Ω( n ) queries to distinguish (i.e., get any multiplicative approx). Another example: • n -cycle. • ( n − c √ n )-cycle + c √ n -clique Require Ω( √ n ) queries to find a clique node. We will assume the graph has Ω( n ) edges from now on. 6-3

  12. (2 + ǫ )-approximation 7-1

  13. The algorithm Algorithm 1. Take subsets S 1 , S 2 , . . . , S 8 /ǫ independently at random from V , each of size Θ( √ n /ǫ O (1) ) 2. Output the smallest number in { d S 1 , d S 2 , . . . , d S 8 /ǫ } , where d S i is the average degree of nodes in set S i . Analysis on board. Theorem This algorithm runs in time O ( √ n /ǫ O (1) ), and with probability 2 / 3, outputs a (2 + ǫ )-approximation 8-1

  14. (1 + ǫ )-approximation 9-1

  15. The idea Idea: group nodes of similar degrees, estimate average within each group. Buckets: set β = ǫ/ c ( c is a const), t = O (log n /ǫ ) (# buckets) For i ∈ { 0 , . . . , t − 1 } , set B i = { v | (1 + β ) i − 1 < d ( v ) ≤ (1 + β ) i } . The total degree of nodes in B i (let d ( X ) = � x ∈ X d ( x )), d ( B i ) ∈ ((1 + β ) i − 1 | B i | , (1 + β ) i | B i | ]. The total degree of nodes in V , i (1 + β ) i − 1 | B i | , � i (1 + β ) i | B i | ]. d ( V ) ∈ ( � 10-1

  16. The first try Algorithm � 1. Take a sample S of size s = 10000 n /ǫ · t . 2. Let S i := S ∩ B i (samples that fall into the i -th bucket). 3. Estimate average degree of B i using S i , that is, ρ i = | S i | / s . Note: ∀ i , E[ ρ i ] = E[ | S i | / s ] = | B i | / n . i ρ i (1 + β ) i − 1 . 4. Output � Does this work? What is for a level i , | S i | is small (that is, | B i | is small)? For those i ’s, ρ i will not be very accurate... 11-1

  17. The second try Idea: set 0 for small buckets. Algorithm n /ǫ · t . Set η = 10000 � 1. Take a sample S of size s = 10000 c 2. Let S i := S ∩ B i (samples that fall into the i -th bucket). 3. For each i , set ρ i = 0 if | S i | ≤ η ; ρ i = | S i | / s otherwise. i ρ i (1 + β ) i − 1 . 4. Output � Note that we don’t have ∀ i , E[ ρ i ] = E[ | S i | / s ] = | B i | / n now. But we can still show some good bounds (on board). Theorem This algorithm runs in time O ( √ n /ǫ O (1) ), and with probability 2 / 3, outputs a (2 + ǫ )-approximation. 12-1

  18. An improved algorithm (if allow neighbor queries) Idea: try to estimate degrees contributed by large-small (see the analysis on board) more precisely. Algorithm n /ǫ · t . Set η = 10000 � 1. Take a sample S of size s = 1000 c 2. For all i , (a) If | S i | ≥ η , set ρ i = | S i | / s ; otherwise set ρ i = 0. (b) For all v ∈ S i , pick a random neighbor u of v , set χ ( v ) = 1 if u ′ is in a small bucket B j . (c) Set α i = |{ v ∈ S i | χ ( v )=1 }| | S i | i ρ i (1 + α i )(1 + β ) i − 1 . 3. Output � Theorem This algorithm runs in time O ( √ n /ǫ O (1) ), and with probability 2 / 3, outputs a (1 + ǫ )-approximation. 13-1

  19. Minimum Spanning Tree 14-1

  20. The connection between # CC and MST Assume a connected graph G ( V , E ) has max degree D , and the weights of all edges are in { 1 , 2 , . . . , W } . Let G i = ( V , E i ) denote the subgraph containing edges of weights at most i , and let c i be the # connected components in G i . We have W − 1 � c i . MST ( G ) = n − W + i =1 15-1

  21. The connection between # CC and MST Assume a connected graph G ( V , E ) has max degree D , and the weights of all edges are in { 1 , 2 , . . . , W } . Let G i = ( V , E i ) denote the subgraph containing edges of weights at most i , and let c i be the # connected components in G i . We have W − 1 � c i . MST ( G ) = n − W + i =1 We thus only need to approximate c i for each i = 1 , . . . , W − 1. 15-2

  22. A sublinear algorithm for # CC 1. Sample a random set of r = c 0 /ǫ 2 vertices u 1 , . . . , u r 2. For each sampled vertex u i , we grow a BFS tree T u i rooted u i as follows. (a) Choose X according to Pr[ X ≥ k ] = 1 / k . (b) Run BFS starting at u i until either i. the whole connected component containing u i has been fully explored, or ii. X vertices have been explored. (c) If BFS stopped in the first case, then set α i = 1, otherwise set α i = 0. � r 3. Output n i =1 α i . r Theorem This algorithm runs in time O ( D log n / ( ǫ 2 ρ )) ( D is the maximum degree of nodes in G ), and with probability 1 − ρ , outputs an answer with additive error ǫ n . 16-1

  23. An improved algorithm for # CC 1. Sample a random set of r = c 0 /ǫ 2 vertices u 1 , . . . , u r 2. For each sampled vertex u i , we grow a BFS tree T u i rooted u i as follows. Set α i = 0 and f = 0. * Flip a coin. Set f = f + 1. If (head) ∧ ( | T u i | < W = 4 /ǫ ) ∧ (no visitied vertex has degree > d ∗ = O (¯ d /ǫ )), then Let B = | T u i | . We continue to grow T u i by B steps. i. If during any of the B steps the component of G containing u i has been fully explored, then set α i = 2 if B ′ = 0 and d u i 2 f / B ′ otherwise, where B ′ ∈ [ B , 2 B ] is the # edges visited in the BFS so far. ii. Else, we repeat step ∗ . � r n 3. Output i =1 α i . 2 r Theorem ¯ This algorithm runs in time O (¯ ǫ / ( ǫ 2 ρ )), and with probability d d log 1 − ρ , outputs an answer with additive error ǫ n . 17-1

  24. Back to MST Set ǫ = φ/ (2 W ), and ρ = 1 / (4 W ) when approx. all c i . The total running time will be O ( D · W 3 · log n /φ 2 ) (to approximate the MST up to a factor of 1 + φ ). 18-1

  25. Back to MST Set ǫ = φ/ (2 W ), and ρ = 1 / (4 W ) when approx. all c i . The total running time will be O ( D · W 3 · log n /φ 2 ) (to approximate the MST up to a factor of 1 + φ ). Can be improved to ˜ O ( DW /φ 2 ). 18-2

  26. Some slides are based on Ronitt Rubinfeld’s course http://stellar.mit.edu/S/course/6/sp13/6.893 19-1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend