Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear - PowerPoint PPT Presentation

Sublinear Algorithms for Big Data Qin Zhang 1-1

Part 3: Sublinear in Time 2-1

Sublinear in time Given a social network graph, if we have no time to ask everyone, can we still compute something non-trivial? For example, the average # of individual’s friends? 3-1

Average degree of a graph Problem definition : Given a simple (no parallel edges, self-loops) graph G = ( V , E ), its average degree � v ∈ V d ( v ) ¯ d = . | V | 4-1

Average degree of a graph Problem definition : Given a simple (no parallel edges, self-loops) graph G = ( V , E ), its average degree � v ∈ V d ( v ) ¯ d = . | V | Representation of G : degree + adjacency list. Our algorithms only make the following operations (queries) • Degree queries: on v return d ( v ). • Neighbor queries: for ( v , j ) return j -th neighbor of v . 4-2

Naive approach fails Naive sampling: pick a set of S consisting s random � i ∈ S d ( v i ) nodes, output . s How large s should be, in order to get an O (1) multiplicative approx? Ω( n )! 5-1

Naive approach fails Naive sampling: pick a set of S consisting s random � i ∈ S d ( v i ) nodes, output . s How large s should be, in order to get an O (1) multiplicative approx? Ω( n )! In general, if given n numbers and we want to estimate their average, Ω( n ) queries are needed. 5-2

Naive approach fails Naive sampling: pick a set of S consisting s random � i ∈ S d ( v i ) nodes, output . s How large s should be, in order to get an O (1) multiplicative approx? Ω( n )! In general, if given n numbers and we want to estimate their average, Ω( n ) queries are needed. But, maybe the degree sequences are special, and we can make use of that? • ( n − 1 , 0 , . . . , 0) is NOT possible. • ( n − 1 , 1 , . . . , 1) is possible. 5-3

Some lower bounds for approximation An extreme case: graph with 0 edge VS graph with 1 edge. Require Ω( n ) queries to distinguish (i.e., get any multiplicative approx). 6-1

Some lower bounds for approximation An extreme case: graph with 0 edge VS graph with 1 edge. Require Ω( n ) queries to distinguish (i.e., get any multiplicative approx). Another example: • n -cycle. • ( n − c √ n )-cycle + c √ n -clique Require Ω( √ n ) queries to find a clique node. 6-2

Some lower bounds for approximation An extreme case: graph with 0 edge VS graph with 1 edge. Require Ω( n ) queries to distinguish (i.e., get any multiplicative approx). Another example: • n -cycle. • ( n − c √ n )-cycle + c √ n -clique Require Ω( √ n ) queries to find a clique node. We will assume the graph has Ω( n ) edges from now on. 6-3

(2 + ǫ )-approximation 7-1

The algorithm Algorithm 1. Take subsets S 1 , S 2 , . . . , S 8 /ǫ independently at random from V , each of size Θ( √ n /ǫ O (1) ) 2. Output the smallest number in { d S 1 , d S 2 , . . . , d S 8 /ǫ } , where d S i is the average degree of nodes in set S i . Analysis on board. Theorem This algorithm runs in time O ( √ n /ǫ O (1) ), and with probability 2 / 3, outputs a (2 + ǫ )-approximation 8-1

(1 + ǫ )-approximation 9-1

The idea Idea: group nodes of similar degrees, estimate average within each group. Buckets: set β = ǫ/ c ( c is a const), t = O (log n /ǫ ) (# buckets) For i ∈ { 0 , . . . , t − 1 } , set B i = { v | (1 + β ) i − 1 < d ( v ) ≤ (1 + β ) i } . The total degree of nodes in B i (let d ( X ) = � x ∈ X d ( x )), d ( B i ) ∈ ((1 + β ) i − 1 | B i | , (1 + β ) i | B i | ]. The total degree of nodes in V , i (1 + β ) i − 1 | B i | , � i (1 + β ) i | B i | ]. d ( V ) ∈ ( � 10-1

The first try Algorithm � 1. Take a sample S of size s = 10000 n /ǫ · t . 2. Let S i := S ∩ B i (samples that fall into the i -th bucket). 3. Estimate average degree of B i using S i , that is, ρ i = | S i | / s . Note: ∀ i , E[ ρ i ] = E[ | S i | / s ] = | B i | / n . i ρ i (1 + β ) i − 1 . 4. Output � Does this work? What is for a level i , | S i | is small (that is, | B i | is small)? For those i ’s, ρ i will not be very accurate... 11-1

The second try Idea: set 0 for small buckets. Algorithm n /ǫ · t . Set η = 10000 � 1. Take a sample S of size s = 10000 c 2. Let S i := S ∩ B i (samples that fall into the i -th bucket). 3. For each i , set ρ i = 0 if | S i | ≤ η ; ρ i = | S i | / s otherwise. i ρ i (1 + β ) i − 1 . 4. Output � Note that we don’t have ∀ i , E[ ρ i ] = E[ | S i | / s ] = | B i | / n now. But we can still show some good bounds (on board). Theorem This algorithm runs in time O ( √ n /ǫ O (1) ), and with probability 2 / 3, outputs a (2 + ǫ )-approximation. 12-1

An improved algorithm (if allow neighbor queries) Idea: try to estimate degrees contributed by large-small (see the analysis on board) more precisely. Algorithm n /ǫ · t . Set η = 10000 � 1. Take a sample S of size s = 1000 c 2. For all i , (a) If | S i | ≥ η , set ρ i = | S i | / s ; otherwise set ρ i = 0. (b) For all v ∈ S i , pick a random neighbor u of v , set χ ( v ) = 1 if u ′ is in a small bucket B j . (c) Set α i = |{ v ∈ S i | χ ( v )=1 }| | S i | i ρ i (1 + α i )(1 + β ) i − 1 . 3. Output � Theorem This algorithm runs in time O ( √ n /ǫ O (1) ), and with probability 2 / 3, outputs a (1 + ǫ )-approximation. 13-1

Minimum Spanning Tree 14-1

The connection between # CC and MST Assume a connected graph G ( V , E ) has max degree D , and the weights of all edges are in { 1 , 2 , . . . , W } . Let G i = ( V , E i ) denote the subgraph containing edges of weights at most i , and let c i be the # connected components in G i . We have W − 1 � c i . MST ( G ) = n − W + i =1 15-1

The connection between # CC and MST Assume a connected graph G ( V , E ) has max degree D , and the weights of all edges are in { 1 , 2 , . . . , W } . Let G i = ( V , E i ) denote the subgraph containing edges of weights at most i , and let c i be the # connected components in G i . We have W − 1 � c i . MST ( G ) = n − W + i =1 We thus only need to approximate c i for each i = 1 , . . . , W − 1. 15-2

A sublinear algorithm for # CC 1. Sample a random set of r = c 0 /ǫ 2 vertices u 1 , . . . , u r 2. For each sampled vertex u i , we grow a BFS tree T u i rooted u i as follows. (a) Choose X according to Pr[ X ≥ k ] = 1 / k . (b) Run BFS starting at u i until either i. the whole connected component containing u i has been fully explored, or ii. X vertices have been explored. (c) If BFS stopped in the first case, then set α i = 1, otherwise set α i = 0. � r 3. Output n i =1 α i . r Theorem This algorithm runs in time O ( D log n / ( ǫ 2 ρ )) ( D is the maximum degree of nodes in G ), and with probability 1 − ρ , outputs an answer with additive error ǫ n . 16-1

An improved algorithm for # CC 1. Sample a random set of r = c 0 /ǫ 2 vertices u 1 , . . . , u r 2. For each sampled vertex u i , we grow a BFS tree T u i rooted u i as follows. Set α i = 0 and f = 0. * Flip a coin. Set f = f + 1. If (head) ∧ ( | T u i | < W = 4 /ǫ ) ∧ (no visitied vertex has degree > d ∗ = O (¯ d /ǫ )), then Let B = | T u i | . We continue to grow T u i by B steps. i. If during any of the B steps the component of G containing u i has been fully explored, then set α i = 2 if B ′ = 0 and d u i 2 f / B ′ otherwise, where B ′ ∈ [ B , 2 B ] is the # edges visited in the BFS so far. ii. Else, we repeat step ∗ . � r n 3. Output i =1 α i . 2 r Theorem ¯ This algorithm runs in time O (¯ ǫ / ( ǫ 2 ρ )), and with probability d d log 1 − ρ , outputs an answer with additive error ǫ n . 17-1

Back to MST Set ǫ = φ/ (2 W ), and ρ = 1 / (4 W ) when approx. all c i . The total running time will be O ( D · W 3 · log n /φ 2 ) (to approximate the MST up to a factor of 1 + φ ). 18-1

Back to MST Set ǫ = φ/ (2 W ), and ρ = 1 / (4 W ) when approx. all c i . The total running time will be O ( D · W 3 · log n /φ 2 ) (to approximate the MST up to a factor of 1 + φ ). Can be improved to ˜ O ( DW /φ 2 ). 18-2

Some slides are based on Ronitt Rubinfeld’s course http://stellar.mit.edu/S/course/6/sp13/6.893 19-1

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear - PowerPoint PPT Presentation

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear in Time 2-1 Sublinear in time Given a social network graph, if we have no time to ask everyone, can we still compute something non-trivial? For example, the average # of

Random Local Exploration Techniques for Sublinear-Time Algorithms Krzysztof Onak IBM Research

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear in Communication 2-1

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model

Sublinear Geometric Algorithms Sublinear Geometric Algorithms B. Chazelle, D. Liu, A. Magen B.

Sublinear Algorithms for ( + 1) Vertex Coloring Sepehr Assadi University of Pennsylvania

L ECTURE 2 Last time Introduction Basic models for sublinear-time computation Simple

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Sublinear Algorithms Lecture 5 Sofya Raskhodnikova Penn State University Thanks to Madhav Jha

L ECTURE 6 Last time Limitations of sublinear-time algorithms Yaos Minimax Principle

Sublinear Algorithms Lectures 1 and 2 Sofya Raskhodnikova Penn State University 1 Tentative

Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Boston University 1 Organizational Course

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview of problems 2-1 Statistics

Searching Uniquely Hamiltonian Planar Graphs with Minimum Degree 3 Benedikt Klocker, Herbert

DEGREE SPECTRA OF THE SUCCESSOR RELATION OF COMPUTABLE LINEAR ORDERINGS JENNIFER CHUBB, ANDREY

NestedMP: Taming Complex Configuration Space of Degree of Parallelism for Nested-Parallel

Given a Polynomial of Degree Bound 8 Find 8 Distinct Points to Efficiently Evaluate it at

Delaunay Triangulations Carola Wenk Based on: Computational Geometry: Algorithms and

Computational Geometry Lecture 12: Delaunay Triangulations 1 Computational Geometry Lecture 12:

Computational Geometry Lecture 12: Delaunay Triangulations Computational Geometry Lecture 12:

Insertions and Deletions in Delaunay Triangulations using Guided Point Location Kevin Buchin

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear - PowerPoint PPT Presentation

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear in Time 2-1 Sublinear in time Given a social network graph, if we have no time to ask everyone, can we still compute something non-trivial? For example, the average # of

Random Local Exploration Techniques for Sublinear-Time Algorithms Krzysztof Onak IBM Research

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear in Communication 2-1

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model

Sublinear Geometric Algorithms Sublinear Geometric Algorithms B. Chazelle, D. Liu, A. Magen B.

Sublinear Algorithms for ( + 1) Vertex Coloring Sepehr Assadi University of Pennsylvania

L ECTURE 2 Last time Introduction Basic models for sublinear-time computation Simple

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Sublinear Algorithms Lecture 5 Sofya Raskhodnikova Penn State University Thanks to Madhav Jha

L ECTURE 6 Last time Limitations of sublinear-time algorithms Yaos Minimax Principle

Sublinear Algorithms Lectures 1 and 2 Sofya Raskhodnikova Penn State University 1 Tentative

Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Boston University 1 Organizational Course

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview of problems 2-1 Statistics

Searching Uniquely Hamiltonian Planar Graphs with Minimum Degree 3 Benedikt Klocker, Herbert

DEGREE SPECTRA OF THE SUCCESSOR RELATION OF COMPUTABLE LINEAR ORDERINGS JENNIFER CHUBB, ANDREY

NestedMP: Taming Complex Configuration Space of Degree of Parallelism for Nested-Parallel

Given a Polynomial of Degree Bound 8 Find 8 Distinct Points to Efficiently Evaluate it at

Delaunay Triangulations Carola Wenk Based on: Computational Geometry: Algorithms and

Computational Geometry Lecture 12: Delaunay Triangulations 1 Computational Geometry Lecture 12:

Computational Geometry Lecture 12: Delaunay Triangulations Computational Geometry Lecture 12:

Insertions and Deletions in Delaunay Triangulations using Guided Point Location Kevin Buchin

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams