Towards Unbiased BFS Sampling Maciej Kurant Athina Markopoulou - PDF document

Towards Unbiased BFS Sampling Maciej Kurant Athina Markopoulou Patrick Thiran EECS Dept EECS Dept School of Computer & Comm. Sciences University of California, Irvine University of California, Irvine EPFL, Lausanne, Switzerland maciej.kurant@gmail.com athina@uci.edu patrick.thiran@epfl.ch Abstract —Breadth First Search (BFS) is a widely used ap- average node degree � k 2 � Random Walk (RW) � q k � expected observed � k � proach for sampling large unknown Internet topologies. Its main arXiv:1102.4599v1 [cs.SI] 22 Feb 2011 Graph traversal techniques: advantage over random walks and other exploration techniques - BFS - DFS is that a BFS sample is a plausible graph on its own, and therefore - Forest Fire we can study its topological characteristics. However, it has been - Snowball / RDS empirically observed that incomplete BFS is biased toward high- � k � degree nodes, which may strongly affect the measurements. Metropolis-Hastings Random Walk (MHRW) In this paper, we first analytically quantify the degree bias of BFS sampling. In particular, we calculate the node degree f fraction of sampled nodes distribution expected to be observed by BFS as a function of the 0 1 fraction f of covered nodes, in a random graph RG ( p k ) with an Fig. 1. Overview of analytical results. We calculate the node degree arbitrary degree distribution p k . We also show that, for RG ( p k ) , distribution q k expected to be observed by BFS in a random graph RG ( p k ) all commonly used graph traversal techniques (BFS, DFS, Forest with a given degree distribution p k , as a function of the fraction of sampled Fire, Snowball Sampling, RDS) suffer from exactly the same bias. nodes f . (In this plot, we show only its average � q k � .) We show RW and Next, based on our theoretical analysis, we propose a practical MHRW as a reference. � k � = � p k � is the real average node degree, and BFS-bias correction procedure. It takes as input a collected BFS � k 2 � is the real average squared node degree. Observations: (1) For sample together with its fraction f . Even though RG ( p k ) does a small sample size, BFS has the same bias as RW; with increasing f , the not capture many graph properties common in real-life graphs bias decreases; a complete BFS ( f =1 ) is unbiased, as is MHRW (or uniform (such as assortativity), our RG ( p k ) -based correction technique sampling). (2) All common graph traversal techniques (that do not revisit the same node) lead to the same bias. (3) The shape of the BFS curve performs well on a broad range of Internet topologies and on depends on the real node degree distribution p k , but it is always monotonically two large BFS samples of Facebook and Orkut networks. decreasing; we calculate it precisely in this paper. (4) We also calculate Finally, we consider and evaluate a family of alternative the original distribution p k based on the sampled q k and f (not shown here). correction procedures, and demonstrate that, although they are unbiased for an arbitrary topology, their large variance makes them far less effective than the RG ( p k ) -based technique. Index Terms —BFS, Breadth First Search, graph sampling, its variations [5,6], as well as the Metropolis-Hastings Random estimation, bias correction, Internet topologies, Online Social Walk (MHRW). They are used for sampling of nodes on the Networks. Web [7], P2P networks [8]–[10], OSNs [2,11] and large graphs in general [12]. Random walks are well studied [4] and result I. I NTRODUCTION in samples that have either no bias (MHRW) or a known bias A large body of work in the networking community focuses (RW) that can be corrected for [13]–[16]. In contrast to BFS, on Internet topology measurements at various levels, including random walks collect a representative sample of nodes rather the IP or AS connectivity, the Web (WWW), peer-to-peer than of topology, and are therefore not the focus of the paper . (P2P) and online social networks (OSN). The size of these However, we use them as baseline for comparison. networks and other restrictions make measuring the entire In the second category, graph traversals , each node is graph impossible. For example, learning only the topology of visited exactly once (if we let the process run until com- Facebook social graph would require downloading more than pletion and if the graph is connected). These methods vary 250 T B of HTML data [2,3], which is most likely impractical. in the order in which they visit the nodes; examples include Instead, researchers typically collect and study a small but BFS, Depth-First Search (DFS), Forest Fire (FF), Snowball representative sample of the underlying graph. Sampling (SBS) and Respondent-Driven Sampling (RDS) 1 . In this paper, we are particularly interested in sampling Graph traversals, especially BFS, are very popular and widely networks that naturally allow to explore the neighbors of a used for sampling Internet topologies, e.g. , in WWW [17] given node (which is the case in WWW, P2P and OSN). or OSNs [18]–[20]. [19] alone has about 380 citations as of A number of graph exploration techniques use this basic December 2010, many of which use its Orkut BFS sample. operation for sampling. They can be roughly classified in two The main reason of this high popularity is that a BFS sam- categories: (i) random walks, and (ii) graph traversals. ple is a plausible graph on its own. Consequently, we can In the first category, random walks , nodes can be revisited. study its topological characteristics ( e.g. , shortest path lengths, This category includes the classic Random Walk (RW) [4] and 1 RDS is essentially SBS equipped with some bias correction procedure This paper is a revised and extended version of [1]. (omitted in Fig. 1).

Towards Unbiased BFS Sampling Maciej Kurant Athina Markopoulou - PDF document

Towards Unbiased BFS Sampling Maciej Kurant Athina Markopoulou Patrick Thiran EECS Dept EECS Dept School of Computer & Comm. Sciences University of California, Irvine University of California, Irvine EPFL, Lausanne, Switzerland

Finite Projective Planes http://math.uwyo.edu/moorhouse/pub/planes/ Eric Moorhouse Mutually

Scalable GPU graph traversal BFS Compressed Row Format Sequential BFS Parallel BFS

Massive Data Algorithmics Lecture 11: BFS and DFS Massive Data Algorithmics Lecture 11: BFS and

Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 Implementation of

BFS and DFS Problem Solving Club Nov 2 2016 Breadth First Search (BFS) Review What is the

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

AP CAPSTONE DIPLOMA at BFS BFS students are prepared to enter top universities worldwide and

Framework - Feed the Future Presenters Emily Hogue Krista Jacobs Farzana Ramzan USAID/BFS

Operationalizing CSA: Applications and CSA Metrics Moffatt Ngugi, BFS/CSI Tatiana Pulido,

BFS/DFS Applications BFS and DFS applications Tyler Moore Shortest path between two nodes in a

Foundations of Artificial Intelligence 10. State-Space Search: Breadth-first Search Malte Helmert

Graphs Lecture 2 Today BFS/DFS Review; proof about DFS tree Implementation Running time

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

ELEC / COMP 177 Fall 2012 Some slides from Kurose

D ISTRIBUTED S YSTEMS [COMP9243] D ATA VS C ONTROL R EPLICATION Lecture 3a: Replication &

Emergency Preparedness Creating a Disaster Recovery Plan for your Drupal Site Ronan Dowling

Failure Comes in Flavors Part I: Anti-Patterns Michael Nygard mtnygard@gmail.com

Overcoming MPI Communication Overhead for Distributed Community Detection NAW SAFRIN SATTAR

Word representations Benoit Favre < benoit.favre@univ-mrs.fr > Aix-Marseille Universit,

Growing a Graph Matching from a Handful of Seeds Ehsan KAZEMI 1 , S. Hamed HASSANI 2 , and

Review Objectives: 1. Knowing the expectations regarding Java 2. Introducing basic concepts of

Towards Unbiased BFS Sampling Maciej Kurant Athina Markopoulou - PDF document

Towards Unbiased BFS Sampling Maciej Kurant Athina Markopoulou Patrick Thiran EECS Dept EECS Dept School of Computer & Comm. Sciences University of California, Irvine University of California, Irvine EPFL, Lausanne, Switzerland

Finite Projective Planes http://math.uwyo.edu/moorhouse/pub/planes/ Eric Moorhouse Mutually

Scalable GPU graph traversal BFS Compressed Row Format Sequential BFS Parallel BFS

Massive Data Algorithmics Lecture 11: BFS and DFS Massive Data Algorithmics Lecture 11: BFS and

Kolganov A.S., MSU The BFS algorithm Graph500 &amp;&amp; GGraph500 Implementation of

BFS and DFS Problem Solving Club Nov 2 2016 Breadth First Search (BFS) Review What is the

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

AP CAPSTONE DIPLOMA at BFS BFS students are prepared to enter top universities worldwide and

Framework - Feed the Future Presenters Emily Hogue Krista Jacobs Farzana Ramzan USAID/BFS

Operationalizing CSA: Applications and CSA Metrics Moffatt Ngugi, BFS/CSI Tatiana Pulido,

BFS/DFS Applications BFS and DFS applications Tyler Moore Shortest path between two nodes in a

Foundations of Artificial Intelligence 10. State-Space Search: Breadth-first Search Malte Helmert

Graphs Lecture 2 Today BFS/DFS Review; proof about DFS tree Implementation Running time

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

ELEC / COMP 177 Fall 2012 Some slides from Kurose

D ISTRIBUTED S YSTEMS [COMP9243] D ATA VS C ONTROL R EPLICATION Lecture 3a: Replication &amp;

Emergency Preparedness Creating a Disaster Recovery Plan for your Drupal Site Ronan Dowling

Failure Comes in Flavors Part I: Anti-Patterns Michael Nygard mtnygard@gmail.com

Overcoming MPI Communication Overhead for Distributed Community Detection NAW SAFRIN SATTAR

Word representations Benoit Favre &lt; benoit.favre@univ-mrs.fr &gt; Aix-Marseille Universit,

Growing a Graph Matching from a Handful of Seeds Ehsan KAZEMI 1 , S. Hamed HASSANI 2 , and

Review Objectives: 1. Knowing the expectations regarding Java 2. Introducing basic concepts of

Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 Implementation of

D ISTRIBUTED S YSTEMS [COMP9243] D ATA VS C ONTROL R EPLICATION Lecture 3a: Replication &

Word representations Benoit Favre < benoit.favre@univ-mrs.fr > Aix-Marseille Universit,