Sampling Algorithms for Data Sampling Algorithms for Data - PowerPoint PPT Presentation

Sampling Algorithms for Data Sampling Algorithms for Data Collection in Online Networks Collection in Online Networks Carter T. Butts 12 Carter T. Butts 12 Minas Gjoka 3 , Maciej Kurant 4 , Athina Markopoulou 3 Minas Gjoka 3 , Maciej Kurant 4 , Athina Markopoulou 3 Department of Sociology 1 Department of Sociology 1 Institute for Mathematical Behavioral Sciences 2 2 Institute for Mathematical Behavioral Sciences Department of Electrical Engineering and Computer Science 3 Department of Electrical Engineering and Computer Science 3 University of California, Irvine University of California, Irvine EPFL, Lausanne 4 EPFL, Lausanne 4 Prepared for the August 25, 2009 UCI MURI AHM. This work was Prepared for the August 25, 2009 UCI MURI AHM. This work was supported by DOD ONR award N00014-8-1-1015. supported by DOD ONR award N00014-8-1-1015.

The Network Sampling Problem The Network Sampling Problem Online networks of increasing interest - and obvious Online networks of increasing interest - and obvious  importance for our project importance for our project Dramatically enhanced data availability versus offline sources Dramatically enhanced data availability versus offline sources  Increasingly relevant to studies of behavior in developed-world Increasingly relevant to studies of behavior in developed-world  context context The problem: online networks are harder to study than they The problem: online networks are harder to study than they  appear appear Many are true "population" networks w/out strong subgroup Many are true "population" networks w/out strong subgroup  boundaries and w/10 6 -10 8 nodes boundaries and w/10 6 -10 8 nodes Generally, no sampling frame; populations "hidden" from a survey Generally, no sampling frame; populations "hidden" from a survey  point of view point of view Important frontier: principled principled sampling methods for online sampling methods for online Important frontier:  networks networks Today, a quick look at some of our recent work in this area Today, a quick look at some of our recent work in this area 

Extant Methods Extant Methods  Primary family of  Some examples Primary family of Some examples methods: link-trace methods: link-trace  Breadth-first search Breadth-first search sampling sampling (BFS) (BFS)  Exploits network  Visit all nodes at Exploits network Visit all nodes at distances 1,2,... from distances 1,2,... from structure for structure for seed seed sampling purposes sampling purposes  Random Walk Random Walk  Basic idea: find Basic idea: find sampling (RW) sampling (RW) nodes by following nodes by following  Choose random Choose random links from an initial links from an initial neighbor of a node to neighbor of a node to seed set seed set visit visit  Many, many variants Many, many variants  Repeat above step Repeat above step  Some offline for a "long" time for a "long" time Some offline

Challenges to Effective Use Challenges to Effective Use  Lack of a known  Unverified Lack of a known Unverified equilibrium distribution convergence equilibrium distribution convergence BFS, most ad hoc For methods with an BFS, most ad hoc For methods with an   methods badly biased equilibrium, need to methods badly biased equilibrium, need to (unless whole network verify convergence (unless whole network verify convergence is captured) is captured) These are really just These are really just  MCMC methods; same MCMC methods; same RW biased, but RW biased, but  issues apply issues apply converges to 1/ in 1/ d d ( ( v v ) ) in converges to Methods do exist, but Methods do exist, but the undirected, the undirected,  were not previously were not previously connected case connected case applied to this problem applied to this problem Can observe d , and Can observe d ( ( v v ) ) , and  One area of progress: thus adjust post-hoc One area of progress: thus adjust post-hoc  application of MCMC application of MCMC Directed case harder - Directed case harder -  diagnostics to network diagnostics to network can derive in theory, but can derive in theory, but sampling procedures sampling procedures not easily measure not easily measure

Avoiding Bias with MCMC Theory Avoiding Bias with MCMC Theory  Why not derive a link-  MHRW algorithm: Why not derive a link- MHRW algorithm: trace design that has a trace design that has a ∈ V initialize: v v (0) (0) ∈ V , , G G initialize:  uniform (or other target) uniform (or other target) Let CONVERGED:=FALSE Let CONVERGED:=FALSE  equilibrium distribution? equilibrium distribution? Let i i :=0 :=0 Let   Metropolis-Hastings Metropolis-Hastings while !CONVERGED !CONVERGED do do while  Random Walk Sampling Random Walk Sampling Let i i := := i i +1 +1 Let  Like simple RW, but Like simple RW, but  Draw v Draw v ( ) from Unif( from Unif( N N ( ( v v ( -1) )) )) ( i i ) ( i i -1)  rejects moves rejects moves if Unif(0,1)> Unif(0,1)> d d ( ( v v ( )/ d d ( ( v v ( ) then then if ( i i -1) -1) )/ ( i i ) ) )  proportionally to ratio of proportionally to ratio of  Let Let v v ( = v v ( ( i i ) ) = ( i-1 i-1 ) ) old/new degrees old/new degrees endif endif  Equilibrium is uniform on Equilibrium is uniform on  if v v (0) ,..., v v ( has converged then then if (0) ,..., ( i i ) ) has converged  sampled component (for sampled component (for  Let CONVERGED:=TRUE Let CONVERGED:=TRUE version shown) version shown) endif endif  If converged, sample does If converged, sample does  endwhile endwhile not require reweighting for not require reweighting for  standard applications standard applications return v v (0) ,... v v ( return (0) ,... ( i i ) ) 

Application: Probability Sampling Application: Probability Sampling of Facebook Users of Facebook Users  Large online service (>2x10 Large online service (>2x10 8 users at time of study) 8 users at time of study)  Can no longer sample directly Can no longer sample directly (Could before, but few knew this!) (Could before, but few knew this!)   Comparative study of sampling methods, using Comparative study of sampling methods, using convergence diagnostics (M. Gjoka et al., 2009) convergence diagnostics (M. Gjoka et al., 2009) Goal: probability sample of non-isolate, publicly viewable users Goal: probability sample of non-isolate, publicly viewable users  Methods: BFS, RW, MHRW, Uniform (reference sample) Methods: BFS, RW, MHRW, Uniform (reference sample)  28 seeds from uniform sample used to launch independent 28 seeds from uniform sample used to launch independent  parallel traces parallel traces Each trace continued for exactly 81K steps (except Uniform, fixed at Each trace continued for exactly 81K steps (except Uniform, fixed at  982K) 982K) Within (Geweke's z z G ) and between (G+R's Ȓ Ȓ ) chain metrics used to ) chain metrics used to Within (Geweke's G ) and between (G+R's  extract final samples for RW, MHRW extract final samples for RW, MHRW

Convergence for the MHRW Convergence for the MHRW Algorithm Algorithm Overall: acceptable convergence between 500 and 3000 iterations (depending on measure) (M. Gjoka et al., 2009)

Comparative Estimation of Local Comparative Estimation of Local Properties Properties (M. Gjoka et al., 2009)

Comparative Estimation of the Comparative Estimation of the Degree Distribution Degree Distribution (M. Gjoka et al., 2009)

Expansion: Multigraph Sampling Expansion: Multigraph Sampling  Often, no Often, no one one network on a network on a given population supports given population supports sampling sampling May be fragmented, or May be fragmented, or  clustered/heterogeneous clustered/heterogeneous (slowing convergence) (slowing convergence)  Solution: multigraph Solution: multigraph sampling sampling Walk on multiple graphs, or Walk on multiple graphs, or  unions of graphs unions of graphs Much better properties, esp if Much better properties, esp if  uncorrelated uncorrelated Individual networks need not Individual networks need not  be well-connected to be useful be well-connected to be useful

Sampling Algorithms for Data Sampling Algorithms for Data - PowerPoint PPT Presentation

Sampling Algorithms for Data Sampling Algorithms for Data Collection in Online Networks Collection in Online Networks Carter T. Butts 12 Carter T. Butts 12 Minas Gjoka 3 , Maciej Kurant 4 , Athina Markopoulou 3 Minas Gjoka 3 , Maciej Kurant 4

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

02 Sampling algorithms Shravan Vasishth SMLP Shravan Vasishth 02 Sampling algorithms SMLP 1 /

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Introduction to Sampling for Non-Statisticians Dr. Safaa R. Amer Overview Part I Part II

Medicare and Medicaid Audit Sampling Strategies Sampling Strategies Creating Sampling Plans and

CS786 Lecture 13: May 14, 2012 Sampling techniques [KF Chapter 12] CS786 P. Poupart 2012 1

Double, Multiple, and Sequential Sampling Double-sampling In a double-sampling plan, a first

Scripting in Virtual Worlds with Remote Data Behram Mistree Virtual World Scripting Scripts

Are PWAs ready to Are PWAs ready to take over the world? take over the world? Implementing main

How I Learned To Stop Worrying And Love Offmine RL An Optimistic Perspective on Offline

CSCI 2133 RAPID PROGRAMMING TECHNIQUES FOR INNOVATION Lab 07: Prep Lab: HTML5 Instructor: Gang

sPHENIX computing sPHENIX timeline PD 2/3 1 st sPHENIX workfest, 2011 in Boulder Computing

Announcements CS 4100: Artificial Intelligence Markov Decision Processes II Homework k 4:

Welcome everyone, thanks for coming (reading), were going to get started. Please turn off

Being Church in a Digital Age: The Continued Case for Prioritizing Digital Ministry Ryan Panzer

Sampling Algorithms for Data Sampling Algorithms for Data - PowerPoint PPT Presentation

Sampling Algorithms for Data Sampling Algorithms for Data Collection in Online Networks Collection in Online Networks Carter T. Butts 12 Carter T. Butts 12 Minas Gjoka 3 , Maciej Kurant 4 , Athina Markopoulou 3 Minas Gjoka 3 , Maciej Kurant 4

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

02 Sampling algorithms Shravan Vasishth SMLP Shravan Vasishth 02 Sampling algorithms SMLP 1 /

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Introduction to Sampling for Non-Statisticians Dr. Safaa R. Amer Overview Part I Part II

Medicare and Medicaid Audit Sampling Strategies Sampling Strategies Creating Sampling Plans and

CS786 Lecture 13: May 14, 2012 Sampling techniques [KF Chapter 12] CS786 P. Poupart 2012 1

Double, Multiple, and Sequential Sampling Double-sampling In a double-sampling plan, a first

Scripting in Virtual Worlds with Remote Data Behram Mistree Virtual World Scripting Scripts

Are PWAs ready to Are PWAs ready to take over the world? take over the world? Implementing main

How I Learned To Stop Worrying And Love Offmine RL An Optimistic Perspective on Offline

CSCI 2133 RAPID PROGRAMMING TECHNIQUES FOR INNOVATION Lab 07: Prep Lab: HTML5 Instructor: Gang

sPHENIX computing sPHENIX timeline PD 2/3 1 st sPHENIX workfest, 2011 in Boulder Computing

Announcements CS 4100: Artificial Intelligence Markov Decision Processes II Homework k 4:

Welcome everyone, thanks for coming (reading), were going to get started. Please turn off

Being Church in a Digital Age: The Continued Case for Prioritizing Digital Ministry Ryan Panzer

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling