Sampling Algorithms for Data Sampling Algorithms for Data - - PowerPoint PPT Presentation

sampling algorithms for data sampling algorithms for data
SMART_READER_LITE
LIVE PREVIEW

Sampling Algorithms for Data Sampling Algorithms for Data - - PowerPoint PPT Presentation

Sampling Algorithms for Data Sampling Algorithms for Data Collection in Online Networks Collection in Online Networks Carter T. Butts 12 Carter T. Butts 12 Minas Gjoka 3 , Maciej Kurant 4 , Athina Markopoulou 3 Minas Gjoka 3 , Maciej Kurant 4


slide-1
SLIDE 1

Sampling Algorithms for Data Sampling Algorithms for Data Collection in Online Networks Collection in Online Networks

Carter T. Butts Carter T. Butts12

12

Minas Gjoka Minas Gjoka3

3, Maciej Kurant

, Maciej Kurant4

4, Athina Markopoulou

, Athina Markopoulou3

3

1 1Department of Sociology

Department of Sociology

2 2Institute for Mathematical Behavioral Sciences

Institute for Mathematical Behavioral Sciences

3 3Department of Electrical Engineering and Computer Science

Department of Electrical Engineering and Computer Science

University of California, Irvine University of California, Irvine

4 4EPFL, Lausanne

EPFL, Lausanne

Prepared for the August 25, 2009 UCI MURI AHM. This work was Prepared for the August 25, 2009 UCI MURI AHM. This work was supported by DOD ONR award N00014-8-1-1015. supported by DOD ONR award N00014-8-1-1015.

slide-2
SLIDE 2

The Network Sampling Problem The Network Sampling Problem

Online networks of increasing interest - and obvious Online networks of increasing interest - and obvious importance for our project importance for our project

Dramatically enhanced data availability versus offline sources Dramatically enhanced data availability versus offline sources

Increasingly relevant to studies of behavior in developed-world Increasingly relevant to studies of behavior in developed-world context context

The problem: online networks are harder to study than they The problem: online networks are harder to study than they appear appear

Many are true "population" networks w/out strong subgroup Many are true "population" networks w/out strong subgroup boundaries and w/10 boundaries and w/106

6-10

  • 108

8 nodes

nodes

Generally, no sampling frame; populations "hidden" from a survey Generally, no sampling frame; populations "hidden" from a survey point of view point of view

Important frontier: Important frontier: principled principled sampling methods for online sampling methods for online networks networks

Today, a quick look at some of our recent work in this area Today, a quick look at some of our recent work in this area

slide-3
SLIDE 3

Extant Methods Extant Methods

 Primary family of

Primary family of methods: link-trace methods: link-trace sampling sampling

 Exploits network

Exploits network structure for structure for sampling purposes sampling purposes

 Basic idea: find

Basic idea: find nodes by following nodes by following links from an initial links from an initial seed set seed set

 Many, many variants

Many, many variants

 Some offline

Some offline

 Some examples

Some examples

 Breadth-first search

Breadth-first search (BFS) (BFS)

 Visit all nodes at

Visit all nodes at distances 1,2,... from distances 1,2,... from seed seed

 Random Walk

Random Walk sampling (RW) sampling (RW)

 Choose random

Choose random neighbor of a node to neighbor of a node to visit visit

 Repeat above step

Repeat above step for a "long" time for a "long" time

slide-4
SLIDE 4

Challenges to Effective Use Challenges to Effective Use

 Lack of a known

Lack of a known equilibrium distribution equilibrium distribution

BFS, most ad hoc BFS, most ad hoc methods badly biased methods badly biased (unless whole network (unless whole network is captured) is captured)

RW biased, but RW biased, but converges to converges to 1/ 1/d d( (v v) ) in in the undirected, the undirected, connected case connected case

Can observe Can observe d d( (v v) ), and , and thus adjust post-hoc thus adjust post-hoc

Directed case harder - Directed case harder - can derive in theory, but can derive in theory, but not easily measure not easily measure

 Unverified

Unverified convergence convergence

For methods with an For methods with an equilibrium, need to equilibrium, need to verify convergence verify convergence

These are really just These are really just MCMC methods; same MCMC methods; same issues apply issues apply

Methods do exist, but Methods do exist, but were not previously were not previously applied to this problem applied to this problem

One area of progress: One area of progress: application of MCMC application of MCMC diagnostics to network diagnostics to network sampling procedures sampling procedures

slide-5
SLIDE 5

Avoiding Bias with MCMC Theory Avoiding Bias with MCMC Theory

 Why not derive a link-

Why not derive a link- trace design that has a trace design that has a uniform (or other target) uniform (or other target) equilibrium distribution? equilibrium distribution?

 Metropolis-Hastings

Metropolis-Hastings Random Walk Sampling Random Walk Sampling

Like simple RW, but Like simple RW, but rejects moves rejects moves proportionally to ratio of proportionally to ratio of

  • ld/new degrees
  • ld/new degrees

Equilibrium is uniform on Equilibrium is uniform on sampled component (for sampled component (for version shown) version shown)

If converged, sample does If converged, sample does not require reweighting for not require reweighting for standard applications standard applications

 MHRW algorithm:

MHRW algorithm:

initialize: initialize: v v(0)

(0)∈

∈V V, , G G

Let CONVERGED:=FALSE Let CONVERGED:=FALSE

Let Let i i:=0 :=0

while while !CONVERGED !CONVERGED do do

Let Let i i:= :=i i+1 +1

Draw Draw v v(

(i i) ) from Unif(

from Unif(N N( (v v(

(i i-1)

  • 1)))

))

if if Unif(0,1)> Unif(0,1)>d d( (v v(

(i i-1)

  • 1))/

)/d d( (v v(

(i i) ))

) then then

 Let

Let v v(

(i i) )=

=v v(

(i-1 i-1) )

endif endif

if if v v(0)

(0),...,

,...,v v(

(i i) ) has converged

has converged then then

 Let CONVERGED:=TRUE

Let CONVERGED:=TRUE

endif endif

endwhile endwhile

return return v v(0)

(0),...

,...v v(

(i i) )

slide-6
SLIDE 6

Application: Probability Sampling Application: Probability Sampling

  • f Facebook Users
  • f Facebook Users

 Large online service (>2x10

Large online service (>2x108

8 users at time of study)

users at time of study)

 Can no longer sample directly

Can no longer sample directly

(Could before, but few knew this!) (Could before, but few knew this!)

 Comparative study of sampling methods, using

Comparative study of sampling methods, using convergence diagnostics (M. Gjoka et al., 2009) convergence diagnostics (M. Gjoka et al., 2009)

Goal: probability sample of non-isolate, publicly viewable users Goal: probability sample of non-isolate, publicly viewable users

Methods: BFS, RW, MHRW, Uniform (reference sample) Methods: BFS, RW, MHRW, Uniform (reference sample)

28 seeds from uniform sample used to launch independent 28 seeds from uniform sample used to launch independent parallel traces parallel traces

Each trace continued for exactly 81K steps (except Uniform, fixed at Each trace continued for exactly 81K steps (except Uniform, fixed at 982K) 982K)

Within (Geweke's Within (Geweke's z zG

G) and between (G+R's

) and between (G+R's Ȓ Ȓ) chain metrics used to ) chain metrics used to extract final samples for RW, MHRW extract final samples for RW, MHRW

slide-7
SLIDE 7

Convergence for the MHRW Convergence for the MHRW Algorithm Algorithm

(M. Gjoka et al., 2009)

Overall: acceptable convergence between 500 and 3000 iterations (depending on measure)

slide-8
SLIDE 8

(M. Gjoka et al., 2009)

Comparative Estimation of Local Comparative Estimation of Local Properties Properties

slide-9
SLIDE 9

Comparative Estimation of the Comparative Estimation of the Degree Distribution Degree Distribution

(M. Gjoka et al., 2009)

slide-10
SLIDE 10

Expansion: Multigraph Sampling Expansion: Multigraph Sampling

 Often, no

Often, no one

  • ne network on a

network on a given population supports given population supports sampling sampling

May be fragmented, or May be fragmented, or clustered/heterogeneous clustered/heterogeneous (slowing convergence) (slowing convergence)

 Solution: multigraph

Solution: multigraph sampling sampling

Walk on multiple graphs, or Walk on multiple graphs, or unions of graphs unions of graphs

Much better properties, esp if Much better properties, esp if uncorrelated uncorrelated

Individual networks need not Individual networks need not be well-connected to be useful be well-connected to be useful

slide-11
SLIDE 11

 Sampling Algorithm

Sampling Algorithm

initialize: initialize: v v(0)

(0)∈

∈V V, , G G, , q q

Let CONVERGED:=FALSE Let CONVERGED:=FALSE

Let Let i i:=0 :=0

while while !CONVERGED !CONVERGED do do

Let Let i i:= :=i i+1 +1

Draw Draw G' G' from from G G with pmf with pmf Multinom( Multinom(q qv

v(

(i i-1)

  • 1)1

1,...,

,...,q qv

v(

(i i-1)

  • 1)Q

Q)

)

Draw Draw v v(

(i i) ) from Unif(

from Unif(N NG'

G'(

(v v(

(i i-1)

  • 1)))

))

if if v v(0)

(0),...,

,...,v v(

(i i) ) has converged

has converged then then

 Let CONVERGED:=TRUE

Let CONVERGED:=TRUE

endif endif

endwhile endwhile

return return v v(0)

(0),...

,...v v(

(i i) )

Multigraph Sampling Algorithm Multigraph Sampling Algorithm

Want to sample from Want to sample from G G={ ={G G1

1,...,

,...,G GQ

Q}

} on vertex set

  • n vertex set V

V

Trivial solution: just walk on Trivial solution: just walk on the union graph the union graph

Fine, but requires taking unions Fine, but requires taking unions

  • f neighborhoods; becomes
  • f neighborhoods; becomes

very expensive for graphs with very expensive for graphs with large cliques (e.g., group co- large cliques (e.g., group co- membership) membership)

Alternate approach: mix Alternate approach: mix across graphs across graphs

Add loops to each graph Add loops to each graph

Choose Choose G Gj

j from

from v vi

i w/prob

w/prob q qij

ij=

=d d( (v vi

i,G

,Gj

j)/

)/∑ ∑k

kd

d( (v vi

i,G

,Gk

k)

)

Choose uniformly from Choose uniformly from N NGj

Gj(

(v vi

i)

)

slide-12
SLIDE 12

Simulated Example Simulated Example

Vertex sampling probability Vertex sampling probability converges to converges to Pr( Pr(v v(

(i i) )=

=v vj

j)

) ∑ ∑k

kd

d( (v vj

j,

,G Gk

k)

)

Ex: 5 order 50 Bernoulli graphs, Ex: 5 order 50 Bernoulli graphs, d d= =1.5 1.5

Each very fragmented, but Each very fragmented, but union is well-connected union is well-connected

Sampling from a single vertex; Sampling from a single vertex; sample of 1e5 thinned from Markov sample of 1e5 thinned from Markov chain of length 5e6 chain of length 5e6

χ χ2

2

p p= =0.52 vs. theoretical 0.52 vs. theoretical

slide-13
SLIDE 13

Final Remarks Final Remarks

 Many ways in which MCMC theory can help with

Many ways in which MCMC theory can help with large-network sampling large-network sampling

Especially useful in the online case, since long chains Especially useful in the online case, since long chains are easy to obtain, sampling is cheap are easy to obtain, sampling is cheap

Easy to apply "off the shelf" simulation/diagnostic Easy to apply "off the shelf" simulation/diagnostic strategies to network sampling problems strategies to network sampling problems

Minor extensions (e.g., multigraph sampling) can help Minor extensions (e.g., multigraph sampling) can help patch weaknesses with current approaches patch weaknesses with current approaches

 Offline generalizations

Offline generalizations

Link-trace sampling now vital in many other contexts Link-trace sampling now vital in many other contexts (e.g., RDS for disease, drug use estimation) (e.g., RDS for disease, drug use estimation)

Ideally, should leverage these ideas for the offline world Ideally, should leverage these ideas for the offline world