Estimating and Sampling Graphs with Multidimensional Random Walks - - PowerPoint PPT Presentation

estimating and sampling graphs with multidimensional
SMART_READER_LITE
LIVE PREVIEW

Estimating and Sampling Graphs with Multidimensional Random Walks - - PowerPoint PPT Presentation

Estimating and Sampling Graphs with Multidimensional Random Walks Group 2: Mingyan Zhao, Chengle Zhang, Biao Yin, Yuchen Liu Motivation Complex Network Social Network Biological Network eference: www.forbe.com imdevsoftware.wordpress.com


slide-1
SLIDE 1

Estimating and Sampling Graphs with Multidimensional Random Walks

Group 2: Mingyan Zhao, Chengle Zhang, Biao Yin, Yuchen Liu

slide-2
SLIDE 2

Motivation

Complex Network Social Network

eference: www.forbe.com imdevsoftware.wordpress.com

Biological Network

slide-3
SLIDE 3

Existing Approaches

´ Random vertex sampling ´ Random edge sampling ´ Random walks

slide-4
SLIDE 4

Frontier Sampling

´ a new m-dimensional random walk that uses m dependent random walkers.

slide-5
SLIDE 5

Contribution

´ Mitigates the large estimation errors caused by disconnected or loosely connected components. ´ Shows that the tail of the degree distribution is better estimated using random edge sampling than random vertex sampling. ´ Presents asymptotically unbiased estimators

slide-6
SLIDE 6

Definitions

Notation Meaning Gd (V, Ed) A labeled directed grap representing the (origina network graph, where V is set of vertices and Ed is a set

  • f edges

(u, v) A connection from u to (a.k.a. edges) Lv and Le Finite set of vertex and edg labels, Le(u,v)= ∅ Edge (u, v) is unlabeled Lv(v)= ∅ Vertex v is unlabeled

slide-7
SLIDE 7

Vertex V.S. Edge Sampling

slide-8
SLIDE 8

Section 4

  • 1. Mathematical theories and conductions on Random Walk Sampli
  • 2. Strong Law of Large Numbers
  • 3. Four estimators will be applied in Section 5
  • 4. Deficiency of RW
  • 5. Multiple Independent Random Walkers
slide-9
SLIDE 9

Strong Law of Large Numbers

Original statistical Thm: Weak law:

B Number of RW steps B*(B) Number of edges in E* Total RW Sampled Edges

slide-10
SLIDE 10

st, label edges of interest: , the probability of the labelled edges e estimator based on SLLN

Estimator 1: Edge Label Density

slide-11
SLIDE 11

Estimator 2: Assortative Mixing Coefficient (AMC)

nsidering directed G, an asymptotically unbiased estimator of AMC:

Covariance Which two are highly correlated?

slide-12
SLIDE 12

Estimator 3:Vertex label Density

Construct an asymptotically unbiased estimator: Since: So, this estimator converges to:

slide-13
SLIDE 13

Estimator 4: Global Clustering Coefficient

e unbiased estimator by SLLN ce:

slide-14
SLIDE 14

Deficiency of RW from one point

  • 1. “Trapped” inside a subgraph (MSE)
  • 2. Start from non-stationary (non-steady) state (MSE, Bias)

Burn-in period: Discard the non-stationary samples

  • 1. Just decrease error with non-stationary one
  • 2. Discarding in a small sample is not ideal

Multiple Independent Random Walkers come!

slide-15
SLIDE 15

Single RW and Multiple RW

Single RW is depend on sample sizes from the estimators provided. Mutiple RW will split the sample sizes into each path. So, error in the total CNMSE Log-log plot!

Still very high

slide-16
SLIDE 16

Why not M-independent RW ?

A B

MIRW is hard to sample m independent vertex with p proportional to their degrees.

slide-17
SLIDE 17

Section 5 Motivation

´ We want an m-dimensional random walk that, in steady state, samples edges uniformly at random but, unlike MultipleRW, can benefit from starting its walkers at uniformly sampled vertices.

slide-18
SLIDE 18

Frontier Sampling

p = ​deg​(𝑣)/∑𝑀↑▒deg​(𝑤) *​1/deg​(𝑣) =

slide-19
SLIDE 19

Frontier Sampling: A m-dimensional Random Wa

´ The frontier sampling process is equivalent to the sampling process of a single random walker over ​𝐻↑𝑛 . (Lemma 5.1)

´ P(selecting a vertex and its outgoing edge in FS) = P(randomly sampling an edge from e(​𝑀↓𝑜 ) in single random walker over ​𝐻↑𝑛 ).

slide-20
SLIDE 20

FS Steady State v.s Uniform Distribution

´ ​𝐿↓𝑔𝑡 (𝑛) be a random variable that denotes the number of random walkers in ​𝑊↓𝐵 in steady state. ´ Let ​𝐿↓𝑣𝑜 (𝑛) be a random variable that denotes the number of sampled vertices, out of m uniformly sampled vertices from V, that belong to ​𝑊↓𝐵 . ´ Proving this to be true indicates that the FS algorithm starting with m random walkers at m uniformly sampled vertices approaches the steady state distribution. This means FS benefits from starting its walkers at uniformly sampled vertices by reducing transient of RW.

slide-21
SLIDE 21

FS Steady State V.S. Uniform Distribution

´ By definition ´ Theorem 5.2 ´ Lemma 5.3 ´ Theorem 5.4

slide-22
SLIDE 22

MultipleRW Steady State V.S. Uniform Distribution

´ ​𝐿↓𝑛𝑥 (𝑛) be a random variable that denotes the steady state number of MultipleRW random walkers in ​𝑊↓𝐵 . Note: ​𝑒↓𝐵 (average degree of vertices in ​𝑊↓𝐵 ) Conclusion: If we initialize m random walkers with uniformly sampled vertices, FS starts closer to steady state than MultipleRW.

slide-23
SLIDE 23

Distributed Frontier Sampling

´ Frontier Sampling can also be parallelized. ´ A MultipleRW sampling process where the cost of sampling a vertex v is an exponentially distributed random variable with parameter deg(v) is equivalent to a FS process. (Theorem 5)

slide-24
SLIDE 24

Experiment and Result

´ Data:

´ “Flickr”, “Livejournal”, and “YouTube” ´ Barabási-Albert [5] graph

´ Goal:

´ Compare FS to MultipleRW, SingleRM ´ Compare FS on random vertex and edge sampling

´ Result: FS is constantly more accurate

slide-25
SLIDE 25

Assortative Mixing Coefficient

slide-26
SLIDE 26

In-degree Distribution Estimates

slide-27
SLIDE 27

Out-degree Distribution Estimates

slide-28
SLIDE 28

In-degree Distribution loosely connected components

´ Barabási-Albert Graph

slide-29
SLIDE 29

FS V.S. Stationary MultipleRW & SingleRW

´ MultipleRW and SingleRW start with steady state

slide-30
SLIDE 30

FS V.S. Random Independent Sampling

slide-31
SLIDE 31

Density of Special Interest Group

slide-32
SLIDE 32

Global Clustering Coefficient Estimates

´ Global Clustering Coefficient

a measure of the degree to which nodes in a graph tend to cluster together.

slide-33
SLIDE 33

Conclusion

´ In almost all of the tests, FS is better.

Future Work

  • estimating characteristics of dynamic networks
  • design of new MCMC-based approximation algorithms
slide-34
SLIDE 34

Thank you!