Sampling Large Graphs: Algorithms and Applications Don Towsley - - PowerPoint PPT Presentation

sampling large graphs
SMART_READER_LITE
LIVE PREVIEW

Sampling Large Graphs: Algorithms and Applications Don Towsley - - PowerPoint PPT Presentation

Sampling Large Graphs: Algorithms and Applications Don Towsley College of Information & Computer Science Umass - Amherst Collaborators: P.H. Wang, J.C.S. Lui, J.Z. Zhou, X. Guan Measuring, analyzing large networks - large networks can be


slide-1
SLIDE 1

Sampling Large Graphs: Algorithms and Applications

Don Towsley College of Information & Computer Science Umass - Amherst

Collaborators: P.H. Wang, J.C.S. Lui, J.Z. Zhou, X. Guan

slide-2
SLIDE 2

Measuring, analyzing large networks

  • large networks can be represented by graphs
  • Facebook

1+ Billion

3

slide-3
SLIDE 3

Measuring, analyzing large networks

  • large networks can be represented by graphs
  • Facebook
  • WWW

50 Billion

3

slide-4
SLIDE 4

Measuring, analyzing large networks

  • large networks can be represented by graphs
  • Facebook
  • WWW
  • Twitter

300 million

3

slide-5
SLIDE 5

Measuring, analyzing large networks

  • large networks can be represented by graphs
  • Facebook
  • WWW
  • Twitter
  • Ebay

233 Million

3

slide-6
SLIDE 6

Measuring, analyzing large networks

  • large networks can be represented by graphs
  • Facebook
  • WWW
  • Twitter
  • Ebay

Curse of data dimensionality !!!

3

slide-7
SLIDE 7

Challenges in measurement: Information distortion

“World Map” in 1459

 incomplete

(Columbus et al. 1492) (Australia 17th century)

 wrong proportions

(Africa & Asia)

www.flickr.com/

slide-8
SLIDE 8

Why do we want to understand these networks?

Want to understand or find out

 how did these networks evolve?

slide-9
SLIDE 9

Why do we want to understand these networks?

High school friendship network

Want to understand or find out

 how did these networks evolve?  who are the influential users?

slide-10
SLIDE 10

Why do we want to understand these networks?

High school friendship network

Want to understand or find out

 how did these networks evolve?  who are the influential users?  how does influence propagate?  communities in these networks?  ….etc.

slide-11
SLIDE 11

Goals and challenges

Goals

 generate statistically valid characterization of

network structure

 node pairs in this work

Challenges

 large networks  correcting for biases

slide-12
SLIDE 12

How to measure: sampling

Random sampling (uniform & independent) Crawling

Node sampling Breadth First sampling (BFS) Random walk sampling (RW) Edge sampling

slide-13
SLIDE 13

How to measure: sampling

Random sampling (uniform & independent) Crawling

Node sampling Breadth First sampling (BFS) Random walk sampling (RW) Edge sampling

slide-14
SLIDE 14

How to measure: sampling

Random sampling (uniform & independent) Crawling

Node sampling Breadth First sampling (BFS) Random walk sampling (RW) Edge sampling

slide-15
SLIDE 15

How to measure: sampling algorithms

Random sampling (uniform & independent) Crawling

Node sampling Breadth First sampling (BFS) Random walk sampling (RW) Edge sampling

slide-16
SLIDE 16

How to measure: sampling algorithms

Random sampling (uniform & independent) Crawling

Node sampling Breadth First sampling (BFS) Random walk sampling (RW) Edge sampling

slide-17
SLIDE 17

 Orkut data set (Mislove 2007), 3M vertices, 200M

edges

 BFS sampling highly biased  difficult to remove bias

Breadth first search sampling

117 CCDF True distribution BFS, depth = 3

slide-18
SLIDE 18

CCDF RW sampling

πi

θi

i

Random walk sampling

Bias removal?

 Markov model  at steady state visits edges

uniformly at random (edge sampling)

Model:

i - P[node degree = i] πi - P[visited degree = i] i  i  i

18

slide-19
SLIDE 19

CCDF RW sampling

πi

θi

i

Random walk sampling

Bias removal?

 Markov model  at steady state visits edges

uniformly at random (edge sampling)

Model:

i - P[node degree = i] πi - P[visited degree = i] i = i  i / avg degree

  • r

i = Norm  i /i

19

slide-20
SLIDE 20

Node sampling vs. RW: Orkut

20 log(degree)

log(CCDF)

random walk

log(degree) log(CCDF

node sampling  RW – estimates tail well  node sampling – estimates small degrees well

slide-21
SLIDE 21

Focus of talk

Measure node pair statistics: important for many applications!

22

slide-22
SLIDE 22

Classification of node pairs

23

Classify node pair [𝑣, 𝑤] using shortest path

  • 1-hop node pair class if distance(u,v) = 1
  • 2-hop node pair class if distance(u,v) = 2
slide-23
SLIDE 23

Homophily

Homophily: tendency of users to connect to others with common interests.

  • P. Singla and M. Richardson. Yes, there is a correlation: from social

networks to personal behavior on the web. In WWW 2008 (MSN) Can infer characteristics and make recommendations

Compare homophily(u,v) between different node pair classes

24

slide-24
SLIDE 24

Pair similarity: Proximity

Proximity(u,v): number of common neighbors

  • f u and v; closeness of u and v

 knowing proximity distribution of node pairs

important for

 friendship prediction  interest recommendation  …

25

u v

slide-25
SLIDE 25

Pair similarity: distance

 Distance(u,v): length of shortest path between u

and v in graph

 measure distance distribution of all node pairs to

calculate

 average distance

  • Twitter: 4.1
  • MSN: 6.6

 effective diameter (the 90th percentile of all distances)  small world

26

slide-26
SLIDE 26

Problem formulation

 undirected graph 𝐻 = (𝑊, 𝐹)  measure node pair characteristics in following

sets:

 all pairs - 𝑇 = { 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝑊, 𝑣 ≠ 𝑤}  one-hop pairs - pairs of connected nodes

𝑇(1) = 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝐹

 two-hop pairs - pairs of nodes with at least one

common neighbor 𝑇(2) = { 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝑊, 𝑣 ≠ 𝑤; ∃𝑦 ∈ 𝑊 𝑡𝑢 𝑦, 𝑣 , 𝑦, 𝑤 ∈ 𝐹}

27

slide-27
SLIDE 27

Problem formulation

 𝑮(𝒗, 𝒘) – similarity of node pair under study,

e.g., # of common neighbors of 𝑣, 𝑤

{𝑏1, … , 𝑏𝐿} - range of 𝐺 𝑣, 𝑤 distribution of 𝐺 𝑣, 𝑤

 𝑇: (𝜕1, … , 𝜕𝐿)  𝑇(1): (𝜕1

(1), … , 𝜕𝐿 (1))

𝑇(2): (𝜕1

2 , … , 𝜕𝐿 2 )

𝜕𝑙, 𝜕𝑙

(1), 𝜕𝑙 (2)- fractions of node pairs in 𝑇, 𝑇 1 , 𝑇(2)

with property 𝐺 𝑣, 𝑤 = 𝑏𝑙

28

slide-28
SLIDE 28

Challenges

 OSNs large

Facebook, Google+, Twitter, Facebook,

LinkedIn, …, 𝑊 > 500 million users

 huge number of node pairs, 𝑊 2 > 1016  topology not available

⇒ sampling required

 UVS (Uniform Vertex Sampling):

  • unbiased for 𝑻
  • sampling bias for 𝑻(𝟐), 𝑻(𝟑).
  • sometimes UVS not allowed

 crawling - RW: sampling bias

need to construct unbiased estimates

29

slide-29
SLIDE 29

Node pair sampling based on UVS

Basic sampling techniques

 UVS: sample nodes from 𝑊 uniformly  weighted vertex sampling (WVS): sample

nodes from V with desired probability distribution (𝜌𝑦: 𝑦 ∈ 𝑊)

 independent WVS (IWVS) (if we have topology)  Metropolis-Hastings WVS (MHWVS) (if not):

at each step, MHWVS selects a node v using UVS and then accepts the sample with probability min(𝜌𝑤/𝜌𝑣, 1) , where 𝑣 is previous sample; otherwise tries again

30

slide-30
SLIDE 30

Node pair sampling based on UVS

All pairs 𝑻 Sampling method: select two different nodes 𝑣 and 𝑤 uniformly at random Estimator: given sampled pairs 𝑣𝑗, 𝑤𝑗 , 𝑗 = 1, … , 𝑜 𝜕𝑙 = 1 𝑜

𝑗=1 𝑜

𝟐(𝐺 𝑣𝑗, 𝑤𝑗 = 𝑏𝑙) , 𝑙 = 1, … , 𝐿 Accuracy (unbiased) 𝐹 𝜕𝑙 = 𝜕𝑙, 𝑙 = 1, … , 𝐿

31

slide-31
SLIDE 31

Node pair sampling based on UVS

One hop pairs 𝑻(𝟐)

Sampling node pair [𝑣, 𝑤] 1) sample node 𝑣 according to probability distribution (𝜌𝑣

1 : 𝑣 ∈ 𝑊), where

𝜌𝑣

1 = 𝑒𝑣

2|𝐹| 𝑒𝑣 - degree of node 𝑣 2) select neighbor 𝑤 at random Each [𝑣, 𝑤] sampled uniformly from 𝑻(𝟐)

32

u

𝑒𝑣

slide-32
SLIDE 32

Node pair sampling based on UVS

One hop pairs 𝑻(𝟐)

Sampling node pair [𝑣, 𝑤] 1) sample node 𝑣 according to probability distribution (𝜌𝑣

1 : 𝑣 ∈ 𝑊), where

𝜌𝑣

1 = 𝑒𝑣

2|𝐹| 𝑒𝑣 - degree of node 𝑣 2) select neighbor 𝑤 at random Each [𝑣, 𝑤] sampled uniformly from 𝑻(𝟐)

33

v u

𝑒𝑣

slide-33
SLIDE 33

Node pair sampling based on UVS

One hop pairs 𝑻(𝟐)

Estimator: given sampled pairs 𝑣𝑗, 𝑤𝑗 , 𝑗 = 1, … , 𝑜 𝜕𝑙

(1) = 1

𝑜

𝑗=1 𝑜

𝟐(𝐺 𝑣𝑗, 𝑤𝑗 = 𝑏𝑙) , 𝑙 = 1, … , 𝐿 Accuracy (unbiased) 𝐹 𝜕𝑙

(1) = 𝜕𝑙 (1), 𝑙 = 1, … , 𝐿

34

slide-34
SLIDE 34

Node pair sampling based on UVS

Two hop pairs 𝑻(𝟑)

 sampling node pair 𝑣, 𝑤

1)

sample node x

2)

select two neighbors u and v of node 𝑦 at random

35

x

slide-35
SLIDE 35

Node pair sampling based on UVS

Two hop pairs 𝑻(𝟑)

 sampling node pair 𝑣, 𝑤

1)

sample node x

2)

select two neighbors u and v of node 𝑦 at random  produces asymptotically unbiased

estimate of 𝜕𝑙

(2), 𝑙 = 1, … , 𝐿

tight convergence rate

36

u v x

slide-36
SLIDE 36

Node pair sampling based on RW

Why?

UVS not available, too costly

API not provided user IDs sparsely distributed

only crawling techniques can be used

random walk: walker moves to random

neighbor, samples its information

we saw for connected non-bipartite graph

𝜌𝑤 = 𝑒𝑤 2 𝐹 , 𝑤 ∈ 𝑊

37

slide-37
SLIDE 37

Node pair sampling based on RW

All pairs 𝑻

 sample node pair 𝑣𝑗, 𝑤𝑗 by two

independent RWs, where 𝑣𝑗, 𝑤𝑗 are nodes sampled by two RWs at step i

 node pair [𝑣,𝑤] sampled

according to stationary distribution

𝜌[𝑣,𝑤] = 𝑒𝑣𝑒𝑤 4 𝐹 2 , 𝑣, 𝑤 ∈ 𝑊

38

𝑤𝑗 𝑣𝑗,

slide-38
SLIDE 38

Node pair sampling based on RW

All pairs 𝑻 Estimator: given sampled node pairs 𝑣𝑗, 𝑤𝑗 , 𝑗 = 1, … , 𝑜

𝜕𝑙

∗ = 1

𝐾

𝑗=1 𝑜

𝟐(𝐺 𝑣𝑗, 𝑤𝑗 = 𝑏𝑙)𝟐(𝑣𝑗 ≠ 𝑤𝑗) 𝑒𝑣𝑗𝑒𝑤𝑗 , 𝑙 = 1, … , 𝐿

𝐾 – normalization constant Accuracy: 𝜕𝑙

∗ - asymptotically unbiased estimate

  • f 𝜕𝑙, 𝑙 = 1, … , 𝐿

39

slide-39
SLIDE 39

Node pair sampling based on RW

One hop pairs 𝑻(𝟐) Sampling method:

 random node pair [𝑣𝑗, 𝑤𝑗]

sampled by RW

 𝑣𝑗, 𝑤𝑗 - nodes sampled at steps

𝑗 and 𝑗 + 1

 produces asymptotically

unbiased estimate of 𝜕𝑙

(1),

𝑙 = 1, … , 𝐿

40

𝑣𝑗 𝑤𝑗 step i-1 step i+1 step i

slide-40
SLIDE 40

Node pair sampling based on RW

Two hop pairs 𝑻(𝟑)

Neighborhood RW (NRW)

 current edge (𝑣, 𝑤)  next edge: select randomly

from edges connected to 𝑣 or 𝑤, except edge (𝑣, 𝑤)

 RW on graph with edges as

nodes

41

slide-41
SLIDE 41

Node pair sampling based on RW

Two hop pairs 𝑻(𝟑)

Probability NRW samples node pair [𝑣,𝑤] in 𝑻(𝟑) converges to 𝜌[𝑣,𝑤]

(2)

= 𝑛(𝑣, 𝑤)/𝑁 𝑛(𝑣, 𝑤) - number neighbors common to 𝑣, 𝑤

42

slide-42
SLIDE 42

Node pair sampling based on RW

Two hop pairs 𝑻(𝟑)

Estimator: given sampled node pairs 𝑣𝑗, 𝑤𝑗 , 𝑗 = 1, … , 𝑜 𝜕𝑙

(2∗) = 1

𝐼

𝑗=1 𝑜

𝟐(𝐺 𝑣𝑗, 𝑤𝑗 = 𝑏𝑙) 𝑛(𝑣𝑗, 𝑤𝑗) 𝐼 – normalization constant Accuracy: asymptotically unbiased estimate of 𝜕𝑙

(2), 1 ≤ 𝑙 ≤ 𝐿

43

slide-43
SLIDE 43

44

Simulations: Distance distribution estimation

 B - number of sampled node pairs  |S| - total number of node pairs  error metric - 𝑂𝑁𝑇𝐹

𝜕𝑙 =

𝐹 𝜕𝑙−𝜕𝑙 2 𝜕𝑙

, 𝑙 = 1, … , 𝐿 Gnutella - |V|≈ 6300

  • B >.005|S|, NMSE < 1
slide-44
SLIDE 44

45

Simulations: Mutual neighbor count distribution in S(1)

𝐶 = 0.01|𝑇 1 | node pairs sampled from 𝑇 1 of soc-Epinions (76,000 nodes)

slide-45
SLIDE 45

46

Simulations: Mutual neighbor count distribution in S(2)

𝐶 = 0.01|𝑇 2 | node pairs sampled from 𝑇 2 of soc-Epinions

slide-46
SLIDE 46

Distribute interests over graph:

 105 distinct interests: number nodes per

interest ~ truncated Pareto distribution over {1, …,103}

 to distribute interest possessed by k different

nodes

1.

select random node v that reaches at least k-1 different nodes

2.

distribute interest to node v and closest k-1 nodes connected to v

47

Simulations: Interest distribution scheme

slide-47
SLIDE 47

 CCDF of common interest

count distribution for S, S(1), and S(2)

 # common interests

smallest for S, largest for S(1)

 consequence of

construction

48

Simulations: Common interest count distribution for generated content

P2P-Gnutella

slide-48
SLIDE 48

Simulations: Common interest count distribution in S, S(1) (Gnutella)

 RW better than UVS for all pairs  little difference for neighbors

49

𝐶 = 0.01|𝑇| node pairs sampled from 𝑇, 𝑇(1) S S(1)

slide-49
SLIDE 49

 IWVS better for small

numbers of interests

 requires knowledge of

topology

𝐶 = 0.01|𝑇 2 | node pairs sampled from 𝑇 2

50

Simulations: Common interest count distribution in S(2) (Gnutella)

slide-50
SLIDE 50

Conclusions

 use sampling to estimate pair characteristics in

sets S, S(1), and S(2).

 sampling methods based on independent vertex

sampling and random walk

 produce asymptotically unbiased estimates

 good illustration of power of random walk  validated approaches on wide range of graphs

51

slide-51
SLIDE 51

Conclusions

 Markov Chain Mixing Times  other more “powerful” & “elegant” sampling methods: Frontier

Sampling (Ribeiro)

 Efficiently Estimating Motif Statistics of Large Networks in the

  • Dark. TKDD 2014

 Design of Efficient Sampling Methods on Hybrid Social-

Affiliation Networks. IEEE ICDE’15

 measuring, maximizing group closeness centrality over disk-

resident graphs. WWW’14

52

slide-52
SLIDE 52

Thanks!

Slides (will be) at http://www- net.cs.umass.edu/networks/towsley/UF RJ-sampling.pdf