sampling large graphs
play

Sampling Large Graphs: Algorithms and Applications Don Towsley - PowerPoint PPT Presentation

Sampling Large Graphs: Algorithms and Applications Don Towsley College of Information & Computer Science Umass - Amherst Collaborators: P.H. Wang, J.C.S. Lui, J.Z. Zhou, X. Guan Measuring, analyzing large networks - large networks can be


  1. Sampling Large Graphs: Algorithms and Applications Don Towsley College of Information & Computer Science Umass - Amherst Collaborators: P.H. Wang, J.C.S. Lui, J.Z. Zhou, X. Guan

  2. Measuring, analyzing large networks - large networks can be represented by graphs - Facebook 1+ Billion 3

  3. Measuring, analyzing large networks - large networks can be represented by graphs - Facebook - WWW 50 Billion 3

  4. Measuring, analyzing large networks - large networks can be represented by graphs 300 million - Facebook - WWW - Twitter 3

  5. Measuring, analyzing large networks - large networks can be represented by graphs - Facebook - WWW - Twitter - Ebay 233 Million 3

  6. Measuring, analyzing large networks - large networks can be represented by graphs - Facebook - WWW - Twitter - Ebay Curse of data dimensionality !!! 3

  7. Challenges in measurement: Information distortion “World Map” in 1459  incomplete (Columbus et al. 1492) (Australia 17 th century)  wrong proportions (Africa & Asia) www.flickr.com/

  8. Why do we want to understand these networks? Want to understand or find out  how did these networks evolve?

  9. Why do we want to understand these networks? Want to understand or find out  how did these networks evolve? High school friendship  who are the influential users? network

  10. Why do we want to understand these networks? Want to understand or find out  how did these networks evolve? High school friendship  who are the influential users? network  how does influence propagate?  communities in these networks?  ….etc .

  11. Goals and challenges Goals  generate statistically valid characterization of network structure  node pairs in this work Challenges  large networks  correcting for biases

  12. How to measure: sampling Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)

  13. How to measure: sampling Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)

  14. How to measure: sampling Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)

  15. How to measure: sampling algorithms Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)

  16. How to measure: sampling algorithms Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)

  17. Breadth first search sampling  Orkut data set (Mislove 2007), 3M vertices, 200M edges CCDF True distribution BFS, depth = 3  BFS sampling highly biased  difficult to remove bias 117

  18. Random walk sampling Bias removal?  Markov model π i  at steady state visits edges uniformly at random (edge θ i sampling ) CCDF Model:  i - P[node degree = i ] RW sampling π i - P[visited degree = i ]  i   i  i i 18

  19. Random walk sampling Bias removal?  Markov model π i  at steady state visits edges uniformly at random (edge θ i sampling ) CCDF Model:  i - P[node degree = i ] RW sampling π i - P[visited degree = i ]  i =  i  i / avg degree i  i = Norm   i / i or 19

  20. Node sampling vs. RW: Orkut random walk node sampling log(CCDF l og(CCDF ) log(degree) log(degree)  RW – estimates tail well  node sampling – estimates small degrees well 20

  21. Focus of talk Measure node pair statistics: important for many applications! 22

  22. Classification of node pairs Classify node pair [𝑣, 𝑤] using shortest path • 1-hop node pair class if distance( u , v ) = 1 • 2-hop node pair class if distance( u , v ) = 2 • … 23

  23. Homophily Homophily: tendency of users to connect to others with common interests. P. Singla and M. Richardson. Yes, there is a correlation: from social networks to personal behavior on the web. In WWW 2008 (MSN) Can infer characteristics and make recommendations Compare homophily( u , v ) between different node pair classes 24

  24. Pair similarity: Proximity Proximity( u , v ) : number of common neighbors of u and v; closeness of u and v u v  knowing proximity distribution of node pairs important for  friendship prediction  interest recommendation  … 25

  25. Pair similarity: distance  Distance( u , v ) : length of shortest path between u and v in graph  measure distance distribution of all node pairs to calculate  average distance • Twitter: 4.1 • MSN: 6.6  effective diameter (the 90th percentile of all distances)  small world 26

  26. Problem formulation  undirected graph 𝐻 = (𝑊, 𝐹)  measure node pair characteristics in following sets:  all pairs - 𝑇 = { 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝑊, 𝑣 ≠ 𝑤}  one-hop pairs - pairs of connected nodes 𝑇 (1) = 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝐹  two-hop pairs - pairs of nodes with at least one common neighbor 𝑇 (2) = { 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝑊, 𝑣 ≠ 𝑤; ∃𝑦 ∈ 𝑊 𝑡𝑢 𝑦, 𝑣 , 𝑦, 𝑤 ∈ 𝐹} 27

  27. Problem formulation  𝑮(𝒗, 𝒘) – similarity of node pair under study, e.g., # of common neighbors of 𝑣, 𝑤  {𝑏 1 , … , 𝑏 𝐿 } - range of 𝐺 𝑣, 𝑤  distribution of 𝐺 𝑣, 𝑤  𝑇: (𝜕 1 , … , 𝜕 𝐿 ) (1) , … , 𝜕 𝐿 (1) )  𝑇 (1) : (𝜕 1 2 , … , 𝜕 𝐿 2 )  𝑇 (2) : (𝜕 1 (2) - fractions of node pairs in 𝑇, 𝑇 1 , 𝑇 (2) (1) , 𝜕 𝑙 𝜕 𝑙 , 𝜕 𝑙 with property 𝐺 𝑣, 𝑤 = 𝑏 𝑙 28

  28. Challenges  OSNs large  Facebook, Google+, Twitter, Facebook, LinkedIn, …, 𝑊 > 500 million users  huge number of node pairs, 𝑊 2 > 10 16  topology not available ⇒ sampling required  UVS (Uniform Vertex Sampling): • unbiased for 𝑻 • sampling bias for 𝑻 (𝟐) , 𝑻 (𝟑) . • sometimes UVS not allowed  crawling - RW: sampling bias  need to construct unbiased estimates 29

  29. Node pair sampling based on UVS Basic sampling techniques  UVS : sample nodes from 𝑊 uniformly  weighted vertex sampling (WVS) : sample nodes from V with desired probability distribution (𝜌 𝑦 : 𝑦 ∈ 𝑊)  independent WVS (IWVS) (if we have topology)  Metropolis-Hastings WVS (MHWVS) (if not): at each step, MHWVS selects a node v using UVS and then accepts the sample with probability min(𝜌 𝑤 /𝜌 𝑣 , 1) , where 𝑣 is previous sample; otherwise tries again 30

  30. Node pair sampling based on UVS All pairs 𝑻 Sampling method: select two different nodes 𝑣 and 𝑤 uniformly at random Estimator : given sampled pairs 𝑣 𝑗 , 𝑤 𝑗 , 𝑗 = 1, … , 𝑜 𝑜 𝜕 𝑙 = 1 𝑜 𝟐(𝐺 𝑣 𝑗 , 𝑤 𝑗 = 𝑏 𝑙 ) , 𝑙 = 1, … , 𝐿 𝑗=1 Accuracy ( unbiased ) 𝐹 𝜕 𝑙 = 𝜕 𝑙 , 𝑙 = 1, … , 𝐿 31

  31. Node pair sampling based on UVS One hop pairs 𝑻 (𝟐) Sampling node pair [𝑣, 𝑤] 1) sample node 𝑣 according to 𝑒 𝑣 probability distribution 1 : 𝑣 ∈ 𝑊) , where (𝜌 𝑣 … 1 = 𝑒 𝑣 𝜌 𝑣 2|𝐹| 𝑒 𝑣 - degree of node 𝑣 u 2) select neighbor 𝑤 at random Each [𝑣, 𝑤] sampled uniformly from 𝑻 (𝟐) 32

  32. Node pair sampling based on UVS One hop pairs 𝑻 (𝟐) Sampling node pair [𝑣, 𝑤] 1) sample node 𝑣 according to 𝑒 𝑣 probability distribution 1 : 𝑣 ∈ 𝑊) , where (𝜌 𝑣 … 1 = 𝑒 𝑣 v 𝜌 𝑣 2|𝐹| 𝑒 𝑣 - degree of node 𝑣 u 2) select neighbor 𝑤 at random Each [𝑣, 𝑤] sampled uniformly from 𝑻 (𝟐) 33

  33. Node pair sampling based on UVS One hop pairs 𝑻 (𝟐) Estimator : given sampled pairs 𝑣 𝑗 , 𝑤 𝑗 , 𝑗 = 1, … , 𝑜 𝑜 (1) = 1 𝜕 𝑙 𝑜 𝟐(𝐺 𝑣 𝑗 , 𝑤 𝑗 = 𝑏 𝑙 ) , 𝑙 = 1, … , 𝐿 𝑗=1 Accuracy (unbiased) (1) = 𝜕 𝑙 (1) , 𝑙 = 1, … , 𝐿 𝐹 𝜕 𝑙 34

  34. Node pair sampling based on UVS Two hop pairs 𝑻 (𝟑)  sampling node pair 𝑣, 𝑤 sample node x 1) x select two neighbors u and v of 2) node 𝑦 at random 35

  35. Node pair sampling based on UVS Two hop pairs 𝑻 (𝟑) u  sampling node pair 𝑣, 𝑤 sample node x 1) x select two neighbors u and v of v 2) node 𝑦 at random  produces asymptotically unbiased (2) , 𝑙 = 1, … , 𝐿 estimate of 𝜕 𝑙 tight convergence rate  36

  36. Node pair sampling based on RW Why?  UVS not available, too costly  API not provided  user IDs sparsely distributed  only crawling techniques can be used  random walk : walker moves to random neighbor, samples its information  we saw for connected non-bipartite graph 𝜌 𝑤 = 𝑒 𝑤 2 𝐹 , 𝑤 ∈ 𝑊 37

  37. Node pair sampling based on RW All pairs 𝑻  sample node pair 𝑣 𝑗 , 𝑤 𝑗 by two independent RWs, where 𝑣 𝑗 , 𝑤 𝑗 𝑣 𝑗 , are nodes sampled by two RWs 𝑤 𝑗 at step i  node pair [ 𝑣 , 𝑤 ] sampled according to stationary distribution 𝜌 [𝑣,𝑤] = 𝑒 𝑣 𝑒 𝑤 4 𝐹 2 , 𝑣, 𝑤 ∈ 𝑊 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend