and size of social networks via
play

and Size of Social Networks via Random Walk Stephen J. Hardiman* - PowerPoint PPT Presentation

Estimating Clustering Coefficients and Size of Social Networks via Random Walk Stephen J. Hardiman* Liran Katzir Capital Fund Management Advanced Technology Labs France Microsoft Research, Israel *Research was conducted while the author was


  1. Estimating Clustering Coefficients and Size of Social Networks via Random Walk Stephen J. Hardiman* Liran Katzir Capital Fund Management Advanced Technology Labs France Microsoft Research, Israel *Research was conducted while the author was unaffiliated

  2. Motivation: Social Networks Qzone Habbo Netlog Sonico.com Bebo Google+ Renren Twitter Flixster Facebook MyLife Classmates.com Tagged Friendster hi5 Sina Weibo Orkut Plaxo LinkedIn Vkontakte

  3. Motivation: External access The online social network Social Analytics v 3 v 5 v 7 v 1 v 2 v 9 Privacy v 4 v 6 v 8 Disk Space Communication

  4. Task: Estimate parameters Global Network Number of Clustering Average Registered Coefficient CC Users Predicting Business Social Products’ development/ Potential. advertisement/ market size.

  5. Global Clustering Coefficient Global CC = 3 x number of triangles number of connected triplet v 3 v 5 v 7 v 1 v 2 v 9 Triangle Connected v 4 v 6 v 8 Triplet

  6. Global Clustering Coefficient Exact: [Alon et al, 1997] Estimation – input is read at least once: • Random Access: [Avron, 2010] • Streaming Model: [Buriol et al, 2006] Estimation – sampling: • Random Access: [Schank et al, 2005] • External Access: This work.

  7. Local Clustering Coefficient C i = #connections between vi′s neighbors d i (d i −1)/2 d i – degree of node i C 2 = 1 / 3 v 3 v 5 v 7 d 1 = 1 d 2 = 3 d 9 = 2 v 1 v 2 v 9 Network Average CC v 4 v 6 v 8 = average local CC

  8. Network Average CC Exact: Naïve. Estimation – input is read at least once: • Streaming Model: [Becchetti et al, 2010] Estimation – sampling: • Random Access: [Schank et al, 2005] • External Access: [Ribeiro et al 2010], [Gjoka et al, 2010], This work – Improved accuracy.

  9. Number of Registered Users Exact: trivial Estimation – sampling: • External Access: [Hardiman et al 2009], [Katzir et al, 2011], This work – Improved accuracy.

  10. Random Walk Sampled Nodes: v 1 v 2 v 3 v 4 v 5 Stationary 2 2 3 𝑒 𝑗 22 22 Distribution = 𝑒 𝑗 22 2 v 3 v 5 v 7 1 3 22 22 22 v 1 v 2 v 9 2 3 22 22 4 22 v 4 v 6 v 8

  11. Random Walk - Summary Sampled Nodes Visible Nodes Invisible Nodes Visible Edges Invisible Edges v 3 v 5 v 7 v 1 v 2 v 9 v 4 v 6 v 8

  12. Global CC Algorithm The estimated global clustering coefficient: Φ 𝑕 𝑑 𝑕 = Ψ 𝑕 1. Ψ 𝑕 – Sampled nodes average degree - 1. 𝜚 𝑙 = 1 if there is an edge 𝑤 𝑙−1 − 𝑤 𝑙+1 , 𝜚 𝑙 = 1 iff 𝑤 𝑙−1 , 𝑤 𝑙 , 𝑤 𝑙+1 is a triangle 0 Otherwise. 2. Φ 𝑕 – Sampled nodes average 𝜚 𝑙 𝑒 𝑙 .

  13. Global CC Example Φ 𝑕 = 1 3 0 + 2 + 0 = 2 𝑕 = 1 5 0 + 2 + 1 + 3 + 1 = 7 Ψ 3 5 𝜚 3 = 1 = 2 5 𝑑 𝑕 7 ≈ 0.47 v 3 v 5 v 7 3 𝜚 2 = 0 𝑑 𝑕 = 9 23 ≈ 0.39 v 1 v 2 𝜚 4 = 0 v 4 v 6

  14. Expectation of 𝝔 𝒍 𝑜 𝐹 𝜚 𝑙 𝑒 𝑙 = 𝑒 𝑗 𝐸 𝐹 𝜚 𝑙 𝑒 𝑙 |𝑦 𝑙 = 𝑤 𝑗 Total expectation 𝑗=1 𝑜 = 𝑒 𝑗 2𝑚 𝑗 𝑒 𝑗 𝑒 𝑗 combinations. 𝑒 𝑗 𝐸 𝑒 𝑗 𝑒 𝑗 2 𝑚 𝑗 yield 𝜚 𝑙 =1 𝑗=1 𝑜 = 2𝑚 𝑗 𝐸 𝑗=1 𝑜 𝑒 𝑗 – The degree of node v i . 𝐸 = 𝑒 𝑗 𝑚 𝑗 – The number of triangles contain v i . 𝑗=1 𝑜 – The number of nodes.

  15. Global CC Proof 𝑜 𝑜 𝐹 Φ 𝑕 = 𝐹 𝜚 𝑙 𝑒 𝑙 = 2 𝑕 = 1 𝐸 𝑚 𝑗 𝐹 Ψ 𝐸 𝑒 𝑗 𝑒 𝑗 − 1 𝑗=1 𝑗=1 concentration bounds 𝐹 Φ 𝑕 𝑜 = Φ 𝑕 2 𝑚 𝑗 𝑗=1 𝑑 𝑕 ≅ = 𝑑 𝑕 concentration bounds 𝐹 Ψ 𝑜 𝑒 𝑗 𝑒 𝑗 − 1 Ψ 𝑗=1 𝑕 𝑕 𝑜 𝑒 𝑗 – The degree of node v i . 𝐸 = 𝑒 𝑗 𝑚 𝑗 – The number of triangles contain v i . 𝑗=1 𝑜 – The number of nodes.

  16. Guarantees For any 𝜗 ≤ 1 8 and 𝜀 ≤ 1 , we have Prob 1 − 𝜁 𝑑 𝑕 ≤ 𝑑 𝑕 ≤ 1 + 𝜁 𝑑 𝑕 ≥ 1 − 𝜀 when the number of samples, r, satisfies 𝑠 ≥ 𝑠 𝑕 = 𝑃 mixing time(𝜁)

  17. Network Average CC Algorithm The estimated network average CC: = Φ 𝑚 𝑑 𝑚 Ψ 𝑚 1. Ψ 𝑚 – Sampled nodes average 1/degree . 𝜚 𝑙 = 1 if there is an edge 𝑤 𝑙−1 − 𝑤 𝑙+1 , 0 Otherwise. 1 2. Φ 𝑚 – Sampled nodes average 𝜚 𝑙 𝑒 𝑙 −1 .

  18. Evaluations Network n (size) D/n c l c g DBLP 977,987 8.457 0.7231 0.1868 Orkut 3,072,448 76.28 0.1704 0.0413 Flickr 2,173,370 20.92 0.3616 0.1076 Live Journal 4,843,953 17.69 0.3508 0.1179 DBLP facts: Paper with most co-authors: has 119 listed authors. Most prolific author: Vincent Poor with 798 entries.

  19. Global CC Gjoka et al* 3.5 Relative estimation value DBLP Network Ribeiro et al* 3 This work 2.5 2 Relative improvement 1.5 ranges between 300% and 500% depending 1 on the network. 0.5 0 0 0.5 1 1.5 2 Percentage of mined nodes

  20. Network Average CC Ribeiro et al 2.5 Relative estimation value Orkut Network Gjoka et al 2 Random walk 1.5 Relative improvement ranges between 50% 1 and 400% depending on the network. 0.5 0 0 0.5 1 1.5 2 Percentage of mined nodes

  21. Conclusions 1. New external access estimator from Global Clustering Coefficient. 2. Improved estimator for Network Average Clustering Coefficient. 3. Improved estimator for number of registered users.

  22. Estimating Sizes of Social Networks via Biased Sampling Oren Somekh Liran Katzir Edo Liberty Yahoo! Labs, Yahoo! Labs, Yahoo! Labs, Haifa, Israel Haifa, Israel Haifa, Israel

  23. The Birthday “Paradox” The expected number of collisions in a list of r i.i.d. samples from a set of n elements is 𝑠 𝑠−1 . 2𝑜 A collision is a pair of identical samples. Example: Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x 2 , x 3 ), (x 2 , x 5 ), and (x 3 , x 5 )

  24. Cardinality estimation uniform When C collisions are observed n ≅ 𝑠 𝑠 − 1 2𝐷 Needs 𝑠 = 𝑃 𝑜 samples to converge. Used by [Ye et al, 2010] to estimate the size.

  25. Stationary distribution sampling Sampled Nodes: v 5 v 2 v 5 v 4 v 2 Stationary 2 2 3 𝑒 𝑗 22 22 Distribution = 𝑒 𝑗 22 2 v 3 v 5 v 7 1 3 22 22 22 v 1 v 2 v 9 2 3 22 22 4 22 v 4 v 6 v 8

  26. Cardinality estimation stationary When C collisions are observed 𝑒 𝑦 1 𝑒 𝑦 n ≅ 2𝐷 4 Needs 𝑠 = 𝑃 𝑜 log 𝑜 samples to converge when 𝑒 𝑗 ~𝑨𝑗𝑞𝑔( 𝑜, 2) .

  27. Example: 𝑒 𝑦 = 2 + 3 + 2 + 4 + 3 1 = 1 2 + 1 3 + 1 2 + 1 4 + 1 3 𝑒 𝑦 14 23 12 𝑜 = 2∙2 ≈ 6.7 v 5 v 2 v 5 v 4 v 2 v 3 v 5 v 7 v 1 v 2 v 9 v 4 v 6 v 8

  28. Global CC Proof 𝑜 𝑜 𝑜 𝐹 𝑒 𝑦 = 𝑒 𝑗 1 = 𝑒 𝑗 1 = 𝑜 𝐹 𝐷 = 𝑒 𝑗 𝑒 𝑗 𝐸 𝑒 𝑗 𝐹 𝐸 𝑒 𝑦 𝐸 𝑒 𝑗 𝐸 𝐸 𝑗=1 𝑗=1 𝑗=1 concentration bounds 𝐹 𝑒 𝑦 𝐹 𝑒 𝑦 1 1 𝑒 𝑗 𝐸 𝑒 𝑗 𝑜 𝑒 𝑦 𝑒 𝑦 𝐸 𝑜 = ≅ = 𝑜 concentration bounds 2𝐹 𝐷 𝑒 𝑗 𝑒 𝑗 2𝐷 𝐸 𝐸 𝑜 𝑒 𝑗 – The degree of node v i . 𝐸 = 𝑒 𝑗 𝑜 – The number of nodes. 𝑗=1

  29. Improvements 1. Using all samples (Hardiman et al 2009). 2. Using Conditional Monte Carlo (This work).

  30. All Samples Restrict computation to indexes m steps apart, 𝐽 = 𝑙, 𝑚 | 𝑙 − 𝑚 ≥ 𝑛 A collision is only be considered within 𝐽 . Φ = 𝑦 𝑙 = 𝑦 𝑚 | 𝑙, 𝑚 ∈ 𝐽 Ratio of degrees is similarly defined 𝑒 𝑦 𝑙 Ψ = 𝑒 𝑦 𝑚 𝑙,𝑚 ∈𝐽

  31. Conditional Monte Carlo A collision between 𝑦 𝑙 and 𝑦 𝑚 , is replaced by the conditional collision is steps k +1 and l +1 respectively. 𝐹 1 𝑦 𝑙+1 =𝑦 𝑚+1 |𝑦 𝑙 , 𝑦 𝑚 = Common Neighbors 𝑒 𝑦 𝑙 𝑒 𝑦 𝑚

  32. Conditional Monte Carlo • The pair 𝑤 4 , 𝑤 7 is not a collision, but it contributes 1 12 to the collision counter. v 3 v 5 v 7 v 1 v 2 v 9 v 4 v 6 v 8

  33. Size Estimation Priot art 2.5 Relative estimation value DBLP Network This work 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 Percentage of mined nodes

  34. Thanks

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend