and Size of Social Networks via Random Walk Stephen J. Hardiman* - - PowerPoint PPT Presentation

and size of social networks via
SMART_READER_LITE
LIVE PREVIEW

and Size of Social Networks via Random Walk Stephen J. Hardiman* - - PowerPoint PPT Presentation

Estimating Clustering Coefficients and Size of Social Networks via Random Walk Stephen J. Hardiman* Liran Katzir Capital Fund Management Advanced Technology Labs France Microsoft Research, Israel *Research was conducted while the author was


slide-1
SLIDE 1

Estimating Clustering Coefficients and Size of Social Networks via Random Walk

Stephen J. Hardiman*

Capital Fund Management France

Liran Katzir

Advanced Technology Labs Microsoft Research, Israel

*Research was conducted while the author was unaffiliated

slide-2
SLIDE 2

Motivation: Social Networks

Facebook

Twitter

Qzone

Google+

Sina Weibo

Habbo

Renren LinkedIn Vkontakte

Bebo

Tagged Orkut

Netlog

Friendster hi5

Flixster

MyLife

Classmates.com Sonico.com Plaxo

slide-3
SLIDE 3

Motivation: External access

v1 v2 v3 v5 v6 v7 v4 v8 v9 Social Analytics

The online social network Disk Space Communication Privacy

slide-4
SLIDE 4

Task: Estimate parameters

Business development/ advertisement/ market size. Predicting Social Products’ Potential. Global Clustering Coefficient Network Average CC Number of Registered Users

slide-5
SLIDE 5

Global CC = 3 x number of triangles number of connected triplet

Global Clustering Coefficient

v1 v2 v3 v5 v6 v7 v4 v8 v9

Triangle Connected Triplet

slide-6
SLIDE 6

Global Clustering Coefficient

Exact: [Alon et al, 1997] Estimation – input is read at least once:

  • Random Access: [Avron, 2010]
  • Streaming Model: [Buriol et al, 2006]

Estimation – sampling:

  • Random Access: [Schank et al, 2005]
  • External Access: This work.
slide-7
SLIDE 7

Ci = #connections between vi′s neighbors di (di−1)/2

Local Clustering Coefficient

v1 v2 v3 v5 v6 v7 v4 v8 v9

di – degree of node i d1 = 1 d9 = 2 d2 = 3 C2 =1/3

Network Average CC = average local CC

slide-8
SLIDE 8

Network Average CC

Exact: Naïve. Estimation – input is read at least once:

  • Streaming Model: [Becchetti et al, 2010]

Estimation – sampling:

  • Random Access: [Schank et al, 2005]
  • External Access: [Ribeiro et al 2010],

[Gjoka et al, 2010], This work – Improved accuracy.

slide-9
SLIDE 9

Number of Registered Users

Exact: trivial Estimation – sampling:

  • External Access: [Hardiman et al 2009],

[Katzir et al, 2011], This work – Improved accuracy.

slide-10
SLIDE 10

Random Walk

v1 v2 v3 v5 v6 v7 v4 v8 v9 Sampled Nodes: v1 v2 v3 v4 1 22 3 22 2 22 2 22 Stationary Distribution =

𝑒𝑗 𝑒𝑗

3 22 2 22 3 22 4 22 2 22 v5

slide-11
SLIDE 11

Random Walk - Summary

v1 v2 v3 v5 v6 v7 v4 v8 v9 Visible Nodes Invisible Nodes Sampled Nodes Visible Edges Invisible Edges

slide-12
SLIDE 12

Global CC Algorithm

  • 1. Ψ

𝑕 – Sampled nodes average degree - 1.

𝜚𝑙 = 1 if there is an edge 𝑤𝑙−1 − 𝑤𝑙+1, 0 Otherwise.

  • 2. Φ𝑕 – Sampled nodes average 𝜚𝑙𝑒𝑙 .

The estimated global clustering coefficient: 𝑑𝑕 =

Φ𝑕 Ψ𝑕

𝜚𝑙 = 1 iff 𝑤𝑙−1, 𝑤𝑙, 𝑤𝑙+1 is a triangle

slide-13
SLIDE 13

Global CC Example

v1 v2 v3 v5 v4 𝜚2 = 0 𝜚3 = 1 Φ𝑕 = 1 3 0 + 2 + 0 = 2 3 Ψ

𝑕 = 1

5 0 + 2 + 1 + 3 + 1 = 7 5 𝑑𝑕 = 2 3 5 7 ≈ 0.47 𝑑𝑕 = 9 23 ≈ 0.39 𝜚4 = 0 v6 v7

slide-14
SLIDE 14

Expectation of 𝝔𝒍

𝐹 𝜚𝑙𝑒𝑙 = 𝑒𝑗 𝐸 𝐹 𝜚𝑙𝑒𝑙|𝑦𝑙 = 𝑤𝑗

𝑜 𝑗=1

= 𝑒𝑗 𝐸

𝑜 𝑗=1

2𝑚𝑗 𝑒𝑗𝑒𝑗 𝑒𝑗 = 2𝑚𝑗 𝐸

𝑜 𝑗=1

Total expectation 𝑒𝑗𝑒𝑗 combinations. 2𝑚𝑗 yield 𝜚𝑙=1

𝑚𝑗 – The number of triangles contain vi. 𝑒𝑗 – The degree of node vi. 𝑜 – The number of nodes. 𝐸 = 𝑒𝑗

𝑜 𝑗=1
slide-15
SLIDE 15

Global CC Proof

𝐸 = 𝑒𝑗

𝑜 𝑗=1

𝑚𝑗 – The number of triangles contain vi. 𝑒𝑗 – The degree of node vi. 𝑜 – The number of nodes. 𝐹 Φ𝑕 = 𝐹 𝜚𝑙𝑒𝑙 = 2 𝐸 𝑚𝑗

𝑜 𝑗=1

𝐹 Ψ

𝑕 = 1

𝐸 𝑒𝑗 𝑒𝑗 − 1

𝑜 𝑗=1

𝑑𝑕 = Φ𝑕

concentration bounds 𝐹 Φ𝑕

Ψ

𝑕 concentration bounds 𝐹 Ψ 𝑕

≅ 2 𝑚𝑗

𝑜 𝑗=1

𝑒𝑗 𝑒𝑗 − 1

𝑜 𝑗=1

= 𝑑𝑕

slide-16
SLIDE 16

Guarantees

For any 𝜗 ≤ 1

8 and 𝜀 ≤ 1, we have

Prob 1 − 𝜁 𝑑𝑕 ≤ 𝑑𝑕 ≤ 1 + 𝜁 𝑑𝑕 ≥ 1 − 𝜀 when the number of samples, r, satisfies 𝑠 ≥ 𝑠

𝑕 = 𝑃 mixing time(𝜁)

slide-17
SLIDE 17

Network Average CC Algorithm

  • 1. Ψ𝑚 – Sampled nodes average 1/degree .

𝜚𝑙 = 1 if there is an edge 𝑤𝑙−1 − 𝑤𝑙+1, 0 Otherwise.

  • 2. Φ𝑚 – Sampled nodes average 𝜚𝑙

1 𝑒𝑙−1 .

The estimated network average CC: 𝑑𝑚 = Φ𝑚

Ψ𝑚

slide-18
SLIDE 18

Evaluations

Network n (size) D/n cl cg DBLP 977,987 8.457 0.7231 0.1868 Orkut 3,072,448 76.28 0.1704 0.0413 Flickr 2,173,370 20.92 0.3616 0.1076 Live Journal 4,843,953 17.69 0.3508 0.1179 DBLP facts: Paper with most co-authors: has 119 listed authors. Most prolific author: Vincent Poor with 798 entries.

slide-19
SLIDE 19

Global CC

Relative improvement ranges between 300% and 500% depending

  • n the network.

0.5 1 1.5 2 2.5 3 3.5 0.5 1 1.5 2 Relative estimation value Percentage of mined nodes DBLP Network Gjoka et al* Ribeiro et al* This work

slide-20
SLIDE 20

Network Average CC

Relative improvement ranges between 50% and 400% depending

  • n the network.

0.5 1 1.5 2 2.5 0.5 1 1.5 2 Relative estimation value Percentage of mined nodes Orkut Network Ribeiro et al Gjoka et al Random walk

slide-21
SLIDE 21

Conclusions

  • 1. New external access estimator from Global

Clustering Coefficient.

  • 2. Improved estimator for Network Average

Clustering Coefficient.

  • 3. Improved estimator for number of registered

users.

slide-22
SLIDE 22

Estimating Sizes of Social Networks via Biased Sampling

Liran Katzir

Yahoo! Labs, Haifa, Israel

Edo Liberty

Yahoo! Labs, Haifa, Israel

Oren Somekh

Yahoo! Labs, Haifa, Israel

slide-23
SLIDE 23

The expected number of collisions in a list of r i.i.d. samples from a set of n elements is 𝑠 𝑠−1

2𝑜

.

The Birthday “Paradox”

A collision is a pair of identical samples. Example: Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x2, x3), (x2, x5), and (x3, x5)

slide-24
SLIDE 24

Cardinality estimation uniform

Needs 𝑠 = 𝑃 𝑜 samples to converge. Used by [Ye et al, 2010] to estimate the size. When C collisions are observed n ≅ 𝑠 𝑠 − 1 2𝐷

slide-25
SLIDE 25

Stationary distribution sampling

v1 v2 v3 v5 v6 v7 v4 v8 v9 Sampled Nodes: v5 1 22 3 22 2 22 2 22 Stationary Distribution =

𝑒𝑗 𝑒𝑗

3 22 2 22 3 22 4 22 2 22 v2 v5 v4 v2

slide-26
SLIDE 26

Cardinality estimation stationary

Needs 𝑠 = 𝑃 𝑜

4

log 𝑜 samples to converge when 𝑒𝑗~𝑨𝑗𝑞𝑔( 𝑜, 2). When C collisions are observed n ≅ 𝑒𝑦 1 𝑒𝑦 2𝐷

slide-27
SLIDE 27

Example:

v1 v2 v3 v5 v6 v7 v4 v8 v9 v5 v2 v5 v4 v2

𝑒𝑦 = 2 + 3 + 2 + 4 + 3 1 𝑒𝑦 = 1 2 + 1 3 + 1 2 + 1 4 + 1 3 𝑜 =

1423

12

2∙2 ≈ 6.7

slide-28
SLIDE 28

Global CC Proof

𝐸 = 𝑒𝑗

𝑜 𝑗=1

𝑒𝑗 – The degree of node vi. 𝑜 – The number of nodes. 𝐹 𝑒𝑦 = 𝑒𝑗 𝐸 𝑒𝑗

𝑜 𝑗=1

𝐹 1 𝑒𝑦 = 𝑒𝑗 𝐸 1 𝑒𝑗

𝑜 𝑗=1

= 𝑜 𝐸 𝑜 = 𝑒𝑦 1 𝑒𝑦

concentration bounds 𝐹 𝑒𝑦 𝐹

1 𝑒𝑦 2𝐷

concentration bounds 2𝐹 𝐷

≅ 𝑒𝑗 𝐸 𝑒𝑗 𝑜 𝐸 𝑒𝑗 𝐸 𝑒𝑗 𝐸 = 𝑜 𝐹 𝐷 = 𝑒𝑗 𝐸 𝑒𝑗 𝐸

𝑜 𝑗=1
slide-29
SLIDE 29

Improvements

  • 1. Using all samples (Hardiman et al 2009).
  • 2. Using Conditional Monte Carlo (This work).
slide-30
SLIDE 30

All Samples

Restrict computation to indexes m steps apart, 𝐽 = 𝑙, 𝑚 | 𝑙 − 𝑚 ≥ 𝑛 A collision is only be considered within 𝐽. Φ = 𝑦𝑙 = 𝑦𝑚 | 𝑙, 𝑚 ∈ 𝐽 Ratio of degrees is similarly defined Ψ = 𝑒𝑦𝑙 𝑒𝑦𝑚

𝑙,𝑚 ∈𝐽

slide-31
SLIDE 31

Conditional Monte Carlo

A collision between 𝑦𝑙 and 𝑦𝑚, is replaced by the conditional collision is steps k+1 and l+1 respectively. 𝐹 1𝑦𝑙+1=𝑦𝑚+1|𝑦𝑙, 𝑦𝑚 = Common Neighbors 𝑒𝑦𝑙𝑒𝑦𝑚

slide-32
SLIDE 32

Conditional Monte Carlo

  • The pair 𝑤4, 𝑤7 is not a collision, but it

contributes 1

12 to the collision counter.

v1 v2 v3 v5 v6 v7 v4 v8 v9

slide-33
SLIDE 33

Size Estimation

0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 Relative estimation value Percentage of mined nodes DBLP Network Priot art This work

slide-34
SLIDE 34

Thanks